Building an audio knowledge graph with ML and TypeDB

This tutorial demonstrates how to integrate TypeDB with machine learning models to build an audio indexing backend. You’ll learn how to build a backend which performs the following steps:
- Transcribes speech using OpenAI Whisper (ASR)
- Identifies speaker segments using pyannote.audio (diarization)
- Stores everything in a TypeDB knowledge graph for querying
By the end, you will have a working pipeline that extracts raw audio into a structured and queryable data stored in TypeDB.
Why: Semantic audio indexing
Raw audio files are rich but infortunately opaque. You cannot search them, query them, or connect them to other data.
It is often desirable to process audio recordings so that we can ask questions such as “What did Speaker 1 say?”, or “Find all utterances between 30s and 60s.”.
This is what semantic audio indexing backend that we’re building will enable you.
A small caveat – speaker diarization identifies speaker IDs (e.g., “SPEAKER_00”, “SPEAKER_01”), not actual person names. Determining real names is out of scope for this tutorial as it requires a more sophisticated detection mechanism!
How the pieces fit together
We combine three components:
- Whisper ASR: Automatic speech recognition that produces timestamped text
https://huggingface.co/openai/whisper-base - pyannote.audio: Speaker diarisation that identifies which speaker is speaking when (produces speaker IDs like “SPEAKER_00”, not actual names)
https://huggingface.co/pyannote/speaker-diarization-3.1 - TypeDB: A knowledge graph database for storing and querying the results
https://typedb.com
Together, these form a complete audio-to-knowledge-graph pipeline.
Step 1: Prerequisites
1.1 Start TypeDB
Install TypeDB and start the server:
typedb server
This launches TypeDB on the default port (8000).
1.2 Set up the Python environment
This project uses Poetry and requires Python 3.11.8.
Install the required dependencies:
poetry add torchpoetry add jaxtypingpoetry add datasets # Hugging Face datasets librarypoetry add pyannote-audio # Speaker diarisation librarypoetry add torchcodec # Audio codec handlingpoetry add transformers # Hugging Face models library
macOS Note: FFmpeg libraries must be accessible:
export DYLD_LIBRARY_PATH=/opt/homebrew/Cellar/ffmpeg@7/7.1.3/lib
Step 2: Build the Audio Indexer
2.1 Define the TypeDB schema
The TypeDB schema is defined in schema/schema.tql:
defineattribute id, value string;attribute start_time, value double;attribute end_time, value double;attribute text, value string;entity speaker, owns id @key;entity utterance, owns start_time, owns end_time, owns text;relation spoke, relates speaker_role, relates utterance_role;speaker plays spoke:speaker_role;utterance plays spoke:utterance_role;
This schema models:
- Speakers with unique IDs (e.g., “SPEAKER_00”, “SPEAKER_01” from diarization)
- Utterances with timing and text
- Relations connecting speakers to their utterances
To set up the database, we use a setup script in schema/setup_script.tql:
database delete diarisationdatabase create diarisationtransaction schema diarisationsource schema.tqlcommit
Execute the setup script using TypeDB Console:
cd schematypedb console --address localhost:1729 \ --username admin \ --password password \ --tls-disabled \ --script=schema/setup_script.tqls
This will:
- Delete any existing
diarisationdatabase (if present) - Create a fresh
diarisationdatabase - Load and commit the schema from
schema.tql
2.2 Load audio data
We’ll use an audio example from the diarizers-community/voxconverse dataset, a benchmark for speaker diarisation.
Specifically, we’ll load the 5th audio sample (index 4) from the test dataset and use only the first five minutes:
from datasets import load_dataset# Load the VoxConverse test datasetdataset = load_dataset("diarizers-community/voxconverse", split="test")# Get the 5th sample (index 4)sample = dataset[4]sample = sample['audio'].get_all_samples()# Limit to first 5 minutes (300 seconds)sample_data = sample.data[:,:(300*sample.sample_rate)]
This gives us raw audio waveform data with the sample rate, limited to 5 minutes for faster processing.
2.3 Speech recognition with Whisper
Load the Whisper model and transcribe the audio:
from transformers import pipelineasr_engine = pipeline( "automatic-speech-recognition", model="openai/whisper-base", generate_kwargs={"language": "english"})recognition = asr_engine( {"raw": sample_data.squeeze().numpy(), "sampling_rate": sample.sample_rate}, return_timestamps=True)
The result contains timestamped text chunks:
for chunk in recognition["chunks"]: start, end = chunk["timestamp"] print(f"{start:.1f}s - {end:.1f}s: {chunk['text']}")
2.4 Speaker diarisation with pyannote
Load the diarisation pipeline and process the audio:
from pyannote.audio import Pipelinefrom pyannote.audio.pipelines.utils.hook import ProgressHookdiarisation_engine = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")with ProgressHook() as hook: diarisation = diarisation_engine( {'waveform': sample_data, 'sample_rate': sample.sample_rate}, hook=hook )
This produces speaker segments with timing. Each speaker is assigned an ID (e.g., “SPEAKER_00”, “SPEAKER_01”):
for turn, speaker in diarisation.speaker_diarization: print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
2.5 Combining recognition and diarisation
Now we merge what was said (recognition) with who said it (diarisation):
def combine_recognition_and_diarisation(recognition, diarisation): """Combine 'what was said' (recognition) with 'who said it' (diarisation)""" combined_utterances = [] for turn, speaker in diarisation.speaker_diarization: segment_start = turn.start segment_end = turn.end # Find all text chunks that overlap with this speaker segment segment_text = [] for chunk in recognition["chunks"]: chunk_start, chunk_end = chunk["timestamp"] if chunk_end is None: chunk_end = chunk_start + 1.0 # Check if chunk overlaps with speaker segment if chunk_start < segment_end and chunk_end > segment_start: segment_text.append(chunk["text"]) combined_text = "".join(segment_text).strip() if combined_text: combined_utterances.append({ "speaker": speaker, "start": segment_start, "end": segment_end, "text": combined_text }) return combined_utterances
The result is a list of utterances, each with a speaker label, timestamps, and text.
2.6 TypeDB ORM for storage
Create a client to interact with TypeDB via its HTTP API:
import requestsfrom typing import List, Dict, Anyclass TypeDBClient: """HTTP client for TypeDB 3.x operations.""" def __init__( self, endpoint: str = "http://localhost:8000", database: str = "diarisation", username: str = "admin", password: str = "password" ): self.endpoint = endpoint self.database = database self.access_token = self._sign_in(username, password) def _sign_in(self, username: str, password: str) -> str: """Sign in and get an access token.""" url = f"{self.endpoint}/v1/signin" response = requests.post(url, json={"username": username, "password": password}) response.raise_for_status() return response.json()["token"] def execute_write(self, query: str) -> Dict[str, Any]: """Execute a write query using the one-shot query API.""" url = f"{self.endpoint}/v1/query" body = { "databaseName": self.database, "transactionType": "write", "query": query } headers = { "Content-Type": "application/json", "Authorization": f"Bearer {self.access_token}" } response = requests.post(url, json=body, headers=headers) response.raise_for_status() return response.json()
Add methods for inserting data:
def insert_speaker(self, id: str) -> Dict[str, Any]: """Insert a speaker (idempotent with 'put').""" query = f''' put $speaker isa speaker, has id "{id}"; ''' return self.execute_write(query)def insert_utterance_with_speaker( self, speaker_id: str, start_time: float, end_time: float, text: str) -> Dict[str, Any]: """Insert an utterance and link it to a speaker.""" escaped_text = text.replace('"', '\\"') query = f''' match $speaker isa speaker, has id "{speaker_id}"; insert $utterance isa utterance, has start_time {start_time}, has end_time {end_time}, has text "{escaped_text}"; (speaker_role: $speaker, utterance_role: $utterance) isa spoke; ''' return self.execute_write(query)
2.7 Putting it all together
The complete pipeline:
if __name__ == "__main__": # 1. Load audio data dataset = load_dataset("diarizers-community/voxconverse", split="test") sample = dataset[4]['audio'].get_all_samples() sample_data = sample.data[:,:(300*sample.sample_rate)] # 2. Load ML models asr_engine = pipeline("automatic-speech-recognition", model="openai/whisper-base") diarisation_engine = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") # 3. Transcribe audio recognition = asr_engine( {"raw": sample_data.squeeze().numpy(), "sampling_rate": sample.sample_rate}, return_timestamps=True ) # 4. Identify speakers with ProgressHook() as hook: diarisation = diarisation_engine( {'waveform': sample_data, 'sample_rate': sample.sample_rate}, hook=hook ) # 5. Combine recognition with diarisation utterances = combine_recognition_and_diarisation(recognition, diarisation) # 6. Store in TypeDB store_diarisation_results(utterances)
Step 3: Run the pipeline
Execute the pipeline:
./main.sh# Or manually:DYLD_LIBRARY_PATH=/opt/homebrew/Cellar/ffmpeg@7/7.1.3/lib poetry run python -m main
The output shows:
- Timestamped transcription chunks
- Speaker segments with timing
- Combined utterances with speaker labels
- Confirmation of TypeDB storage
Conclusion
We have built an audio data indexer that transforms raw audio into a structured knowledge graph by combining:
- Whisper for speech-to-text with timestamps
- pyannote.audio for speaker diarisation
- TypeDB for knowledge graph storage
This enables querying conversations by speaker, time, and content, allowing users and also other systems to query and performs manipulation on the extracted data.
The same pattern applies to any audio source: podcasts, meetings, interviews, or customer calls. The knowledge graph structure makes it easy to extend with additional metadata and relationships.
Full Working Example
For a complete, working implementation of this tutorial, visit the GitHub repository.



