Building an audio knowledge graph with ML and TypeDB

This tutorial demonstrates how to integrate TypeDB with machine learning models to build an audio indexing backend.

Ganesh Hananda


This tutorial demonstrates how to integrate TypeDB with machine learning models to build an audio indexing backend. You’ll learn how to build a backend which performs the following steps:

  • Transcribes speech using OpenAI Whisper (ASR)
  • Identifies speaker segments using pyannote.audio (diarization)
  • Stores everything in a TypeDB knowledge graph for querying

By the end, you will have a working pipeline that extracts raw audio into a structured and queryable data stored in TypeDB.

Why: Semantic audio indexing

Raw audio files are rich but infortunately opaque. You cannot search them, query them, or connect them to other data.

It is often desirable to process audio recordings so that we can ask questions such as “What did Speaker 1 say?”, or “Find all utterances between 30s and 60s.”.

This is what semantic audio indexing backend that we’re building will enable you.

A small caveat – speaker diarization identifies speaker IDs (e.g., “SPEAKER_00”, “SPEAKER_01”), not actual person names. Determining real names is out of scope for this tutorial as it requires a more sophisticated detection mechanism!

How the pieces fit together

We combine three components:

  1. Whisper ASR: Automatic speech recognition that produces timestamped text
    https://huggingface.co/openai/whisper-base
  2. pyannote.audio: Speaker diarisation that identifies which speaker is speaking when (produces speaker IDs like “SPEAKER_00”, not actual names)
    https://huggingface.co/pyannote/speaker-diarization-3.1
  3. TypeDB: A knowledge graph database for storing and querying the results
    https://typedb.com

Together, these form a complete audio-to-knowledge-graph pipeline.

Step 1: Prerequisites

1.1 Start TypeDB

Install TypeDB and start the server:

typedb server

This launches TypeDB on the default port (8000).

1.2 Set up the Python environment

This project uses Poetry and requires Python 3.11.8.

Install the required dependencies:

poetry add torch
poetry add jaxtyping
poetry add datasets # Hugging Face datasets library
poetry add pyannote-audio # Speaker diarisation library
poetry add torchcodec # Audio codec handling
poetry add transformers # Hugging Face models library

macOS Note: FFmpeg libraries must be accessible:

export DYLD_LIBRARY_PATH=/opt/homebrew/Cellar/ffmpeg@7/7.1.3/lib

Step 2: Build the Audio Indexer

2.1 Define the TypeDB schema

The TypeDB schema is defined in schema/schema.tql:

define
attribute id, value string;
attribute start_time, value double;
attribute end_time, value double;
attribute text, value string;
entity speaker,
owns id @key;
entity utterance,
owns start_time,
owns end_time,
owns text;
relation spoke,
relates speaker_role,
relates utterance_role;
speaker plays spoke:speaker_role;
utterance plays spoke:utterance_role;

This schema models:

  • Speakers with unique IDs (e.g., “SPEAKER_00”, “SPEAKER_01” from diarization)
  • Utterances with timing and text
  • Relations connecting speakers to their utterances

To set up the database, we use a setup script in schema/setup_script.tql:

database delete diarisation
database create diarisation
transaction schema diarisation
source schema.tql
commit

Execute the setup script using TypeDB Console:

cd schema
typedb console --address localhost:1729 \
--username admin \
--password password \
--tls-disabled \
--script=schema/setup_script.tqls

This will:

  1. Delete any existing diarisation database (if present)
  2. Create a fresh diarisation database
  3. Load and commit the schema from schema.tql

2.2 Load audio data

We’ll use an audio example from the diarizers-community/voxconverse dataset, a benchmark for speaker diarisation.

Specifically, we’ll load the 5th audio sample (index 4) from the test dataset and use only the first five minutes:

from datasets import load_dataset
# Load the VoxConverse test dataset
dataset = load_dataset("diarizers-community/voxconverse", split="test")
# Get the 5th sample (index 4)
sample = dataset[4]
sample = sample['audio'].get_all_samples()
# Limit to first 5 minutes (300 seconds)
sample_data = sample.data[:,:(300*sample.sample_rate)]

This gives us raw audio waveform data with the sample rate, limited to 5 minutes for faster processing.

2.3 Speech recognition with Whisper

Load the Whisper model and transcribe the audio:

from transformers import pipeline
asr_engine = pipeline(
"automatic-speech-recognition",
model="openai/whisper-base",
generate_kwargs={"language": "english"}
)
recognition = asr_engine(
{"raw": sample_data.squeeze().numpy(), "sampling_rate": sample.sample_rate},
return_timestamps=True
)

The result contains timestamped text chunks:

for chunk in recognition["chunks"]:
start, end = chunk["timestamp"]
print(f"{start:.1f}s - {end:.1f}s: {chunk['text']}")

2.4 Speaker diarisation with pyannote

Load the diarisation pipeline and process the audio:

from pyannote.audio import Pipeline
from pyannote.audio.pipelines.utils.hook import ProgressHook
diarisation_engine = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
with ProgressHook() as hook:
diarisation = diarisation_engine(
{'waveform': sample_data, 'sample_rate': sample.sample_rate},
hook=hook
)

This produces speaker segments with timing. Each speaker is assigned an ID (e.g., “SPEAKER_00”, “SPEAKER_01”):

for turn, speaker in diarisation.speaker_diarization:
print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

2.5 Combining recognition and diarisation

Now we merge what was said (recognition) with who said it (diarisation):

def combine_recognition_and_diarisation(recognition, diarisation):
"""Combine 'what was said' (recognition) with 'who said it' (diarisation)"""
combined_utterances = []
for turn, speaker in diarisation.speaker_diarization:
segment_start = turn.start
segment_end = turn.end
# Find all text chunks that overlap with this speaker segment
segment_text = []
for chunk in recognition["chunks"]:
chunk_start, chunk_end = chunk["timestamp"]
if chunk_end is None:
chunk_end = chunk_start + 1.0
# Check if chunk overlaps with speaker segment
if chunk_start < segment_end and chunk_end > segment_start:
segment_text.append(chunk["text"])
combined_text = "".join(segment_text).strip()
if combined_text:
combined_utterances.append({
"speaker": speaker,
"start": segment_start,
"end": segment_end,
"text": combined_text
})
return combined_utterances

The result is a list of utterances, each with a speaker label, timestamps, and text.

2.6 TypeDB ORM for storage

Create a client to interact with TypeDB via its HTTP API:

import requests
from typing import List, Dict, Any
class TypeDBClient:
"""HTTP client for TypeDB 3.x operations."""
def __init__(
self,
endpoint: str = "http://localhost:8000",
database: str = "diarisation",
username: str = "admin",
password: str = "password"
):
self.endpoint = endpoint
self.database = database
self.access_token = self._sign_in(username, password)
def _sign_in(self, username: str, password: str) -> str:
"""Sign in and get an access token."""
url = f"{self.endpoint}/v1/signin"
response = requests.post(url, json={"username": username, "password": password})
response.raise_for_status()
return response.json()["token"]
def execute_write(self, query: str) -> Dict[str, Any]:
"""Execute a write query using the one-shot query API."""
url = f"{self.endpoint}/v1/query"
body = {
"databaseName": self.database,
"transactionType": "write",
"query": query
}
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.access_token}"
}
response = requests.post(url, json=body, headers=headers)
response.raise_for_status()
return response.json()

Add methods for inserting data:

def insert_speaker(self, id: str) -> Dict[str, Any]:
"""Insert a speaker (idempotent with 'put')."""
query = f'''
put
$speaker isa speaker, has id "{id}";
'''
return self.execute_write(query)
def insert_utterance_with_speaker(
self,
speaker_id: str,
start_time: float,
end_time: float,
text: str
) -> Dict[str, Any]:
"""Insert an utterance and link it to a speaker."""
escaped_text = text.replace('"', '\\"')
query = f'''
match
$speaker isa speaker, has id "{speaker_id}";
insert
$utterance isa utterance,
has start_time {start_time},
has end_time {end_time},
has text "{escaped_text}";
(speaker_role: $speaker, utterance_role: $utterance) isa spoke;
'''
return self.execute_write(query)

2.7 Putting it all together

The complete pipeline:

if __name__ == "__main__":
# 1. Load audio data
dataset = load_dataset("diarizers-community/voxconverse", split="test")
sample = dataset[4]['audio'].get_all_samples()
sample_data = sample.data[:,:(300*sample.sample_rate)]
# 2. Load ML models
asr_engine = pipeline("automatic-speech-recognition", model="openai/whisper-base")
diarisation_engine = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
# 3. Transcribe audio
recognition = asr_engine(
{"raw": sample_data.squeeze().numpy(), "sampling_rate": sample.sample_rate},
return_timestamps=True
)
# 4. Identify speakers
with ProgressHook() as hook:
diarisation = diarisation_engine(
{'waveform': sample_data, 'sample_rate': sample.sample_rate},
hook=hook
)
# 5. Combine recognition with diarisation
utterances = combine_recognition_and_diarisation(recognition, diarisation)
# 6. Store in TypeDB
store_diarisation_results(utterances)

Step 3: Run the pipeline

Execute the pipeline:

./main.sh
# Or manually:
DYLD_LIBRARY_PATH=/opt/homebrew/Cellar/ffmpeg@7/7.1.3/lib poetry run python -m main

The output shows:

  1. Timestamped transcription chunks
  2. Speaker segments with timing
  3. Combined utterances with speaker labels
  4. Confirmation of TypeDB storage

Conclusion

We have built an audio data indexer that transforms raw audio into a structured knowledge graph by combining:

  • Whisper for speech-to-text with timestamps
  • pyannote.audio for speaker diarisation
  • TypeDB for knowledge graph storage

This enables querying conversations by speaker, time, and content, allowing users and also other systems to query and performs manipulation on the extracted data.

The same pattern applies to any audio source: podcasts, meetings, interviews, or customer calls. The knowledge graph structure makes it easy to extend with additional metadata and relationships.

Full Working Example

For a complete, working implementation of this tutorial, visit the GitHub repository.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to newsletter

Further Learning

Feedback