TypeDB Learning Center
Building Enhanced Knowledge Graphs with TypeDB
Knowledge engineering is a continuously developing field, and has only continued to become more relevant as emerging AI technologies require increasingly sophisticated knowledge bases to leverage. One of the biggest advances in knowledge graph design was the shift from more primitive triplestores to higher-level, more expressive graph databases in the early 2010s, but there have been no significant technological leaps since then. While AI has seen astounding improvements in the past fifteen years, the databases we use to store our knowledge graphs have remained largely the same.
Certainly, there have been architectural improvements: graph databases are faster, more scalable, and have integrated multi-model capabilities. But the underlying data model has not changed, limiting the ability of graph query languages to express high-level queries in terms of business logic. TypeDB offers a functional database programming model, with a high-level conceptual query language. In this article, we’ll explore how this can enable more expressive knowledge graphs, and improve accuracy for retrieval-augmented generation tasks.
Why knowledge graphs?
Knowledge graphs originate in the field of computational linguistics, with the term generally considered to have been coined in 1973. In this sense, knowledge graphs have always been closely tied to the field of natural language processing (NLP). There has been considerable interest in knowledge graphs for building expert systems, though they did not receive widespread recognition until 2012, with the release of Google’s eponymous Knowledge Graph. The impact of Google’s Knowledge Graph has been felt by all of us. In addition to providing results based on similarity to search terms, their search engine has since provided a helpful summary of key information on the right-hand side of the page.
Since 2012, knowledge graphs have only continued to become more relevant. The knowledge graph landscape was profoundly changed in November 2022 with the release of the large language model (LLM) ChatGPT. By January, it had become the fastest growing software product in history, and spawned a new wave of LLM technologies and toolings. LLMs have quickly proved themselves in a variety of user-facing roles, however, they have not been without their problems. Problems that knowledge graphs were perfectly positioned to solve.
The problem with LLMs
One of the biggest challenges in LLM development is preventing hallucinations: inaccurate or nonsensical outputs that merely mimic a correct response. Many of the more egregious hallucinations have made international news: Bard incorrectly claiming that the James Webb Space Telescope was the first to photograph exoplanets, Gemini generating racially diverse images of Second World War era German soldiers, and most recently, Google AI Search advising users to add glue to their pizza to improve the consistency of the cheese.
Many hallucinations are not so obviously erroneous, appearing plausible without further investigation. These are the more problematic kind, as they have the potential to be taken as fact by the user. In June 2023, a case heard at a US Disctrict Court gained media attention when the plaintiff’s lawyers submitted material citing case law precedent for which the cases were entirely nonexistent, having been hallucinated by ChatGPT. Early versions of ChatGPT were notorious for falsely referencing information, and this is still an area in which LLMs struggle.
Where do hallucinations come from?
Typically, hallucinations arise from problems with the training process: inaccurate data, biased data, and overfitting can all contribute to hallucinations. As an emergent property of LLMs, hallucinations are difficult to tackle without negatively affecting correct outputs. This arises from the fundamental nature of LLMs: both training and response generation are probabilistic processes, so it is not possible to guarantee that errors will not occur. As a result, managing hallucinations has become a key area of research as the LLM industry matures.
Retrieval-agumented generation
By far, the most popular method of reducing hallucinations is to use retrieval-agumented generation (RAG). In a RAG system, the language model uses a database as a source of truth. User prompts are broken down by the LLM into one or more queries that are issued to the database. The results returned by the database are then parsed by the LLM in order to generate a response to the original prompt. This is very similar to the way Google’s Knowledge Graph works: the search phrase is broken down to identify the key terms, which are then queried for in the data store, with the results returned to the user.
RAG combines the LLM’s ability to process ad-hoc natural language inputs with the permanence and robust accuracy provided by a database, overcoming the LLM’s tendancy to generate inaccurate data without the user having to write any database queries. RAG also provides additional benefits, such as the ability to make ad-hoc updates to the knowledge base without retraining the LLM.
Structured vs unstructured data
RAG has consistently been found to outperform fine-tuning as a method for improving LLM response quality, especially for tasks concerning specific facts or knowledge. Using structured data for RAG, in the form of knowledge graphs, generally produces more accurate outputs than using unstructured data, such as books or articles. One reason for this is the more precise delineation between facts.
Individual facts are in a knowledge graph are generally encoded as well-defined semantic triples, the most atomic form of knowledge representation. This is in contrast to unstructured data, where the scope of a fact may vary throughout: an individual fact might be captured by just a single clause in a sentence, or an entire paragraph of text. Likewise, multiple facts might overlap within the same block of text. Transformation of unstructured data into structured data is a popular way of improving responses, though this can be a costly process.
Using graph databases for RAG
Knowledge graphs for RAG systems are stored in graph databases, which have likewise seen a boom in popularity in the last decade. The idea of graph-based data stores had existed since the early 1990s, and the first commercially viable versions were released in the mid-2000s, notably Neo4j in 2007, the most popular graph database today. Graph databases are based on graph theory and implemented as labeled property graphs (LPGs), which are built up from semantic triples at the fundamental level. However, unlike triplestores, which require users to directly manipulate triples, graph databases provide high-level abstractions for working with data, inspired by Chen’s 1976 entity-relationship (ER) model.
However, graph databases come with a number of caveats. While their data is considered structured by RAG standards, they are considered only semi-structured in the context of database models, because graph databases are schemaless. In a previous article series, we explored the problems that arise from modelling complex, interconnected data without a sufficiently expressive schema. Most notably, the lack of a schema that can describe high-level model constraints means that we cannot leverage such constraints when querying, leading to queries that are more imperative than declarative. In this article, we will explore how this impedes the ability of RAG systems to accurately generate queries, and how TypeDB provides a higher-level querying syntax that LLMs can capitalise on when generating queries.
Social media graph
For this exercise, we’ll use the example of a social media graph that captures a number of facts about different entities:
- People and their relationships: friends, relatives, parents, children, siblings, partners, fiances, and spouses.
- Organisations of different types: companies, charities, schools, colleges, and universities, and the people who have been employed by or attended them.
- Groups and their members.
- The interactions by users with different pages (personal and organisation profiles, in addition to groups): viewing, posting, sharing, commenting, reacting, and responding to polls.
- Geographic locations, their hierarchies, and the things located there: people, organisations, life events, and posts.
We’ll see how we can go about implementing three sample queries using both Neo4j and TypeDB. The queries showcased are simple to phrase but difficult to answer, requiring diverse data spread over large sub-graphs to be aggregated in order to reach the correct result.
Common countries
What are the names of countries that both “Emilio Ravenna” and “Gabriel Medina” have been to?
Let’s start off with something fairly easy. For this query, we’ll need to understand how the knowledge graph can tell use if a user has been to a particular country. The country might be listed as the user’s location, they might have a recorded life event that took place there, or they might have posted from there. The complexity arises from the fact that the location listed in any of these cases might not be a country: it might be listed as a state, city, or even a specific building (e.g. the Sydney Opera House).
When implementing this in Neo4j, the recursive search through locations is conveniently handled by Cypher’s variable-length relationship syntax. Compared to a relational database where we would have to use something like a recursive CTE, this is much more concise. Let’s see how we use it to express the above query.
MATCH
(emilio:Person {name:"Emilio Ravenna"}),
(gabriel:Person {name: "Gabriel Medina"}),
(country:Country)
WHERE
(
(emilio)-[:LOCATED_IN*]->(country)
OR
(emilio)-[:BORN_IN]->(country)
OR
(emilio)-[:BORN_IN]->(:Place)-[:LOCATED_IN*]->(country)
OR
(emilio)<-[:FIANCE]-(:Engagement)-[:LOCATED_IN*]->(country)
OR
(emilio)<-[:SPOUSE]-(:Marriage)-[:LOCATED_IN*]->(country)
OR
(emilio)-[:AUTHORED]->(:Post)-[:LOCATED_IN*]->(country)
) AND (
(gabriel)-[:LOCATED_IN*]->(country)
OR
(gabriel)-[:BORN_IN]->(country)
OR
(gabriel)-[:BORN_IN]->(:Place)-[:LOCATED_IN*]->(country)
OR
(gabriel)<-[:FIANCE]-(:Engagement)-[:LOCATED_IN*]->(country)
OR
(gabriel)<-[:SPOUSE]-(:Marriage)-[:LOCATED_IN*]->(country)
OR
(gabriel)-[:AUTHORED]->(:Post)-[:LOCATED_IN*]->(country)
)
RETURN country.name
While fairly easy to follow, this Cypher query is still significantly more complicated than the original natural-language query. This is due to the fact that we have to specify each way that the users might be recorded to have been to a country. For an LLM in a RAG system, identifying all these ways can prove difficult, potentially leading some countries to be missed out of the results.
Engagement rates
What are the tags on the five posts made by user “Mario Santos” in the “Knowledge Graphs” group that have the highest engagement rates?
This query is simple enough to understand, and is clearly very useful for analysing social impact. A post’s engagement rate is equal to the number of people who have engaged with the post divided by the total number of viewers. But determining the number of engagements is tricky: an engagement on a post could come in the form of a share, a comment, a reaction, or a poll response. Not only that, but commenting on or reacting to an existing comment on the post would also count as an engagment, potentially in a deeply nested comment tree. In order to query this in Neo4j, we could use the following Cypher query.
MATCH
(mario:Person {name: "Mario Santos"}),
(kg_group:Group {name: "Knowledge Graphs"}),
(mario)-[:AUTHORED]->(post:Post)-[:POSTED_TO]->(kg_group)
WITH post
MATCH
(engager:Profile)
WHERE (
(post)<-[:SHARE_OF]-(:Post)<-[:AUTHORED]-(engager)
) OR (
(post)<-[:REPLY_TO*]-(:Comment)<-[:AUTHORED]-(engager)
) OR (
(post)<-[:REACTED_TO]-(engager)
) OR (
(post)<-[:REPLY_TO*]-(:Comment)<-[:REACTED_TO]-(engager)
) OR (
(post:PollPost)<-[:RESPONDED_TO]-(engager)
)
WITH post, count(DISTINCT engager) AS engager_count
MATCH
(viewer:Profile)-[:VIEWED]->(post)
WITH post, engager_count, count(viewer) AS viewer_count
RETURN collect(post.tag) AS tags, engager_count / viewer_count AS engagement_rate
ORDER BY engagement_rate
LIMIT 5
The five ways a user could interact with a post are captured by the disjunction in the WHERE
clause of the query:
- Sharing the post in a new post.
- Commenting on the post, at any depth in the comment tree.
- Reacting to the post.
- Reacting to a comment on the post, at any depth in the comment tree.
- If the post is a poll post, responding to the poll.
We can use the same variable-length relationship syntax as previously to perform the depth searching of comment trees. But again, the Cypher query is far from the original natural-language query. In a RAG system, generating this query would prove difficult for an LLM. It might not be able to accurately identify every way in which a post could be engaged with. Additionally, the recursive cases here are more complex than in the previous query, and require particular insight of a kind that LLMs don’t generally have. To further complicate the task, we have irrelevant nodes of the Engagement
type in our graph, which the LLM must ignore. As we saw in the previous example, these represent engagements between fiances, and have nothing to do with this query. If any element of the query generated is incorrect, the LLM could very easily report inaccurate engagement rates.
Possible friends
What are the names of the five users that share the most social connections with “Pablo Lamponne”, and that also have overlapping education or employment with him?
This query is designed to find people that Pablo likely knows but hasn’t added as friends. Again, this query is straightforward at face value but complex to execute against the graph. Let’s see how we implement it in Cypher.
MATCH
(pablo:Person {name: "Pablo Lamponne"}),
(possible_friend:Person),
(pablo)-[event_1]-(organisation:Organisation),
(possible_friend)-[event_2]-(organisation)
WHERE
(
"EducationalInstitute" IN labels(organisation)
AND
type(event_1) = "ATTENDED"
AND
type(event_2) = "ATTENDED"
) OR (
type(event_1) = "EMPLOYED"
AND
type(event_2) = "EMPLOYED"
) AND (
(
event_1.start <= event_2.start
AND
event_2.start < event_1.end
) OR (
event_2.start <= event_1.start
AND
event_1.start < event_2.end
)
)
WITH pablo, possible_friend
MATCH
(mutual:Person)
WHERE
(
(pablo)-[:FRIEND_OF|PARTNER_OF|RELATIVE_OF|PARENT_OF|SIBLING_OF]-(mutual)
OR
(pablo)<-[:FIANCE]-(:Engagement)-[:FIANCE]->(mutual)
OR
(pablo)<-[:SPOUSE]-(:Marriage)-[:SPOUSE]->(mutual)
) AND (
(possible_friend)-[:FRIEND_OF|PARTNER_OF|RELATIVE_OF|PARENT_OF|SIBLING_OF]-(mutual)
OR
(possible_friend)<-[:FIANCE]-(:Engagement)-[:FIANCE]->(mutual)
OR
(possible_friend)<-[:SPOUSE]-(:Marriage)-[:SPOUSE]->(mutual)
)
RETURN possible_friend.name, count(mutual) AS mutuals
There are two sources of complexity in this query: the identification of overlapping events, and the range of social connections that users can have. The overlap of two events is tricky to express, while searching the range of social connections presents a similar problem to the previous two queries.
There is a further nuance: why is it that friends, relationships, family relations, parentships, and siblingships are modelled as relationships, but engagements and marriages are modelled as nodes? If we examine the first query again, we see that we need to record the locations of engagements and marriages. Because Neo4j does not support nested relationships, we must reify them into nodes in order for them to act as the endpoints for the LOCATED_IN
relationships.
The Cypher implementation of this query also has a distinct structural difference to the natural-language query: in the original version, the overlapping events forms the end of the sentence, whereas the structure of Cypher requires that the event is constrained before we consider the number of social relations.
All of these factors contribute to making this query difficult for an LLM to implement. Compared to the original query, the Cypher implementation has a very imperative structure. Without knowledge of the data model’s abstractions and true rational insight, there isn’t sufficient context to correctly build this query, and there is a high chance that the LLM will generate a faulty query leading to incorrect results.
The functional solution
TypeDB is a modern database with a powerful and intutive query language, TypeQL, based on award-winning research in type theory. One of the guiding design principles is that TypeQL should read as closely to natural language as possible, while still providing the safety guarantees and high-level expressivity of a strongly typed programming language. Here, we see how we can go about implementing the above queries in TypeQL.
What are the names of countries that both “Emilio Ravenna” and “Gabriel Medina” have been to?
match
$emilio isa person, has name "Emilio Ravenna";
$gabriel isa person, has name "Gabriel Medina";
$country isa country, has name $name;
has-been-to($emilio, $country) == true;
has-been-to($gabriel, $country) == true;
fetch
$name;
What are the tags on the five posts made by user “Mario Santos” in the “Knowledge Graphs” group that have the highest engagement rates?
match
$mario isa person, has name "Mario Santos";
$kg-group isa group, has name "Knowledge Graphs";
(post: $post, author: $mario, page: $kg-group) isa posting;
$engagement = engagement-rate($post);
fetch
$post: tag;
$engagement;
sort $engagement desc;
limit 5;
What are the names of the five users that share the most social connections with “Pablo Lamponne”, and that also have overlapping education or employment with him?
match
$pablo isa person, has name "Pablo Lamponne";
$possible-friend isa person, has name $name;
$mutuals = mutual-relation-count($pablo, $possible-friend);
{ have-education-overlap($pablo, $possible-friend) == true; }
or
{ have-employment-overlap($pablo, $possible-friend) == true; };
fetch
$name;
$mutuals;
sort $mutuals desc;
limit 5;
Unlike the Cypher queries, the TypeQL queries need no explanation: even without knowledge of the query language, it is easy to tell what the intent of the queries is. This makes TypeQL queries easy to write and maintain, but more importantly, it means they are very close to the original prompts, and will be generated by the LLM with greater accuracy.
Understanding functions
The ability for these queries to be so simple comes from one of TypeQL’s newest features introduced with TypeDB 3.0: functions. In the above queries, we directly make use of five functions:
has-been-to($person: person, $place: place) -> bool
engagement-rate($content: content) -> double
mutual-relation-count($person-1: person, $person-2: person) -> int
have-education-overlap($person-1: person, $person-2: person) -> bool
have-employment-overlap($person-1: person, $person-2: person) -> bool
The full defintions of these functions are given here, along with any nested functions they call.
Functions for “common countries” query
fun locations-of($object) -> { place }:
match
{ (place: $place, located: $object) isa location; }
or
{
(place: $place, located: $child-place) isa location;
$child-place in locations-of($object);
};
return { $place };
fun has-been-to($person: person, $place: place) -> bool:
match
{ $place in locations-of($person); }
or
{
$event links ($person);
{ $event isa birth; }
or
{ $event isa engagement; }
or
{ $event isa marriage; };
$place in locations-of($event);
} or {
(post: $post, author: $person) isa posting;
$place in locations-of($post);
};
return check;
Functions for “engagement rates” query
fun engagements($content: content) -> { content-engagement }:
match
{ $engagement isa content-engagement, links (content: $content); }
or
{
(parent: $content, post: $post) isa posting;
$engagement in engagements($post);
} or {
(parent: $content, comment: $comment) isa commenting;
$engagement in engagements($comment);
};
return { $engagement };
fun engagers($content: content) -> { profile }:
match
$engagement in engagements($content);
$engagement links (author: $profile);
return $profile;
fun engager-count($content: content) -> int:
match
$engager in engagers($content);
return count($engager);
fun viewer-count($content: content) -> int:
match
(viewer: $viewer, viewed: $content) isa viewing;
return count($viewer);
fun engagement-rate($content: content) -> double:
match
$engager-count = engager-count($content);
$viewer-count = viewer-count($content);
$engagement-rate = $engager-count / $viewer-count;
return $engagement-rate;
Functions for “possible friends” query
fun mutual-relations($person-1: person, $person-2: person) -> { person }:
match
($person-1, $mutual) isa social-relation;
($person-2, $mutual) isa social-relation;
return { $mutual };
fun mutual-relation-count($person-1: person, $person-2: person) -> int:
match
$mutual in mutual-relations($person-1, $person-2);
return count($mutual);
fun have-overlap($event-1, $event-2) -> bool:
match
$event-1 has start-date $start-1, has end-date $end-1;
$event-2 has start-date $start-2, has end-date $end-2;
{
$start-1 <= $start-2;
$start-2 < $end-1;
} or {
$start-2 <= $start-1;
$start-1 < $end-2;
};
return check;
fun have-education-overlap($person-1: person, $person-2: person) -> bool:
match
$education-1 isa education, links (institute: $institute, attendee: $person-1);
$education-2 isa education, links (institute: $institute, attendee: $person-2);
have-overlap($education-1, $education-2) == true;
return check;
fun have-employment-overlap($person-1: person, $person-2: person) -> bool:
match
$employment-1 isa employment, links (employer: $employer, employee: $person-1);
$employment-2 isa employment, links (employer: $employer, employee: $person-2);
have-overlap($employment-1, $employment-2) == true;
return check;
The complexity in the queries is properly contained within the logic of the functions, which provide abstractions for the high-level constraints that we place on the data to be retrieved. If these functions are predefined in the schema and indexed by the LLM, they will be accessible for RAG tasks. As queries that use the functions are closer to their natural language representations, query generation that leverages them will have a higher success rate.
To achieve this, we define the functions that the RAG system will need in advance, but we can also have the system dynamically generate functions on the go. TypeQL queries can use functions defined at both the schema level and the individual query level. Using query-level functions is ideal for modular RAG architectures, which permit the LLM to recursively break the prompt down into a series of sequential or parallel tasks.
Usig polymorphism in functions
TypeQL’s polymorphic properties also lends itself to writing more reusable, modular functions. Let’s take a look at an example. The engagement-rate
function calls a function engagements
that returns a stream of content-engagement
objects.
fun engagements($content: content) -> { content-engagement }:
match
{ $engagement isa content-engagement, links (content: $content); }
or
{
(parent: $content, post: $post) isa posting;
$engagement in engagements($post);
} or {
(parent: $content, comment: $comment) isa commenting;
$engagement in engagements($comment);
};
return { $engagement };
This function makes use of inheritance polymorphism, which allows it to take multiple different types of content as input, and return multiple different engagment types as outputs:
- The input variable
$content
is of typecontent
, which can be apage
, apost
, or acomment
. These also have further subtypes! The concrete subtypes ofpage
areperson
,organisation
,company
,charity
,school
,college
, anduniversity
, andgroup
, and the subtypes ofpost
aretext-post
,share-post
,image-post
,video-post
,live-video-post
, andpoll-post
. - The output is a stream of
$engagement
variables of typecontent-engagment
, which can be any ofposting
,sharing
,commenting
,reaction
, orresponse
.
It is also a recursive function, which allows it to search through comment trees. We can also use polymorphic functions that don’t require specific types to operate on, for example the have-overlap
function.
fun have-overlap($event-1, $event-2) -> bool:
match
$event-1 has start-date $start-1, has end-date $end-1;
$event-2 has start-date $start-2, has end-date $end-2;
{
$start-1 <= $start-2;
$start-2 < $end-1;
} or {
$start-2 <= $start-1;
$start-1 < $end-2;
};
return check;
The arguments $event-1
and $event-2
don’t have specific types defined, so TypeDB will infer them from the context of the function: as long as a data instance has start dates and end dates, it is a valid argument. This will check if any two event-like data instances have an overlap, regardless of type!
Comparing functions in Neo4j
But Cypher has functions too, couldn’t we do the same there? We certainly could, but unlike TypeQL’s first-class functions, functions in Neo4j have a number of drawbacks:
- They must be written in Java rather than natively in Cypher.
- New functions cannot be used without restarting the database.
- Functions are imperative and are not optimized by the query planner.
- Memory and dependencies on other functions must be manually managed.
Compared to native TypeQL functions, Neo4j’s imported Java functions are not well-suited for RAG applications. The architectural overhead required to correctly write and implement them makes generating functions a harder task for LLMs. Additionally, the inability to define them at the query level without restarting the database (as we can with TypeQL functions), means that modular RAG solutions can’t generate and leverage new functions on the go. Rather than being able to break down the problem into modular and reusable steps, a RAG system built over Neo4j must contain the entire solution within the space of a single query.
The future of knowledge graphs
It is clear that the future of human-machine interactions will be dominated by LLMs, acting as translators between natural languages and programming languages. Indeed, this is a practical requirement for us to build artificial general intelligences (AGIs). Facilitating the translation between the high-level concepts of natural language and the low-level formats in which we encode information requires a new kind of knowledge graph.
Graph databases are powerful tools, but their query languages still operate in terms of low-level encodings: nodes and edges, only fractionally more conceptual than directly using semantic triples. In order to better enable AIs to understand our requests, we need them to be able to process ideas at a higher level. This forms the basis for the functional database programming model, implemented by TypeDB. With TypeQL’s higher-level declarative functions, with built-in polymorphism, we can build RAG systems that are smarter and more capable than basic graph-based ones, and gradually push the boundaries of LLMs towards AGI.