TypeDB Fundamentals
TypeDB's Data Model: a Comparison
The key components of the PERA data model underlying TypeDB are derived from a set of simple and fundamental modeling primitives. In this article, we describe how these primitives are not exclusive to the PERA model but apply to many other data models. This will unveil the close kinship between existing models and the PERA model which brings about welcome benefits: we often find that migration to TypeDB from other databases becomes a simple and straight-forward task.
In our discussion, we will focus on three prominent models: the relational model, the graph data model, and the document data model. In each case, we first inspect the concepts and dependencies that govern data structures in the model. We then illustrate with examples how these structures are commonly used, and how they can be translated into TypeDB.
Relational data model
The key organizing concepts
In the well-known SQL-like approach to the relational model, data is organized, of course, into tables with columns, which represent the named concepts of the model (e.g. a table labeled Person
with a column labeled Name
comprises two names concepts: that of persons and that of names). Instances of a given table concept are called rows, and instances of a given column concept are either values in some literal value type (like strings or integers), in which case we speak of an attribute column, or they are values that represent references to (keys of) rows in another table, in which case we speak of a foreign key column.
Concept dependency
Having defined the concepts of the relational data model, what about concept dependencies? Certainly, creating an instance in a column (whether this is an attribute or foreign key column) must refer to a row for the instance to live in, and so columns will depend on their respective tables as concepts. Moreover, in the case of foreign key columns, each instance of the column must be defined with an explicit reference to a row in another table. Thus foreign key columns depend both on their own table and the foreign table they reference. Let’s illustrate this with a simple example!
A simple relational database
Consider an SQL schema comprising a single table as follows:
CREATE TABLE employee (
name text,
team text
);
In this case, we thus defined three concepts in total: the concept of employees (the table employee
), that of names (the column name
), and that of teams (the column team
).
We can now create respective instances for each of these concepts. Creating a new employee instance will add a new row to the employee
table (let us refer to this instance as $a
). Of course, without also creating a name and a team of that employee, the row would at first be “empty”, yielding a table as follows:
name | team
-----+------
NULL | NULL <-- new row representing employee $a
Next, creating a name
instance requires us to first choose an employee, and then add a literal value in the corresponding row and column. This similarly applies to instances of team
. If we do so for our employee $a
constructed earlier, our table might end up looking something like this:
name | team
------+------------
Ana | Engineering <-- our fully specified employee $a
Introducing foreign keys
The example is of course overly simplistic and not very useful as is. Let’s improve it a little bit: for instance, the fact that we are keeping track of teams using literal string values is likely not a good idea, as it duplicates work and invites spelling mistakes. Consider the following slightly extended (but certainly still simplistic) schema.
CREATE TABLE employee (
name text,
team_membership integer REFERENCES Teams(Team_ID)
);
CREATE TABLE team (
team_ID integer,
team_name text
);
In contrast to our first schema, we have now created an independent concept of team
s, a new concept of team_name
dependent on teams, and a new concept of team_membership
dependent on both employees and teams. Thus, in order to create an instance of the concept “team membership of $a
in $t
” we would have to make reference to an instance in the team
table (represented, in our schema, by its unique team_ID
key, which we exclude from being a concept as it only serves to identify rows in the team
s tables). A database conforming to the schema could for instance look as follows.
---EMPLOYEE-----------
name | team_membership
----------------------
Ana | 1
Bob | 2
-------TEAM----------
team_ID | team_name
----------------------
1 | Engineering
2 | Marketing
Migrating relational data to TypeDB
Having explained all parts of our relational schema in terms of concepts and concepts dependencies, we can translate the schema into the PERA model in a straight-forward manner. In TypeQL, the schema corresponding to the preceding example could be represented as follows.
define
employee sub entity,
has name,
plays team_membership:member;
team sub entity,
has team_name,
plays team_membership:team;
team_membership sub relation,
relates member,
relates team;
name sub attribute, value string;
team_name sub attribute, value string;
While becoming slightly more verbose, the PERA schema gives explicit insight into the dependencies of all concepts at work. This can be beneficial in many ways: for example, we can now easily model an employee being a member of multiple teams (simply by creating appropriate instances of the team_membership
concepts). In contrast, in the relational model, this would require a schema modification: turning our foreign-key into an associative entity (a.k.a. join table).
Associative entities are not entities
There is one small caveat to the discussion above: the usual modeling ambiguity of concept vs. dependencies still applies, i.e., we may need to decide whether some dependencies should rather be concepts or, conversely, whether certain concepts should rather be dependencies. In the particular case of the relational model, this gives two alternatives when conceptualizing foreign key columns.
- First, as described above, we might choose to think of tables as entities, and of foreign key columns as corresponding relation concepts. For example, the column
team_membership
in the table ofemployee
s which records theteam
they work in would become its own relation concept, relating employees and teams. - However, in some situations we might prefer to “de-conceptualize”, and turn concepts into dependencies. For example, if our schema contains a join table of
team_membership
s with columnsmember
andteam
, then, rather than thinking of the table as an entity concept and its columns as relation concepts, it would likely be preferable to deconceptualize: this means turningmember
andteam
back into “mere” role dependencies of a relation conceptteam_membership
.
To summarize, as in the PERA model itself, we often have several options to design our database schema for a given data domain.
Graph data model
The key organizing concepts
In the modern inception of the property graph data model, data is organized using labels and property keys as the named concepts of the model, though neither need to be specified in a schema ahead of time when working with schemaless graph databases. Labels fall into two categories: node labels (e.g. Person
) and edge labels (e.g. KNOWS
), and the respective instances of these are, unsurprisingly, labeled nodes and edges. In contrast, data instances of property keys (e.g. name
) are literal values (such as the string "Ana"
).
Concept dependency
The dependency of concepts in the graph model is relatively simple to see. Defining an instance of an edge label, i.e. creating a new labeled edge, requires making reference to (possibly the same) two labeled nodes which describe the start and the end point of that edge. Thus, edge labels conceptually depend on the nodes they connect. Similarly, in order to define an instance of a property key, i.e. create a new key-value pair, we first need to choose a node or edge to which we assign the pair. Thus, property keys conceptually depend on nodes and edges in order to be definable/interpretable. Let’s illustrate this with an example!
A simple graph database
Consider the following simple graph database (note, we are using Cypher-like syntax here)
CREATE (a:Employee {name:'Ana'})
CREATE (b:Employee {name:'Bob'})
CREATE (e:Team {team_name:'Engineering'})
CREATE (m:Team {team_name:'Marketing'})
CREATE (a)-[:MEMBER_OF {role:'director'}]->(e)
CREATE (b)-[:MEMBER_OF {role:'manager'}]->(m)
This graph database thus comprises the following concepts:
- A (node label) concept
Employee
. - A (node label) concept
Team
. - A (edge label) concept
MEMBER_OF
of membership of employees in teams (even though in schemaless graph database this exact dependency is hard to enforce: a priori nothing can prevent us from addingCREATE (a)-[:MEMBER_OF]->(b)
to the above, which would create a membership edge between two employees). - Three property key concepts:
name
,team_name
, androle
information for employees who are members of teams.
Understanding the limits of the graph model
This design for our graph database makes it easy to record, say, when an employee is a member of two or more teams. For example, we could add to the above that
CREATE (a)-[:MEMBER_OF {role:'advisor'}]->(m)
stating that the employee Ana is not only a director in the Engineering team, but also an advisor to the Marketing team.
Despite this flexibility, we also quickly reach the boundaries of the graph model: one fundamental modeling obstruction arises from the fact that in the graph model we have a simple “two level” model of concept dependencies: edges (“level 1”) depend on the nodes (“level 0”) that they attach to. Thus, if we had another concept that depended on a level 1 concept, there would be no way to directly capture this in the graph model since we can only represent dependencies of depth 1—this is also known as the issue of reification.
For instance, in our above example, consider turning the property key role
into its own node concept (in other word, we normalize the role
property key). In order to represent that roles are taken up by members in teams we would now have to allow edges between role
nodes and MEMBER_OF
edges. Since the graph model does not feature these “level 2” edges this is impossible to do directly represent, and we must, instead, resort to other “hacks” of our schema (e.g., splitting edges into two and thereby creating a new node).
Migrating graph data to TypeDB
As in a case of migrating relational data, the migration from graph data is very much straight-forward: each concept converts to a type in the corresponding PERA schema with concept dependencies as described above. The resulting PERA schema is similar to that in the relational case, up to one small addition:
define
employee ... # as before
team ... # as before
team_membership... # as before
name ... # as before
team_name ... # as before
role sub attribute, value string;
team_membership owns role;
We can now go ahead and create data instance reproducing exactly our earlier graph database as follows:
insert
$a isa employee, has name 'Ana';
$b isa employee, has name 'Bob';
$e isa team, has team_name 'Engineering';
$m isa team, has team_name 'Marketing';
($a, $e) isa team_membership, has role 'director';
($b, $m) isa team_membership, has role 'manager';
(Note, in the last two lines we omit the interface types of the relation as this can be inferred by the type inference engine of TypeDB!)
Property normalization in TypeQL
So how does our earlier problem of normalizing the role
property key play out in the PERA model? Well, it is not a problem at all. We can simply have an independent concept of “roles”, obtaining a schema along the following lines:
define
... # employee, team, name, team_name as before
role sub entity,
has title,
plays team_member_role:role;
team_membership sub relation,
relates member,
relates team,
plays team_member_role:membership;
team_member_role sub relation,
relates membership,
relates role;
title sub attribute, value string;
Document data model
The key organizing concepts
The document model is, in essence, a generalization of the relational model (tables correspond to collections, and rows to documents) but drops any requirements on normalization of data, meaning data can be nested, and generally has much looser requirements on referential integrity. There are two kinds of named concepts in the document model: collections of documents (e.g. a collection of Person
documents) and dictionary keys in documents (e.g. a Name
key). Instances of the former are documents in a given collection, while instances of the latter are “subdocuments”, which could comprise a simple literal value or further nested dictionaries and lists.
Concept dependency
Discussing concept dependency in the context of the document model is somewhat more subtle due to the frequent occurrence of duplicate data in several places of the database (without a unified reference mechanism between these semantically related occurrences of data). However, analogous to the relational case, the following observations can be made. Documents in collections (just like rows in tables) can, a priori, be created without reference to instances of any other concept and are thus independent concepts.
In contrast, dictionary keys represent concepts that depend on their immediate parent key (if applicable) or their parent document (otherwise). Additionally, if the key’s subdocument makes reference to documents in other collections then this creates further dependencies of the key on the referenced collections. If a key only holds a literal value in its subdocument, then we speak of an attribute key, and otherwise of a subdocument key.
Let’s see an example of how this works in practice!
A simple document database
We consider a document database with two collections: Employees
and Teams
. A document in the Employee
collection could look like this (note that we indicate document ids with a $
prefix):
{
_id: $a,
name: "Ana",
address: {
street: "Ant Street",
city: "Austin"
}
}
… or like this:
{
_id: $b,
name: "Bob",
address: {
street: "Bee Street",
city: "Boston"
}
}
In order to extract the named concepts from this document collection, note that the given document contains four distinct keys: name
and address
, both of which depend on the employee
document they live in, as well as street
and city
, both of which depend on their parent address
instance. Note that we ignore _id
as a key here, as these merely serve to identify document instances.
Next, we consider documents in the Team
collection, which could be of the following form
{
_id: $e,
team_name: "Engineering",
team_member_list: [
$a
]
},
{
_id: $m,
team_name: "Marketing",
team_member_list: [
$a,
$b
]
}
The documents thus exhibit two new concepts: that of team_name
and that of team_members
. The former depends on the team, represented by the parent document, while the latter depends both on the team as well as a (variadic!) collection of employees.
Migrating document data to TypeDB
Just like in the case of relational and graph data, the hard part of migrating document data into the PERA model was to understand the concepts and their dependencies that are implicitly used to organize the data in the model. Having analyzed the concepts of our document database example, we obtain the following PERA schema:
define
employee sub entity,
owns name
plays address:owner,
plays team_member_list:member;
address sub relation,
relates owner,
owns street,
owns city;
name sub attribute, value string;
street sub attribute, value string;
city sub attribute, value string;
team sub entity,
owns team_name,
plays team_member_list:team;
team_member_list sub relation,
relates team,
relates employee;
team_name sub attribute, value string;
We remark that the above schema is the result of our general translation procedure, and that, for practical purposes, the result can likely be improved in various ways (this also highlights certain shortcomings of the original document database). For example, we may want to record role
information for team members individually, which would require a schema more similar to that in our graph database example. Or, we may want to record addresses by attributes of structured value type (a natural feature of the PERA model, and one that is also firmly on the roadmap for TypeDB). Those beauty spots aside, the above schema is a great starting point for our migration!
Having, defined the schema implicit to our document database, let us now create data as follows
insert
$a isa employee,
has name "Ana",
$addr_a (owner: $a) isa address,
has street "Ant Street",
has city "Austin";
$b isa employee,
has name "Bob",
$addr_b (owner: $b) isa address,
has street "Bee Street",
has city "Boston";
$e isa team,
has team_name "Engineering";
(team: $e, member: $a) isa team_member_list;
$n isa team,
has team_name "Marketing";
(team: $e, member: $a, member: $b) isa team_member_list;
Easy! Though, we should add that our remark about the “deconceptualization” in the relational case still applies here as well. This means that just like “join tables”, some collections will, in fact, be better understood as dependent concepts, but this design choice will have to be made on a case-by-case basis.
Adding polymorphism to the picture
In the previous sections we’ve seen detailed de-constructions of the relation, graph, and document model into their fundamental concepts and concept dependencies. Then, exploiting the fact that the PERA model builds directly on a conceptual model, we were able to easily migrate our data. Our insights from each of these migrations are summarized in the diagrams below: here, for each model, we illustrate its respective terminology for concepts (depicted using boxes) as well as how these concepts may be interdependent on each other (depicted by arrows).
The graphic powerfully illustrates the following important point. Across relational, graph, and document model, we find a successive increase of flexibility in “what can depend on what” (for example, in the graph model edges may have properties, which is not the case for foreign-keys in the relational model). Besides this observation however, the diagrams for all three models do not differ very much in their simple fundamental approach to understanding data via conceptualization and concept dependency.
Summary
The PERA model not only refines existing models, capturing each part of the above diagrams in their fully general form, but it also naturally infuses the diagram with a new idea: polymorphism—or, in more detail, the idea of abstracting dependencies using interface polymorphism, and the idea of organizing concepts using inheritance polymorphism. This represents a further step in the above succession of data models, yielding a polymorphic conceptual data model (for the interested reader, we refer to our article need for polymorphism in databases for a discussion on the impact that native, zero-cost polymorphism can have on database engineering). This positions the PERA model as a powerful generalization of existing data modeling paradigms.