TypeDB Learning Center
Type Theory as the Unifying Foundation for Modern Databases
If the term “type theory” sounds exciting to you, chances are you are either a theoretical computer scientist or a mathematician. But with several high-profile successes of the field over the last decade or two, it is becoming evident that type theory has a lot to teach us about designing the foundations of practical modern technology. We are already seeing many powerful type theoretical ideas crop up in languages like, say, Rust (including, e.g., the affine type system that governs borrowing, or pattern matching definitions that are used to destructure algebraic types). Transferring these fundamental ideas from mathematics to practical computer science can yield substantial benefits in terms of expressivity, interoperability, and safety of our systems and applications. Surprisingly, it can also make our systems behave more intuitively as we will learn.
So what about database languages? How could we integrate modern type-theoretic ideas as we see it happening with programming languages? As a matter of fact, type-theoretic thinking has so far found little to no application in the field, at least when it comes to the design of database models and languages. Traditional models are built on rigid foundations that are not easily amenable to integrating type-theoretical language. This leaves much room for improvement and modernization, but it means that we must re-think databases at the foundational level:
With new type-theoretic mathematics at our disposal, how would we re-think and re-design the foundations of modern databases in the best possible way?
In this article, we want to give a first comprehensive answer to this question, which just recently started to draw more and more interest from both academia and industry. To do so, we will have to start from the very beginning, taking you on a whirlwind tour through the classical mathematical foundations for databases, modern type theory, conceptual database modeling, and a grand synthesis of all of these ideas.
The modernization of mathematics
Scientific thinking evolves over time, and when we look back a hundred years we often find that scientists thought very differently about the fundamental ingredients of any specific field of research. Of course, the same applies to mathematics and theoretical computer science: the question “What are the foundations of mathematics?” itself has seen a dramatic evolution in the past century. In this section, we start with a brief survey of this evolution, which will then also help us understand where traditional database paradigms stem from.
The original inspiration for relational databases
The first successful attempts at nailing down the formal foundations of mathematics emerged at the beginning of the 20th century in the work of Russell and Whitehead, even though there was no “precise statement of the syntax of the formalism” at that time (as Goedel pointed out). Nonetheless, early work was paving the way for what was to come: mathematicians identified sets and relations (also called predicates by mathematicians) as a starting point from which all other mathematics was to be constructed by certain axiomatic set operations (like unions or powersets). For example, functions F(x) = y
(say, Date-of-Event(e) = d
) where conceived as specific relations F(x,y)
in which, for all x
, there exists exactly one y
such that F(x,y)
holds true. In contrast, general relations R(x,y)
(say, Marriage-of-Persons(p1,p2)
) need not have this property.
This line of thinking was, most certainly, hugely influential in the inception of relational algebra by Codd in the 1970s. Codd precisely envisioned relations as predicates, and all information was to be encoded in such relations. The key algebraic operations provided by Codd, moreover, were inspired by those of predicate logic: Codd’s main theorem precisely states that relational algebra is just as expressive as domain-independent predicate logic. While expressive, the relational algebra approach is also extremely restrictive: the idea that ‘everything is a set or relation’ wasn’t satisfactory to many computer scientists, as many data structures are not natively modeled by sets and relations, and the search for other database paradigms continued. In fact, even vanilla SQL does not follow the relational paradigm exactly: in contrast to Codd’s set semantics, modern RDMSs work with bag semantics, which allows the row (x,y)
to appear multiple times in a relation R
. This choice, of course, reflects the needs of practical data modeling.
But what do mathematicians think of set theory? Did the theory of sets and relations solve the foundations of mathematics once and for all? The short answer is “no”. Indeed, while mathematicians have spent considerable effort in making set-theoretical foundations practical, the reality of mathematical practice turned out to be very far from “just sets and relations”. Very similar to the concerns of many computer scientists, mathematicians found that most of the ideas they care about cannot be natively modeled by sets and relations. As a result, the search for better foundations of mathematics soon continued, and we made substantial progress on the question in the latter part of the 20th century. This success was based on a rather different line of thought than what was originally envisioned by set theorists.
Mathematics moved on to compositional systems
Type theory is about types and their interactions. In contrast to just sets, this adds a new level of descriptiveness to our thinking: for example, in type theory, functions between types become first-class citizens as they capture how types interact. Types and their functions now organize into a new kind of structure: a category (not a set)! Nowadays, almost all areas of modern mathematics work in the language of category theory. Categories provide a bird’s eye perspective on type interaction, and on how interactions compose. This has led some mathematicians to call category and type theory the “compositional foundation” of mathematics.
Types are about composability
B. Milewksi, “Category theory for Programmers”
Importantly, not just mere functions of types compose in type theory. The birds-eye perspective taken by type theorists provided us with a clear picture of the integration of classical logic (say, predicate logic) and type theory: this is now known as the propositions-as-types paradigm. In this paradigm, classical predicates become so-called dependent types: while predicates R(x,y)
describe when a relation is true depending on the given x
and y
, a dependent type R(x,y)
describes which terms z : R(x,y)
belong to some type depending on x
and y
. Not only does this immediately resolve our earlier issue with “bag semantics” (we can have multiple “row” terms z1, z2, z3, ...
in a “relation” R(x,y)
for the same x
and y
), but it now also allows us to compose dependencies. Indeed, while classical predicates depend on sets, dependent types depend on, well, types again! We can use this iteratively and consider types depending on types depending on types… etc. This leads to composed dependent expressions such as z : R(x,y); w : S(z);
. From a more “functional” perspective, this composition could also be written as w : S(z : R(x,y))
. Such composed dependencies turn out to be an extremely natural and ubiquitous data structure, as we will soon learn. As a first example, consider the composition d : Date-of-Event(m : Marriage-of-Persons(p1, p2))
… but more on this soon!
Dependent types, and the propositions-as-types paradigm, are key ingredients of modern type-theoretic “compositional” mathematics. The approach has celebrated many recent successes.
To most of us, all this still probably sounds a bit abstract. As we will now learn, it is not: each of us already uses dependent types on a daily basis.
Crash course in type theory
If you have read until here, have never heard of “type theory” before, and have no idea what the previous section was going on about, do not give up just yet. Here is a crash in type theory, not for mathematicians, but for English speakers.
What is a type?
Types appear in many different domains in computer science, ranging from conceptual data modeling to programming languages to formal proof assistants. As a starting point for our journey we consider the following broad definition of what a type is:
A type is a description of a domain that a variable can range over.
The definition is general enough to encompass interesting examples from a variety of domains. Let’s focus on examples from both programming languages and natural language. The former comes with familiar examples, such as the following.
- Consider the typing
x : Integer
, stating that a variablex
has the type of integersInteger
. This type precisely describes the domain thatx
can range over, namely thatx
can be any integral number like1
,-2
, or13
. - Similarly, the typing
x : String
tells us thatx
will range over the domain of strings, such as"abc"
,"John Doe"
, or"typedb.com"
. - The typing
x : Divisor-of(y)
is more interesting, here the domain thatx
can range over depends on a variabley : Integer
. This precisely is a dependent type! If we initialize the variable to, say,y = 26
then this yields an independent domain description ofx
, namely thatx
can be among the integers1
,2
,13
, and26
.
Our definition of what a type is also (in a more loose sense) applies to natural language. Let’s look at some examples.
- Consider the typing
x : Person
. This again describes the domain thatx
can range over, namely,x
can be any object identified as a “person”. You might argue that this domain is not as “universally” described as, say, the domain of integers, but let us fix a universe and time to make it precise (where of course, by a universe, we will mean a database!). - Similarly, the typing
x : Pet-of-Person(y)
is a valid description of a domain forx
, but (as in our earlier case of divisors of an integer variable) our description here contains a variabley
(which we take to be of typePerson
). So this is a dependent type. For any given persony
,x
will range over the domain of objects identified as “pets of persony
“. - Just for the sake of mention, the standard example from database theory: consider the type
x : Marriage-of-Persons(y,z)
. This, again, is a dependent type, now depending on two variablesy
andz
. The variablex
ranges over the domain of objects identified as “marriages between y and z”. We make two important observations. First, “objects” can refer to abstract concepts like “marriage”. Second, recall that types can contain more than one object: in particular, the same two personsy
andz
may have had multiple marriages (it happens!).
So, what should we take away from the above? Well, first of all that the basic idea behind types and dependent types is really not that abstract. And secondly that dependent are, in fact, very much ubiquitous. Indeed, as we have just learned, whenever you say “John is one of my best friends”, you are secretly initializing the dependent type Best-friend-of-Person(y)
with y
set to yourself, and John
a term in the resulting type!
How types compose
The real power of type theory derives from the fact that term and type expressions compose. This need not be explained in complicated mathematical terms either. It can be understood by analogy with natural language.
We have just learned that, from the perspective of natural language, types are descriptions of domains for variables… such descriptions compose. One fundamental way of composition, which we have already met in a previous example, is the chaining of descriptions. For example, let’s go back to our earlier typing d : Date-of-Event(m : Marriage-of-Persons(p1 : Person, p2 : Person))
: this composes the descriptions, “p1 and p2 are persons”, “m is a marriage of p1 and p2” and “d is a (or the) date of m” into a single combined description “d is a date of the marriage m of persons p1 and p2”. Type theorists have a fancy term for this composed type: you are looking at a dependent pair type. Type theorists may write this as:
Σ[ p1 ∈ Person ] Σ[ p2 ∈ Person ] Σ[ m ∈ Marriage p1 p2 ] Σ[ d ∈ Date m ] ⊤
But, in all honesty, that looks pretty hard to read! And it need not be: getting ahead of ourselves, here is a “near-natural language” pattern in the query language TypeQL, which does the very same thing:
$p1 isa person;
$p2 isa person;
$m ($p1, $p2) isa marriage;
$m has date $d;
Reading this is easy as cake!
Besides sequentially combining descriptions, we may also parallelize them, and this yields a second important way of composing types. For example, the type x : Cat
(“x is a cat”) and type x : Dog
(“x is a dog”) can be “parallelized” in natural language by the description “x is a cat or a dog”. A type theorist, would speak of a sum type and write x : Cat + Dog
. Again, in terms of intuitivity and readability, we believe that we can do much better than that. Here is a TypeQL pattern that expresses this type:
{ $x isa cat; } or { $x isa dog; };
At this point, you have already understood a substantial part of the idea behind “composability of types”. The fact that most of our explanation can be expressed in natural language goes to show, well, how natural the approach taken by type theory is.
A brief remark, for those in the know. A final class of type operations involves types of functions between types. In dependent type theory, these are generalized by the notion of dependent function types, which type theorists write using the symbol Π
(in a way, dependent pair types, Σ
, are only “half” of dependent type theory, with the other half being dependent function types). While important for representing reasoning with implications, these function types are “much bigger” than pair types, which makes them less suitable for efficiently describing data with them. However, we do absolutely see a role for dependent function types in databases, too: their elements capture the logical implication rules that can infer new data from existing data. We will get back to this later and you may, for now, forget that we have mentioned functions at all.
Polymorphism-enhanced conceptual data models
In the last section, we defined types as “descriptions” of data domains that variables can range over, and we learned how the composability of types naturally gave rise to descriptions of complex domains. We even saw first examples of our query language, TypeQL, and how its design directly mirrored the “natural language” perspective on basic type-theoretic ideas. But having identified our “type-theoretic query language” is only one part of the picture. The other is the construction of a suitable data model that governs user-defined types (like Person
or Marriage
) and their dependencies (like Marriage-of-Persons(p1,p2)
), which we can then query for.
The categorization of language
We set out to built a generalist database, that can natively express and translate between diverse data and models. There is a long-established subfield of classical database theory that has worked precisely towards the goal of “generalism” for a long time: the field of conceptual modeling and, in particular, of entity-relationship modeling. Both are vast fields of research, but here, we are mainly interested in only the most fundamental structures exhibited by the pioneers in the field. Later we will see that type theory provides yet another powerful perspective on these structures.
The idea for “entity-relation-attribute” (ERA) models, which underlies most conceptual modeling approaches as well, was born out of the observation that language can be usefully categorized into three categories. Up to slight simplifications, the categories may be summarized as follows.
- The first category of language comprises common nouns, such as “person”, “car”, or “marriage”. These describe independent domains (and thus types!) of objects. Specific objects, such as a specific person
p
, are referred to as ‘proper nouns’. A conceptual modeler would call a proper noun an entity, and speak of an entity type when referring to a well-described domain of individual entities. - Next up, we have verbs, which connect nouns by some relationship, as captured e.g. by the usual “subject-verb-object” statements. For example, we could say that “a person marries another person”. A conceptual modeler would speak of a relation between these objects; in contrast, a relation type would be an abstract description of the domain of specific relations (e.g. the domain of all “marriages”).
- In the third category, we find qualities of nouns or verbs: in the former case, we speak of adjectives, and in the latter, of adverbs. Speaking of a “young person” is an example of the former, while “a person marries another person today” is an example of the latter. In conceptual modeling, these qualities of specific entities and relations are called attributes, whereas the abstract description of the domain of specific attributes would be an attribute type (such as the “dates of an event”).
The above three categories are summarized in the table below.
Natural Language | Conceptual modeling |
---|---|
noun (a person) | entity (the person p ) |
verb (a person marries a person) | relation (the marriage m between person p1 and p2 ) |
adjective/adverb (a person marries a person today) | attribute (the date d of the event e ) |
But how does this relate to our earlier discussion of type theory? Well, the above table gives a first hint by introducing “variables” as part of our example descriptions in the conceptual modeling column. Let’s see in detail how type theory comes back into play.
The type-theoretic perspective
Looking at the conceptual modeling examples in the previous table, we find a striking structural similarity to our earlier discussion of types and type dependencies. Indeed, the “conceptual distinction” into entities, relations, and attributes made in conceptual modeling can also be recovered from a purely type-theoretic point of view.
- Entity and relation types are types containing objects, meaning “abstract” concepts that themselves do not have intrinsic structure. In a database, we often represent individual objects only by “addresses” or (in database theory lingo) by object identifiers. Entity types are independent types (like “person”), while relation types are dependent types like (”marriage of persons
p1
andp2
”). - Attribute types are types containing values, meaning quantifiable concepts with intrinsic structure (such as integers, whose algebraic structure allows for various forms of representation like sequences of binary bits, or decimal digits). Attribute types are traditionally considered to be dependent types (like “date of event
e
”).
The next table summarizes the correspondence between conceptual modeling and type theory.
Conceptual modeling | Type theory |
---|---|
entity type (containing the person p ) | type without dependencies containing objects |
relation type (containing the marriages m between persons p1 and p2 ) | type with dependencies containing objects |
attribute type (containing the dates d of events e ) | type with dependencies containing values |
While attributes are traditionally considered to be dependent on other concepts (namely the concepts that a given attribute associates a quality to), in the 2 by 2 matrix of “independent/dependent” types and types containing “objects/values”, this leaves a gap: what are the independent analogs of attributes? This turns out to encompass a natural class of concepts too!
There is no good term for these “independent attributes” in classical modeling, but the rigorous perspective of type-theoretic mathematics leads us to fill in gaps in the language. In fact, the resulting notion of “independent attributes” can be very useful in the context of a database, as it allows for the easy storage (and manipulation) of “global constants”, which can be used to directly represent data pertaining to the domain itself (like “the current date”).
The next table illustrates the resulting “completed” table of concepts:
Objects | Values | |
---|---|---|
Independent | entities | global constants |
Dependent | relations | attributes |
Let us briefly add one important remark. We are not linguists, and we also have no intention of splitting hairs. Whether a concept is independent or not depends on the perspective of the data modeler. Consider, for example, our earlier dependent type Pet-of-Person(p)
capturing pet animals of a given person p
. We can now identify this as an unary relation type. But other conceptions of that type are possible: you could argue that pets should be considered independent entities (though maybe in this case the type Animal
would be a more accurate description of the model’s intent, since the word Pet
usually implies some form of ownership by a person).
Putting type polymorphism to use
In our above examples, we have seen an attribute type “date of event e
” and a relation type “marriage of persons p1
and p2
”. But why is it the case that we can speak of the “date of marriages”? What’s the connection between “dated events” and “marriages” here?
The answer to this question stems from type polymorphism, a ubiquitous idea in type theory and (relatedly) in object-oriented programming. There are three main forms of polymorphism found in the wild (for convenience to the reader, let’s switch to “object-oriented lingo”, but the reader may replace “classes” with “types” as well):
- The most well-known form of polymorphism is inheritance polymorphism. This lets classes inherit essential functionality from “parent” classes. This is used to model the “specialization” and “generalization” of concepts.
- Interface polymorphism lets classes implement features specified by interfaces. This is used to model specified “traits” or “capabilities” of objects, without capturing “everything” there is to these concepts. In particular, while an object will have a single parent, it may have multiple traits. And it will inherit all traits from its parent.
- In parametric polymorphism we write programs (or queries!) in a general fashion without reference to specific classes. This can be used to express statements or questions which are of “general form”, and apply to a variety of classes at the same time.
We can now answer the question we set out to solve: what is the relation of marriages and dated events, precisely? Well, the type Marriage
has the trait of being an “event with an associated Date
”. So, here, we employ a form of interface polymorphism!
However, all three forms of polymorphism provide powerful functionality for our data model and query language. In our fundamentally re-designed database, we want them all! Indeed, while inheritance hierarchies are an already well-established part of database design, interfaces often allow us to much more directly capture the true logic underlying our data (as we have seen in our above example). Finally, using parametric polymorphism will enable a highly powerful feature of our approach: by writing general “polymorphic” queries, we will “decouple” our queries from our database schema. This drastically minimizes the required maintenance of queries for all types during structural updates to our database!
The synthesis
In the previous three sections, we have seen:
- How type theory powers modern “composable” mathematics.
- How types and type composition can be easily understood in natural language.
- How classical conceptual ERA modeling can be cast into type theory, where it can then be enhanced with type polymorphism.
In this final section, let us fuse the above insights into the design and construction of a novel kind of database. This will comprise two main components: the type-theoretic query language, TypeQL, and its underlying polymorphic entity-relation-attribute (PERA) data model, which TypeQL seamlessly integrates with. The result will be a truly generalist, extensible, type-theoretic foundation for modern databases.
Declarative type-theoretic querying … in TypeQL!
Let us start with our query language, TypeQL. We have already seen an example of its near-natural syntax earlier, and really there is not much more to it! TypeQL is conceptually simple but highly expressive query language, with a composable and fully declarative syntax. Let’s start with an example, building on some of the ideas and examples that we saw earlier. Consider the following data retrieval query:
match
$p1 isa person, has name "Henry VIII";
$m (spouse: $p1, spouse: $p2) isa marriage, has date $d;
$d >= 1535-01-01;
get $p2;
Note that, except for the additional keywords match
and get
which indicate data retrieval, the body of the query looks very similar to our earlier “natural language” descriptions of composed types. The purpose of the query is almost self-explanatory: it retrieves all persons p2
which have been married to Henry VIII after the beginning of the year 1535. But, in order to really connect the dots with our earlier discussion of type theory, let us go through the code of the above query’s body line by line:
- The first line
p1 isa person, has name "Henry VIII";
really contains two statements which have been “shortened” into one. First, it states$p1 isa person
which, as we’ve seen earlier, simply declares the variablep1
to be of (entity) typeperson
. Second, we have the statement$p1 has name "Henry VIII"
: this syntax is specific to attribute types, and states that"Henry VIII"
is a term in the typename
depending on$p1
(so, using our earlier syntax, you may think of this as assigning the value“Henry VII”
to an implicit variable of type “name of p2”). - The second line
$m ($p1, $p2) isa marriage, has date $d;
also implicitly contains two statements. First, we declare the variable$m
to be of type “marriage of persons p1 and p2”. Importantly, note how the typeperson
of$p2
may be inferred purely from the context it is used in! We will revisit type inference in the next section. Second, we once more have a statement$m has date $d
that refers to an attribute type, stating that$d
is a term in the typedate
depending on$m
. - The final line
$d >= 1535-01-01;
imposes a condition on$d
using the predicatex < y
. At first glance, this may look somewhat different from the structure of the first two lines, but really it is not: predicates (via “predicates-as-types”) naturally fit our syntax here.
This, of course, is a rather simplistic example, but it illustrates the point of type composition in data retrieval queries. Note that we haven’t yet discussed at all where types like person
or marriage
are defined, and this will be the content of the next section! Before that, let us briefly look at another illustrative, and slightly more complex example. Consider the following query:
match
$ownership (animal: $pet, owner: $owner) isa pet-ownership;
# retrieve only pets that were adopted
$ownership has by-adoption true;
# retrieve only specific cats or dogs
{ $pet isa cat, has lives-left $lives;
$lives > 3; }
or
{ $pet isa dog, has age $age;
$age > 3; };
# exclude Flintstones' pets
not {
{ $owner has name "Fred Flintstone"; }
or
{ $owner has name "Wilma Flintstone"; };
};
get $pet, $owner;
While similarly in style to the previous query, this illustrates several novelties over the first example.
- Unlike our earlier relation type
marriage
which relates identicalspouse
interfaces, our relationpet-ownership
features asymmetric interfaces forpet
andowner
terms. Interfaces of relation types are called role types, and in general, a relation type may depend on any number of distinct role types! - The above query contains the keyword
or
which links two pattern blocks (enclosed in brackets{...}
). As we have briefly mentioned, this should be understood as describing two cases of patterns in parallel. For our example, the firstor
means that we are retrieving for$pet
either cats with more than3
lives left, or dogs older than 3 years in age. - The second instance of an
or
block is enclosed by the keywordnot
. This indicates that that type described on the inside of thenot
block is not inhabited by any terms (again, this has a type-theoretic foundation: see propositional truncation or bracket types if you are interested). For our specific query, this is easy to understand: we exclude the case of pet owners named Fred or Wilma Flintstone from our data retrieval!
This all looks nice and readable, but what if instead of “queries for user-specified cases”, we want to write queries that cover a range of cases that are selected based on some criteria? For example, how do we query for types of animals that can have specific attributes? This is where the real power of type-polymorphic queries comes into play. Consider the following example:
match
# let's consider subtypes of animal
$animal-type sub animal;
# we are interested only in types which implement
# ownership of the personality indicator type
$animal-type owns personality-indicator;
# Now we retrieve pets ownerships
(pet: $animal, owner: $owner) isa pet-ownership;
# where the variable $animal is a $animal-type
# (note '!' avoids the variable $a living in a
# proper subtype of $animal-type )
$animal isa! $animal-type;
# Finally, retrieve pet ownerships in which pet
# and owner have the same personality indicator
$animal has personality-indicator $i;
$owner has personality-indicator $i;
get $animal, $owner;
The crucial innovation of this query over the previous examples is the usage of the type variable $animal-type
. This allows us to consider a range of animal types, based on the criterion that to we want a personality-indicator
attribute associated to that type. The query than queries for pet-ownership in which pet and owner have the same such personality indication.
The variabilization of types follows a larger design pattern which we fully subscribe to here:
Everything has a type so everything can be a variable
The resulting synthesis of variabilization, (inheritance and interface) polymorphism, and conceptual modeling with type dependencies has many powerful repercussions. We do not have the space here to comprehensively list them, but now that we have a first idea of the style and expressiveness of fully declarative, type-theoretic querying, we a ready to dive yet deeper. We will now learn about the practical database schemas which underlie queries like the one above!
A fully composable data model
TypeQL is not only a data query language (as illustrated in the previous section using match ... get
queries) and a general data manipulation language, but is also a data definition language, which allows us to define schemas based on our data model. All three aspects of the language are closely related in syntax and style, and follow the same fundamental type-theoretic ideas.
So what definitions will we need to express in TypeQL? We already have learned about a small selection of key ingredients of our polymorphism-enhanced entity-relation-attribute model which we need to be able to specify, including:
- Entity types, as the types of concepts that do not depend on other concepts.
- Relation types, together with their dependencies on (potentially multiple) role types that they relate.
- Attribute types whose terms will derive from a specified value, together with an implicit dependency on a (single) ownership type.
- Inheritance hierarchies for all of the above, which mean types being defined to be subtypes of other types.
- Interface implementations which, concretely, means types are defined to play roles and own attributes.
All of these components are can be defined intuitively and concisely in TypeQL. To illustrate how, let’s look at a first example of a data definition query (which always starts with the keyword define
). The following definition provides a schema suitable for running our marriage
query from the previous section:
define
person sub entity,
owns name,
plays marriage:spouse;
marriage sub relation,
relates spouse,
owns date;
name sub attribute,
value string;
date sub attribute,
value datetime;
Note the three reserved “root-types” entity, relation, and attribute. A new type in one of the respective categories is introduced simply by accordingly subtyping these root-types. Note that the role dependencies of relation types are indicated using the keyword relates
. Conversely, if another type implements such a “role interface”, then this is indicated using the keyword plays
. Similarly, if a type implement the “ownership interface” of an attribute, then this is indicated using the keyword owns
.
Simple enough! Let’s look at a second example, this one providing a schema for our earlier “polymorphic” pet-ownership
query (here, we simply “add” the schema definition below to the one above, in particular, we assume person
to already be defined):
define
animal sub entity,
plays pet-ownership:pet,
owns age;
cat sub animal,
owns personality-indicator;
dog sub animal,
owns personality-indicator;
tardigrade sub animal; # tardigrades are cute but don't have personalities
pet-ownership sub relation,
relates pet,
relates owner;
person plays pet-ownership:owner,
owns personality-indicator;
personality-indicator sub attribute,
value string;
age sub attribute,
value long;
Under the hood, there is much more going on here than in our first example. To begin, note that the schema definition introduces user-defined subtypes: namely, the types cat
, dog
, and tardigrade
are all defined to be entity subtypes of the entity type animal
. As a consequence of this, all of these subtypes can automatically play the role of pet in pet ownerships, and own the age attribute. In contrast, the attribute personality-indicator
is only specified for specific animal types. As an aside, note that definitions like person plays pet-ownership:owner
can be added at any time, and need not be given together with the definition of person
.
So what happens to our polymorphic query if we change our schema? Say, we add the following definition:
define
chicken sub animal,
owns personality-indicator;
fish sub animal; # fish cannot take the Myers-Briggs test since they have no hands
The answer is: the query still does exactly what it should do. While previously our type variable $animal-type
that represents types of animals that own a personality indicator ranged over cat
and dog
, now after the schema modification, it will range over cat
, dog
, and chicken
! The need for carefully maintaining our queries based on the database schema has thereby been (essentially) eliminated via our type-theoretic query approach.
Our example nicely illustrates the intuitiveness of interfaces and type inheritance in the PERA model. None of the above behavior is surprising, and none is difficult to read or understand. While they are arguably very simple examples, they nonetheless serve to illustrate some of the key reasons why we would design our database as described, and not any other way.
Native logical reasoning in TypeDB
There is much more to the PERA model, and it’s impossible to cover all aspects of it in one sitting. Let us nonetheless give the spotlight briefly to one more native type-theoretic feature, which relates to our earlier remark about “function types” in our crash course on type theory.
Namely, type theory is built to represent and reason with logical implications. This is also integrated into our data model and resolves the practical problem of directly integrating application logic and database management. Concretely, TypeQL allows the definition of rules which can be used to infer new data from existing data. Consider the following (very) simple example:
# Whenever an adoption date is given
# mark pet-ownership as by-adoption
rule assign-adoption-status: when {
$o isa pet-ownership, has adoption-date $d;
} then {
$o has by-adoption true;
};
This guarantees that our earlier by-adoption
attribute will be automatically set to true
whenever an adoption-date
is known.
As a slightly more complicated example, consider the following “edge transitivity” rule:
rule share-pets-by-marriage: when {
$m (spouse: $p1, spouse: $p2) isa marriage;
(pet: $a, owner: $p1) isa pet-ownership;
not { $m has divorce-date $d; };
} then {
(pet: $a, owner: $p2) isa pet-ownership;
};
When utilizing symbolic reasoning, new fact generation occurs at query time, ensuring that the inferred data is never stale. By using rules to govern data dependencies within the database, all inferred data can be made to have a single source of truth, preventing data conflicts from ever occurring. This ensures that the database is always in a consistent and current state, preventing the need for pre-computation cycles.
Of course, defining rules is only half of the required work. The other half is implementing an efficient “reasoning engine” that can infer new facts by triggering rules sequentially or in parallel. The polymorphic database TypeDB implements such an engine, based on a concurrent actor model particularly suitable for large-scale databases!
Modeling relational, graph, document, and more
At this point, you may ask: if the PERA data model implemented by TypeDB and its query language TypeQL are so different from existing models, won’t that be an issue for the interoperability with these models? The converse is the case: the PERA model was precisely designed to be a unification of existing ideas, with data model compatibility built into it.
Consider, for instance, the relational model which stores data in tables and rows (that may occur multiple times when using bag semantics). (1) Each table in our relational schema translates to an entity type in the PERA model. (2) The entity type will own an attribute type for each column in the table. (3) Foreign keys from one table in another will be modeled by relation types. And that yields a very straight-forward comparison between the two models! Of course, the experienced SQL developer may now ask: aren’t some tables better modeled as relation types? They are, but our choices here reflect the important observation that the relational model flattens dependencies, so that a without additional semantic context we cannot know if some of some of our entity types are secretly associative entity types (also known as join tables): if this information is known to us then, indeed, it is preferable to model associative entities as relations! But disregarding this detail, let’s see a simple example of the comparison in action.
define
# We define a table 'student' with two columns
# as well as a column _containing_ a foreign key:
student sub entity,
owns subject-of-study,
owns start-date,
plays supervision:supervisee;
# We define a table of 'professor' with three columns.
# It also _provides_ a foreign key to another table:
professor sub entity,
owns name,
owns number-of-students,
owns taking-new-PhDs,
plays supervision:supervisor;
# The foreign key establishes a relation between the
# two table, captured by the following type:
supervision sub relation,
relates supervisee,
relates supervisor;
# Finally, we add types to our columns:
subject-of-study sub attribute, value string;
start-date sub attribute, value datetime;
name sub attribute, value string;
number-of-students sub attribute, value long;
taking-new-PhDs sub attribute, value boolean;
So what is happening here? We are defining a simple relational schema with two tables. The first is a table of students with columns for their “subject of study” and the “start date” of their studies, as well as a foreign “supervision” key. The second is a table of “professors” with columns for their “name”, the “number of students” that they supervise, a column indicating whether they are currently “taking new PhD students”. The table of “professors” provides a key.
Note that, a priori, relations need not be n-to-1 as implied by this example. Using n-to-n relation can be a powerful way to directly model multi-value foreign key (a student may have multiple supervising professors!). In fact, the possibility of having multi-value columns also applies to attribute columns. In other words, we can easily represent databases that are not normalized! However, an in-depth discussion of good modeling practice the interesting topic of (multi-value) foreign keys and attributes will have to be left to a future instalment of this series!
For now, let’s briefly look at another interesting example: labeled property knowledge graphs. They, too, can be natively worked with in the PERA model. Roughly speaking, node labels correspond to entity types, while edge types correspond to relation types. Properties of both nodes and edges are modeled using attributes of these types. Rules, such as the transitivity rule above, can be used to generate custom closures of edges under composition, akin to queries with specific “graph patterns” in graph databases. Let’s look at a simplified example:
define
# First, we define the node types we are interested in.
# For this example, this will be 'city' type nodes!
city sub entity,
owns name;
name sub attribute, value string;
# Next, we define our edge types
flight sub relation,
relates departure-city,
relates arrival-city,
owns departure-time,
owns arrival-time;
# For simplicity, let's give departure and arrival times
# of daily flights as a "minutes since midnight" integer
departure-time sub attribute, value long;
arrival-time sub attribute, value long;
# Finally, let's define the logic of paths. In this
# example we are only interesting in path of length 2.
two-leg-flight sub relation,
relates first-leg;
relates second-leg;
flight plays two-leg-flight:first-leg,
plays two-leg-flight:second-leg;
rule possible-two-leg-flight: when {
$flight1 (departure-city: $start, arrival-city: $connect) isa flight,
has arrival-time $time1;
$flight2 (departure-city: $connect, arrival-city: $end) isa flight,
has departure-time $time2;
$time2 > $time1 + 30; # allow at least 30 minutes to connect
} then {
(first-leg: $flight1, second-leg: $flight2) isa two-leg-flight;
}
In the above example, we are considering a simple graph of “city” type nodes related by “flight” type edges. Of course in a more complicated graph, we could have many more node and edge types, and these could further be organized by an inheritance hierarchy. Observe that each type can come with customized attributes, which makes the above a schema-based approach to graph modeling.
After defining node and edge types, we turn to paths. Important, paths in graph are not inserted data (i.e. part of the graphs definition); instead, they are inferred from inserted data. Thus, we employ type-theoretic reasoning to infer path. For our above example, we only consider paths of fixed length 2 by a rule “possible-two-leg-flight” that combines two “flight” type edges. (Note that arbitrary length paths can be dealt with by recursive rules. But as before, we will leave a detailed discussion of the interesting topic of “paths of relations” to a further instalment of this series.) The point of our simple example is that the reasoning capabilities of our model give us highly expressive control over our intended path logic: in the above example, we do not connect any two flights that are composable on the same city node in our graph; instead, we only connect those which allow for an at least 30 minute lay-over! This usage of our type-theoretic reasoning gives users fine-grained control which type of paths in our graphs we really need to consider. Reasoning in the PERA is a highly generalizable tools, and can be adapted to many other real-world scenarios.
All of the above gives only a small taste of what can be represented in the PERA model. Plenty of other data and knowledge representation features fall within its unifying scope, ranging from hierarchical document structures, RDF with reification (via higher-level relations), and many common object-oriented features.
The future
You made it to the end! That was a lot to take in, but we hope you enjoyed the read. The TypeDB journey certainly has only started, and we are excited to finally share all our thoughts on the “unified theory of databases” with you. There will be much more to come, so if you’d like to stay in touch, please do so: subscribe to our TypeDB newsletter, join us on Discord, and follow us on Github!
References
This is article is an expanded version of our ACM CIKM talk:
2023. Type Theory as a Unifying Paradigm for Modern Databases. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 5238–5239. https://doi.org/10.1145/3583780.3615999