Officially out now: The TypeDB 3.0 Roadmap

TypeDB Fundamentals

The Confluence of Database Languages and Type Theory


TypeDB is a novel kind of database: it combines type-theoretic data structures and language with classical database concepts. Its query language, TypeQL, is a direct reflection of this confluence: in TypeQL, database engineers think about variables and their types, and de-structure their queries into type-theoretic patterns. The result is generalist, extensible, type-theoretic paradimg for querying and developing modern databases. In this article, we will shed light on some of the basic principles of this confluence, focusing on the design of TypeQL and the novel data model of TypeDB, which we call the polymorphic entity-relation-attribute (PERA) model.

Declarative type-theoretic querying

Let us start with our query language, TypeQL. You may have already seen some example of its near-natural syntax in other article. If these examples made sense to you, there is not much more required to understanding the language! TypeQL is conceptually simple but highly expressive query language, with a composable and fully declarative syntax. Let’s start with an example, building on some of the ideas and examples that we saw earlier. Consider the following data retrieval query:

match
    $p1 isa person, has name "Henry VIII";
    $m (spouse: $p1, spouse: $p2) isa marriage, has date $d;
    $d >= 1535-01-01;
get $p2;

Note that, except for the additional keywords match and get which indicate data retrieval, the body of the query looks very similar to our earlier “natural language” descriptions of composed types. The purpose of the query is almost self-explanatory: it retrieves all persons p2 which have been married to Henry VIII after the beginning of the year 1535. But, in order to really connect the dots with our earlier discussion of type theory, let us go through the code of the above query’s body line by line:

  1. The first line p1 isa person, has name "Henry VIII"; really contains two statements which have been “shortened” into one. First, it states $p1 isa person which, as we’ve seen earlier, simply declares the variable p1 to be of (entity) type person. Second, we have the statement $p1 has name "Henry VIII": this syntax is specific to attribute types, and states that "Henry VIII" is a term in the type name depending on $p1 (so, using our earlier syntax, you may think of this as assigning the value “Henry VII” to an implicit variable of type “name of p2”).
  2. The second line $m ($p1, $p2) isa marriage, has date $d; also implicitly contains two statements. First, we declare the variable $m to be of type “marriage of persons p1 and p2”. Importantly, note how the type person of $p2 may be inferred purely from the context it is used in! We will revisit type inference in the next section. Second, we once more have a statement $m has date $d that refers to an attribute type, stating that $d is a term in the type date depending on $m.
  3. The final line $d >= 1535-01-01; imposes a condition on $d using the predicate x < y. At first glance, this may look somewhat different from the structure of the first two lines, but really it is not: predicates (via “predicates-as-types”) naturally fit our syntax here.

This, of course, is a rather simplistic example, but it illustrates the point of type composition in data retrieval queries. Note that we haven’t yet discussed at all where types like person or marriage are defined, and this will be the content of the next section! Before that, let us briefly look at another illustrative, and slightly more complex example. Consider the following query:

match
    $ownership (animal: $pet, owner: $owner) isa pet-ownership;
    # retrieve only pets that were adopted
    $ownership has by-adoption true;
    
    # retrieve only specific cats or dogs
    { $pet isa cat, has lives-left $lives; 
      $lives > 3; } 
    or 
    { $pet isa dog, has age $age; 
      $age > 3; };
    
    # exclude Flintstones' pets 
    not { 
    	{ $owner has name "Fred Flintstone"; }
    	or 
    	{ $owner has name "Wilma Flintstone"; };
    };
get $pet, $owner;

While similarly in style to the previous query, this illustrates several novelties over the first example.

  1. Unlike our earlier relation type marriage which relates identical spouse interfaces, our relation pet-ownership features asymmetric interfaces for pet and owner terms. Interfaces of relation types are called role types, and in general, a relation type may depend on any number of distinct role types!
  2. The above query contains the keyword or which links two pattern blocks (enclosed in brackets {...}). As we have briefly mentioned, this should be understood as describing two cases of patterns in parallel. For our example, the first or means that we are retrieving for $pet either cats with more than 3 lives left, or dogs older than 3 years in age.
  3. The second instance of an or block is enclosed by the keyword not. This indicates that that type described on the inside of the not block is not inhabited by any terms (again, this has a type-theoretic foundation: see propositional truncation or bracket types if you are interested). For our specific query, this is easy to understand: we exclude the case of pet owners named Fred or Wilma Flintstone from our data retrieval!

This all looks nice and readable, but what if instead of “queries for user-specified cases”, we want to write queries that cover a range of cases that are selected based on some criteria? For example, how do we query for types of animals that can have specific attributes? This is where the real power of type-polymorphic queries comes into play. Consider the following example:

match
    # let's consider subtypes of animal
    $animal-type sub animal;
    # we are interested only in types which implement
    # ownership of the personality indicator type
    $animal-type owns personality-indicator;
    
    # Now we retrieve pets ownerships
    (pet: $animal, owner: $owner) isa pet-ownership;
    
    # where the variable $animal is a $animal-type
    # (note '!' avoids the variable $a living in a 
    # proper subtype of $animal-type )
    $animal isa! $animal-type;  
    
    # Finally, retrieve pet ownerships in which pet 
    # and owner have the same personality indicator
    $animal has personality-indicator $i;
    $owner has personality-indicator $i;
get $animal, $owner;

The crucial innovation of this query over the previous examples is the usage of the type variable $animal-type. This allows us to consider a range of animal types, based on the criterion that to we want a personality-indicator attribute associated to that type. The query than queries for pet-ownership in which pet and owner have the same such personality indication.

The variabilization of types follows a larger design pattern which we fully subscribe to here:

Everything has a type so everything can be a variable

The resulting synthesis of variabilization, (inheritance and interface) polymorphism, and conceptual modeling with type dependencies has many powerful repercussions. We do not have the space here to comprehensively list them, but now that we have a first idea of the style and expressiveness of fully declarative, type-theoretic querying, we a ready to dive yet deeper. We will now learn about the practical database schemas which underlie queries like the one above!

A fully composable data model

TypeQL is not only a data query language (as illustrated in the previous section using match ... get queries) and a general data manipulation language, but is also a data definition language, which allows us to define schemas based on our data model. All three aspects of the language are closely related in syntax and style, and follow the same fundamental type-theoretic ideas.

So what definitions will we need to express in TypeQL? We already have learned about a small selection of key ingredients of our polymorphism-enhanced entity-relation-attribute model which we need to be able to specify, including:

  • Entity types, as the types of concepts that do not depend on other concepts.
  • Relation types, together with their dependencies on (potentially multiple) role types that they relate.
  • Attribute types whose terms will derive from a specified value, together with an implicit dependency on a (single) ownership type.
  • Inheritance hierarchies for all of the above, which mean types being defined to be subtypes of other types.
  • Interface implementations which, concretely, means types are defined to play roles and own attributes.

All of these components are can be defined intuitively and concisely in TypeQL. To illustrate how, let’s look at a first example of a data definition query (which always starts with the keyword define). The following definition provides a schema suitable for running our marriage query from the previous section:

define
    person sub entity,
        owns name,
        plays marriage:spouse;
    marriage sub relation,
        relates spouse,
        owns date;
    name sub attribute, 
        value string;
    date sub attribute, 
        value datetime;

Note the three reserved “root-types” entity, relation, and attribute. A new type in one of the respective categories is introduced simply by accordingly subtyping these root-types. Note that the role dependencies of relation types are indicated using the keyword relates. Conversely, if another type implements such a “role interface”, then this is indicated using the keyword plays. Similarly, if a type implement the “ownership interface” of an attribute, then this is indicated using the keyword owns.

Simple enough! Let’s look at a second example, this one providing a schema for our earlier “polymorphic” pet-ownership query (here, we simply “add” the schema definition below to the one above, in particular, we assume person to already be defined):

define
    animal sub entity,
        plays pet-ownership:pet,
        owns age;
    cat sub animal,
        owns personality-indicator;
    dog sub animal,
        owns personality-indicator;
    tardigrade sub animal; # tardigrades are cute but don't have personalities
    pet-ownership sub relation,
        relates pet,
        relates owner;
    person plays pet-ownership:owner,
        owns personality-indicator;
    personality-indicator sub attribute,
        value string;
    age sub attribute,
        value long;

Under the hood, there is much more going on here than in our first example. To begin, note that the schema definition introduces user-defined subtypes: namely, the types cat, dog, and tardigrade are all defined to be entity subtypes of the entity type animal. As a consequence of this, all of these subtypes can automatically play the role of pet in pet ownerships, and own the age attribute. In contrast, the attribute personality-indicator is only specified for specific animal types. As an aside, note that definitions like person plays pet-ownership:owner can be added at any time, and need not be given together with the definition of person.

So what happens to our polymorphic query if we change our schema? Say, we add the following definition:

define
    chicken sub animal,
        owns personality-indicator;
    fish sub animal; # fish cannot take the Myers-Briggs test since they have no hands

The answer is: the query still does exactly what it should do. While previously our type variable $animal-type that represents types of animals that own a personality indicator ranged over cat and dog, now after the schema modification, it will range over cat, dog, and chicken! The need for carefully maintaining our queries based on the database schema has thereby been (essentially) eliminated via our type-theoretic query approach.

Our example nicely illustrates the intuitiveness of interfaces and type inheritance in the PERA model. None of the above behavior is surprising, and none is difficult to read or understand. While they are arguably very simple examples, they nonetheless serve to illustrate some of the key reasons why we would design our database as described, and not any other way.

Native logical reasoning

There is much more to the PERA model, and it’s impossible to cover all aspects of it in one sitting. Let us nonetheless give the spotlight briefly to one more native type-theoretic feature, which relates to our earlier remark about “function types” in our crash course on type theory.

Namely, type theory is built to represent and reason with logical implications. This is also integrated into our data model and resolves the practical problem of directly integrating application logic and database management. Concretely, TypeQL allows the definition of rules which can be used to infer new data from existing data. Consider the following (very) simple example:

# Whenever an adoption date is given
# mark pet-ownership as by-adoption

rule assign-adoption-status: when {
    $o isa pet-ownership, has adoption-date $d;
} then {
    $o has by-adoption true;
};

This guarantees that our earlier by-adoption attribute will be automatically set to true whenever an adoption-date is known.

As a slightly more complicated example, consider the following “edge transitivity” rule:

rule share-pets-by-marriage: when {
    $m (spouse: $p1, spouse: $p2) isa marriage;
    (pet: $a, owner: $p1) isa pet-ownership;
    not { $m has divorce-date $d; };
} then {
    (pet: $a, owner: $p2) isa pet-ownership;
};

When utilizing symbolic reasoning, new fact generation occurs at query time, ensuring that the inferred data is never stale. By using rules to govern data dependencies within the database, all inferred data can be made to have a single source of truth, preventing data conflicts from ever occurring. This ensures that the database is always in a consistent and current state, preventing the need for pre-computation cycles.

Of course, defining rules is only half of the required work. The other half is implementing an efficient “reasoning engine” that can infer new facts by triggering rules sequentially or in parallel. The polymorphic database TypeDB implements such an engine, based on a concurrent actor model particularly suitable for large-scale databases!

Modeling relational, graph, document, and more

At this point, you may ask: if the PERA data model implemented by TypeDB and its query language TypeQL are so different from existing models, won’t that be an issue for the interoperability with these models? The converse is the case: the PERA model was precisely designed to be a unification of existing ideas, with data model compatibility built into it.

Consider, for instance, the relational model which stores data in tables and rows (that may occur multiple times when using bag semantics). (1) Each table in our relational schema translates to an entity type in the PERA model. (2) The entity type will own an attribute type for each column in the table. (3) Foreign keys from one table in another will be modeled by relation types. And that yields a very straight-forward comparison between the two models! Of course, the experienced SQL developer may now ask: aren’t some tables better modeled as relation types? They are, but our choices here reflect the important observation that the relational model flattens dependencies, so that a without additional semantic context we cannot know if some of some of our entity types are secretly associative entity types (also known as join tables): if this information is known to us then, indeed, it is preferable to model associative entities as relations! But disregarding this detail, let’s see a simple example of the comparison in action.

define 
    # We define a table 'student' with two columns
    # as well as a column _containing_ a foreign key:
    student sub entity,
        owns subject-of-study,
        owns start-date,
        plays supervision:supervisee;

    # We define a table of 'professor' with three columns.
    # It also _provides_ a foreign key to another table:
    professor sub entity,
        owns name,
        owns number-of-students,
        owns taking-new-PhDs,
        plays supervision:supervisor;

    # The foreign key establishes a relation between the
    # two table, captured by the following type:
    supervision sub relation,
        relates supervisee,
        relates supervisor;

    # Finally, we add types to our columns:
    subject-of-study sub attribute, value string;
    start-date sub attribute, value datetime;
    name sub attribute, value string;
    number-of-students sub attribute, value long;
    taking-new-PhDs sub attribute, value boolean;

So what is happening here? We are defining a simple relational schema with two tables. The first is a table of students with columns for their “subject of study” and the “start date” of their studies, as well as a foreign “supervision” key. The second is a table of “professors” with columns for their “name”, the “number of students” that they supervise, a column indicating whether they are currently “taking new PhD students”. The table of “professors” provides a key.

Note that, a priori, relations need not be n-to-1 as implied by this example. Using n-to-n relation can be a powerful way to directly model multi-value foreign key (a student may have multiple supervising professors!). In fact, the possibility of having multi-value columns also applies to attribute columns. In other words, we can easily represent databases that are not normalized! However, an in-depth discussion of good modeling practice the interesting topic of (multi-value) foreign keys and attributes will have to be left to a future instalment of this series!

For now, let’s briefly look at another interesting example: labeled property knowledge graphs. They, too, can be natively worked with in the PERA model. Roughly speaking, node labels correspond to entity types, while edge types correspond to relation types. Properties of both nodes and edges are modeled using attributes of these types. Rules, such as the transitivity rule above, can be used to generate custom closures of edges under composition, akin to queries with specific “graph patterns” in graph databases. Let’s look at a simplified example:

define
    # First, we define the node types we are interested in.
    # For this example, this will be 'city' type nodes!
    city sub entity,
        owns name;
    name sub attribute, value string;
    
    # Next, we define our edge types
    flight sub relation,
        relates departure-city,
        relates arrival-city,
        owns departure-time,
        owns arrival-time;
    # For simplicity, let's give departure and arrival times 
    # of daily flights as a "minutes since midnight" integer
    departure-time sub attribute, value long;
    arrival-time sub attribute, value long;
    
    # Finally, let's define the logic of paths. In this 
    # example we are only interesting in path of length 2.
    two-leg-flight sub relation,
        relates first-leg;
        relates second-leg;
    flight plays two-leg-flight:first-leg,
        plays two-leg-flight:second-leg;
    
    rule possible-two-leg-flight: when {
        $flight1 (departure-city: $start, arrival-city: $connect) isa flight,
            has arrival-time $time1;
        $flight2 (departure-city: $connect, arrival-city: $end) isa flight,
            has departure-time $time2;
        $time2 > $time1 + 30; # allow at least 30 minutes to connect
    } then {
        (first-leg: $flight1, second-leg: $flight2) isa two-leg-flight;
    }

In the above example, we are considering a simple graph of “city” type nodes related by “flight” type edges. Of course in a more complicated graph, we could have many more node and edge types, and these could further be organized by an inheritance hierarchy. Observe that each type can come with customized attributes, which makes the above a schema-based approach to graph modeling.

After defining node and edge types, we turn to paths. Important, paths in graph are not inserted data (i.e. part of the graphs definition); instead, they are inferred from inserted data. Thus, we employ type-theoretic reasoning to infer path. For our above example, we only consider paths of fixed length 2 by a rule “possible-two-leg-flight” that combines two “flight” type edges. (Note that arbitrary length paths can be dealt with by recursive rules. But as before, we will leave a detailed discussion of the interesting topic of “paths of relations” to a further instalment of this series.) The point of our simple example is that the reasoning capabilities of our model give us highly expressive control over our intended path logic: in the above example, we do not connect any two flights that are composable on the same city node in our graph; instead, we only connect those which allow for an at least 30 minute lay-over! This usage of our type-theoretic reasoning gives users fine-grained control which type of paths in our graphs we really need to consider. Reasoning in the PERA is a highly generalizable tools, and can be adapted to many other real-world scenarios.

Summary

All of the above gives only a small taste of how data can be represented in the PERA model, the data model of TypeDB, and how it can be queried with its query language, TypeQL. Plenty of other data and knowledge representation features fall within the unifying scope of the PERA model, ranging from hierarchical document structures, RDF with reification, and many common object-oriented features. For a more in-depth discussion of the model (and many more example of the TypeQL) check out our other fundamentals articles! And if you really want to dive into the mathematical foundations of TypeQL, then you should also have a look at our research papers section.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

Type Theory Fundamentals

Discover the powerful unification of paradigms backed by modern type-theoretic mathematics, laying a novel foundation for modern database applications.

Read article

Type-Theoretic Language

TypeQL is the type-theoretic query language of TypeDB, allowing for highly expressive querying power beyond the capabilities of other modern database paradigms.

Read article

TypeQL ACM Paper

Read the preprint of the paradigm-defining paper behind TypeQL, due to be published in Proceedings of the ACM on Management of Data.

Read paper

Feedback