TypeDB Learning Center

The Polymorphic Data Model With Types


The TypeDB database is based on a novel, highly expressive, and highly typed data model called the polymorphic entity-relation-attribute (PERA) model. In contrast to other data models, the PERA model combines three distinct themes in database and programming language design into a single simple model.

  1. First, the PERA model builds on core ideas from conceptual and semantic data modeling, which ensures that the model is structurally close to, and just as intuitive as, natural language.
  2. Second, the model integrates polymorphic thinking, which organizes concepts into inheritance type hierarchies and models the “behavioral traits” of these types through interfaces, which other types depend on.
  3. And third, the PERA model is accessible through a robust type-theoretic querying paradigm, in which composite types act as declarative queries. The result is a novel high-level query language: TypeQL.

In this article, we will give an in-depth discussion of the PERA model, providing an overview of its core concepts and of TypeDB’s language, TypeQL, which is tailored to the PERA model.

To get us started, let us briefly review the larger picture, including the theory and vision, the pragmatic motivation, and the practical application of TypeDB as a high-level database for building robust modern applications—this will help us to put our discussion of the PERA model into context.

The theory and vision

TypeDB’s design, following the emerging trend of modern languages providing high-level “zero-cost” abstractions, is firmly rooted in the theory of types. This stands in stark contrast to common existing databases which are based on classical predicate-based and imperative thinking. In these classical approaches, types only arise as an afterthought, which means that many of the benefits of typed programming cannot be fully realized. Type systems provide an expressive way of declaring and controlling the behavior of programs. This makes applications safer, easier to compose, and easier to maintain. In our article on the type-theoretic paradigm for modern databases we discuss how type theory allows us to re-think databases: types are declarations of “domains of data” and thus a great starting point to design a truly declarative query language. Moreover, since types provide the primary framework for polymorphism, and since dependencies in data can be elegantly represented with type dependencies, we argued that there is really no way around type theory for building the next generation of databases.

The pragmatic motivation

Modern programming languages, often driven by scientific and open-source movements, have evolved dramatically over the last several decades. In contrast, databases seem to be stuck with, by now, rather aged paradigms; this phenomenon may be partially explained by the more commercialized environment that drives database engineering, or possibly by the much higher requirements for robustness, maturity, and theoretical understanding of database systems. As a result of these different speeds of development, today we deal on the regular with mismatches between modern type- and object-oriented programming on one hand and database languages on the other. In our article on the need for a polymorphic database we discussed these mismatches and shortcomings in more detail, and the negative consequences that they have for guaranteeing the semantic integrity, modularity, and maintainability of database applications. We also observed that attempts to implicitly translate between high-level programming languages and database language (e.g. via ORMs) often result in costly abstractions, in the form of non-optimal queries or overhead compute for generating object representations. These mismatches compound at scale, with custom-build solutions providing a rather expensive last resort for many customers.

The practical implementation

TypeDB puts the theory and vision of type-theoretic databases to use in order to address the aforementioned pain points. As a database, TypeDB builds directly on a polymorphic conceptual data model, comprising entity, relation, and attribute types, as well as their inheritance hierarchies and interfaces. TypeDB comes with a high-level, type-theoretic query language, TypeQL, that lets users directly interact with the PERA data model. The type system of TypeDB ensures semantic integrity of data, and its language is compositional, modular, and intuitive, which ensures that even highly complex applications remain maintainable and can be fearlessly modified at any level of granularity without having to introduce breaking changes. All this and more is discussed in detail in our fundamentals article introducing TypeDB.

Introducing the polymorphic data model of TypeDB

Describing the physical data model is the first step in the design of any database, and so the PERA model plays a particularly important role in understanding the inner workings of TypeDB. Keeping the larger picture of TypeDB, its theory, and its motivation in mind, in today’s article we will give an in-depth introduction to the PERA data model, which is positioned to power a novel class of natively polymorphic databases.

1. Preliminaries about concepts and types

Databases are a vast topic and thus, as a first point of order, it will be useful to get our terminology straight: when defining the PERA model and when comparing it with other models, we will be using terms such as concepts, types, instances, dependencies, inheritance, interfaces, objects, attributes, etc. Let’s make precise how we should think about each of these terms.

Concepts, instances, dependencies

For data to be effectively used and queried, it needs to be organized and interpreted. At the surface level, there seem to be many ways to go about this; and indeed, many different database models co-exist nowadays. However, deeper down, we will find that most models are, in fact, governed by very similar elementary principles. These principles are data conceptualization on one hand and concept dependency on the other.

Conceptualization describes the process of introducing named concepts (like “person”, “car”, or “marriage”), which are understood by the user, in order to interpret our data: this means that we identify specific data or data structures as instances belonging to the concepts. Concept dependency describes the relations between the resulting instances and concepts: namely, some of our concepts might not be interpretable in isolation but only in the presence of instances of other concepts: concretely, this means that in order to meaningfully create instances of the former concept we need to reference instances of the latter concept—we say that the former concept depends on the latter concept.

As a simple example of dependency, consider the concepts of “persons” and of “names” in a relational database set-up: we could choose rows in a Person table to be the data structure that represents instances of a “person”. Names could be represented by a column in that table: each entry in the column is “a name”, i.e., an instance of the name concept. But note, in order to create an instance of a name (i.e. an entry in the Name column) we need to choose a pre-existing row for that instance to live in. In other words, in this example, the concept of names depends on persons that own them—this is intuitively clear since, after all, a name should name something. While the example may seem slightly artificial, it illustrates a general organizational principle found across essentially all database paradigms as we will see.

The type-theoretic perspective

In type theory we formally study type systems, comprising types and terms of types, which provide descriptions of specific collections of data, as well as typed programs or operations, which allow passing data between types. Type theory allows us to make the above discussion of conceptualization and concept dependency precise in simple and direct terms:

  1. Independent concepts, such as the concept Person, correspond to plain types in our type system;
  2. Dependent concepts, such as “name(s) of a person x”, correspond to so-called dependent types, often written using functional notation as Name(x).

We will forgo a more in-depth review of type-theoretic ideas here (and it will not be necessary to know type theory to understand the main ideas of the PERA model)—for the reader interested in more detailed discussion, have a look at our earlier article on type theory!

A fundamental operation in type systems is subtyping, which allows us to cast instances of one type into instances of another type. For example, considering a concept (i.e., type) Employee as a subtype of Person, means that each instance e of the type Employee can be cast into an instance cast(e) of Person. We usually keep casting operations implicit in our notation, writing the person cast(e) simply as e.

When casting one concept type into another concept type, like in our example, then we speak of inheritance polymorphism. There is also a second different kind of subtype casting, which concerns concept (i.e., type) dependencies. Namely, for any dependent concept we can abstract the type of the instances that the concept can be instantiated with. For example, for our dependent type Name(x) of “names of x”, you can think of this as introducing an abstract interface type of “name owners” with the purpose of collecting all possible x. Then, one or more concept types could be cast into that abstract type of name owners: for example, person instances x of type Person may be cast into name owners, but so might, say, cities x belonging to some other concept City—observe how this allows us to conceptualize both “names of persons” and “names of cities” in a unified way! Unlike subtyping one concept by another concept, subtyping abstracted dependencies by concepts in this way will be referred to as interface polymorphism.

Object vs. attribute types

There is an important distinction to be made when it comes to the instances of the types in our type system. For example, we think of instances in the type Person and the type Name(x) for some person x in a fundamentally different way. Let’s explain this.

Recall from our earlier example, in a relational approach, we could think of instances of persons as rows in a Person table, while names may be string values in the Name column of that table. But while we may add any number of new rows to our table, we cannot invent new strings. In other words, person instances can be freely created while instances of names are drawn from an existing data type comprising literal values, such as strings, integers, or booleans (or composites thereof). Note that this distinction is not just terminological, but it implies fundamentally different behavior of these kinds of types:

  1. If we define that “p is a person who has name 'Foo Bar'” and later define “q is another person who has name 'Foo Bar'” then we will have created two instances of persons, with the same name.
  2. In contrast, if we define that “p is a person with name 'Foo Bar'” and later define that “p is a person who additionally has the name 'Foo Bar'”, then the second definition will not create anything! In this sense, literal value creation is idempotent.

Types like, say, Person, Employee, or City, whose instances can be freely created will be referred to as object types, and their instances will be referred to as objects. In contrast, types like, say, Name or Age, whose instances derive from system-defined types of values, will be referred to as attribute types.

We note that while our “object” terminology highlights an analogy with the object-oriented case, this analogy should be taken with a grain of salt: for example, in our simple type-theoretic approach will have no need for intricate constructors of objects (in particular, no values ever need to be supplied to create a new object) nor does there need be an analog of “methods” for classes in the PERA model. Instead, at the root of TypeDB lies a conceptual data model of entities, relations, and attributes, abstracting and generalizing ideas, e.g., of relational databases. Indeed, we will later describe the kinship of the PERA, relational, graph, and document models, which are all fundamentally distinct from the object-oriented approach.

Conceptualizing concept dependencies

It is worth emphasizing the following practice-oriented remark: in general, the role distribution between concepts and dependencies is not always “black and white”. Namely, concept dependencies themselves can be conceptualized—or conversely, we may turn a concept into a dependency.

Let us illustrate this. Consider the concepts of Employee and Team. We are faced with two options when modeling these concepts:

  1. We could decide that the concept of a team, in fact, can only make sense if we make explicit reference to the set of employees which constitute the team. In this case, Team would be the concept dependent on that of Employee, and we could thus speak of the concept of “team(s) with specific employees a, b, and c”.
  2. Alternatively, we could conceptualize the aforementioned dependency of teams on employees. This first means that Team becomes an independent concept, i.e. team instances can now be created without reference to employees. Second, however, we make our earlier dependency of teams on employees into its own dependent concept: let’s call this the concept of Team_Membership. Note, Team_Membership now depends both on the concepts of Employee and that of Team. To create an instance of team membership, we must reference an employee e in a team t.

Both options are, a priori, acceptable ways to model the involved concepts. However, in specific situations, one could imagine arguments for choosing one over the other. For example, it might be common for employees to frequently switch between teams, but we do not want to update a team instance every time there are changes to its set of members. Or, it could be very uncommon that the concept “team of a specific set of employees a, b, and c” would ever have more than one instance; and so it might ultimately not be that useful or interesting.

The arguments illustrate reasons for favoring option 2 over option 1. However, the situation could easily be reversed: for example, if we replace “Team of employees” by a “Marriage of persons” in our example, then we might in fact never want to think of a marriage as remaining the same instance if we switch out its participants, or, we might always want to create a marriage instance with reference to its spouses (moreover, some spouses marry multiple times in which case we might actually want to record multiple marriage objects m1, m2, … in the same type “marriage between $p1 and $p2”). For practical modeling, the following rule of thumb summarizes the key observation:

If, for a concept in our application, instances should never be created without reference to other objects, then make the concept dependent! Otherwise, it is preferable to make the concept independent, and conceptualize possible connections to other concepts.

2. Exploring the PERA model with TypeQL

In this section we will journey through the individual components of the polymorphic entity-relation-attribute (PERA) model, while also learning how these components can be addressed in TypeQL.

We have already met many of the key ideas underlying the PERA model in the previous section—recall that in the last section we, in particular, learned:

  • How concepts and concept dependencies are key principles for the organization of data,
  • How type systems can capture concepts as types and concept dependencies as type dependencies,
  • How inheritance and interface polymorphisms are two basic features of type systems with type dependencies,
  • How there is a fundamental difference in the instantiation of object and attribute types.

The PERA model combines all of the above, and further refines our preliminary ideas. To get us started, we give a brief overview of the terms that we will discuss in more detail in the following sections.

In the PERA model, we distinguish three kinds of types. First, we speak of entity types to mean independent object types, i.e. types whose instances are objects and which do not depend on any other types. Complementing this, we speak of relation types when referring to dependent object types. Recall from our discussion of interface polymorphism that we generally consider type dependencies on “abstract interface types”, which specific concept types may then be cast into. In the case of relation types, we refer to these abstract interface types as role types (or simply “roles”), and their instances as role players. A relation may depend on one or more role types.

The PERA model also features attribute types: note that, in the context of the PERA model, we will require all our attribute types to be dependent concepts.The abstract interface types of attribute types are referred to as ownership types (or simply “ownerships”). Importantly, the model imposes for each attribute type to have exactly one such ownership type. In the 2×2 matrix of (independent/dependent)×(object/attribute) types we are now left only with the combination of “independent attribute types”—these global constant types do not feature very prominently in the model, but we will briefly address them later in the special case of PERA attribute types “without specified owner”.

Based on the above, the PERA model comprises two elementary ingredients.

  1. A “database schema” of types, falling into the above categories, and subtyping information about these types, describing inheritance hierarchies and interface implementations.
  2. Appropriate “data” instances in each of these types.

In the following, we discuss in detail how both of these components can be defined. As a definitional language we will use TypeQL, which will allow us to directly translate what we learned into practice!

A brief aside: it is somewhat curious that we arrived at the distinction between entities, relations, and attributes purely from the perspective of concepts (and formally, type theory). Nonetheless, the resulting model structurally resembles the well-known entity-relationship approach to conceptual a.k.a. semantic data modeling. While we will not go into a detailed comparison here with topics from classical semantic data modeling. But we do take it as a good sign that the model reproduces well-established classical insights from a novel type-theoretic perspective!

Disclaimer: the upcoming release of TypeQL 3.0 will introduce several changes to the TypeQL, which will be reflected in a future version of this article. This version of the article is based on TypeQL 2.x syntax. Of course, the fundamental role of the PERA model will remain unchanged.

Entities

Entity types are the independent concepts in our database, and as such, they are the simplest to define in our database schema. In TypeQL a new entity type is specified by a statement of the following form:

define person sub entity;

The statement should be read as defining the type Person to be a subtype of an abstract super-type entity which is the default “root” type for entity types, i.e. all other entity types inherit from entity.

In general, types may also inherit from other defined entity types. In TypeQL this is specified by writing, for example:

define employee sub person;

Recall, employee being a subtype of person means that when we introduce an instance of employee later on, then it can be cast into a person as well!

Importantly, when defining inheritance hierarchies for any kind of types (i.e. for entity, relation, or attribute types) the PERA model requires each type to have exactly one super-type, which may be the default root type. This condition is also known as single-inheritance, and lets us avoid the diamond problem (which becomes particularly acute when working with dependent types, since dependencies will be inherited from supertypes).

We also mention that types may be designated as abstract types. For example, for our entity type person defined earlier we may further specify that

define person abstract;

Designating the type person as abstract in this way means that no instance of the type can ever be created directly in person. In other words, instances of person will only be obtainable by casting instances of subtypes of person into a person. The three root types, entity, relation, and attribute, are all examples of abstract types, and any new type can be defined to be abstract.

Relations

Relation types are the dependent analogs of entity types. Since a relation may have multiple roles (recall, these are the abstract interface types that relation types depend on), TypeQL has the additional keyword relates to specify the roles that a specific relation may depend on. For example, we could specify

define 
marriage sub relation,
    relates spouse;

in order to define a marriage relation type, which depends on the role spouse.

As before, types may be specified to inherit from other types, but in contrast to entity types, this now also requires dealing with the inheritance of roles. We illustrate this with the following example of three different subtypings of the marriage type:

# case 1: role overwriting
define
hetero_marriage sub marriage,
    relates husband as spouse,
    relates wife as spouse;

# case 2: role inheritance
define
religious_marriage sub marriage;

# case 3: role extension
define 
witnessed_marriage sub marriage,
    relates witness;

The first of the above cases specifies a new relation type hetero_marriage which subtypes marriage, but overwrites the spouse by two (more specific) roles: husband and wife. Note, in general, we can overwrite roles with one, two, or more roles. In the second case, we specify a relation type religious_marriage which does not overwrite the spouse role and thus inherits it directly: this means that religious_marriage (like marriage) may depend on any instance that can be cast into the spouse role. In the third and final case, we define a relation type witnessed_marriage which, like religious_marriage inherits the spouse role since it is not overwritten. In addition, it also extends role dependencies to another role, called witness. All three cases of overwriting, inheriting, and extending roles may be combined when defining subtypes of relation types.

Attributes

Unlike entity and relation types, attribute types have instances that are literal values of a specific, pre-defined type. In TypeQL this “value type” of an attribute type is indicated using the keyword value. For example, consider the specification

define
name sub attribute, value string;

which defines an attribute type name and states that instances of name will be literal values of type string. We may further restrict the value type to a subtype with appropriate expressions. For example, replacing the above by

define
name sub attribute, value string, regex "^(Ana|Bob)$";

would mean that instances of name have to be among the set of strings {Ana, Bob}. Note that, unlike relation types, there is no explicit specification of the ownership interface of the attribute type. Indeed, since each attribute type has a single unique ownership interface, in TypeQL we leave this interface implicit!

We remark that the condition for attribute type to have exactly one interface is a very reasonable design choice: in many situations, 𝑛-ary attributes can create ambiguity for data modelers as they may often instead be conceived as unary attributes of 𝑛-ary relations. For example, the binary attribute concept of “distance between x and y” could instead be understood as the unary attribute concept of “length of p” where p is an instance of the binary relation concept of “shortest paths between x and y”.

Just as entity and relation types, attribute types, too, may be organized in inheritance hierarchies. For example, we may replace our earlier TypeQL specification by

define
identifier sub attribute, abstract, value string;
name sub ID;

which defines the attribute type name as a subtype of some abstract supertype identifier. Note that attribute subtypes inherit the value type of their supertype: e.g., instances of name will be string values like those of identifier. We remark that in TypeQL 2.x, all attribute supertype must be marked as abstract types: this is meant to avoid confusion when the same literal value is being used at different levels of an attribute type hierarchy.

Interface implementation

Since dependent types depend on abstract interface types, we still need to specify which types implement which interfaces in order to be able to actually instantiate dependent types. In essence, any such specification can be made when defining our database schema, with the simple rule that role players and owners must be objects (“objectified interfaces”). In other words, only object types can implement interfaces. Note that, without this rule, we might end up in situations in which our model loses track of the user’s intention due to the idempotency of literal value creation.

Roles

Let us first consider the implementation of roles in a relation type. In TypeQL, the implementing types are specified using the plays keyword. For example, we could specify:

define person plays marriage:spouse;

This specifies that we can cast instances of the type person into instances of the abstract interface type spouse of marriage. In natural language, we say that persons can play the role of spouse in marriages . The definition thus allows instantiation of the concept “marriage of spouses x and y” for persons x and y.

Note that our plays specification above uses the “scoped” notation marriage:spouse in order to refer to the spouse role; this is because, generally in TypeQL, role identifiers (like spouse) are required to only be unique within the scope of their relation type hierarchy.

Importantly, since relation types are object types, they, too, can play roles. For example, we could have:

define
civil_servant sub person;
registry_entry sub relation,
    relates registrar,
    relates event;
civil_servant plays registry_entry:registrar;
marriage plays registry_entry:event;

Ownerships

Next, let’s consider implementing ownerships of attribute types. In TypeQL, implementations are specified using the keyword owns. For example, continuing our previous examples, we could write:

define person owns name;

This specifies that instances of type person can be cast into the abstract interface type of “name owners” (recall this interface implicit and left unnamed in TypeDB). In natural language, persons can own names. As a result of the definition, we can now instantiate the concept of “name(s) of x” for a person x.

Any number of types can be specified to have a given ownership (and similarly, any number of types can play a given role). For example, in addition to the above, we could specify:

define 
city sub entity;
city owns name;

in order to allow cities to own names as well.

And finally, since all object types (including relations) can own attributes, we could further extend our example, for example:

define
date sub attribute, value datetime;
marriage owns date;

Implementations are inherited

We remark that if an object type implements a role in relations (or an ownership of attributes) then this will be passed on to all its subtypes in the evident way: type-theoretically, we simply compose the respective casting operations. For example, since employee is a subtype of person, and since person implements name ownership, so does employee. Indeed, any employee instance can be cast into a person instance, and any person instance can be cast into a name owner.

Inserting data instances

In the previous sections we illustrated the key ingredients of database schemas in the PERA model, comprising entity, relation, and attribute types, as well as their inheritance hierarchies and their interface dependencies. In this section, we briefly describe how data instances can be created in the types specified by the schema in the PERA model.

As before, we will give all our specifications using TypeQL, which provides a simple and intuitive syntax for data creation, revolving around the keywords isa and has. Continuing with the example schema informally described in the previous sections, consider the following data insert query in TypeQL:

insert
$a isa civil_servant;
$a has name "Ana";
$b1 isa person, has name "Bob";
$b2 isa person, has name "Bob";
$m (spouse: $b1, spouse: $b2) isa marriage, has date 2004-05-17;
(event: $m, registrar: $a) isa registry_entry;

Let’s go through the above example line by line.

  1. The first line creates a new civil_servant object, and assigns that object to the variable $a.
  2. In the second line, we create value "Ana" in attribute type “name of $a“.
  3. The third line is similar to the first two, but note that statements with the same subject can be concatenated!
  4. The fourth line creates yet another person object, assigned to $b2. Importantly, note that $b1 and $b2 are not the same person object!
  5. The fifth line creates a new object in type “marriage of spouses $b1 and $b2“, and assigns the object to variable $m. It then also creates the date value 2004-05-17 in the type “date of $m
  6. Finally, the last line, creates a new object in the type “registry_entry of event $m by registrar $a”. Note that the newly created object is not assigned to any variable here: it can be left implicit!

The example, in essence, covers all basic cases of data instantiation in the PERA model. But, of course, there is a little bit of fineprint to the topic. We address some of it in the following sections.

Cardinality and variadicity

We earlier saw how to create a new marriage instance between two spouses. However, we never specified that a marriage should have exactly two spouses. A priori, the cardinalities of roles in the PERA model are variadic, meaning any number of role players can be given when instantiating a relation type (as long as at least one roleplayer for the relation type is given). That means, for example, we could also create:

insert
$a isa person, has name "Austin";
(spouse: $a) isa marriage;
# indeed, the other spouse could be unknown!

which would create a marriage with a single spouse. Variadicity is highly useful in order to record partial information, in this case describing the case where one of the spouses in a marriage relation is unknown.

However, without care, variadicity could also allow us to record a marriage with three or more spouses which might violate the intended semantics of the type. For this reason, in the formal PERA model, cardinalities of roleplayers are bounded (e.g. to indicate that a marriage should have no more than two than spouses) and bounds are preserved when inherited. The ability to express precise cardinality constraints is firmly on our roadmap for TypeQL 3.0!

Intentionality

A rather subtle condition that data inserts need to satisfy is that of intentionality. In brief, the condition ensures that there in no ambiguity in interpreting the user’s intention as to which roles or ownerships an object is cast into when instantiating a dependent type. To illustrate this, we consider an example of how to not satisfy intentionality, starting with the following database schema:

define
companionship sub relation,
    relates companion;
marriage sub companionship,
    relates spouse as companion;
friendship sub companionship,
    relates friend as companion;
person sub entity;
person plays friendship:friend;
person plays marriage:spouse;

This (purely illustrative) schema describes various forms of companionship, defining that a person can play the role of either a friend in a friendship or a spouse in a marriage. Since we didn’t make companionship an abstract type, we could now attempt to insert data as follows:

insert
$p isa person;
$q isa person;
(companion: $p, companion: $q) isa companionship;

At first glance, this insert looks fine: indeed, $p and $q are objects of the type person and thus can be cast into either friends or spouses, and thus also into the role of companions. Nonetheless, the above would be invalid in the PERA model as it creates unwanted ambiguity. Indeed, the schema of our database tells us that persons can precisely be either a friend or a spouse. So in the insert clause above, which one of those two are $p and $q? The answer is ambiguous, and the intention of the user is unclear here. The intentionality condition prevents these and similar ambiguities by requiring that dependent types can only be instantiated with objects whose types are explicitly defined to implement the needed interfaces (using plays or owns specifications). That means, the above data creation would be valid if we also had specified that person plays companionship:companion; !

Globality of attributes

When referring to an attribute (as opposed to “attribute type”) we usually don’t mean a bare literal value instance in an attribute type, but rather the tuple of both literal value and attribute type. For example, if some object has a name with value "Austin" (as created by our previous example) then we refer to the serialization name:"Austin" of attribute type identifier and a literal value as an attribute. In particular, if another object is created it may have the same attribute:

insert
$a isa person, has name "Austin";
$c isa city, has name "Austin";

Our terminology here reflects the following design choice: in TypeDB, we do not store duplicates of attributes. That is, if "Austin" is the name of some person and of some city, the database only needs to record one name:"Austin" attribute, with appropriate reference to both the person and the city. This can save both valuable space and time in our database application (note that the choice concerns only the implementation of the PERA model, not the definition of the model itself). We refer to this design choice as working with global attributes.

The idea of globality for attributes is reflected in another feature: we allow attributes that are global constants. This leverages the isa keyword as before, but this time used to create instances in attribute types. Consider for example:

insert
$n "Ana" isa name;
1970-01-01 isa date;
$a isa person;
$a has name $n;

Line by line, this data specification does the following:

  1. In the first line, we create the literal value "Ana" in the type “name without owner“, and assign the value to a variable $n.
  2. We then create creates value 1970-01-01 in the type “date without owner
  3. Reusing variables as before, in the last two lines we create a new person object, and the use our value “Ana” in the variable $n to create a new instance in type “name of $a

Thus, unlike in the case of relation types where at least one roleplayer needed to be instantiated, we do allow attribute types to be instantiated with no owners. This can be used to create global constants which are not associated with any owner, even though the feature is rarely used.

3. Comparison to other models

In Section 2, we’ve discussed in detail the individual components of the PERA model. Importantly, up to a handful of minor design choices, these components derived from a set of simple and fundamental modeling primitives, described in Section 1. As we will now describe, these primitives are not exclusive to the PERA model but govern many other data models, albeit often being implemented in a less general way. This will unveil the close kinship between existing models and the PERA model which brings about welcome benefits: we often find that migration to TypeDB from other databases becomes a simple and straight-forward task.

In our discussion, we will focus on three prominent models: the relational model, the graph data model, and the document data model. In each case, we first inspect the concepts and dependencies that govern data structures in the model. We then illustrate with examples how these structures are commonly used, and how they can be translated into TypeDB.

Relational data model

In the well-known SQL-like approach to the relational model, data is organized, of course, into tables with columns, which represent the named concepts of the model (e.g. a table labeled Person with a column labeled Name comprises two names concepts: that of persons and that of names). Instances of a given table concept are called rows, and instances of a given column concept are either values in some literal value type (like strings or integers), in which case we speak of an attribute column, or they are values that represent references to (keys of) rows in another table, in which case we speak of a foreign key column.

Having defined the concepts of the relational data model, what about concept dependencies? Certainly, creating an instance in a column (whether this is an attribute or foreign key column) must refer to a row for the instance to live in, and so columns will depend on their respective tables as concepts. Moreover, in the case of foreign key columns, each instance of the column must be defined with an explicit reference to a row in another table. Thus foreign key columns depend both on their own table and the foreign table they reference. Let’s illustrate this with a simple example!

A simple relational database

Consider an SQL schema comprising a single table as follows:

CREATE TABLE employee (
    name text,
    team text
);

In this case, we thus defined three concepts in total: the concept of employees (the table employee), that of names (the column name), and that of teams (the column team).

We can now create respective instances for each of these concepts. Creating a new employee instance will add a new row to the employee table (let us refer to this instance as $a). Of course, without also creating a name and a team of that employee, the row would at first be “empty”, yielding a table as follows:

name | team
-----+------
NULL | NULL	<-- new row representing employee $a

Next, creating a name instance requires us to first choose an employee, and then add a literal value in the corresponding row and column. This similarly applies to instances of team. If we do so for our employee $a constructed earlier, our table might end up looking something like this:

 name |   team
------+------------
 Ana  | Engineering	<-- our fully specified employee $a

The example is of course overly simplistic and not very useful as is. Let’s improve it a little bit: for instance, the fact that we are keeping track of teams using literal string values is likely not a good idea, as it duplicates work and invites spelling mistakes. Consider the following slightly extended (but certainly still simplistic) schema.

CREATE TABLE employee (
    name text,
    team_membership integer REFERENCES Teams(Team_ID)
);
CREATE TABLE team (
    team_ID integer,
    team_name text
);

In contrast to our first schema, we have now created an independent concept of teams, a new concept of team_name dependent on teams, and a new concept of team_membership dependent on both employees and teams. Thus, in order to create an instance of the concept “team membership of $a in $t” we would have to make reference to an instance in the team table (represented, in our schema, by its unique team_ID key, which we exclude from being a concept as it only serves to identify rows in the teams tables). A database conforming to the schema could for instance look as follows.

---EMPLOYEE-----------
 name | team_membership
----------------------
 Ana  |   1
 Bob  |   2

-------TEAM----------
 team_ID | team_name
----------------------
    1    | Engineering
    2    | Marketing

Migrating relational data to TypeDB

Having explained all parts of our relational schema in terms of concepts and concepts dependencies, we can translate the schema into the PERA model in a straight-forward manner. In TypeQL, the schema corresponding to the preceding example could be represented as follows.

define
employee sub entity, 
    has name, 
    plays team_membership:member;
team sub entity, 
    has team_name,
    plays team_membership:team;
team_membership sub relation,
    relates member,
    relates team;
name sub attribute, value string;
team_name sub attribute, value string;

While becoming slightly more verbose, the PERA schema gives explicit insight into the dependencies of all concepts at work. This can be beneficial in many ways: for example, we can now easily model an employee being a member of multiple teams (simply by creating appropriate instances of the team_membership concepts). In contrast, in the relational model, this would require a schema modification: turning our foreign-key into an associative entity (a.k.a. join table).

There is one small caveat to the discussion above: our earlier remark about the “conceptualization of concept dependencies” in Section 1 still applies! This means, we may need to decide whether some dependencies should rather be concepts or, conversely, whether certain concepts should rather be dependencies. In the particular case of the relational model, this gives two alternatives when conceptualizing foreign key columns.

  1. First, as described above, we might choose to think of tables as entities, and of foreign key columns as corresponding relation concepts. For example, the column team_membership in the table of employees which records the team they work in would become its own relation concept, relating employees and teams.
  2. However, in some situations we might prefer to “de-conceptualize”, and turn concepts into dependencies. For example, if our schema contains a join table of team_memberships with columns member and team, then, rather than thinking of the table as an entity concept and its columns as relation concepts, it would likely be preferable to deconceptualize: this means turning member and team back into “mere” role dependencies of a relation concept team_membership.

To summarize, as in the PERA model itself, we often have several options to design our database schema for a given data domain.

Graph data model

In the modern inception of the property graph data model, data is organized using labels and property keys as the named concepts of the model, though neither need to be specified in a schema ahead of time when working with schemaless graph databases. Labels fall into two categories: node labels (e.g. Person) and edge labels (e.g. KNOWS), and the respective instances of these are, unsurprisingly, labeled nodes and edges. In contrast, data instances of property keys (e.g. name) are literal values (such as the string "Ana").

The dependency of concepts in the graph model is relatively simple to see. Defining an instance of an edge label, i.e. creating a new labeled edge, requires making reference to (possibly the same) two labeled nodes which describe the start and the end point of that edge. Thus, edge labels conceptually depend on the nodes they connect. Similarly, in order to define an instance of a property key, i.e. create a new key-value pair, we first need to choose a node or edge to which we assign the pair. Thus, property keys conceptually depend on nodes and edges in order to be definable/interpretable. Let’s illustrate this with an example!

A simple graph database

Consider the following simple graph database (note, we are using Cypher-like syntax here)

CREATE (a:Employee {name:'Ana'})
CREATE (b:Employee {name:'Bob'})
CREATE (e:Team {team_name:'Engineering'})
CREATE (m:Team {team_name:'Marketing'})
CREATE (a)-[:MEMBER_OF {role:'director'}]->(e)
CREATE (b)-[:MEMBER_OF {role:'manager'}]->(m)

This graph database thus comprises the following concepts:

  • A (node label) concept Employee.
  • A (node label) concept Team.
  • A (edge label) concept MEMBER_OF of membership of employees in teams (even though in schemaless graph database this exact dependency is hard to enforce: a priori nothing can prevent us from adding CREATE (a)-[:MEMBER_OF]->(b) to the above, which would create a membership edge between two employees).
  • Three property key concepts:name, team_name, and role information for employees who are members of teams.

This design for our graph database makes it easy to record, say, when an employee is a member of two or more teams. For example, we could add to the above that

CREATE (a)-[:MEMBER_OF {role:'advisor'}]->(m)

stating that the employee Ana is not only a director in the Engineering team, but also an advisor to the Marketing team.

Despite this flexibility, we also quickly reach the boundaries of the graph model: one fundamental modeling obstruction arises from the fact that in the graph model we have a simple “two level” model of concept dependencies: edges (“level 1”) depend on the nodes (“level 0”) that they attach to. Thus, if we had another concept that depended on a level 1 concept, there would be no way to directly capture this in the graph model since we can only represent dependencies of depth 1—this is also known as the issue of reification. For instance, in our above example, consider turning the property key role into its own node concept (in other word, we normalize the role property key). In order to represent that roles are taken up by members in teams we would now have to allow edges between role nodes and MEMBER_OF edges. Since the graph model does not feature these “level 2” edges this is impossible to do directly represent, and we must, instead, resort to other “hacks” of our schema (e.g., splitting edges into two and thereby creating a new node).

Migrating graph data to TypeDB

As in a case of migrating relational data, the migration from graph data is very much straight-forward: each concept converts to a type in the corresponding PERA schema with concept dependencies as described above. The resulting PERA schema is similar to that in the relational case, up to one small addition:

define
employee ...		# as before
team ... 		# as before
team_membership...	# as before
name ...		# as before
team_name ...		# as before
role sub attribute, value string;
team_membership owns role;

We can now go ahead and create data instance reproducing exactly our earlier graph database as follows:

insert
$a isa employee, has name 'Ana';
$b isa employee, has name 'Bob';
$e isa team, has team_name 'Engineering';
$m isa team, has team_name 'Marketing';
($a, $e) isa team_membership, has role 'director';
($b, $m) isa team_membership, has role 'manager';

(Note, in the last two lines we omit the interface types of the relation as this can be inferred by the type inference engine of TypeDB!)

So how does our earlier problem of normalizing the role property key play out in the PERA model? Well, it is not a problem at all. We can simply have an independent concept of “roles”, obtaining a schema along the following lines:

define
... # employee, team, name, team_name as before
role sub entity,
    has title,
    plays team_member_role:role;
team_membership sub relation,
    relates member,
    relates team,
    plays team_member_role:membership; 
team_member_role sub relation,
    relates membership,
    relates role;
title sub attribute, value string;

Document data model

The document model is, in essence, a generalization of the relational model (tables correspond to collections, and rows to documents) but drops any requirements on normalization of data, meaning data can be nested, and generally has much looser requirements on referential integrity. There are two kinds of named concepts in the document model: collections of documents (e.g. a collection of Person documents) and dictionary keys in documents (e.g. a Name key). Instances of the former are documents in a given collection, while instances of the latter are “subdocuments”, which could comprise a simple literal value or further nested dictionaries and lists.

Discussing concept dependency in the context of the document model is somewhat more subtle due to the frequent occurrence of duplicate data in several places of the database (without a unified reference mechanism between these semantically related occurrences of data). However, analogous to the relational case, the following observations can be made. Documents in collections (just like rows in tables) can, a priori, be created without reference to instances of any other concept and are thus independent concepts. In contrast, dictionary keys represent concepts that depend on their immediate parent key (if applicable) or their parent document (otherwise). Additionally, if the key’s subdocument makes reference to documents in other collections then this creates further dependencies of the key on the referenced collections. If a key only holds a literal value in its subdocument, then we speak of an attribute key, and otherwise of a subdocument key.

Let’s see an example of how this works in practice!

A simple document database

We consider a document database with two collections: Employees and Teams. A document in the Employee collection could look like this (note that we indicate document ids with a $ prefix):

{
    _id: $a,
    name: "Ana",
    address: {
        street: "Ant Street",
        city: "Austin"
    }
}

… or like this:

{
    _id: $b,
    name: "Bob",
    address: {
        street: "Bee Street",
        city: "Boston"
    }
}

In order to extract the named concepts from this document collection, note that the given document contains four distinct keys: name and address, both of which depend on the employee document they live in, as well as street and city, both of which depend on their parent address instance. Note that we ignore _id as a key here, as these merely serve to identify document instances.

Next, we consider documents in the Team collection, which could be of the following form

{
    _id: $e,
    team_name: "Engineering",
    team_member_list: [
        $a
    ]
},
{
    _id: $m,
    team_name: "Marketing",
    team_member_list: [
        $a,
        $b
    ]
}

The documents thus exhibit two new concepts: that of team_name and that of team_members. The former depends on the team, represented by the parent document, while the latter depends both on the team as well as a (variadic!) collection of employees.

Migrating document data to TypeDB

Just like in the case of relational and graph data, the hard part of migrating document data into the PERA model was to understand the concepts and their dependencies that are implicitly used to organize the data in the model. Having analyzed the concepts of our document database example, we obtain the following PERA schema:

define
employee sub entity,
    owns name
    plays address:owner,
    plays team_member_list:member;
address sub relation,
    relates owner,
    owns street,
    owns city;
name sub attribute, value string;
street sub attribute, value string;
city sub attribute, value string;

team sub entity,
    owns team_name,
    plays team_member_list:team;
team_member_list sub relation,
    relates team,
    relates employee;
team_name sub attribute, value string;

We remark that the above schema is the result of our general translation procedure, and that, for practical purposes, the result can likely be improved in various ways (this also highlights certain shortcomings of the original document database). For example, we may want to record role information for team members individually, which would require a schema more similar to that in our graph database example. Or, we may want to record addresses by attributes of structured value type (a natural feature of the PERA model, and one that is also firmly on the roadmap for TypeDB). Those beauty spots aside, the above schema is a great starting point for our migration!

Having, defined the schema implicit to our document database, let us now create data as follows

insert
$a isa employee, 
has name "Ana", 
$addr_a (owner: $a) isa address, 
has street "Ant Street", 
has city "Austin";

$b isa employee, 
has name "Bob", 
$addr_b (owner: $b) isa address, 
has street "Bee Street", 
has city "Boston";

$e isa team, 
has team_name "Engineering";
(team: $e, member: $a) isa team_member_list;

$n isa team, 
has team_name "Marketing";
(team: $e, member: $a, member: $b) isa team_member_list;

Easy! Though, we should add that our remark about the “deconceptualization” in the relational case still applies here as well. This means that just like “join tables”, some collections will, in fact, be better understood as dependent concepts, but this design choice will have to be made on a case-by-case basis.

Adding polymorphism to the picture

In the previous sections we’ve seen detailed dissections of the relation, graph, and document model into their fundamental concepts and concept dependencies. Then, exploiting the fact that the PERA model builds directly on a conceptual model, we were able to easily migrate our data. Our insights from each of these migrations are summarized in the diagrams below: here, for each model, we illustrate its respective terminology for concepts (depicted using boxes) as well as how these concepts may be interdependent on each other (depicted by arrows).

The graphic powerfully illustrates the following important point. Across relational, graph, and document model, we find a successive increase of flexibility in “what can depend on what” (for example, in the graph model edges may have properties, which is not the case for foreign-keys in the relational model). Besides this observation however, the diagrams for all three models do not differ very much in their simple fundamental approach to understanding data via conceptualization and concept dependency.

The PERA model, in contrast, not only further refines these existing models, capturing each part of the above diagrams in their fully general form, it also naturally infuses the diagram with a new idea: abstracting dependencies using interface polymorphism, and organizing concepts using inheritance polymorphism. This represents a further step in the above succession of data models, yielding a polymorphic conceptual data model. For the interested reader, we refer to our article need for polymorphism in databases for a discussion on the immense impact that native, zero-cost polymorphism can have on database engineering!

4. Beyond the basics

The polymorphic conceptual framework provided by the PERA model is a large step up from the existing data modeling paradigms in terms of expressivity. Now, equipped with a novel polymorphic and type-theoretic toolset, we can go even some steps further. To end this article, let us briefly mention two directions of exciting developments enabled by the PERA model.

  1. The type-theoretic underpinnings of the PERA model also lend themselves to design a query language, which led to the inception of TypeQL. Indeed, so far we’ve only seen how to use TypeQL to define schema and insert data into our database. However, TypeQL also naturally acts as a query language for data retrieval, tailored to support polymorphism.
  2. Type theory is a natural framework for work with logical derivations. TypeDB integrates a powerful rule engine, which can be used to construct inferred data and views, and which is also natively accessible through TypeQL.

We now give a small taste of these ideas, but defer more in-depth discussion of either point to the future.

Querying with interface polymorphism

Queries in TypeDB use the same syntax as schema definition and data insertion statements, but all objects, values or types in these statements may now be variablized. As an example, consider the following simple query

match
$eng_team isa team, has name "Engineering";
(team: $eng_team, member: $m) isa team_membership;
$m has name $n;
get $m, $n;

Let us inspect the query line by line:

  1. The first line introduces a variable $eng_team of type team. The object(s) represented by that variable are required to have the name "Engineering";
  2. The second line, reusing the variable $eng_team, introduces another variable $m. This variable is required to be (to represent objects that are) in team membership relation with $eng_team.
  3. In the last line of the match clause, we introduce yet another variable, $n, which is a name of $m.

In summary, the query matches and returns all employees in the engineering team and their names!

Now, consider extending our employee-team schema from earlier as follows.

define
contractor sub entity, 
    has name,
    plays team_membership:member;

This adds to our schema the concept of contractor. Like employees, contractors can own names, and be counted as members of teams. So what happens to our earlier query? It still works exactly the same! This time around, the query matches and returns all employees or contractors in the engineering team and their names!

Our example illustrates one way in which interface polymorphism can be extremely useful. Interfaces often lead to conceptually simpler and more flexible code, e.g. when emulating a foreign key to multiple tables.

Querying with inheritance polymorphism

The question of how to represent and model inheritance are long known issues for practical databases. In fact, during the rise of object-oriented programming, it led to the inception of a whole new type of object-oriented databases, though the approach, arguably, turned out to be not simple and “conceptual” enough in order to scale and compete with other database paradigms. Based on its stype-theoretical foundations, the PERA model integrates inheritance via subtyping in a simple and pure form.

As an example consider the following variation of our earlier schema of employees and teams from the previous section:

define
person sub entity,
    owns name;
employee sub person,
    owns EmployeeID,
    plays team_membership:member;
director sub employee,
    plays team_leadership:leader;
team_membership sub relation,
    relates member,
    relates team;
team_leadership sub team_membership,
    relates leader as member;
name sub attribute, value string;
EmployeeID sub attribute, value string;

Now we may consider the following query.

match
$e isa employee, has EmployeeID $id;
(team: $t, leader: $e) isa team_leadership;
get $e, $id, $t;

As before, let’s go through the query line by line:

  1. The first line defines $e to be a variable representing an object of type employee, and a variable $id representing an employee ID of the employee $e. Note, since director is a subtype of employee, $e could, in particular, represent a director object.
  2. In the second, we declare that $e is the leader of a team $t.

The query thus returns all employees $e with ID $id who lead some team $t. Note, based on the inheritance structure defined in our schema, this surely means that the query will only return directors for the variable $e!

Rules: inferring new data on-the-fly

A final aspect of the PERA model concerns the ability to reason over data, meaning we may infer new instances of concepts from existing ones. This can be thought of as a (much more general) form of, say, constructing views in the relational model. The PERA model naturally incorporates rules due to its type-theoretic foundations: in fact, in TypeQL, rules use essentially the same syntax as queries or data specifications!

As a short and simple example, consider the following rule:

define director-team-members-are-leader:
when {
    $d isa director;
    (team: $t, member: $d) isa team_membership;
} then {
    (team: $t, leader: $d) isa team_leadership;
}

The rule states simply that, whenever a director is a member of a team, then they will also be considered a leader of that same team.

Reasoning with rules can be sequentially chained or reasoning can branch over multiple derivations of instances. This makes reasoning a powerful tool that can capture complex application logic in concise form!

Summary

While we have seen that the PERA model is fundamentally compatible with existing database paradigms, the model provides a substantially more elegant and unified approach to the organization of data by concepts. Moreover, it integrates modern polymorphic programming paradigms through its type-theoretic perspective on conceptual modeling. This addresses several pragmatic pain points of modern database engineering, increases application robustness, and reduces maintenance costs. TypeDB implements the PERA model, and comes with an intuitive query language, TypeQL, which makes it easy to get started!

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback