Officially out now: The TypeDB 3.0 Roadmap

TypeDB Fundamentals

From Concepts to Types in Databases


Databases are a vast topic and thus it is generally useful to get one’s terminology straight. In particular, when defining the PERA model and when comparing it with other models, we often use terms such as concepts, types, instances, dependencies, inheritance, interfaces, objects, attributes, etc. In this article we will make precise how we should think about each of these terms.

Concepts, instances, dependencies

For data in any database to be effectively used and queried, it needs to be organized in an interpretable way. At the surface level, there seem to be many ways to go about this; and indeed, many different database models co-exist nowadays. However, deeper down, we will find that most models are, in fact, governed by very similar elementary principles. These principles are data conceptualization on one hand and concept dependency on the other.

Categorizing data by concepts

Conceptualization describes the process of introducing named concepts (like “person”, “car”, or “marriage”), which are understood by the user, in order to categorize our data: this means that we identify specific data or data structures as instances belonging to the concepts.

Data depending on other data

Concept dependency describes the relations between the resulting instances and concepts: namely, some of our concepts might not be interpretable in isolation but only in the presence of instances of other concepts: concretely, this means that in order to meaningfully create instances of the former concept we need to reference instances of the latter concept—we say that the former concept depends on the latter concept.

A simple Example

As a simple example of dependency, consider the concepts of “persons” and of “names” in a relational database set-up: we could choose rows in a Person table to be the data structure that represents instances of a “person”. Names could be represented by a column in that table: each entry in the column is “a name”, i.e., an instance of the name concept. But note, in order to create an instance of a name (i.e. an entry in the Name column) we need to choose a pre-existing row for that instance to live in. In other words, in this example, the concept of names depends on persons that own them—this is intuitively clear since, after all, a name should name something.

While the example may seem slightly artificial, it illustrates a general organizational principle found across essentially all database paradigms as we will see.

The type-theoretic perspective

Type systems

In type theory we formally study type systems, comprising types and terms of types, which provide descriptions of specific collections of data, as well as typed programs (or operations), which allow passing data between types. Type theory allows us to make the above discussion of conceptualization and concept dependency precise in simple and direct terms:

  1. Independent concepts, such as the concept Person, correspond to plain types in our type system;
  2. Dependent concepts, such as “name(s) of a person x”, correspond to so-called dependent types, often written using functional notation as Name(x).

We will forgo a more in-depth review of type-theoretic ideas here (and it will not be necessary to know type theory to understand the main ideas of the PERA model)—for the reader interested in more detailed discussion, have a look at our earlier article on type theory!

Subtyping

A fundamental operation in type systems is subtyping, which allows us to cast instances of one type into instances of another type. For example, considering a concept (i.e., type) Employee as a subtype of Person, means that each instance e of the type Employee can be cast into an instance cast(e) of Person. We usually keep casting operations implicit in our notation, writing the person cast(e) simply as e.

Inheritance vs Interface polymorphism

When casting one concept type into another concept type, like in our example, then we speak of inheritance polymorphism. There is also a second different kind of subtype casting, which concerns concept (i.e., type) dependencies. Namely, for any dependent concept we can abstract the type of the instances that the concept can be instantiated with. For example, for our dependent type Name(x) of “names of x”, you can think of this as introducing an abstract interface type of “name owners” with the purpose of collecting all possible x. Then, one or more concept types could be cast into that abstract type of name owners: for example, person instances x of type Person may be cast into name owners, but so might, say, cities x belonging to some other concept City—observe how this allows us to conceptualize both “names of persons” and “names of cities” in a unified way! Unlike subtyping one concept by another concept, subtyping abstracted dependencies by concepts in this way will be referred to as interface polymorphism.

Object vs. attribute types

There is an important distinction to be made when it comes to the instances of the types in our type system. For example, we think of instances in the type Person and the type Name(x) for some person x in a fundamentally different way. Let’s explain this.

Modes of term instantiation

Recall from our earlier example, in a relational approach, we could think of instances of persons as rows in a Person table, while names may be string values in the Name column of that table. But while we may add any number of new rows to our table, we cannot invent new strings. In other words, person instances can be freely created while instances of names are drawn from an pre-defined data type: strings. Strings are, of course, just a particular example chosen here: more generally, we will say value types when referring to pre-defined data types. These can comprise values such as strings, integers, or booleans or composites thereof. Note the distinction between types collecting objects and values is not just terminological—it implies fundamentally different behavior of these kinds of types:

  1. If we define that “p is a person who has name 'Foo Bar'” and later define “q is another person who has name 'Foo Bar'” then we will have created two instances of persons, with the same name.
  2. In contrast, if we define that “p is a person with name 'Foo Bar'” and later define that “p is a person who additionally has the name 'Foo Bar'”, then the second definition will not create anything! In this sense, literal value creation is idempotent.

More examples

Types like, say, Person, Employee, or City, whose instances can be freely created will be referred to as object types, and their instances will be referred to as objects. In contrast, types like, say, Name or Age, whose instances derive from system-defined types of values, will be referred to as attribute types.

Comparison to object-oriented approach

We note that while our “object” terminology highlights an analogy with the object-oriented case, this analogy should be taken with a grain of salt: for example, in our simple type-theoretic approach will have no need for intricate constructors of objects (in particular, no values ever need to be supplied to create a new object) nor does there need be an analog of “methods” for classes in the PERA model. Instead, at the root of TypeDB lies a conceptual data model of entities, relations, and attributes, abstracting and generalizing ideas, e.g., of relational databases. Indeed, we will later describe the kinship of the PERA, relational, graph, and document models, which are all fundamentally distinct from the object-oriented approach.

Design paradigms: a brief overview

At this point, we need to make an important, practice-oriented remark. In modeling, the function of concepts and dependencies is not always “black and white”. Namely, concept dependencies themselves can be conceptualized—or conversely, we may turn a concept into a dependency.

Conceptualizing concept dependencies

Let us illustrate how this works. Consider the concepts of Employee and Team. We are faced with two options when modeling these concepts:

  1. We could decide that the concept of a team, in fact, can only make sense if we make explicit reference to the set of employees which constitute the team. In this case, Team would be the concept dependent on that of Employee, and we could thus speak of the concept of “team(s) with specific employees a, b, and c”.
  2. Alternatively, we could conceptualize the aforementioned dependency of teams on employees. This first means that Team becomes an independent concept, i.e. team instances can now be created without reference to employees. Second, however, we make our earlier dependency of teams on employees into its own dependent concept: let’s call this the concept of Team_Membership. Note, Team_Membership now depends both on the concepts of Employee and that of Team. To create an instance of team membership, we must reference an employee e in a team t.

Pros and cons of conceptualization

Both options are, a priori, acceptable ways to model the involved concepts. However, in specific situations, one could imagine arguments for choosing one over the other. For example, it might be common for employees to frequently switch between teams, but we do not want to update a team instance every time there are changes to its set of members. Or, it could be very uncommon that the concept “team of a specific set of employees a, b, and c” would ever have more than one instance; and so it might ultimately not be that useful or interesting.

The arguments illustrate reasons for favoring option 2 over option 1. However, the situation could easily be reversed: for example, if we replace “Team of employees” by a “Marriage of persons” in our example, then we might in fact never want to think of a marriage as remaining the same instance if we switch out its participants, or, we might always want to create a marriage instance with reference to its spouses (moreover, some spouses marry multiple times in which case we might actually want to record multiple marriage objects m1, m2, … in the same type “marriage between $p1 and $p2”).

The type-theoretic paradigm

The following rule of thumb summarizes the key observation from a type-theoretically inspired perspective.

If, for a concept in our application, instances should never be created without reference to other objects, then make the concept dependent! Otherwise, it is preferable to make the concept independent, and conceptualize possible connections to other concepts.

The hypergraph paradigm

While the above rule of thumb is, in a way, idiomatic to the PERA model and TypeDB, it should be considered as an advanced data modeling technique. Indeed, for many day-to-day engineering tasks, a “flat” model design provides a simpler approach to organizing data. In a flat model, dependencies cannot be nested. As a results, we only have two layers of dependencies: the independent types (a.k.a. the nodes) and the dependent types (a.k.a. the hyper-edges) which may reference (and thus link) other nodes. This paradigm is captured by the following rule of thumb:

If, for a concept in our application, instances of a type should never be deleted when another instance is deleted, then take this concept to be a node type in our database. In contrast, if it should be deleted based on the deletion of other concepts, then make it an hyper-edge type (in other words, hyper-edges cannot have “dangling endpoints”: if you delete the endpoints, you need to delete the hyper-edge).

In the hypergraph paradigm, we would of course want to avoid haven edges between edges. But such higher-order dependencies can be represented in the PERA model, and can come in handy for the advanced modeler.

Summary

Concepts and concept dependency are key ingredients in knowledge and data representation. In this article we’ve learned how types provide a (mathematically formal) way of capturing and working with concepts. The resulting approach turns out to be highly flexible, supporting the integration of subtypes and polymorphism, as well as the possibility of working with different kinds of data modeling paradigm. This type-theoretic perspective provides the foundation on which we build TypeDB, its data model, and its query language.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

TypeDB's Data Model

Learn about the conceptual Polymorphic Entity-Relation-Attribute model that backs TypeDB and TypeQL, and how it subsumes and surpasses previous database models.

Read article

Enhanced Modeling

Type theory provides a powerful perspective on traditional conceptual modeling techniques, and enables us to further improve models by integrating polymorphism.

Read article

Object Model Mismatch

Object model mismatches remain a widespread problem. They manifest under different conditions in relational, document, and graph databases, but common problems always result.

Read article

Feedback