TypeDB Fundamentals
From Concepts to Types in Databases
Databases are a vast topic and thus it is generally useful to get one’s terminology straight. In particular, when defining the PERA model and when comparing it with other models, we often use terms such as concepts, types, instances, dependencies, inheritance, interfaces, objects, attributes, etc. In this article we will make precise how we should think about each of these terms.
Concepts, instances, dependencies
For data in any database to be effectively used and queried, it needs to be organized in an interpretable way. At the surface level, there seem to be many ways to go about this; and indeed, many different database models co-exist nowadays. However, deeper down, we will find that most models are, in fact, governed by very similar elementary principles. These principles are data conceptualization on one hand and concept dependency on the other.
Categorizing data by concepts
Conceptualization describes the process of introducing named concepts (like “person”, “car”, or “marriage”), which are understood by the user, in order to categorize our data: this means that we identify specific data or data structures as instances belonging to the concepts.
Data depending on other data
Concept dependency describes the relations between the resulting instances and concepts: namely, some of our concepts might not be interpretable in isolation but only in the presence of instances of other concepts: concretely, this means that in order to meaningfully create instances of the former concept we need to reference instances of the latter concept—we say that the former concept depends on the latter concept.
A simple Example
As a simple example of dependency, consider the concepts of “persons” and of “names” in a relational database set-up: we could choose rows in a Person
table to be the data structure that represents instances of a “person”. Names could be represented by a column in that table: each entry in the column is “a name”, i.e., an instance of the name concept. But note, in order to create an instance of a name (i.e. an entry in the Name
column) we need to choose a pre-existing row for that instance to live in. In other words, in this example, the concept of names depends on persons that own them—this is intuitively clear since, after all, a name should name something.
While the example may seem slightly artificial, it illustrates a general organizational principle found across essentially all database paradigms as we will see.
The type-theoretic perspective
Type systems
In type theory we formally study type systems, comprising types and terms of types, which provide descriptions of specific collections of data, as well as typed programs (or operations), which allow passing data between types. Type theory allows us to make the above discussion of conceptualization and concept dependency precise in simple and direct terms:
- Independent concepts, such as the concept
Person
, correspond to plain types in our type system; - Dependent concepts, such as “name(s) of a person
x
”, correspond to so-called dependent types, often written using functional notation asName(x)
.
We will forgo a more in-depth review of type-theoretic ideas here (and it will not be necessary to know type theory to understand the main ideas of the PERA model)—for the reader interested in more detailed discussion, have a look at our earlier article on type theory!
Subtyping
A fundamental operation in type systems is subtyping, which allows us to cast instances of one type into instances of another type. For example, considering a concept (i.e., type) Employee
as a subtype of Person
, means that each instance e
of the type Employee
can be cast into an instance cast(e)
of Person
. We usually keep casting operations implicit in our notation, writing the person cast(e)
simply as e
.
Inheritance vs Interface polymorphism
When casting one concept type into another concept type, like in our example, then we speak of inheritance polymorphism. There is also a second different kind of subtype casting, which concerns concept (i.e., type) dependencies. Namely, for any dependent concept we can abstract the type of the instances that the concept can be instantiated with. For example, for our dependent type Name(x)
of “names of x
”, you can think of this as introducing an abstract interface type of “name owners” with the purpose of collecting all possible x
. Then, one or more concept types could be cast into that abstract type of name owners: for example, person instances x
of type Person
may be cast into name owners, but so might, say, cities x
belonging to some other concept City
—observe how this allows us to conceptualize both “names of persons” and “names of cities” in a unified way! Unlike subtyping one concept by another concept, subtyping abstracted dependencies by concepts in this way will be referred to as interface polymorphism.
Object vs. attribute types
There is an important distinction to be made when it comes to the instances of the types in our type system. For example, we think of instances in the type Person
and the type Name(x)
for some person x
in a fundamentally different way. Let’s explain this.
Modes of term instantiation
Recall from our earlier example, in a relational approach, we could think of instances of persons as rows in a Person
table, while names may be string values in the Name
column of that table. But while we may add any number of new rows to our table, we cannot invent new strings. In other words, person instances can be freely created while instances of names are drawn from an pre-defined data type: strings. Strings are, of course, just a particular example chosen here: more generally, we will say value types when referring to pre-defined data types. These can comprise values such as strings, integers, or booleans or composites thereof. Note the distinction between types collecting objects and values is not just terminological—it implies fundamentally different behavior of these kinds of types:
- If we define that “
p
is a person who has name'Foo Bar'
” and later define “q
is another person who has name'Foo Bar'
” then we will have created two instances of persons, with the same name. - In contrast, if we define that “
p
is a person with name'Foo Bar'
” and later define that “p
is a person who additionally has the name'Foo Bar'
”, then the second definition will not create anything! In this sense, literal value creation is idempotent.
More examples
Types like, say, Person
, Employee
, or City
, whose instances can be freely created will be referred to as object types, and their instances will be referred to as objects. In contrast, types like, say, Name
or Age
, whose instances derive from system-defined types of values, will be referred to as attribute types.
Comparison to object-oriented approach
We note that while our “object” terminology highlights an analogy with the object-oriented case, this analogy should be taken with a grain of salt: for example, in our simple type-theoretic approach will have no need for intricate constructors of objects (in particular, no values ever need to be supplied to create a new object) nor does there need be an analog of “methods” for classes in the PERA model. Instead, at the root of TypeDB lies a conceptual data model of entities, relations, and attributes, abstracting and generalizing ideas, e.g., of relational databases. Indeed, we will later describe the kinship of the PERA, relational, graph, and document models, which are all fundamentally distinct from the object-oriented approach.
Design paradigms: a brief overview
At this point, we need to make an important, practice-oriented remark. In modeling, the function of concepts and dependencies is not always “black and white”. Namely, concept dependencies themselves can be conceptualized—or conversely, we may turn a concept into a dependency.
Conceptualizing concept dependencies
Let us illustrate how this works. Consider the concepts of Employee
and Team
. We are faced with two options when modeling these concepts:
- We could decide that the concept of a team, in fact, can only make sense if we make explicit reference to the set of employees which constitute the team. In this case,
Team
would be the concept dependent on that ofEmployee
, and we could thus speak of the concept of “team(s) with specific employeesa
,b
, andc
”. - Alternatively, we could conceptualize the aforementioned dependency of teams on employees. This first means that
Team
becomes an independent concept, i.e. team instances can now be created without reference to employees. Second, however, we make our earlier dependency of teams on employees into its own dependent concept: let’s call this the concept ofTeam_Membership
. Note,Team_Membership
now depends both on the concepts ofEmployee
and that ofTeam
. To create an instance of team membership, we must reference an employeee
in a teamt
.
Pros and cons of conceptualization
Both options are, a priori, acceptable ways to model the involved concepts. However, in specific situations, one could imagine arguments for choosing one over the other. For example, it might be common for employees to frequently switch between teams, but we do not want to update a team instance every time there are changes to its set of members. Or, it could be very uncommon that the concept “team of a specific set of employees a
, b
, and c
” would ever have more than one instance; and so it might ultimately not be that useful or interesting.
The arguments illustrate reasons for favoring option 2 over option 1. However, the situation could easily be reversed: for example, if we replace “Team
of employees” by a “Marriage
of persons” in our example, then we might in fact never want to think of a marriage as remaining the same instance if we switch out its participants, or, we might always want to create a marriage instance with reference to its spouses (moreover, some spouses marry multiple times in which case we might actually want to record multiple marriage objects m1
, m2
, … in the same type “marriage between $p1
and $p2
”).
The type-theoretic paradigm
The following rule of thumb summarizes the key observation from a type-theoretically inspired perspective.
If, for a concept in our application, instances should never be created without reference to other objects, then make the concept dependent! Otherwise, it is preferable to make the concept independent, and conceptualize possible connections to other concepts.
The hypergraph paradigm
While the above rule of thumb is, in a way, idiomatic to the PERA model and TypeDB, it should be considered as an advanced data modeling technique. Indeed, for many day-to-day engineering tasks, a “flat” model design provides a simpler approach to organizing data. In a flat model, dependencies cannot be nested. As a results, we only have two layers of dependencies: the independent types (a.k.a. the nodes) and the dependent types (a.k.a. the hyper-edges) which may reference (and thus link) other nodes. This paradigm is captured by the following rule of thumb:
If, for a concept in our application, instances of a type should never be deleted when another instance is deleted, then take this concept to be a node type in our database. In contrast, if it should be deleted based on the deletion of other concepts, then make it an hyper-edge type (in other words, hyper-edges cannot have “dangling endpoints”: if you delete the endpoints, you need to delete the hyper-edge).
In the hypergraph paradigm, we would of course want to avoid haven edges between edges. But such higher-order dependencies can be represented in the PERA model, and can come in handy for the advanced modeler.
Summary
Concepts and concept dependency are key ingredients in knowledge and data representation. In this article we’ve learned how types provide a (mathematically formal) way of capturing and working with concepts. The resulting approach turns out to be highly flexible, supporting the integration of subtypes and polymorphism, as well as the possibility of working with different kinds of data modeling paradigm. This type-theoretic perspective provides the foundation on which we build TypeDB, its data model, and its query language.