What it means to be more strongly-typed than SQL and NoSQL
Unlike existing database paradigms, TypeDB is designed based on modern ideas from type theory. If you want to learn how this works in detail, you may be interested in our article on type-theoretic databases in our learning section. In this post, we will discuss a crucial difference between TypeDB and other database paradigms, which is a direct result of these type-theoretic foundations of TypeDB. But don’t worry, we won’t assume any familiarity with type theory and will keep our discussion as high level as possible!
As a warm-up we will give a brief overview of various database paradigms and how these relate and organize. We then recall and discuss some of the key ideas from type theory and conceptual modeling that led to the design of TypeDB. Finally, this will set the stage for understanding our main claim in this post: that TypeDB is more strongly-typed than other popular databases!
A common language for data models
A big challenge in the field of database design is the large variety of data that we encounter on a daily basis. For this reason, many different data models, each with their own specific modeling and querying paradigms, have emerged over the past decades. Among the most prominent such models, one often finds the following mentioned:
- The relational model. In the relational model we group data into tuples of relations, which can then be queried by a set of basic operations that manipulate relations. SQL is a query language standard implementing this paradigm, and is in many ways extending the original expressive strength of the model.
- The triplestore model. The triplestore (or RDF) data model is based on the idea of storing representations of natural language statements of form “subject-predicate-object”. The dominant query language, SPARQL, takes much inspiration from SQL… after all, triples are just specific relations.
- The graph model. Thinking about triples as the edges of a graph, the graph model conceptually extends the triplestore model by many useful graph operations, including e.g. transitive graph closures which cannot be easily expressed in standard SQL (as a trade-off, the graph model usually drops direct support of other powerful features of triplestore, like iterated reification).
- The key/value model. The retrieval of values by keys is among the simplest database paradigms and doesn’t really comprise an interesting ‘data model’ in itself (i.e. a schematic organization of data). It is nonetheless pervasive in many different data storage systems: for instance, in vector databases we can query for approximate value vectors by key vectors relative to a pre-defined embedding map.
- The document model. The document model extends the simple storage of key/value pairs by allowing for key/value mappings to themselves become values associated to a key. The result are data formats like XML or JSON, and these are implemented by several popular database.
TypeDB falls into none of the above categories: instead, it is based on its own “polymorphic conceptual” data model that, in particular, is meant to provide a common generalization of many of the above models. (Note: this stands in contrast to existing “multi-model” approaches, in which multiple different data models are implemented in parallel.) Indeed, conceptual data models are among the most expressive, high-level data models, and they often provide numerous ways of describing how data relates and is structured. For this reason, conceptual data models still feature centrally in database theory and are often considered a “very important phase in designing a successful database application” [Chapter 3, “Fundamentals of database systems”, 2016]. Famous examples of conceptual data models include Chen’s entity-relation model (ER model), and Abiteboul-Hull’s generic semantic model (GSM).
Unfortunately, a frequent issue with high-level conceptual data models is their lack of practical mathematical formalizations. In the reliability-sensitive field of database theory we need a workable theory to make correctness guarantees for query planning, optimization and transactionality (in contrast, lack of careful theoretical verification can lead to serious issues that frequently affect NoSQL databases in particular). This is where type theory comes into play, a sub-field of mathematical logic. TypeDB is built around the realization that mathematical ideas central to type theory are very well-suited to make precise the informal notions from classical conceptual data modeling. Along this line of thought, the topic of “type-theoretic databases” has recently gained much traction in both academia and industry.
So: what is exactly is the role of type theory for TypeDB, and how does it relate to conceptual data models more generally?
From variables to type theory
TypeDB’s design philosophy centers around the variabilization of natural language. Consider the sentence:
- Today I ate an apple.
A variabilized version of this statement could, for instance, read:
- Today I ate an
Variables represent information unknown to us. However, the space of all information is intractably big, and so it makes sense to constrain the unknown further: this is the purpose of types. Together, variables and types make up the glue of much of mathematics: this begins, for instance, with elementary equational statements such as “solve
x^n + y^n = z^n for positive integers
x, y, z, n where
n > 2”: here, both the constraints “positive integer” and “
> 2” describe the types of the variables at issue.
In TypeDB, variables and types fulfill a very similar purpose: we use types to enable the user to specify which operations are possible (for instance, which types can play which roles in other types, or which types may have which attributes associated to them), and then use composite type constraints to build queries and reason about variables. Importantly, for the purposes of efficient data modeling we found that we do not need the full power of existing type theories: for example, much of every day reasoning happens at the level of first-order functional dependencies and so higher-order functions are absent in TypeDB. These choices guarantee optimizability and performance of the database. The following slogan summarizes our approach:
TypeDB follows a type-theoretic conceptual data model enhanced with type polymorphism.
TypeDB’s type theoretical conceptual modeling primitives
Let us briefly go into more detail about what me mean by designing a language close to the “classical primitives of conceptual data modeling” while at the same time taking inspiration from type theory. Our main conceptual data model, for reference, will be Chen’s entity-relation model. Common primitives of ER-modeling include, for instance, entitites, relations, and attributes. One way to motivate these primitives is by their usage in natural language. Importantly, however, they may alternatively be motivated from a purely type-theoretic perspective as well.
- Entity types are a class of types which (as types) do not depend on any other types, and whose terms are non-composite expressions (or ‘structureless’ constants, if you wish). Entities must therefore be identified by the information that they are connected to, i.e. by their relations to other data and by their attribute values.
- Relation types are, like entity types, types whose terms are non-composite expressions. However, unlike entity types, they may depend on other types. The type
marriage(s) of $x and $yis an example very much in the spirit of our earlier discussion of variabilization: the variables
$yare unknowns that the type of
marriage(s)depends on. The functional programmers among us may know that binary relations are generalized by so-called profunctors (as used e.g. in Haskell), and this is precisely the right way to think about relation types in TypeDB (generalized to the
n > 0).
- Value types, unlike entities or relations, have structure. For example, numbers have structure (e.g. addition is a structure, making e.g.
3 = 2 + 1) and so do strings: a term
'abc'in the type of strings is precisely the composite of its individual characters. The presence of structure means that value types have meaningful (namely, structure-preserving) interpretations across different systems: put plainly, the concept of numbers makes sense across many different systems, and no one is surprised that most programming languages support it.
- Attribute types, like relations, are dependent types but their terms are required to be values (i.e. terms sourced from value types); as an example, consider the type
names of $x. (Note, while theoretically we could have attributes depending on multiple other types, like e.g. the binary attribute
measured distance between $x and $y, in TypeDB we made the choice of letting attributes only depend on at most one type, called the attribute’s owner: this avoids ambiguity, since n-ary attributes may also be modeled as attributes of n-ary relations, say, a
lengthattribute of a binary relation
path between $x and $y.)
There are many other aspects of conceptual modeling that can be type-theoretically addressed in a similar way (including functional dependencies, cardinality constraints, key attributes, and many more). Moreover, and importantly, the query language TypeQL of TypeDB also adds additional powerful features that are motivated not by the theory of conceptual data models but from a point of view of type theory: these, for example, include formal modes of data deduction as well as type-polymorphic querying, which we will now discuss in more detail.
Everything has a type
The term strongly-typed is, in general, not very precisely defined. Usually, one finds definitions along the following lines: a language is strongly typed if at the time of compilation (or interpretation) each expression, i.e. each variable or composite expression, is given a well-defined type. One issue (of several) with this definition is the notion of expressions. For example, while the expression
if A then True else False would likely count as a typable expression (you may agree the type is
Boolean), the expression
while A do B is harder to give a type to, due to its fundamentally imperative nature. In contrast, in the world of type theory (and this also extends to functional programming languages) essentially everything can be understood to have a type, even entire programs. While we are on this topic, here is one of our favorite quotes (by Ulrik Buchholtz, professor in functional programming at Nottingham University).
To be is to be the element of a type.Ulrik Buchholtz
Following this mantra, instead of our earlier definition, one may informally define “strong(er) typing” to mean the following: we say a language is (more) strongly-typed if (more of) its syntactic components can be given types, either implicitly or explicitly. This interpretation of the term ‘strongly-typed’ is, in particular, closely related to the usage of the word in the context of interpreting one system in another type system.
Hold on for a second, everything has a type? What about types, do types have types? Yes, even types have types; in fact, this is a crucial feature of modern type theories allowing for mathematics to ‘make statements about itself’. And, for us in TypeQL, the same holds true. By allowing for essentially all components of our languages to be variabilized, we obtain a much more expressive class of queries that can ask questions about the database structure itself. For example, consider the statement:
$eventis a marriage between wife Andrea and wife Beatrice.
In TypeQL, all parts of this statement can be variabilized (and in each case this turns the statement into an appropriate query); for instance we could variabilize the above as:
$eventis a marriage between
$roleAndrea and wife Beatrice.
(As an aside: while the above are variabilized statements in natural language, you will find that the actual TypeQL code doesn’t look that different at all!)
In contrast to the above, type-theoretic design principles play a rather sub-ordinate role in most other existing data models and their query languages. For instance, in the relational model, relation tuples (i.e. rows) are given types, namely the “type of their relation” (i.e. of their table). In particular, unary tuples (i.e. rows with a single column) are usually typed with a datatype specified for that column. So, if we only care about relation tuples as expressions then it is true that SQL qualifies as a ‘strongly-typed’ language. But there are, of course, many other components to SQL as a language, and most of those components cannot be usefully treated as typed. Thus, they cannot be variabilized or queried for. As a example related to the above, we cannot have variables that represent columns in tables themselves, and so we cannot query, say, for an attribute that fulfils specific conditions:
Andrea has an
$attributethat contains the string
To repeat: in SQL we cannot write a query of the above form, which would have to return columns of, say, some table
person with row
Andrea. This is one of several ways in which TypeQL is strictly more expressive than SQL. Similar remarks apply to most other data models. Of course, one should always be careful to not compare apples and oranges, and often systems are designed for very domain-specific purposes without the goal of creating a highly generalizable query language. But importantly, the approach taken by TypeQL outlined above, makes it possible for many other data models to be naturally subsumed in it.
This concludes our brief discussion of the “very strong” typing in TypeQL. As a query language, TypeQL aims to strike a new powerful balance between intuivity, generalizability, and optimizability, and this must include the ability to variabilize types. We built TypeQL firmly on the principle: everything has a type, and so everything can be a variable.