TypeDB Blog


The need for subtyping and polymorphism in databases

Shane Johnson


In this article, we’ll explain how a new revolution in databases overcomes the limitations of relational, document and graph databases by using subtyping and polymorphism — core object-oriented (OO) programming principles — to create higher abstractions and achieve greater expressivity.

This revolution grants developers and engineers the ability to create semantically meaningful data models and write concise, powerful queries. It’s built on a fundamentally new system of modeling and querying data that embraces core developer beliefs, makes working with databases a joy and opens up entirely new realms of possibilities.

Specifically, it’s designed to accelerate development and unblock innovation via:

  • Intuitive yet extensible data models,
  • Precise yet adaptive queries.

Modern applications require expressive building blocks

The more expressive a programming language is, the easier it is to use — and the more productive we are when using it. As developers, we prefer languages that allow us to express any intent (e.g. solution), and to do so in a clean, clear and concise manner. It’s why we choose languages such as Java, TypeScript, Go, Kotlin and Rust to build modern applications. Their support for strong typing and core OO programming principles provides us with the expressivity needed to solve difficult problems.

If we were building a PaaS, we would need to model repositories. We would not create a HashMap, convert all of a repository’s properties to strings, and then put them in the HashMap. No, we would create an abstract Repository class with two concrete implementations: GitHubRepository and GitLabRepository. These classes are types, and types provide us with abstraction, inheritance and polymorphism.

When we take advantage of types in expressive programming languages, our code becomes much more compact and flexible — and of course, easier to interpret. There’s nothing more enjoyable than solving a difficult problem with an elegant solution and some beautiful code.

Conventional databases don’t recognize how we code

MongoDB provides us with flexibility, but anything beyond basic CRUD operations requires queries that are not only obscenely verbose, but incredibly fickle. The slightest mistake, if we even realize it, could cost us hours of time.

Postgres is a fantastic relational database. However, it’s still a relational database. It can’t capture the intricacy of complex domains, and SQL is not nearly as declarative as one would think. In fact, there’s little worse than troubleshooting a query with hundreds of lines of SQL (or writing one).

Neo4j helped bring graph databases to the mainstream. There is an incredible amount of value in highly interconnected data. However, it’s limited to binary relationships and schemaless. It always comes up short the moment you need to model complex environments and processes.

By describing data as rows in a table, documents in a collection or nodes in a graph, conventional databases tell us next to nothing about the data — or how to interpret it.

The problem is conventional databases like MongoDB, Postgres and Neo4j, while popular and broadly adopted, lack a useful type system. It makes no difference whether they’re relational, document or graph databases. They have no idea that abstraction, inheritance and polymorphism are fundamental capabilities we’ve come to expect from programming languages. The value of core OO principles like these is lost on them.

And there’s nothing we can do about it. So, we live with it.

Accepting limitations slows down new development

As a result, we’ve had to endure the painful consequences of building applications on databases that don’t share the same values as developers and engineers. We’ve come to accept that while our application code should be beautiful and elegant, a testament to our pursuit of perfection, we’ve lowered the bar when it comes to data models and queries.

And this compromise leads to painful consequences in the long run:

  • Brittle foundation — We try to avoid extending the data model because doing so not only requires changes to the schema, but to queries as well — and thus our application code. The more rigid and stale our data model becomes, the more we fear breaking it.
  • Blurry reference — We create a generic data model in order to avoid application code changes. However, in doing so, it no longer provides us with a faithful representation of the domain — forcing us to build powerful applications on low-resolution data models.
  • Bloated code — We end up with complex data access tiers because while our applications use meaningful types, we have to build on primitive access to generic data. Thus, we end up writing long, complicated SQL queries or custom data access and transformation code.

The net effect is we can’t innovate as fast as our employers need us to, and not nearly as fast as we’d love to. We take pride in writing elegant and novel code — and we very much want to build applications, services and features that deliver more customer value.

However, we’re forced to waste time:

  • Working around database limitations,
  • Finding a way to move forward with less ambition,
  • Putting something off until it becomes practical.

The old way leads to incomplete workarounds

Let’s examine the data model for a CI/CD platform. We have artifacts that are built from sources. In practice, there are several different types of artifacts (e.g., archives and executables). However, common practice would have us model artifacts as generically as possible so we can add new types without requiring database schema changes.

Let’s examine two approaches that demonstrate the futility of trying to create a safe and intuitive type system in Postgres, MongoDB and Neo4j.

Approach #1 — Model data by storing its type in a column, field or property.

Approach #2 — Model data by using different tables, buckets or labels for different types.

Traditional data models are unsafe

Example #1 — Find all binary artifacts built from the 3.0 branch.

Postgres with SQL

This SQL query is for a schema with the following tables:

  • The artifacts table stores artifact instances. It references the artifact_types table.
  • The artifact_types table stores a hierarchy of artifact types. It references itself.
  • The builds table stores builds. It references the artifacts and sources tables.
  • The sources table stores commits such as branches, tags and pull requests.

select filename
from artifacts a 
join artifact_types t on t.id = a.type_id
join builds b on b.artifact_id = a.id
join sources s on s.id = b.source_id
where t.name = 'binary' and
      s.branch = '3.0';

The problem is that the validity of the query depends on the data rather than the schema.

What if the correct name in the artifact_types table is not “binary”, but “bin”?
What if someone updates the artifact_types table and replaces “binary” with “exe/bin”?
MongoDB aggregation pipelines

This query for a MongoDB database is similar to the Postgres one described above. It contains the following collections: artifacts, artifactTypes, builds and sources.

db.artifactTypes.aggregate([
  {
    $match: {
      name: "executable",
    },
  },
  {
    $graphLookup: {
      from: "artifactTypes",
      startWith: "$_id",
      connectFromField: "_id",
      connectToField: "supertype",
      as: "subtypes",
    },
  },
  {
    $lookup: {
      from: "artifacts",
      localField: "subtypes._id",
      foreignField: "artifactTypeId",
      as: "artifact",
    },
  },
  {
    $unwind: {
      path: "$artifact",
    },
  },
  {
    $lookup: {
      from: "builds",
      localField: "artifact._id",
      foreignField: "artifactId",
      as: "build",
    },
  },
  {
    $unwind: {
      path: "$build",
    },
  },
  {
    $lookup: {
      from: "sources",
      localField: "build.sourceId",
      foreignField: "_id",
      as: "source",
      pipeline: [
        {
          $match: {
            $expr: {
              $eq: ["$branch", "3.0"],
            },
          },
        },
      ],
    },
  },
  {
    $unwind: {
      path: "$source",
    },
  },
  {
    $project: {
      filename: "$artifact.filename"
    },
  },
])

Like the Postgres version, the validity of this query depends on the data rather than the schema. However, the abundance of input values significantly increases the risk of making a mistake. A developer would have to provide the correct value for 24 fields. If any of them are wrong, the query will return the wrong results — and it would likely go unnoticed.

Neo4j with Cypher

This Cypher query is similar to the Postgres one, but is adapted for querying nodes and edges in a graph. First, it finds all Build and Artifact nodes connected by a CREATES edge. Next, it finds any Source nodes connected to the matched Build nodes by a TARGETS edge. Finally, like the SQL version, it checks the type and branch properties of the matched Artifact and Source nodes to further filter the results.

MATCH 
  (b:Build)-[:CREATES]-(a:Artifact), 
  (b)-[:TARGETS]-(s:Source)
WHERE 
  a.type = 'Binary' AND
  s.branch = '3.0'
RETURN a.filename

However, like both the Postgres and MongoDB versions, its validity is entirely dependent on the data rather than the schema — and in the exact same way.

Example #2a — Find all binaries + scripts (i.e., executables) built from the 3.0 branch.

Postgres with SQL

The difference between this version and original one is that we now have to use a recursive common table expression (CTE) to find all types of executables (e.g., binary + script). It makes the query more complex, and it isn’t any safer.

with recursive subtypes as (
  select t.id
  from artifact_types as t
  where t.name = 'executable'
  union all
  select t.id
  from artifact_types as t
  join subtypes as st on st.id = t.parent_id
)
select filename
from artifacts a 
join subtypes t on t.id = a.type_id
join builds b on b.artifact_id = a.id
join sources s on s.id = b.source_id
where s.branch = '3.0';

MongoDB aggregation pipelines

Like the updated Postgres query, we have to insert a $graphLookup stage into our MongoDB aggregation pipeline in order to first identify all types of executables. It’s more complex, and it’s still incredibly unsafe.

db.artifactTypes.aggregate([
  {
    $match: {
      name: "executable",
    },
  },
  {
    $graphLookup: {
      from: "artifactTypes",
      startWith: "$_id",
      connectFromField: "_id",
      connectToField: "supertype",
      as: "subtypes",
    },
  },
  {
    $lookup: {
      from: "artifacts",
      localField: "subtypes._id",
      foreignField: "artifactTypeId",
      as: "artifact",
    },
  },
  {
    $unwind: {
      path: "$artifact",
    },
  },
  {
    $lookup: {
      from: "builds",
      localField: "artifact._id",
      foreignField: "artifactId",
      as: "build",
    },
  },
  {
    $unwind: {
      path: "$build",
    },
  },
  {
    $lookup: {
      from: "sources",
      localField: "build.sourceId",
      foreignField: "_id",
      as: "source",
      pipeline: [
        {
          $match: {
            $expr: {
              $eq: ["$branch", "3.0"],
            },
          },
        },
      ],
    },
  },
  {
    $unwind: {
      path: "$source",
    },
  },
  {
    $project: {
      filename: "$artifact.filename",
    },
  },
])

Neo4j with Cypher

The updated Cypher query uses node labels instead of properties. Specifically, when we create an artifact node, we have to specify its type and all of its supertypes as labels. For example, if binaries are types of executables, and executables are types of artifacts, then we would insert a binary artifact with the query below.

CREATE (a:Artifact:Executable:Binary 
  {
    filename: 'typedb.exe'
  })

We can now use the query below. It’s simpler and safer. However, it puts the onus on developers to ensure the correct supertypes are specified in every single insert. And to make matters worse, it would be extremely difficult to modify the type hierarchy once data is added.

MATCH 
  (b:Build)-[:CREATES]-(e:Executable), 
  (b)-[:TARGETS]-(s:Source)
WHERE 
  s.branch = '3.0'
RETURN a.filename

Note, the use of labels for type hierarchies can cause performance issues. It’s recommended to limit the number of labels per node, thus the practical depth of type hierarchies implemented with labels is severely constrained.

Complex data access is fragile and troublesome

Example #2b — Find all executables (e.g., binaries and scripts) built from the 3.0 branch.

Postgres with SQL

Despite the potential for schema changes, we could opt to keep the artifacts table and create separate executablesbinaries and scripts tables. However, we’re simply exchanging recursive CTEs for a combination of joins and unions in reads.

The previous query would have to be rewritten.

select a.filename
from artifacts a
join executables e on e.artifact_id = a.id
join binaries b on b.executable_id = e.id
join builds bu on bu.artifact_id = a.id
join sources so on so.id = bu.source_id
where so.branch = '3.0'
union
select a.filename
from artifacts a
join executables e on e.artifact_id = a.id
join scripts s on s.executable_id = e.id
join builds bu on bu.artifact_id = a.id
join sources so on so.id = bu.source_id
where so.branch = '3.0';

This query is safe in so much as its validity is no longer dependent on the data. However, it’s up to developers to know all of the applicable types, and if there is a change (e.g., a new type of executable), then the query will have to be updated — and the application redeployed.

Further, writes now require multiple inserts.

insert into artifacts (id, filename)
values (1, "typedb.exe");

insert into executables (id, artifact_id)
values (1, 1);

insert into binaries (id, executable_id)
values (1, 1);

MongoDB with aggregation pipelines

We could take a similar approach with MongoDB so our database contains the following collections: artifactsexecutablesbinaries and scripts. However, like the Postgres alternative, we’re simply exchanging the $graphLookup stage with nested $lookup functions and one or more $union stages.

The previous query would have to be rewritten.

db.binaries.aggregate([[
  {
    $lookup: {
      from: "executables",
      localField: "executableId",
      foreignField: "_id",
      as: "executable",
      pipeline: [
        {
          $lookup: {
            from: "artifacts",
            localField: "artifactId",
            foreignField: "_id",
            as: "artifact",
          },
        },
        {
          $unwind: {
            path: "$artifact",
          },
        },
      ],
    },
  },
  {
    $unionWith: {
      coll: "scripts",
      pipeline: [
        {
          $lookup: {
            from: "executables",
            localField: "executableId",
            foreignField: "_id",
            as: "executable",
            pipeline: [
              {
                $lookup: {
                  from: "artifacts",
                  localField: "artifactId",
                  foreignField: "_id",
                  as: "artifact",
                },
              },
              {
                $unwind: {
                  path: "$artifact",
                },
              },
            ],
          },
        },
      ],
    },
  },
  {
    $unwind: {
      path: "$executable",
    },
  },
  {
    $lookup:
      {
        from: "builds",
        localField: "executable.artifactId",
        foreignField: "artifactId",
        as: "build",
      },
  },
  {
    $unwind:
      {
        path: "$build",
      },
  },
  {
    $lookup: {
      from: "sources",
      localField: "build.sourceId",
      foreignField: "_id",
      as: "source",
      pipeline: [
        {
          $match: {
            $expr: {
              $eq: ["$branch", "3.0"],
            },
          },
        },
      ],
    },
  },
  {
    $unwind: {
      path: "$source",
    },
  },
  {
    $project: {
      filename: "$executable.artifact.filename",
    },
  },
])

Unfortunately, it’s not any safer than the original version because MongoDB queries are excessively verbose and prone to error. And like the Postgres alternative, the burden is on developers to know all of the applicable types and update the query whenever one is added or removed.

And again, writes now require multiple inserts.

db.artifacts.insert(
   {
     _id: 1,
     filename: "typedb.exe"
   }
)

db.executables.insert(
   {
     _id: 1,
     artifactId: 1
   }
)

db.binaries.insert(
   {
     _id: 1,
     executableId: 1
   }
)

Neo4j with Cypher

The alternative to modeling type hierarchies with labels is to model them as a set of connected nodes. First, nodes are created to represent a hierarchy of artifact types: ExecutableBinary and Script. Then, edges are created:

  • From the Binary node to the Executable node,
  • From the Script node to the Executable node.

When an Artifact node is created, so must a TYPE_OF edge that connects it to a node within the artifact type hierarchy (e.g., from an Artifact node to a Binary node).

MATCH 
   (a:Artifact)-[:TYPE_OF*]-(e:Executable), 
   (a)-[:CREATED_BY]-(b:Build), 
   (b)-[:TARGETS]-(s:Source)
WHERE s.branch = '3.0'
RETURN a

There are two problems with this approach:

One, it would result in the creation of a supernode, and supernodes should be avoided because they can cause performance issues. The problem occurs when a query passes through a node with countless edges and must traverse every single one of them.

Two, the type hierarchy, which is needed to enforce the correctness of data, resides within the data itself. This means the only way to identify the artifact type hierarchy is by exploring the data first — and the data can change without anyone realizing it.

  • What if the Executable node was removed?
  • What if a new node, ShellScript, was added?

It’s simply not safe to use data to define the type of other data.

Example #3 — Corrupting the integrity of data

Postgres with SQL

Regardless of the approach, we still can’t guarantee the integrity of our data.

We could create non-null, unique and foreign keys on the artifact_id column in each artifact subtype table. This would ensure every row in an artifact subtype table references exactly one row in the artifacts table, and it would prevent multiple rows in an artifact subtype table from referencing the same row in the artifacts table. However, it’s still possible for rows in different artifact subtype tables to reference the same row in the artifacts table — and that would compromise the integrity of our data.

It’s exactly what would happen if we executed the following inserts.

insert into artifacts (id, filename)
values (1, "typedb.exe");

insert into binaries (id, artifact_id)
values (1, 1);

insert into archives (id, artifact_id)
values (1, 1);

MongoDB CRUD API

There’s even less we can do with MongoDB. That’s the price paid for doing away with schemas.

db.artifacts.insert(
   {
     _id: 1,
     filename: "typedb.exe"
   }
)

db.binaries.insert(
   {
     _id: 1,
     artifactId: 1
   }
)
db.archives.insert(
   {
     _id: 1,
     artifactId: 1
   }
)

Neo4j with Cypher

The same is true with Neo4j.

CREATE (a:Artifact
   {
      filename: 'typedb.exe'
   })

MATCH
  (a:Artifact),
  (b:Binary)
WHERE a.filename = 'typedb.exe'
CREATE (a)-[t:TYPE_OF]->(b)

MATCH
  (a:Artifact),
  (v:Archive)
WHERE a.filename = 'typedb.exe'
CREATE (a)-[t:TYPE_OF]->(v)

The old way is unsafe, and far from intuitive. In effect, we have to tell the database what to do when there are different types of things — and there always are. Worse, we won’t know if our queries will return the correct results, or if we’ve compromised the integrity of our data.

Database can use modern programming language principles

Why not extend OO programming principles to the database?
Why not bring the safety and expressivity of strong type systems to the database?

We should and, with a strongly-typed database, we can.

A strongly-typed database provides us with a set of base constructs that a) have semantic meaning themselves and b) can be extended to create a faithful representation of our domain. The best way to do this is with an enhanced entity-relationship (EER) model. It describes data as entities, relations and attributes — all of which have types and meaning.

  • Entities represent physical or logical things whose existence is independent of anything else.
  • Relations represent the connections between one or more entities/relations.
  • Attributes represent the characteristics (or properties) of entities and relations.

When data is described as an entity, relation or attribute, it’s put into a context which provides us with a basic understanding of what it is and how to interpret it. Further, relations define roles which provide additional context for how entities/relations are related.

A fundamental advantage of EER models, compared to relational, document and graph, is the inclusion of subtypes with total or partial specialization. It’s the equivalent of inheritance with or without abstraction in OO programming. And with inheritance, we get polymorphism. EER models are the means by which we can apply these OO programming principles to the data models in strongly-typed databases.

In addition, strongly-typed databases provide us with query languages that are far more expressive than SQL because they allow us to query data based on its type and/or characteristics. SQL, while great for querying relations, is a poor choice because it’s neither typed nor fully declarative. Instead, strongly-typed databases implement composable, pattern-based query languages. They’re safer and more expressive than SQL, yet are also more concise and far easier to read and write.

TypeDB extends strong typing to the database

TypeDB is the world’s first strongly-typed database. A type system rooted in EER concepts allows developers and engineers to model data in an intuitive way and query it with natural language expressions. In addition, it removes the mismatch between logical and physical data models. How you would logically model your data in a diagram is exactly how you would model it in TypeDB. There’s no need to worry about normalization and reification. Further, the data model can be modified and extended at any time without having to touch a single line of application code.

Its query language, TypeQL, is fully declarative and designed for expressivity. It is based on the idea of using composable patterns to describe what the data you need looks like as opposed to telling the database where the data is and what to do with it. As a result, we can ensure our queries are not only valid, but concise and easy to interpret.

In TypeQL, the alternative to example #1, which finds all binary artifacts, is just four lines.

match 
 $b isa binary;
 $c isa source, has branch "3.0";
 ($b, $c) isa build;

And the TypeQL alternative to examples #2a and #2b, which finds all executable artifacts (i.e., binary + script artifacts), is also just four lines.

match 
 $e isa executable;
 $c isa source, has branch "3.0";
 ($e, $c) isa build;

The above query requires no joins, unions of recursive CTEs. The SQL versions are up to 4x longer. However, while example #3 would have to be rewritten if a new type of executable were added, the TypeQL version is not only concise and intuitive, it’s adaptive. As we add new types of executables, the query will automatically include any instances of them — no query changes required, so no application code changes required.

Meanwhile, inserting data with TypeQL requires but a single line.

insert $b isa binary, has filename "typedb.exe";

And example #3, which compromises data integrity, is not possible with TypeDB because entities, relations and attributes can only have one type and one supertype. Thus, an artifact can be an executable (supertype) and a binary (subtype), but it can’t be both a binary (subtype) and an archive (subtype) — it must be one or the other. This guarantees the integrity of our data will not be compromised by a query similar to example #5.

A higher level expressivity

The expressivity of TypeQL is unmatched. In fact, it’s so declarative, we don’t even have to tell TypeDB where or what it is we’re looking for. In conventional databases, queries must specify where the data is located (e.g., table or collection) and on what columns/fields to apply constraints. Then a full table scan or index lookup is performed to find data which matches the constraints. However, TypeDB doesn’t need to be told where the data is, or even what it is.

This query finds all artifacts which have an attribute value containing “TypeDB”:

match 
  $a isa artifact, has $attr;
  $attr contains "linux";

The results would include artifacts whose filename or OS attributes have values containing “linux” (e.g., typedb-studio-linux-2.18.0.tar.gz). This type of polymorphic query is possible because a) TypeDB handles attributes as base types in its EER model and b) it can infer the type of variables defined in queries. It will start by inferring that $attr is an attribute with a string value. Next, it will find all attributes whose string values contain “linux”. Finally, it will find all entities that a) own one of the matched attributes and b) are subtypes of artifact.

However, queries can be even further generalized. This next one finds all entities and relations which own a “name” attribute whose value contains “TypeDB”:

match 
  $x isa $y, has name $n;
  $n contains "TypeDB";

It can return results which include different types. It’s the equivalent of finding rows in any table that has a “name” column containing “TypeDB” or finding documents in any collection if they have a “name” field containing “TypeDB”, neither of which is possible.

The results could include:

  • source entity having a name attribute whose value is "Branch: TypeDB 3.0",
  • An artifact entity having a name attribute whose value is "TypeDB Studio 2.18.0/Windows",
  • build relation having a name attribute whose value is "TypeDB 3.0 Build #1".

This last query removes all type constraints.

match $x isa $y, hs $attr "TypeDB";

TypeDB can not only return different types, it can return different types with different attributes. This is unimaginable with any other database, regardless of whether it’s relational, document or graph. We can’t write a SQL query without specifying the table name(s) in a FROM clause and the column name(s) in a WHERE clause. It’s impossible. And yet, it’s easy to do with TypeDB.

The results could include:

  • source entity having a branch attribute whose value is "TypeDB 3.0",
  • An artifact entity having a name attribute whose value is "TypeDB 3.0/macOS",
  • build relation having a target attribute whose value is "TypeDB Cloud".

This degree of expressiveness is beyond the realm of conventional databases, and opens up new possibilities for developers and engineers.

Innovate faster and build the future

TypeDB is revolutionizing databases by combining strong typing and semantic meaning to provide higher level abstractions and a powerful query language based on composable patterns. It not only provides developers and engineers with semantically meaningful data models and intuitive, powerful data access, but opens up a world of applications and use cases that were previously unimaginable or too difficult to realize with conventional databases.

However, we’ve only begun to surface what’s possible with TypeDB. By integrating its type system and query language with a powerful inference engine, developers and engineers can take advantage of AI within the database to surface information that was previously difficult, if not impossible, to find. The type of information that enhances cyber threat intelligence, accelerates drug discovery, guides autonomous vehicles and prevents data breaches with smarter identity and access management.

Download TypeDB today, start building the future and redefine what’s possible.


Did you enjoy this blog post? Also check out The age of AI is upon us, where are the smart databases.

Interested in learning more about TypeDB? You can subscribe to our TypeDB newsletter or follow us on Medium and interact with the TypeDB community on Discord and Github!

Please feel free to contact us directly if you’d like to discuss the use of TypeDB in more detail.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback