TypeDB Fundamentals Lecture Series: from Nov 16 to Jan 9

TypeDB Learning Center

Why We Need a Polymorphic Database


Databases have come a long way since the navigational databases of the 1960s. The relational, document, and graph database paradigms have each profoundly impacted the way we store and query data, but many fundamental engineering challenges remain. Why is object-relational mismatch still a problem after 50 years? Why do we need ORMs to manage our data models for us? Why do queries need to be updated when the data model is extended, even if the question being asked remains the same?

Each successive database paradigm has only implemented workarounds to these problems without addressing the underlying cause: they cannot interpret declarative polymorphic queries. Programming languages have continuously become more powerful and declarative as they have evolved, but database query languages have not significantly advanced beyond SQL. This has left databases unable to express many of the OOP features that we quickly take for granted, like abstraction, inheritance, and polymorphism.

This article attempts to answer the above questions by conducting an in-depth exploration of the strengths and weaknesses of three of the most popular current database paradigms and how they deal with the polymorphism intrinsic to object models that represent natural data domains. It begins by examining the historical context and design philosophies behind the relational, document, and graph database paradigms to understand their capabilities. It then demonstrates the object-relational mismatch inherent in relational schemas with a simple class-inheritance structure and compares the three common enterprise design patterns used to model inheritance, along with the drawbacks of each one. This is followed by a discussion of the approaches that schemaless databases take to the same problem, with specific reference to the document and graph paradigms.

The article’s main focus is an object model for a simple filesystem, demonstrated with some lightweight application code in Java for instantiating objects and performing a few basic query-like operations. This code is compared to a relational implementation in PostgreSQL, a document implementation in MongoDB, and a graph implementation in Neo4j. Through queries that are equivalent to the application code, the significant difference in declarativity between modern programming languages and current database query languages is shown, demonstrating how current databases lead to fundamentally brittle application architectures. In this article’s conclusion, the wide-ranging problems presented throughout are summarised to define the necessary capabilities a polymorphic database would need to solve them.

An overview of current database paradigms

Programming languages are typically considered to be either general-purpose languages (GPLs) or domain-specific languages (DSLs). The distinction between these categories is not well defined, and there are many examples of DSLs gaining traction outside of their domain and becoming GPLs or, similarly, GPLs becoming phased out except outside of a few specific domains and becoming DSLs. While all GPLs must necessarily be Turing-complete, this property is not a defining characteristic of GPLs, and many DSLs are also Turing-complete. However, developers can normally agree on whether a given programming language is a GPL or not, based on applicability and usability across a wide range of application domains.

A similarly loose distinction exists in database paradigms, and paradigms are generally regarded as either general-purpose or domain-specific. It is thus appropriate to make the definition that a general-purpose database paradigm is one that has applicability and usability across a wide range of data domains. This article will focus on three of the most common database paradigms considered to be general-purpose, but that are very different from each other: relational, document, and graph.

Relational databases

The first versions of relational database theory and SQL were conceived in the early 1970s when procedural programming languages dominated computer science and information technology. Codd’s set-theoretic relational algebra (Codd, 1969) is the foundation of SQL, and describes all data as tuples within relations, which map attributes onto atomic values. These are practically implemented under more well-known terms, as rows within tables, which map column names onto their values for the given row. In a relational database, a schema provides structural integrity, ensuring that data cannot be corrupted and that queries can be analyzed for correctness.

Basics of relational modeling

As relational algebra requires that individual data values must be atomic, data must be normalized in order to model non-atomic dependencies, including multivalued attributes and relationships with one-to-many or many-to-many cardinalities. This is done by storing dependent data in separate tables, which reference the row containing the parent data by a foreign key. In order to retrieve the complete data, the tables must be joined on the column containing the foreign key. The result is known as a view, or a derived relation in the set-theoretic algebra. Like the structure of the tables themselves, the structure of the foreign keys is also enforced by the schema, thus maintaining referential integrity by preventing the creation of null references (functionally similar to null pointers).

Relational databasesSet-theoretic algebra
TableRelation
ColumnAttribute
RowTuple
ViewDerived relation
The basic components of a relational database, as described by Codd’s set-theoretic relational algebra (1969).

Relational databases were a turning point for data persistence. Previous database paradigms were navigational (though this encompassing term for pre-relational databases was not coined until later by Bachman, 1973), and accessed data imperatively. Users would have to describe how to iterate (or navigate) from each record to the next, by following links between them. In comparison, relational databases function declaratively: the user describes the qualifiers that define a record to retrieve, and the database returns all matching records.

Relations are not Objects

Relational databases quickly became embedded in almost every digitized industry as the prevailing convention replacing navigational databases, becoming an ANSI and ISO standard in 1986 and 1987, respectively. This was prior to the wide spread of object-oriented programming (OOP), which was spearheaded by the release of C++ in 1986, not standardized by the ISO until 1998. As a result, the most common general-purpose database paradigm made no account of what would become the most common general-purpose programming paradigm. This presented a cumbersome challenge to programmers: the relational and object models are fundamentally incompatible, requiring data structures to be translated from one to the other when transferring data between the application and the database. This disparity is known as object-relational mismatch (OR mismatch).

OR mismatch arises from the fact that rows in a relational database, as implementations of set-theoretic tuples, are one-dimensional, while the data structures used in OOP are multi-dimensional. The multidimensionality of object models enables the expression of powerful concepts like abstraction, inheritance, and polymorphism, all indispensable tools in modern programming that allow the natural representation of complex data domains and the native expression of their data. In contrast, relational databases are not able to natively express OOP data structures.

The complexity of data across all industries is increasing at an exponential rate, and while OR mismatch was originally simple, if cumbersome, to manage, it has become increasingly unmanageable. In some cases, as will be examined later in this article, the number of tables required to model data is combinatorial with respect to the number of object classes. This problem led to the inception of several new database paradigms under the NoSQL umbrella, which each attempted to work around OR mismatch in a different way.

Document databases

The first NoSQL database paradigm that saw widespread adoption for general-purpose applications was the document database, with the release of MongoDB in 2009. In document databases, data is encoded in a hierarchical format such as JSON or XML, with each level in the hierarchy comprising a document. Documents are stored in collections, indexed by IDs, and contain named fields mapped to either primitive data types or nested sub-documents. The same applies to the nested documents themselves, with no notional limit to the depth of nesting.

Basics of document modeling

In principle, document databases are identical to key-value databases with the additional abstraction of native data structure nesting. It is possible to manually replicate the function of a document database using a key-value store, by having values correspond to other keys. This abstraction makes document databases far more usable and optimized than key-value databases and has led them to be widely adopted for general-purpose data persistence, while key-value databases are typically reserved for specialized applications, such as caches and storage engines.

Much like key-value databases, no constraints are placed on the values that can be stored. Because of this, document databases are referred to as semi-structured. While any single document has a definite parseable structure, there are no mandated constraints on that structure, which may be modified freely as needed. Documents in the same collection do not need to have the same structure and might be completely dissimilar, with no common fields.

Consequences of lacking structure

Due to the lack of structural constraints, document databases are schemaless. This made them very appealing to engineers working with big data, as disparate data can easily be ingested with no prior knowledge of structure. The nesting of related data also allows for highly efficient insertion and retrieval, even at very large data volumes, as all data nested in a document is stored as a single block on disk. In comparison, relational databases must perform multiple disk reads to retrieve the data required for a single query, as related data is stored in separate tables that must be joined at query-time.

At a time when relational databases were the norm, but were failing to keep up with the continually increasing complexity of contemporary data, document databases were a compelling alternative. However, document databases often fail to deliver if used indiscriminately. The hierarchical data structure becomes problematic when dealing with data that is not strictly hierarchical. If it is no longer possible to nest dependent data, it must be stored in a separate document, and referenced by some value. When querying, a lookup must then be performed on the reference value. This process is generally less performant (per result) than performing a join in a relational database because document databases cannot capitalize on the constraints provided by a schema to optimize the process. As a result, document database design patterns advocate for using nested structures wherever possible, and so are not well-suited to use cases involving strongly non-hierarchical data.

Graph databases

Modern graph databases evolved from an earlier sub-category of navigational databases known as network databases. The idea had existed since the early 1990s but did not become widely popularised until the first commercially viable versions were released in the mid-2000s, notably Neo4j in 2007, the most popular graph database today. Graph databases are based on graph theory and implemented as labeled property graphs (LPGs), with inspiration drawn from Chen’s entity-relationship (ER) model (Chen, 1976).

Basics of graph modeling

In a graph database, data is stored as nodes (representing entities) and relationships between them, equivalent to the vertices and edges of a graph, respectively. Both nodes and relationships can be assigned labels, which represent the names of categories they fall into, and properties, which store data about them in scoped key-value stores.

Graph databasesGraph-theoretic algebraER modeling
NodeVertexEntity
RelationshipDirected edgeRelationship
LabelLabelType
PropertyLabelAttribute
The basic components of a graph database, implementing a labeled property graph (LPG) model. Mathematically, properties can be considered special kinds of labels, and they can be expressed using the same algebraic functions.

Relationships are implemented as directed edges, equivalent to the semantic triple model, with the start-node, edge, and end-node representing the subject, predicate, and object, respectively. This allows for most semantic relationships to be naturally modeled in graph databases. There is no notional limit to the number of relationships a node can be connected with, allowing graph databases to easily represent highly interconnected data. Graph databases are schemaless, allowing any labels and properties to be assigned to nodes and relationships. It is worth noting that, while relationships can have multiple labels in graph-theoretic algebra, many LPG databases only permit relationships to have a single label, such as Neo4j (in which it is called the relationship’s type).

A big step from triplestores

The structure of LPG data is very similar to that of an earlier graph-based data model, the triplestore model, implementing the resource description framework (RDF) standard. LPG databases are strictly more expressive than triplestores, able to represent everything they can and more. The most important addition is that of node and relationship properties, which gives LPG elements internal structures. In contrast, node properties in triplestores are modeled by creating a node representing the property value and linking it to the associated node with an edge representing the property key. Relationship properties cannot easily be modeled in triplestores, as edges cannot be directly connected to other edges. In order to store them, the relationship edge must be replaced by a node, which is linked to the original edge’s two nodes by two new edges. Properties can then be assigned to the “relationship” node. This process is known as reification.

Limitations of graphs

Like document databases, graph databases appealed to engineers working with semi-structured data due to their lack of schema and ability to easily store disparate data structures. However, unlike them, graph databases are highly optimized to store non-hierarchical data involving multidimensional connections between entities. Relationships are implemented as pairs of pointers to the storage-layer addresses of the nodes they link. This allows for highly efficient queries on them, in comparison to relational and document databases, which must query relationships between tables or collections by searching for records that contain a referenced value. Choosing between document and graph databases thus generally involves consideration of the characteristics of the data domain and application.

Graph databases have seen particular popularity in the field of knowledge engineering due to their ability to ingest and represent data that is both disparate and strongly interconnected. However, graph databases are unable to natively express certain kinds of relationships, which leads to difficulties in modeling them. As relationships are implemented as directed edges, they can only model relationships that are binary (between exactly two nodes) and directed (with the two nodes specified as a start- and end-node). In order to express other kinds of relationships, such as n-ary ones (between any number of nodes), nested ones (between other relationships instead of nodes), and undirected ones (in which neither node is the start or end), relationships must be reified.

Reification of relationships in LPG graphs allows these more complex relationship types to be implemented, but storing relationships as nodes breaks from the LPG model, as nodes should represent entities rather than relationships. This creates significant challenges for integration with applications, as relationships cannot all be queried in the same way, and is particularly relevant when conducting graph analytics, such as node classification and link prediction.

Multi-model databases

The latest trend in recent years is multi-model databases. They allow for different data to be modeled in different ways depending on the use case and are particularly useful in applications involving the federation of different data domains. In multi-model databases, each unit of data is stored using a different modeling paradigm and format. Data in different formats can then be linked together, essentially creating a polyglot persistence architecture within a single database. Some multi-model databases also allow data to be cast from one format to another, for instance enabling relational data to be queried as a graph.

As an example, a multi-model database might be used to model a social network. User information would be stored in relational tables so as to guarantee proper referential integrity for this critical data. Content information would be stored in document collections to easily and quickly ingest the stream of produced content pieces, which have disparate but self-contained hierarchical structures. Finally, all of the complex interconnections between the users and content pieces would be stored in a graph layer, in which the user nodes are mapped onto relational rows and the content nodes are mapped onto documents.

By combining previously incompatible data domains under a single database, the technical complexity of application architectures can be significantly reduced while still providing the flexibility required to optimize the storage of different kinds of data. In this way, multi-model databases have all the strengths and weaknesses of the models they combine.

Object-relational mismatch

OR mismatch is a problem that arises from the inability of one-dimensional relational data structures to store multi-dimensional object models. While mapping from object models to database models can also involve mismatches under the document and graph paradigms, this section of the article examines the problem with specific respect to relational databases and presents the commonly accepted solutions.

OR mismatch can manifest itself in a number of ways. One of the most well-known ones is in the modeling of inheritance structures, where a trade-off must typically be made in one or more factors in order to store data in a consistent manner. Those factors include referential integrity, semantic integrity, model maintainability, model extensibility, query maintainability, query performance, and low disk usage. Of those factors, referential integrity is the most important. Automatic referential integrity management prevents data corruption and is one of the most powerful features of relational databases. It is the primary reason they are still used today for the storage of critical data. If referential integrity must be sacrificed, it is likely that a relational database is not a good fit for the application at hand.

The conventional wisdom

Three popular enterprise design patterns have emerged to deal with inheritance, coined by Martin Fowler (2002). They are single-table inheritance, concrete-table inheritance, and class-table inheritance. Each of them trades off a different set of factors listed above. In order to analyze these design patterns, an example of a two-level hierarchy will be utilized, pertaining to employees in a company, all of whom are either permanent employees or contractors.

In this example, all employees have:

  • An employee ID.
  • A first name.
  • A last name.

Permanent employees also have:

  • A start date.
  • An end date (which is null for current employees).
  • A salary.

Contractors instead have:

  • An agency ID.
  • A contract number.
  • An hourly rate.

Pattern 1: Single-table inheritance

In this design pattern, all data is stored in one table that represents the common parent type. The table contains columns for the parent type’s attributes and columns for the attributes of every child type.

CREATE TYPE EmployeeType
AS ENUM ('permanent_employee', 'contractor');

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    employee_type EmployeeType NOT NULL,
    start_date DATE DEFAULT NULL,
    end_date DATE DEFAULT NULL,
    agency_id INT DEFAULT NULL REFERENCES agencies(id),
    contract_number INT DEFAULT NULL,
    salary MONEY DEFAULT NULL,
    hourly_rate MONEY DEFAULT NULL
);

The greatest strengths of this design pattern are the maintainability of the model and queries. The model is trivial and involves no foreign keys. By leveraging the DEFAULT NULL property of the columns specific to each employee type, insert queries only need to reference the relevant columns. Read queries are also simple, with the ability to filter on employee type or retrieve all employees at once, both without any joins or unions.

Lack of integrity constraints

However, there is an immediately obvious problem. There are no constraints on the attributes an employee has. This means that it is possible for a permanent employee to have an hourly rate, a contractor to have a salary, or even an employee to have both or neither. This can easily be fixed by adding a custom constraint to the table, enforcing the correct attributes are null for each employee type. This ensures semantic integrity within the table.

ALTER TABLE employees
ADD CONSTRAINT check_nulls
CHECK ( (
    employee_type = 'permanent_employee'
    AND num_nulls(start_date, salary) = 0
    AND num_nonnulls(agency_id, contract_number, hourly_rate) = 0
    ) OR (
    employee_type = 'contractor'
    AND num_nulls(agency_id, contract_number, hourly_rate) = 0
    AND num_nonnulls(start_date, end_date, salary) = 0
) );

Harder to manage are cross-table dependencies. If another table references the employee table with a foreign key, there is no simple way to ensure that it only references appropriate employee types. For instance, if there is a table of performance reviews, it is important to ensure that only permanent employees can be referenced. As a result, while it is possible to maintain semantic integrity within a table, it is not possible across the entire database when using this design pattern.

The custom constraints required also cause issues for extensibility, and quickly become unfeasible to manage as more employee types are added, potentially in a deep multi-level hierarchy, and likely with complex attribute interdependencies. This easily becomes unmaintainable, and makes the model very difficult to extend.

A single bottleneck

Furthermore, this design pattern can have a serious impact on performance. At small loads, queries about employees are very performant as no joins are required, in contrast to other design patterns. However, as all queries about employees access the same table, read and write locks will cause a bottleneck, preventing queries from executing performantly at large loads. Finally, a large amount of unnecessary disk space is required to store the data due to the large number of nulls, though more modern relational databases have improved compression algorithms for mitigating this.

Pattern 2: Concrete-table inheritance

In this design pattern, one table is used for each concrete (non-abstract) type in the hierarchy. Each table contains columns for the parent type’s attributes in addition to those for the attributes specific to the concrete type represented.

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

Increased query complexity

Like with the single-table design pattern, the model is simple to maintain, but the same does not apply to the queries. Read queries for single employee types are trivial, but those over multiple employee types have to perform a union over those types, filling in nulls for columns that do not appear in all of those types. This can quickly become difficult to manage as the number of employee types grows, especially if some columns are shared over some of those types.

SELECT
    id,
    first_name,
    last_name,
    'permanent_employee' AS employee_type,
    start_date,
    end_date,
    salary,
    NULL AS agency_id,
    NULL AS contract_number,
    NULL AS hourly_rate
FROM permanent_employees
UNION
SELECT
    id,
    first_name,
    last_name,
    'contractor' AS employee_type
    NULL AS start_date,
    NULL AS end_date,
    NULL AS salary,
    agency_id,
    contract_number,
    hourly_rate
FROM contractors;

Insert queries present a unique challenge. All employees must have globally unique IDs, but as different employee types are split over different tables, it is no longer possible to delegate their assignment to the database using the auto-incrementing SERIAL data type. As a result, the database is unable to control semantic integrity of employee IDs, requiring them to be managed outside of the database using a stateful allocation system. Extensibility is also a problem with this pattern. Because columns are repeated across tables, any new columns or changes to existing ones must be propagated to every table that has them.

However, there are some distinct benefits. Each employee type has its own table, meaning that cross-table dependencies can be correctly enforced. There are no unnecessary nulls in the data, keeping disk usage as low as possible. Finally, as the employee types are completely split over different tables, there is less competition for read and write locks, giving superior performance at large loads in comparison to the single-table pattern.

Pattern 3: Class-table inheritance

In this design pattern, one table is used for every type in the hierarchy, abstract and concrete. Each table only contains columns for the attributes specific to the type represented.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL
);

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY REFERENCES employee(id),
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY REFERENCES employee(id),
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

Increased maintenance costs

Compared to the single- and concrete-table patterns, this pattern is more architecturally complex, and this complexity quickly worsens as the model grows, requiring the most tables of the three design patterns. Querying is also significantly more complex. Employees are no longer represented by single rows, with part of the data stored in the parent table and part stored in one of the child tables. Insert queries must insert data into multiple tables for what is conceptually the creation of a single entity, and read queries must join those tables together in order to retrieve the full information on an employee.

DO $$
DECLARE
    employee_id INT;
BEGIN
    INSERT INTO employees (id, first_name, last_name)
    VALUES (DEFAULT, 'Kima', 'Greggs')
    RETURNING id INTO employee_id;

    INSERT INTO permanent_employees (id, start_date, salary)
    VALUES (employee_id, '2020-09-25', 65000.00);
END $$;

With this pattern, it is possible to simultaneously solve the semantic integrity issues with the previous patterns. Cross-table dependencies can be correctly enforced by referencing the table of the appropriate type, and global ID uniqueness enforced by storing them in the table at the hierarchy root and making them auto-incrementing. Despite this, another semantic integrity issue arises in that it is not possible to enforce the number of types an employee can have. For example, a single employee in the employees table could have entries in the tables for both permanent employees and contractors, even though an employee should only have one concrete type. As this is a form of cross-table dependency, it cannot be easily solved with native SQL constraints and the lack of semantic integrity remains.

Like the concrete-table pattern, this pattern keeps disk usage low due to the lack of unnecessary nulls, however performance is bottlenecked by the top-level employee table at large loads, which must be accessed in almost every query. The biggest advantage of this design pattern is its extensibility. Adding new types of employee does not require refactoring of existing tables, and each column only appears in a single table, making them easy to add and maintain.

Summary

Each of the three popular enterprise design patterns for modeling inheritance has different strengths and weaknesses, summarised in the table below. While the factors that must be optimised vary between use cases, the most important differentiators between the three patterns in the general case are query maintainability for single-table inheritance, query performance for concrete-table inheritance, and model extensibility for class-table inheritance. Achieving all three at the same time is rarely possible when modeling inheritance structures in a relational database.

While all three design patterns ensure referential integrity, none are able to enforce semantic integrity, with each one failing in a different way. This is the most serious problem. Despite the schema constraining the data model, relational databases are unable to prevent nonsensical data from being inserted. This can easily lead to long-term degradation in data quality if semantic integrity is not managed outside of the database, which introduce an additional layer of complexity to the database architecture and places responsibility on the engineer to properly enforce, rather than on the database. With more realistically complex databases, it is easy for problems to be introduced and go unnoticed.

Single-tableConcrete-tableClass-table
Referential integrity
Semantic integrity
Model maintainability
Model extensibility
Query maintainability
Query performance
Low disk usage
The strengths and weaknesses of the three enterprise design patterns for inheritance in a relational databases in the context of a general use case.

Schemaless approaches

While the many NoSQL database paradigms differ substantially, a unifying characteristic is that most of them do not have schemas, allowing for easy insertion of polymorphic data. The examples below show insert queries for the document database MongoDB and the graph database Neo4j, based on the same data model used in the previous section. In the document database example, documents for different employee types are inserted into the same employee collection, with the specific employee types recorded as data. In the graph database example, employees are given multiple labels to represent the multiple types in the employee type hierarchy.

agency_id = db.agencies.find( { "id": 8 } ).next()["_id"]

db.employees.insert( [
    {
        "id": 1,
        "first_name": "Kima",
        "last_name": "Greggs",
        "employee_type": "permanent_employee",
        "start_date": new Date("2020-09-25"),
        "salary": 65000.00
    },
    {
        "id": 2,
        "first_name": "Reginald",
        "last_name": "Cousins",
        "employee_type": "contractor",
        "agency": agency_id,
        "contract_number": 301,
        "hourly_rate": 50.00
    }
] )
MATCH (baltimoreAgency:Agency {id: 8} )
CREATE
    (kima:Employee:PermanentEmployee {
        id: 1,
        first_name: "Kima",
        last_name: "Greggs",
        start_date: date("2020-09-25"),
        salary: 65000.00
    } ),
    (reginald:Employee:Contractor {
        id: 2,
        first_name: "Reginald",
        last_name: "Cousins"
    } )-[:CONTRACTED_HOURLY_FROM {
        contract_number: 301,
        hourly_rate: 50.00
    } ]->(baltimoreAgency);

In both the document and graph database cases, it is simple to query either all employees or only employees of a specific type. This is achieved by filtering on the employee type field in the document case, and by matching only the label of the desired type in graph case. This approach can be trivially extended to larger type hierarchy. In the document case, this would be achieved by replacing the single employee type string with a list of employee types, and in the graph case, simply by adding additional labels to the nodes.

The most significant problem with the schemaless paradigms is the difficulty of enforcing semantic integrity. Many NoSQL databases offer simple constraint definition and validation, but cannot give the more powerful integrity assurances possible with the foreign keys of a relational schema. This is a necessary compromise in order to be able to easily store semi-structured data.

Integrity in document databases

Enforcing the fields that a given document should have can be achieved by constraining the permitted value ranges of those fields. In document databases, constraints are defined per collection, meaning that all documents in the employee collection would be subject to the same constraints. As a result, it would not be possible to ensure that only permanent employees have salaries, for example. This could be solved for using different collections for different employee types and applying constraints on an individual basis, but then constraints that apply to parent types must be maintained in multiple collections.

db.createCollection( "permanent_employees", {
    "validator": { "$jsonSchema": {
        "bsonType": "object",
        "properties": {
            "id": { "bsonType": "int" },
            "first_name": { "bsonType": "string" },
            "last_name": { "bsonType": "string" },
            "start_date": { "bsonType": "date" },
            "end_date": { "bsonType": "date" },
            "salary": { "bsonType": "double", "multipleOf": 0.01 }
        },
        "required": [
            "id",
            "first_name",
            "last_name",
            "start_date",
            "salary"
        ],
        "additionalProperties": false
    } }
} )

db.createCollection( "contractors", {
    "validator": { "$jsonSchema": {
        "bsonType": "object",
        "properties": {
            "id": { "bsonType": "int" },
            "first_name": { "bsonType": "string" },
            "last_name": { "bsonType": "string" },
            "agency": { "bsonType": "objectId" },
            "contract_number": { "bsonType": "int" },
            "hourly_rate": { "bsonType": "double", "multipleOf": 0.01 }
        },
        "required": [
            "id",
            "first_name",
            "last_name",
            "agency_id",
            "contract_number",
            "hourly_rate"
        ],
        "additionalProperties": false
    } }
} )

The problem then becomes a trade-off of semantic integrity against model maintainability. Even then, semantic integrity can still not be completely guaranteed due to the inability to define cross-collection constraints. These two design patterns, grouping all employees in a single collection or dividing them between collections based on type, are comparable to the single-table and concrete-table inheritance patterns for relational databases. A similar pattern comparable to the class-table pattern can also be imagined, in which employee data is divided between collections for the parent and child types.

Conventional Wisdom

Daniel Coupal et al. (2023) present enterprise design patterns for modeling data of this kind in a document database, in which they advocate for storing all similar documents (in this case employees) in a single collection representing the parent type, what they call the “polymorphic pattern”. In contrast, separating similar documents into separate collections is specifically described as an anti-pattern. The primary reason for this is performance, as document databases are not optimised to perform the cross-collection lookups required to collectively access data divided across multiple collections.

This consideration also manifests itself when dealing with dependent data. In the example, contractor documents reference an agency ID, corresponding to a document in the agencies collection. In order to retrieve a contractor’s data along with that of their agency, a costly cross-collection lookup must be performed. In this case, performance could be improved by instead nesting the agency document within the contractor document as a sub-document, but this comes with other problems.

{
    "id": 2,
    "first_name": "Reginald",
    "last_name": "Cousins",
    "employee_type": "contractor",
    "agency": {
        "id": 8,
        "name": "Baltimore Agency"
    },
    "contract_number": 301,
    "hourly_rate": 50.00
}

If an agency has multiple contractors, then its data must be nested in every associated contractor document. Likewise if any other documents in other collections are associated to the agency, then they must also contain a copy of the data. This leads to data redundancy issues, taking significant extra disk space in exchange for improved performance. Modifying data stored in this way is also a considerable challenge, as it requires the engineer to ensure the redundant copies remain synchronized. Without a schema to indicate where all the copies might be, potentially at different levels of nesting, this design pattern quickly becomes impractical to maintain.

Document databases are excellent at storing semi-structured data without many connections, and are highly optimized for such specific use cases in both query simplicity and performance. However in order to be so well optimized, they sacrifice certain features that are necessary for more general purposes. The inability to place detailed constraints on inserted data prevents native enforcement of semantic integrity, especially when utilizing the polymorphic design pattern. The necessity of using redundant data to performantly model connections causes further integrity issues, easily leading to inconsistent and stale data. This makes document databases poorly suited to applications where data integrity is essential without external integrity management at the application layer.

Integrity in graph databases

Graph databases are better than document databases in modeling constraints, with the ability to restrict property value ranges on a per-label basis. Unlike in document databases, where documents are in exactly one collection, nodes in graph databases can have multiple labels, allowing the property constraints required for the parent employee type and the child types to be simultaneously enforced without switching to a suboptimal data model.

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.id IS KEY;

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.id IS TYPED INTEGER;

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.first_name IS NOT NULL;

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.first_name IS TYPED STRING;

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.last_name IS NOT NULL;

CREATE CONSTRAINT FOR (n:Employee)
REQUIRE n.last_name IS TYPED STRING;

CREATE CONSTRAINT FOR (n:PermanentEmployee)
REQUIRE n.start_date IS NOT NULL;

CREATE CONSTRAINT FOR (n:PermanentEmployee)
REQUIRE n.start_date IS TYPED DATE;

CREATE CONSTRAINT FOR (n:PermanentEmployee)
REQUIRE n.end_date IS TYPED DATE;

CREATE CONSTRAINT FOR (n:PermanentEmployee)
REQUIRE n.salary IS NOT NULL;

CREATE CONSTRAINT FOR (n:PermanentEmployee)
REQUIRE n.salary IS TYPED FLOAT;

Unfortunately, this feature does not always extend to relationships in the case of graph databases that do not allow them to have multiple labels. In such cases, it is necessary to assume an approach similar to the relational concrete-table inheritance pattern, whereby constraints pertaining to a parent relationship type are repeated for the concretely implemented label of each child type. In the example, this might manifest itself as different contract rates (hourly, monthly, per project), encoded as different relationship labels. Each label would have a different rate constraint, but they would all duplicate the same contract number constraint. This comes with additional work in maintaining the constraints and ensuring they do not become inconsistent with updates. It is also not possible to define constraints over all contract rate labels, for instance ensuring that every contract number is globally unique.

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_HOURLY_FROM]-()
REQUIRE r.contract_number IS KEY;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_HOURLY_FROM]-()
REQUIRE r.contract_number IS TYPED INTEGER;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_HOURLY_FROM]-()
REQUIRE r.hourly_rate IS NOT NULL;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_HOURLY_FROM]-()
REQUIRE r.hourly_rate IS TYPED FLOAT;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_MONTHLY_FROM]-()
REQUIRE r.contract_number IS KEY;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_MONTHLY_FROM]-()
REQUIRE r.contract_number IS TYPED INTEGER;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_MONTHLY_FROM]-()
REQUIRE r.monthly_rate IS NOT NULL;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_MONTHLY_FROM]-()
REQUIRE r.monthly_rate IS TYPED FLOAT;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_PER_PROJECT_FROM]-()
REQUIRE r.contract_number IS KEY;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_PER_PROJECT_FROM]-()
REQUIRE r.contract_number IS TYPED INTEGER;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_PER_PROJECT_FROM]-()
REQUIRE r.project_pay IS NOT NULL;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_PER_PROJECT_FROM]-()
REQUIRE r.project_pay IS TYPED FLOAT;

Conventional wisdom

One way of getting around this issue is to make the contract itself a node by reifying it. This would result in an entirely different data structure that diverges from the semantics of the domain. In conceptual terms, the contract is the relationship between the contractor and the agency. By reifying the relationship in order to apply the necessary constraints, the data model ceases to be a true reflection of the domain, in addition to complicating application integration as the node and two attached edges must now be treated as a single indivisible unit. In fact, Barrasa (2017) writes that one of the motivations behind the LPG model was to avoid the reification necessary to model relationship properties in the triplestore model. The need to implement the strategy in an LPG demonstrates how graph databases are not built to handle non-trivial constraints.

MATCH (baltimoreAgency:Agency {id: 8} )
CREATE
    (reginald:Employee:Contractor {
        id: 2,
        first_name: "Reginald",
        last_name: "Cousins"
    } )<-[:
    } )-[:CONTRACTED_HOURLY_FROM {
        contract_number: 301,
        hourly_rate: 50.00
    } ]->(baltimoreAgency);

Another problem with label-based constraints is that there is no way to control semantic grouping of labels. For instance, it is possible to create a node with the PermanentEmployee label that does not have the Employee label, or to create a node with the abstract Employee label but no label for a concrete type. Finally, there is no way to constrain the nodes that can act as the start- and end-nodes in a give relationship, for example preventing employees from being contracted from other employees (or anything else) instead of agencies. This means that there is no way to natively ensure relationship semantics.

As with document databases, graph databases are an excellent choice for storing semi-structured data but cannot enforce the constraints necessary to guarantee semantic integrity. This requires integrity to be similarly managed outside of the database, introducing additional complexities and points of failure into the application architecture, as well as complicating database design and maintenance.

Summary

Schemaless databases are unable to model the complex interdependent constraints necessary to ensure data integrity. In principle, there is nothing preventing such databases from introducing more expressive constraints, however the lack of schema leads to considerable practical issues. The use of schemas in relational databases allows complex constraints to be performantly enforced, because the database is able to create indexes on schema elements to optimise constraint management. Implementing constraints with the same level of expressivity in a schemaless database is not a viable solution, because performance will not be able to scale effectively.

As a result, schemaless databases can only enforce weaker constraints on inserted data, and so are unsuitable for use cases where data integrity is a priority. For other applications, they permit many conveniences that cannot easily be achieved with relational databases, particularly the storage and querying of polymorphic data. However, as will be examined in the next section, the lack of schema creates further problems that prevent data from being queried declaratively.

An object model for a simple filesystem

The previous example of a company’s employees was effective for demonstrating some of the intrinsic problems with representing object models with current database paradigms, but it was unrealistically simple. This section will examine a more practical model for a filesystem. While the filesystem being modeled is conceptually very simple, the object model (illustrated in Java) involves both inheritance and interfaces in order to naturally model its polymorphic nature, and so is necessarily complex. This ensures that the model adheres closely to the domain semantics, and permits trivial extensibility.

The model contains three top-level entity types: users, user groups, and resources (which are abstract). It also includes the additional entity types of admins, which are subtypes of users, and files, which are subtypes of resources. Each of these entity types is represented by a class in the object model.

Each entity is keyed by a different attribute: emails for users (and therefore admins), names for user groups, and paths for files. These attributes are also represented by classes, subtypes of an abstract ID class, which are then stored as fields in the classes of their associated entities. While it might seem more straightforward to model them as primitive fields, making them classes allows for more powerful operations to be performed on the filesystem, as will be demonstrated shortly. The attribute classes each expose a “key” interface that is implemented by the associated entity classes.

Finally, two relations are modeled: resource ownerships and group ownerships, which are both subtypes of an abstract ownership type. Resource ownerships are used to relate resources to their owners, which can be users (including admins) or user groups, while groups ownerships are used to relates groups to their owners, which can only be admins. Each relation type is represented by a class, which exposes “owned” and “owner” interfaces.

The complete object model can be found here. It is worth acknowledging that the model is not complete. For instance, there are no resources other than files, and it is impossible to make a user a member of a user group. This model is only intended to illustrate certain properties of the class structure, and so these trivial additions are excluded as they are not directly relevant to the discussion. More resource types could easily be added to the model by copying the structure of the file class, and likewise group memberships could be added by copying the structure of the ownership classes. The model is also not intended to serve as a fully-fledged application, and many important considerations (such as ensuring that everything has an appropriate access modifier, or ensuring that re-assigning of ownerships after initial assignment is correctly handled) have been neglected for simplicity.

With the model in place, creating filesystem objects is quite simple.

// Instantiate an admin with the specified email.
Admin cedric = new Admin("cedric@vaticle.com");

// Instantiate a user with the specified email.
User jimmy = new User("jimmy@vaticle.com");

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
UserGroup engineers = new UserGroup("engineers", cedric);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
File benchmark = new File("/jimmy/benchmark-results.xlsx", jimmy);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
File roadmap = new File("/vaticle/feature-roadmap.pdf", engineers);

Retrieving data belonging to the objects is also simple. This will be examined through the lens of four query-like operations, each more complex than the last:

  1. Retrieve the ID of a given resource’s owner.
  2. Retrieve the IDs of all resources owned by a given user.
  3. Retrieve the types and IDs of all objects in the filesystem.
  4. Retrieve the details of every ownership in the file system, specifically:
    • The type of ownership (group or resource).
    • The type and ID of the owned object.
    • The type and ID of the object’s owner.

Query 1: Retrieve the ID of a given resource’s owner

// Retreive the ID of the feature roadmap's owner.
System.out.println(((Id.Key<?>) roadmap.getOwner()).getId().value);

// Output:
// engineers

Now the reason behind making the attributes into classes becomes clear. In this query, we can easily retrieve the ID of the feature roadmap’s owner without knowing the type of the owner. While simple, this is immensely powerful as we can introduce new classes that implement the ResourceOwnership.Owner interface and still retrieve the IDs of all resource owners without having to modify any code (as long as everything implements a subtype of the Id.Key interface). Likewise in the following example, we retrieve the IDs of all resources owned by Jimmy without having to know their exact types.

Query 2: Retrieve the IDs of all resources owned by a given user

// Retrieve the IDs of all resources owned by Jimmy.
System.out.println(jimmy.getOwnedResources().stream()
    .map(resource -> ((Id.Key<?>) resource).getId().value)
    .collect(Collectors.toSet())
);

// Output:
// [ /jimmy/benchmark-results.xlsx ]

This makes it trivial to add new resource types to the system, as this code will still print out the ID’s of Jimmy’s owned resources without having to modify it. If we only want to retrieve the IDs of a certain kind of resource, it is simple to do so by adding a filter stage to the stream pipeline. We can also do even more abstract operations. In the next example, we retrieve the type and ID of every object in the filesystem.

Query 3: Retrieve the types and IDs of all objects in the filesystem

// Make a collection of all filesystem objects.
Set<Id.Key<?>> fileSystemObjects = Set.of(
    cedric, jimmy, engineers, benchmark, roadmap
);

// Retrieve the type and ID of all objects in the filesystem.
fileSystemObjects.stream()
    .map(object -> Map.of(
        "object-type", object.getClass().getSimpleName(),
        "object-id", ((Id.Key<?>) object).getId().value
    ))
    .forEach(System.out::println);

// Output:
// { object-id=/vaticle/feature-roadmap.pdf, object-type=File }
// { object-id=/jimmy/benchmark-results.xlsx, object-type=File }
// { object-id=engineers, object-type=UserGroup }
// { object-id=cedric@vaticle.com, object-type=Admin }
// { object-id=jimmy@vaticle.com, object-type=User }

Query 4: Retrieve the details of every ownership in the file system

// Retrieve the details of every ownership in the filesystem.
fileSystemObjects.stream()
    .filter(object -> object instanceof Ownership.Owned)
    .map(object -> (Ownership.Owned) object)
    .map(owned -> Map.of(
        "owned-id", ((Id.Key<?>) owned).getId().value,
        "owned-type", owned.getClass().getSimpleName(),
        "owner-id", (((Id.Key<?>) owned.getOwner()).getId()).value,
        "owner-type", owned.getOwner().getClass().getSimpleName(),
        "ownership-type", owned.getOwnership().getClass().getSimpleName()
    ))
    .forEach(System.out::println);

// Output:
// {
//     owned-id=/vaticle/feature-roadmap.pdf,
//     ownership-type=ResourceOwnership,
//     owner-type=UserGroup,
//     owner-id=engineers,
//     owned-type=File
// }
// {
//     owned-id=/jimmy/benchmark-results.xlsx,
//     ownership-type=ResourceOwnership,
//     owner-type=User,
//     owner-id=jimmy@vaticle.com,
//     owned-type=File
// }
// {
//     owned-id=engineers,
//     ownership-type=GroupOwnership,
//     owner-type=Admin,
//     owner-id=cedric@vaticle.com,
//     owned-type=UserGroup
// }

Because the GroupOwnership and ResourceOwnership classes are subtypes of the Ownership class, we can retrieve the set of all ownerships in the filesystem in one go. Queries 3 and 4 function without knowing any of the exact types in the filesystem as they operate only on interface implementations. No matter how many new types we add to the filesystem, these operations will continue to function and give correct results. This justifies the seemingly unnecessary technical overhead of the object model, as it is perfectly extensible without needing to refactor existing classes. Even if the filesystem increases in complexity, the design pattern of the object model will never become more complex than it already is. This is because the object model has been designed with the conceptual model of the filesystem domain in mind, which polymorphism is an integral part of.

Relational implementation

In order to implement the filesystem model in a relational database, we will use the class-table inheritance pattern, as it is the most extensible.

CREATE TABLE users (
    email TEXT PRIMARY KEY
);

CREATE TABLE admins (
    email TEXT PRIMARY KEY REFERENCES users(email)
);

CREATE TABLE user_groups (
    name TEXT PRIMARY KEY
);

CREATE TABLE resources (
    id TEXT PRIMARY KEY
);

CREATE TABLE files (
    path TEXT PRIMARY KEY REFERENCES resources(id)
);

CREATE TABLE ownerships (
    id SERIAL PRIMARY KEY
);

CREATE TABLE resource_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE user_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_id TEXT REFERENCES users(email),
    resource_id TEXT REFERENCES resources(id)
);

CREATE TABLE user_group_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_group_id TEXT REFERENCES user_groups(name),
    resource_id TEXT REFERENCES resources (id)
);

CREATE TABLE group_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE admin_of_group_ownerships (
    id INT PRIMARY KEY REFERENCES group_ownerships(id),
    admin_id TEXT REFERENCES admins(email),
    user_group_id TEXT REFERENCES user_groups(name)
);

The ownership interfaces present an interesting challenge here. As each foreign key column can only reference a column in a single other table, we have to create one table for users that own resources, one for user groups that own resources, and one for admins that own groups. The last one might seem extraneous, but it is important for extensibility, as we might add another types that can own groups in the future. With the schema defined, we can insert the data.

-- Instantiate an admin with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('cedric@vaticle.com');

    INSERT INTO admins (email)
    VALUES ('cedric@vaticle.com');
END $$;

-- Instantiate a user with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('jimmy@vaticle.com');
END $$;

-- Instantiate a user group with the specified name,
-- and assign the admin Cedric as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO user_groups (name)
    VALUES ('engineers');

    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;

    INSERT INTO group_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO admin_of_group_ownerships (id, admin_id, user_group_id)
    VALUES (ownership_id, 'cedric@vaticle.com', 'engineers');
END $$;

-- Instantiate a file with the specified path,
-- and assign the user Jimmy as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO files (path)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_of_resource_ownerships (id, user_id, resource_id)
    VALUES (ownership_id, 'jimmy@vaticle.com', '/jimmy/benchmark-results.xlsx');
END $$;

-- Instantiate a file with the specified path,
-- and assign the Engineers group as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO files (path)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_group_of_resource_ownerships (id, user_group_id, resource_id)
    VALUES (ownership_id, 'engineers', '/vaticle/feature-roadmap.pdf');
END $$;

Because creating any filesystem object (for example, an admin) involves inserting data into multiple tables (the users and admins tables for an admin), the queries need to be enclosed in transactions to ensure consistency. In the application code, constructor arguments enforce that resources and user groups cannot be created without an owner, which similarly utilises the transactions. In this way, atomic operations in the application are also atomic operations in the database. The transactions are also necessary to create a scope for saving the auto-generated object IDs so they can be shared across tables. The second operation (creating the user Jimmy) is the only one that doesn’t require a transaction wrapper as it’s already atomic, but we have given it one anyway to maintain a consistent approach to data insertion.

Query 1: Retrieve the ID of a given resource’s owner

-- Retrieve the ID of the feature roadmap's owner.
SELECT owner_id
FROM (
    SELECT
        resource_ownerships.id AS ownership_id,
        user_of_resource_ownerships.user_id AS owner_id,
        user_of_resource_ownerships.resource_id AS resource_id
    FROM resource_ownerships
    JOIN user_of_resource_ownerships USING (id)
    UNION
    SELECT
        resource_ownerships.id AS ownership_id,
        user_group_of_resource_ownerships.user_group_id AS owner_id,
        user_group_of_resource_ownerships.resource_id AS resource_id
    FROM resource_ownerships
    JOIN user_group_of_resource_ownerships USING (id)
) AS resource_ownerships
WHERE resource_id = '/vaticle/feature-roadmap.pdf';

As with the application code, we need to retrieve the ID of the roadmap’s owner without knowing the owner’s type. Because the owner could be either a user or a user group, we need to join both tables and union the results. Assuming data integrity has been maintained, we should only get one result.

Query 2: Retrieve the IDs of all resources owned by a given user

-- Retrieve the IDs of all resources owned by Jimmy.
SELECT user_of_resource_ownerships.resource_id AS resource_id
FROM resource_ownerships
JOIN user_of_resource_ownerships USING (id)
WHERE user_of_resource_ownerships.user_id = 'jimmy@vaticle.com';

This query is significantly simpler than the previous one, as we know the type of both the owner (a user) and the owned object (a resource). The class-table structure is particularly useful here, as we’re only interested in the ID, which is stored in the resources parent table.

Query 3: Retrieve the types and IDs of all objects in the filesystem

-- Retrieve the type and ID of all objects in the filesystem.
SELECT object_type, object_id
FROM (
    SELECT
        'user' AS object_type,
        email AS object_id
    FROM users
    WHERE email NOT IN (
        SELECT email
        FROM admins
    )
    UNION
    SELECT
        'admin' AS object_type,
        email AS object_id
    FROM admins
    UNION
    SELECT
        'user_group' AS object_type,
        name AS object_id
    FROM user_groups
    UNION
    SELECT
        'file' AS object_type,
        path AS object_id
    FROM files
) AS objects;

Retrieving all the objects in the filesystem requires a union of several tables, in which the object type (corresponding to the name of the table) must be hardcoded into the query. While the resources table can be ignored as is it represents an abstract entity type, the users table represents an entity type that is both concrete and extended (by the admin type). As a result, an additional condition needs to be added so that admins are not returned twice as users and as admins. We choose to return them with the type “admin” as it is the more specialised type.

Query 4: Retrieve the details of every ownership in the file system

-- Retrieve the details of every ownership in the filesystem.
SELECT
    ownership_type,
    owned_type,
    owned_id,
    owner_type,
    owner_id
FROM (
    SELECT
        'group_ownership' AS ownership_type,
        'user_group' AS owned_type,
        admin_of_group_ownerships.user_group_id AS owned_id,
        'admin' AS owner_type,
        admin_of_group_ownerships.admin_id AS owner_id
    FROM ownerships
    JOIN admin_of_group_ownerships USING (id)
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user' AS owner_type,
        user_of_resource_ownerships.user_id AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    WHERE user_of_resource_ownerships.user_id NOT IN (
        SELECT admins.email
        FROM admins
    )
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'admin' AS owner_type,
        admins.email AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    JOIN admins ON admins.email = user_of_resource_ownerships.user_id
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user_group' AS owner_type,
        user_group_of_resource_ownerships.user_group_id AS owner_id
    FROM ownerships
    JOIN user_group_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_group_of_resource_ownerships.resource_id
) AS ownerships;

Once again, we have to rely on unions to return all of the results. Because of the multiple implementations of the Owned and Owner interfaces in the object model, the number of unions required will be equal to the number of combinations of concrete classes that implement the interfaces, whether directly or by inheritance. We also need the special condition on users to ensure admins are not returned twice.

Ownership.Owned implementationOwnership.Owner implementation
UserGroup implements GroupOwnership.OwnedAdmin implements GroupOwnership.Owner
File implements ResourceOwnership.OwnedUser implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedAdmin implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedUserGroup implements ResourceOwnership.Owner
Concrete implementations of ownership interfaces in the filesystem object model, both directly and by inheritance.

Document implementation

In order to implement the filesystem model in a document database, we will use the polymorphic design pattern. Adhering to the paradigm’s best practices, ownerships will be recorded by nesting a reference to the owner in the owned object. As constraints have previously proven impractical to use effectively, we’ll leave them out of the discussion and focus on the querying.

// Instantiate an admin with the specified email.
db.users.insert( {
    "email": "cedric@vaticle.com",
    "user_type": "admin"
} )

// Instantiate a user with the specified email.
db.users.insert( {
    "email": "jimmy@vaticle.com",
    "user_type": "user"
} )

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
owner_id = db.users.find( { "email": "cedric@vaticle.com" } ).next()["_id"]
db.groups.insert( {
    "name": "engineers",
    "group_type": "user_group",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
owner_id = db.users.find( { "email": "jimmy@vaticle.com" } ).next()["_id"]
db.resources.insert( {
    "path": "/jimmy/benchmark-results.xlsx",
    "resource_type": "file",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
owner_id = db.groups.find( { "name": "engineers" } ).next()["_id"]
db.resources.insert( {
    "path": "/vaticle/feature-roadmap.pdf",
    "resource_type": "file",
    "owner": owner_id
} )

In order to store everything’s exact type, we’ve created a type field in every collection. If the type hierarchy had more levels, this would involve creating additional fields or using a list field to contain all the appropriate types, as collections can only embody a single type. The built-in _id field is used by the database to uniquely identify each document, and is used to record cross-collection references and perform lookups, but is not meaningful piece of information in its own right. In the following discussion, the term “ID” refers to the IDs in the original object model (emails, names, paths) rather than this specific implementation’s _id field.

Query 1: Retrieve the ID of a given resource’s owner

// Retrieve the ID of the feature roadmap's owner.
db.resources.aggregate( [
    { "$match": { "path": "/vaticle/feature-roadmap.pdf" } },
    { "$lookup": {
        "from": "users",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_owners"
    } },
    { "$lookup": {
        "from": "groups",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_group_owners"
    } },
    { "$addFields": { "owners": { "$concatArrays": [ "$user_owners", "$user_group_owners" ] } } },
    { "$unwind": "$owners" },
    { "$project": {
        "_id": false,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "$owners.name" }
        ] } }
    } }
] )

This query involves cross-collection lookups so requires an aggregation pipeline. Considering it at a high level, there are two main steps. The first is to perform lookups on each applicable collection, in this case users and groups, as the owner’s collection is not known, and then concatenate the results of both into a single field. The second step is to extract the owner ID from each answer with a projection, which requires a switch statement as the ID field differs for different owner types. An alternative approach would be to nest the projection as sub-pipelines within the lookups, but this achieves the same result with roughly equal complexity. We choose this approach because it places all of the projections within a single pipeline element, making it easier to extend and debug.

Query 2: Retrieve the IDs of all resources owned by a given user

// Retrieve the IDs of all resources owned by Jimmy.
db.users.aggregate( [
    { "$match": { "email": "jimmy@vaticle.com" } },
    { "$lookup": {
        "from": "resources",
        "localField": "_id",
        "foreignField": "owner",
        "as": "owned_resources"
    } },
    { "$unwind": "$owned_resources" },
    { "$project": {
        "_id": false,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owned_resources.resource_type", "file" ] }, "then": "$owned_resources.path" }
        ] } }
    } }
] )

This query is a bit shorter than the last one as we only need to perform a lookup onto one collection. With the current data model, the switch statement is actually unnecessary as we only have a single resource type, but this could easily change in the future so we keep it in for extensibility.

Query 3: Retrieve the types and IDs of all objects in the filesystem

// Retrieve the type and ID of all objects in the filesystem.
db.users.aggregate( [
    { "$unionWith": { "coll": "groups" } },
    { "$unionWith": { "coll": "resources" } },
    { "$addFields": {
        "object_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$user_type", "user" ] }, "then": "user" },
            { "case": { "$eq": [ "$user_type", "admin" ] }, "then": "admin" },
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "user_group" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "file" }
        ] } }
    } },
    { "$project": {
        "_id": false,
        "object_type": true,
        "object_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$user_type", "user" ] }, "then": "$email" },
            { "case": { "$eq": [ "$user_type", "admin" ] }, "then": "$email" },
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "$name" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "$path" }
        ] } }
    } }
] )

To retrieve all filesystem objects, we can easily union all relevant collections together. We then need two switch statements, one to get the type of each object and one to get its ID.

Query 4: Retrieve the details of every ownership in the file system

db.groups.aggregate( [
    { "$addFields": { "ownership_type": "group_ownership" } },
    { "$unionWith": {
        "coll": "resources",
        "pipeline": [ { "$addFields": { "ownership_type": "resource_ownership" } } ]
    } },
    { "$lookup": {
        "from": "users",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_owners"
    } },
    { "$lookup": {
        "from": "groups",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_group_owners"
    } },
    { "$addFields": { "owners": { "$concatArrays": [ "$user_owners", "$user_group_owners" ] } } },
    { "$unwind": "$owners" },
    { "$addFields": {
        "owned_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "user_group" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "file" }
        ] } }
    } },
    { "$addFields": {
        "owned_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "$name" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "$path" }
        ] } }
    } },
    { "$addFields": {
        "owner_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "user" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "admin" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "user_group" }
        ] } }
    } },
    { "$project": {
        "_id": false,
        "ownership_type": true,
        "owned_type": true,
        "owned_id": true,
        "owner_type": true,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "$owners.name" }
        ] } }
    } }
] )

There’s a lot to unpack in this query. The first step is to union the groups and resources collections, as those are the two types that can be owned. We also add a field for ownership type at this stage, as it is dependent only on the collection parent types rather than the more specialised types in the collection. Afterwards we have to perform lookups on both the user and user group collections, and concatenate the results. Finally, we can assemble the fields required for the final projection. This involves four switch statements, one each for the type and ID of the owned object, and one each for those of the owner.

Graph implementation

With the graph implementation, there is only one sensible design pattern due to the ability to assign multiple labels to nodes. It might need to change if we need multiple labels for relationships (requiring reification), but that doesn’t appear to be necessary at this stage or in the foreseeable future of the model. As with the document implementation, we’ll leave constraints out of the picture for now.

// Instantiate an admin with the specified email.
CREATE (cedric:User:Admin {email: "cedric@vaticle.com"});

// Instantiate a user with the specified email.
CREATE (jimmy:User {email: "jimmy@vaticle.com"});

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
MATCH (cedric:Admin {email: "cedric@vaticle.com"})
CREATE
    (engineers:UserGroup {name: "engineers"}),
    (cedric)-[:OWNS {ownership_type: "GroupOwnership"}]->(engineers);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
MATCH (jimmy:User {email: "jimmy@vaticle.com"})
CREATE
    (benchmark:Resource:File {path: "/jimmy/benchmark-results.xlsx"}),
    (jimmy)-[:OWNS {ownership_type: "ResourceOwnership"}]->(benchmark);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
MATCH (engineers:UserGroup {name: "engineers"})
CREATE
    (roadmap:Resource:File {path: "/vaticle/feature-roadmap.pdf"}),
    (engineers)-[:OWNS {ownership_type: "ResourceOwnership"}]->(roadmap);

Compared to the relational and document implementations, this one is pretty straightforward and requires little explanation. The only place where we’ve had to depart from the object model is in the storage of ownership types as properties of OWNS relationships rather than as additional labels, in order to avoid reification.

Query 1: Retrieve the ID of a given resource’s owner

// Retrieve the ID of the feature roadmap's owner.
MATCH
    (roadmap:File {path: "/jimmy/benchmark-results.xlsx"}),
    (owner)-[:OWNS]->(roadmap)
UNWIND labels(owner) AS owner_type
WITH
    owner, owner_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE owner_type IN keys(id_type_map)
AND id_type_map[owner_type] IN keys(owner)
RETURN owner[id_type_map[owner_type]] AS owner_id;

This query does something quite powerful. Unlike the relational and document queries where we needed to specify the tables or collections in which to search for the roadmap’s owner, here we can retrieve the owner node without knowing anything about it other than the fact it owns the roadmap. This is very similar to the application code in which we can retrieve the owner object without knowing its type. However, while in the application code we can retrieve the owner’s ID without knowing its actual type, here we need (the equivalent of) a switch statement in order to select the right property of the owner node based on its type.

Query 2: Retrieve the IDs of all resources owned by a given user

// Retrieve the IDs of all resources owned by Jimmy.
MATCH
    (jimmy:User {email: "jimmy@vaticle.com"}),
    (jimmy)-[:OWNS]->(resource:Resource)
UNWIND labels(resource) AS resource_type
WITH
    resource, resource_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE resource_type IN keys(id_type_map)
AND id_type_map[resource_type] IN keys(resource)
RETURN resource[id_type_map[resource_type]] AS resource_id;

This query is nearly identical in structure to the previous one, except that now we specify the label of the node we’re interested in, specifically Resource.

Query 3: Retrieve the types and IDs of all objects in the filesystem

// Retrieve the type and ID of all objects in the filesystem.
MATCH (object)
UNWIND labels(object) AS object_type
WITH object, object_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE object_type IN keys(id_type_map)
AND id_type_map[object_type] IN keys(object)
AND NOT (
     object_type = "User"
     AND "Admin" IN labels(object)
)
RETURN object_type, object[id_type_map[object_type]] AS object_id;

Once again we’re doing something pretty powerful here by querying all nodes without specifying their labels, this time without any conditions. However, as with the relational implementation, we need to filter out results where admins would be returned as users, because the unwind operation will return both User and Admin as Cedric’s types, generating two results where we only want one.

Query 4: Retrieve the details of every ownership in the file system

// Retrieve the details of every ownership in the filesystem.
MATCH (owner)-[ownership:OWNS]->(owned)
UNWIND labels(owned) AS owned_type
UNWIND labels(owner) AS owner_type
WITH ownership, owned, owned_type, owner, owner_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE owned_type IN keys(id_type_map)
AND id_type_map[owned_type] IN keys(owned)
AND owner_type IN keys(id_type_map)
AND id_type_map[owner_type] IN keys(owner)
AND NOT (
    owner_type = "User"
    AND "Admin" IN labels(owner)
)
RETURN
    ownership.ownership_type AS ownership_type,
    owned_type,
    owned[id_type_map[owned_type]] AS owned_id,
    owner_type,
    owner[id_type_map[owner_type]] AS owner_id;

For this final query, we don’t need to do anything that we haven’t already done in the previous queries. We retrieve nodes without specifying their labels, we unwind the two lists of labels, we have to use a switch statement to retrieve the correct ID types for the two labels in each result, and we filter out users who are also admins.

Discussion

There are some common themes across database paradigms in the implementation of the object model and the four queries examined. Having being discussed sufficiently in previous sections, this discussion will not focus on logical model, schema, and constraint design.

Data insertion

Data insertion is significantly more complicated in the relational implementation than in the document and graph ones due to the chosen class-inheritance pattern. Atomic objects in the object model become divided over multiple tables, requiring insertions to be enclosed in a transaction to maintain atomicity. In the worst case, rows must be inserted into five tables to create a file. Considering the simplicity of the object model, it is easy to see how this will be difficult to scale, potentially requiring hundreds of tables to be modified for a filesystem of realistic complexity. In comparison, the insert queries for the document and graph implementations seem relatively simple.

Query 1: Retrieve the ID of a given resource’s owner

Turning to the first query, the relational and document implementations both need to enumerate the tables or collections in which the owner might be, and perform a union (or equivalent) of the results. This is not an easily extensible approach at all. Whenever new types that can own resources are added to the filesystem, these queries will need to be modified or produce incomplete results. The graph implementation does not have this problem, as it queries the owner without specifying its type in the same way as the application code. The query is leveraging interface polymorphism, by finding entities that implement specific interfaces, in this case the Ownership.Owner interface via the end-node of the OWNS relationship type.

However, all three implementations need a switch statement (or equivalent) in order to retrieve the owner’s ID. This is less obvious in the relational implementation as projections are performed in each branch of the union. In fact, this was the alternative approach available in the document implementation, illustrating the mathematical equivalence of these two operations. The graph implementation is not able to leverage the Id.Key interface, which leaves it in the same position as the relational and document implementations. Like the union, the switch statement is not particularly extensible, requiring modification every time new resource owner types are added to the system.

Query 2: Retrieve the IDs of all resources owned by a given user

In the second query, the type of the target entity is known to be a resource, so unions are not necessary in any implementation. Both the document and graph implementations require a switch statement to identify the correct ID type, assuming that there will be further resource types with different ID types added in the future (for instance, a repository with a name), however this is not required in the relational implementation. The resource IDs can be queried polymorphically as, regardless of their attribute types (denoted by column names) in each of the child tables for resource types, they are all stored in the same column of the parent table and so have the same type there. This is likewise a form of interface polymorphism.

This is a nice advantage of the class-table pattern, but it is important to note that the values stored in the parent and child tables are only values, and do not have the same object identity. As columns in the relational paradigm must have a single data type, this also restricts this form of polymorphism to cases where all object IDs have the same data type. If this is no longer the case, they will need to be stored in the child tables with a second meaningless ID used for maintaining references, and can no longer be queried polymorphically.

Query 3: Retrieve the types and IDs of all objects in the filesystem

The third query is significantly greater in scope than the previous two. While in those we knew the types of some of the elements involved, here we are instead asking for those types, in addition the IDs whose types we do not know. This query is much higher level, but is not conceptually complex than the previous two, requiring roughly the same number of words to phrase in natural language, and being only marginally more complex in the application code. However, in both the relational and document implementations, this query becomes significantly more complex. A union must be taken of all tables or collections representing entity types, and switch statements (or equivalent) are need for both the types and IDs of objects.

Meanwhile, the graph implementation saves on complexity by retrieving all nodes at once, thought must still utilise the switch statement. The ability to retrieve all nodes without specifying labels or any interface implementations is a form of parametric polymorphism, as the node type has been implicitly parametrised. However, this assumes that all nodes represent filesystem entities. If reified relationships are required anywhere, either to express complex relationship types (n-ary, nested, or undirected) or to allow relationships to have multiple labels (perhaps for constraint implementation), then this approach can no longer be taken, and an explicit disjunction over all valid filesystem entity types would be required.

The relational and graph implementations have also gained an additional condition to ensure that admins are not also returned as users. This is due to the fact that admins occur in both the users and admins table in the relational implementation, and have both the User and Admin labels in the graph implementation. We have been able to express that admins have multiple types, but neither implementation can recognise that one of those is a subtype of the other and thus not a true result. Neither database paradigm is able to leverage inheritance polymorphism in this way. It might be apparent that the document database can, but this is not the case. The user_type field only contains a single value (the most specialised), and it is likewise blind to the inheritance hierarchy. If we had chosen to instead store a list field user_types, then the same problem would occur.

Query 4: Retrieve the details of every ownership in the file system

The fourth query is somewhat more complex than the previous ones, as we are retrieving five different attributes for each result. This is reflected in the full natural language description:

“Retrieve the details of every ownership in the file system, specifically: the type of ownership (group or resource), the type and ID of the owned object, and the type and ID of the object’s owner.”

It is also reflected in the increased complexity of the application code, which must make heavy use of the Ownership.Owned interface and its methods. In each database implementation, all of the previous issues are combined and convoluted. The relational and document implementations require unions across all the object types, the relational and graph implementations require the additional condition to prevent admins being returned twice, and all three require switch statements (or equivalent) in order to return the correct ID types for each object.

The relational implementation is particularly egregious. While there are only four branches combined in the union for this simple filesystem, as one branch is needed for each pair of types that implement Ownership.Owned and Ownership.Owner, the number of branches required in the union is combinatorial with respect to the number of types that can be involved in an ownership. If there were ten concrete types that implemented Ownership.Owned, and ten that implemented Ownership.Owner, then a total of one hundred branches would be required in the union.

Summary

Despite having adhered to recommended design patterns for each database paradigm, the queries clearly have several issues. This is due to the polymorphism inherent in the data model. The document database did not display any polymorphic querying capabilities as, unlike the relational and graph paradigms, the document paradigm is not based on any strong mathematical formalism. While the relational and graph paradigms exhibited some limited polymorphic querying capabilities, neither is able to generally leverage inheritance polymorphism, interface polymorphism, and parametric polymorphism across all queries and in all circumstances, as the polymorphism in their formalisms is limited in scope.

The biggest problem with the lack of polymorphic querying capabilities is that the queries are required to return polymorphic results. Taking Query 3 as an example, there are five results, in which each pair of object type and ID is a different type from the others. In order to return polymorphic results without polymorphic querying, semantic context must be hardcoded into the queries. For example, in the relational implementation of Query 3, if the query is translated into a natural language question, it would be:

Retrieve the string “user” and the user email for users in the filesystem that are not admins, the string “admin” and the user email for users in the filesystem that are admins, the string “user_group” and the user group name for user groups in the filesystem, and the string “file” and the file path for files in the filesystem.

This is completely different to the original natural language query:

Retrieve the types and IDs of all objects in the filesystem.

In effect, the query has been irreversibly translated from its original declarative form into an implementation-specific imperative form. This is the case for all four queries in all three database paradigms (to different degrees), and is in contrast to the application code, in which the query operation actually captures the natural-language expression.

This has severe impacts for the maintainability and extensibility of the database implementation. The queries only express the question being asked of the database for the specific data model at the time they were written. If the data model is modified or extended, they cease to function as intended, and possibly at all, leading to intrinsically brittle application architectures. As a result, engineers are forced to manually implement and maintain polymorphism in their database models and queries in order to maintain parity with their application code.

Conclusion

Current database paradigms have been left far behind programming languages in expressivity. While general-purpose programming languages have continually evolved to become more abstract and declarative, database query languages have been unable to keep up. The inability of contemporary databases to model or express the polymorphism inherent in natural data domains causes a number of problems for engineers:

  • Semantic integrity of data cannot be natively enforced.
  • While queries can return polymorphic results, polymorphic queries cannot be natively expressed.
  • Queries are long, complex, and imperative in the way they access data.
  • Domain semantics must be hardcoded into queries, requiring continual maintenance as the data model is extended.

The difficulty of maintaining polymorphic data models in databases has lead to the widespread use of object-relational mappers (ORMs). These manage the translation between the application and database models through object-oriented APIs, ensuring semantic integrity and enabling polymorphic querying. Contrary to their name, ORMs are now available for many different kinds of database. However, ORMs have been poorly regarded by seasoned engineers for as long as they have existed. They do not offer the full expressivity of a native database query language, often cannot generate data models and queries with optimal performance, are typically tied to specific database technologies, and introduce an additional layer of complexity and overhead to database architecture. A full discussion of these problems is outside of the scope of this article, and will be given in-depth coverage in a future one.

If it is possible for semantic integrity to be enforced in an application, why should this not be possible in a database? If it is possible to natively express a natural-language polymorphic query in application code, why should this not be possible for a database query language? If it is possible to write elegant, powerful, and declarative code, why should this not be possible for queries? If application code can be made generic enough to function even when the class hierarchies are extended, why should it not be possible to do the equivalent in a database?

In order to enable all these things, a database would have to implement polymorphism as a primary component of its design. If such a database fully implemented inheritance, interface, and parametric polymorphism, then it would be possible to enforce semantic integrity and express declarative polymorphic queries, without compromising on maintainability, extensibility, or performance. Engineers would no longer need to manage polymorphism themselves, nor would they need to resort to relying on ORMs. Such a database would be a polymorphic database.

TypeDB is such a polymorphic database. It features a conceptual data model, a strong subtyping system, a symbolic reasoning engine, and a beautiful and elegant type-theoretic language: TypeQL. Together, these features enable the expressivity and safety necessary to express the polymorphism in natural data domains without compromises, bringing a new polymorphic paradigm into the range of contemporary general-purpose database paradigms.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback