TypeDB Learning Center

Why We Need a Polymorphic Database


Databases have come a long way since the navigational databases of the 1960s. The relational, document, and graph database paradigms have each profoundly impacted the way we store and query data, but many fundamental engineering challenges remain. Why is object-relational mismatch still a problem after 50 years? Why do we need ORMs to manage our data models for us? Why do queries need to be updated when the data model is extended, even if the question being asked remains the same?

Each successive database paradigm has only implemented workarounds to these problems without addressing the underlying cause: they cannot express many of the OOP features that we quickly take for granted, like abstraction, inheritance, and polymorphism. Programming languages have continuously become more powerful and declarative as they have evolved, but database query languages have not significantly advanced beyond SQL. This has left databases unable to directly implement object models, enforce semantic integrity of data, or resolve declarative polymorphic queries.

This article attempts to answer the above questions by conducting an in-depth exploration of the strengths and weaknesses of three of the most popular current database paradigms and how they deal with the polymorphism intrinsic to object models that represent natural data domains. It begins by examining the historical context and design philosophies behind the relational, document, and graph database paradigms to understand their capabilities and the conditions under which object model mismatches occur. It then shows how semantic integrity of polymorphic data cannot be enforced in those paradigms with a simple class inheritance hierarchy. Finally, using an object model for a lightweight filesystem application as a benchmark, the stark difference in declarativity between application code and database queries is demonstrated.

Throughout this article, code examples for relational, document, and graph databases will use PostgreSQL, MongoDB, and Neo4j syntax respectively. The application code for the filesystem is written in Java.

Object model mismatch

A wide variety of database paradigms have been conceived since the onset of the digital age, ranging from the navigational databases of the 1960s to the multi-model databases of today. Each of those paradigms has built on previous ones to account for the continually increasing scope and complexity of modern data, but many challenges in data engineering seem to evade any solution. Among those is the infamous object model mismatch, most commonly referred to in the context of relational databases as object-relational mismatch, though equivalent mismatches are prevalent across most contemporary database paradigms.

This article will focus on three paradigms that have seen widespread adoption in commercial software engineering for high-level application development: the relational, document, and graph paradigms. However, the issues that will be discussed are representative of most other paradigms. Due to their very different logical data models, object model mismatch manifests under different conditions in different database paradigms, but the common problems always result when mismatch occurs.

Object-relational mismatch

The first versions of relational database theory and SQL were conceived in the early 1970s when procedural programming languages dominated computer science and information technology. Codd’s set-theoretic relational algebra (Codd, 1969) is the foundation of SQL, serving as both the data definition language (DDL) and data manipulation language (DML). In a relational database, a schema provides structural integrity, ensuring that data cannot be corrupted and that queries can be analyzed for correctness.

Relational databases were a turning point for data persistence. Previous database paradigms were navigational (though this encompassing term for pre-relational databases was not coined until later by Bachman, 1973), and accessed data imperatively. Users would have to describe how to iterate (or navigate) from each record to the next, by following links between them. In comparison, relational databases function declaratively: the user describes the qualifiers that define a record to retrieve, and the database returns all matching records.

Relational databases quickly became embedded in almost every digitized industry as the prevailing convention replacing navigational databases, becoming an ANSI and ISO standard in 1986 and 1987, respectively. This was prior to the wide spread of object-oriented programming (OOP), which was spearheaded by the release of C++ in 1986, not standardized by the ISO until 1998. As a result, the most common general-purpose database paradigm made no account of what would become the most common general-purpose programming paradigm. This presented a cumbersome challenge to programmers: the relational and object models were fundamentally incompatible, requiring data structures to be translated from one to the other when transferring data between the application and the database. This disparity is known as object-relational mismatch.

Basics of relational modeling

In relational algebra, all data is described as tuples within relations, which map attributes onto values. These are practically implemented under more well-known terms, as rows within tables, which map column names onto their values for the given row. Relations can be combined using set operations like intersections (known as inner-joins), unions, projections, and cartesian products (cross-joins). The resulting sets are known as derived relations in the algebra, and as views in databases. The result of a query is always technically a view, even if it only projects all the columns of a single table.

Relational databasesSet-theoretic algebra
TableRelation
ColumnAttribute
RowTuple
ViewDerived relation
The basic components of a relational database, as described by Codd’s set-theoretic relational algebra (1969).

One element of relational databases that does not originate in Codd’s algebra is the foreign key: a constraint on the values of attributes in a particular table’s column to values that already exist in another table’s column. This is used to guarantee referential integrity of data dependencies between table rows. To retrieve dependent data instances together, the two tables must then be joined on the columns linked by the foreign key, though it is not essential to leverage a foreign key when creating joins. Like the structure of the tables themselves, the structure of the foreign keys is also enforced by the schema.

Representing multivalued attributes

Object-relational mismatch arises from the fact that tuples are one-dimensional, and so cannot represent multi-dimensional objects. The most common example of this phenomenon is in representing multivalued attributes, such as a user in a system with multiple email addresses.

INSERT INTO users (id, first_name, last_name, emails)
VALUES (DEFAULT, 'Thomas', 'Carcetti', 'thomas.carcetti@vaticle.com;thomas@vaticle.com;tommy@vaticle.com');

The data in this example is unnormalized as multiple email values have been concatenated into a single string. This makes adding, retrieving, and removing individual addresses harder to do without corrupting the data. In order to avoid such bad modeling practices, the data is normalized to ensure every value in every table row is atomic. In this particular case, a new user_emails table is created to store the emails individually, with a foreign key referencing the users table.

DO $$
DECLARE
    user_id INT;
BEGIN
    INSERT INTO users (id, first_name, last_name)
    VALUES (DEFAULT, 'Thomas', 'Carcetti')
    RETURNING id INTO inserted_user_id;
  
    INSERT INTO user_emails (user_id, email)
    VALUES
        (inserted_user_id, 'thomas.carcetti@vaticle.com'),
        (inserted_user_id, 'thomas@vaticle.com'),
        (inserted_user_id, 'tommy@vaticle.com');
END $$;

Inserting a user now requires multiple queries enclosed in a transaction, and retrieving all a user’s data requires joining the two tables on the foreign key. While significantly safer, this implementation has a heavy cost: a single conceptual user is now represented as rows within multiple tables. Essentially, the idea of object identity has been blurred. The mapping of application objects representing users to the rows that represent them in the database is no longer a one-to-one mapping, and a mismatch between the object and relational models is created.

Tuples are not Objects

In this model, a user is a multidimensional data structure in that the emails can be thought of as a tuple within the row, which is itself a tuple. The unnormalized and normalized implementations are simply equivalent representations of the same two-dimensional structure. In the general case, object-relational mismatch is a problem that arises from the inability of tuples, which are one-dimensional, to store multi-dimensional objects.

The multidimensionality of object models enables the expression of powerful concepts like abstraction, inheritance, and polymorphism, all indispensable tools in modern programming that allow the natural representation of complex data domains and the native expression of their data. Multivalued attributes are just one example of a model feature that results in mismatch, with the normalized representations of one-to-many relations, many-to-many relations, inheritance hierarchies, and interface structures also leading to mismatches. Some of these occur in later examples in this article, and will be highlighted when they do.

Object-document mismatch

The document database was the first NoSQL database paradigm that saw widespread adoption for general-purpose applications, with the release of MongoDB in 2009. Like most NoSQL databases, document databases came about in response to the rapidly growing prevalence of big data, a result of increased digitization of business processes and the rise of cloud computing. For such applications, the poor performance of queries that join many tables with large number of rows, combined with the inflexibility of the schema-mandated structure for storing disparate data from different sources, makes relational databases a bad choice.

Document databases were designed to solve this problem by eliminating the schema and storing multi-dimensional data structures without the normalization necessary in relational databases. This allows the ingestion of semi-structured data and the storage of dependent data with its parent data, which give enormous improvements to ease of modeling and query performance when working with big data. However, this comes at the cost of not being able to rely on the schema to enforce structure or integrity, and so relational databases remained prevalent where these concerns were a priority.

In principle, document databases are identical to key-value databases with the additional abstraction of representing hierarchical data structures by natively embedding dependent structures in their parents. This abstraction makes document databases far more usable and optimized for high-level applications than key-value databases, and has led them to be widely adopted for general-purpose data persistence. Meanwhile, key-value databases are typically reserved for specialized applications such as caches and storage engines.

Basics of document modeling

In document databases, data is encoded in a serializable hierarchical format such as JSON or XML, with each level in the hierarchy comprising a document. Documents are stored in collections, indexed by IDs, and contain named fields mapped to either primitive data types or embedded sub-documents. The same applies to the embedded documents themselves, with no notional limit to the depth of embedding. This allows the trivial representation of multidimensional data structures that are purely hierarchical, leading to the convenient elimination of object model mismatch in many cases, including that of multivalued attributes.

db.users.insert( [
    {
        "first_name": "Thomas",
        "last_name": "Carcetti",
        "emails": [
            "thomas.carcetti@vaticle.com",
            "thomas@vaticle.com",
            "tommy@vaticle.com"
        ]
    },
    {
        "first_name": "Rhonda",
        "last_name": "Pearlman",
        "emails": [
            "rhonda.pearlman@vaticle.com",
            "rhonda@vaticle.com"
        ]
    }
] )

Much like in the predecessor key-value paradigm, no constraints are placed on the values that can be stored in a document databases. Because of this, document databases are referred to as semi-structured. While any single document has a definite parseable structure, there are no mandated constraints on that structure, which may be modified freely as needed. Documents in the same collection do not need to have the same structure and might be completely dissimilar, with no common fields. Many document databases feature DDL syntax for defining optional constraints on inserted data, though the available functionality is limited in scope when compared to relational schemas.

Consequences of lacking structure

The embedding of dependent data also allows for highly efficient insertion and retrieval in comparison to relational databases, even at very large data volumes, as all data nested in a document is stored as a single block on disk. However, the hierarchical data model becomes problematic when dealing with data that is not strictly hierarchical. In the following example, two documents are inserted into the projects collection, within which are embedded the previous documents from the users collection in the members list field.

db.projects.insert( [
    {
        "name": "TypeDB 3.0",
        "members": [
            {
                "first_name": "Thomas",
                "last_name": "Carcetti",
                "emails": [
                    "thomas.carcetti@vaticle.com",
                    "thomas@vaticle.com",
                    "tommy@vaticle.com"
                ]
            },
            {
                "first_name": "Rhonda",
                "last_name": "Pearlman",
                "emails": [
                    "rhonda.pearlman@vaticle.com",
                    "rhonda@vaticle.com"
                ]
            }
        ]
    },
    {
        "name": "TypeDB Cloud beta",
        "members": [
            {
                "first_name": "Rhonda",
                "last_name": "Pearlman",
                "emails": [
                    "rhonda.pearlman@vaticle.com",
                    "rhonda@vaticle.com"
                ]
            }
        ]
    }
] )

This highlights a problem: the user documents must be stored multiple times within the projects collection, and once more in the users collection for direct querying, leading to data duplication. The representation of atomic user objects as multiple documents is a form of object-document mismatch, analogous to object-relational mismatch. This is the result of a many-to-many relation between users and projects. While documents databases are able to eliminate mismatches for multivalued attributes and one-to-many relations, they still occur for many-to-many relations due to the inability to represent them hierarchically.

To avoid this, it is possible to reference existing documents by an ID rather than duplicating them. When querying such a model, a lookup (equivalent to a relational join) must then be performed from the reference value onto the collection containing the referenced data. This process is significantly less performant than performing a join in a relational database because document databases cannot capitalize on the constraints provided by a schema to optimize the process. As a result, document database design patterns advocate for using embedded structures wherever possible rather than relying on references. Thus, it is possible to eliminate object-document mismatch for many-to-many relations, but this comes with a severe performance cost and is typically not viable in production.

Object-graph mismatch

Modern graph databases evolved from an earlier sub-category of navigational databases known as network databases. The idea had existed since the early 1990s but did not become widely popularised until the first commercially viable versions were released in the mid-2000s, notably Neo4j in 2007, the most popular graph database today. Graph databases are based on graph theory and implemented as labeled property graphs (LPGs), with inspiration drawn from Chen’s entity-relationship (ER) model (Chen, 1976).

The structure of LPG data is very similar to that of an earlier graph-based data model, the triplestore paradigm, implementing the RDF standard. LPG databases are strictly more expressive than triplestores, able to represent everything they can with more elegant modeling. Triplestore databases are sometimes also referred to as graph databases, but in this article the term graph is used exclusively in reference to LPGs. One of the key differences between them will be described later in this section.

Like document databases, graph databases are schemaless and appealed to engineers working with semi-structured data due to their ability to easily store disparate data structures. However, unlike them, graph databases are highly optimized to store and traverse non-hierarchical data involving multidimensional connections between records by storing them as pointers. Compared to document databases, in which dependent data is stored in one place, traversing connections on disk incurs a non-negligible performance cost, but this enables the storage of highly interconnected data without duplication or performance-intensive lookups. Choosing between document and graph databases thus involves consideration of the characteristics of the data domain and application.

Basics of graph modeling

In a graph database, data is stored as nodes (representing entities) and relationships between them, equivalent to the vertices and edges of a graph respectively. Both nodes and relationships can be assigned labels, which represent the names of categories they fall into, and properties, which store data about them in scoped key-value stores. While relationships can have multiple labels in graph-theoretic algebra, many LPG databases only permit relationships to have a single label.

Graph databasesGraph-theoretic algebraER modeling
NodeVertexEntity
RelationshipDirected edgeRelationship
LabelLabelType
PropertyLabelAttribute
The basic components of a graph database, implementing a labeled property graph (LPG) model. Mathematically, properties can be considered special kinds of labels, and they can be expressed using the same algebraic functions.

Relationships are implemented as directed edges, equivalent to the triplestore model, with the start-node, edge, and end-node representing the subject, predicate, and object, respectively. This allows for most semantic relationships to be naturally modeled in graph databases. There is no notional limit to the number of relationships a node can be connected with, allowing graph databases to easily represent highly interconnected data. Being schemaless, graph databases allow any labels and properties to be assigned to nodes and relationships and, like document databases, many incorporate label-based DDL syntax for constraint definition, which will be examined later in this article.

A competitive edge

The most important differentiator between graph and triplestore databases is the introduction of properties, which gives both nodes and relationships key-value attribute stores as internal structures. In contrast, node attributes in triplestore databases are modeled by creating a node representing the attribute value and linking it to the associated node with an edge representing the attribute key. Relationship attributes cannot easily be modeled in triplestore databases, as edges cannot be directly connected to other edges. In order to store them, the relationship edge must be replaced by a node, which is linked to the original edge’s two start- and end-nodes by two new edges. Attributes can then be assigned to the node that represents the relationship as in the case of node attributes.

The process of replacing a single edge with a node and pair of edges is known as reification, and results in object model mismatch. Because the atomic relationship is represented by multiple graph elements, the conceptual identity of the relationship object is lost. The inability to model attributes of relationships, a fairly common requirement, without reifying the relationships is a significant shortcoming of the triplestore model, and a major factor that affected the design of the graph paradigm. By implementing the LPG model, graph databases avoid reification, and thus the resulting mismatch. This is demonstrated in the following example, where the property temporary is assigned to a MEMBER_OF relationship without having to reify it.

CREATE
    (thomas:User {
        first_name: "Thomas",
        last_name: "Carcetti",
        emails: [ "thomas.carcetti@vaticle.com", "thomas@vaticle.com", "tommy@vaticle.com" ]
    }),
    (rhonda:User {
        first_name: "Rhonda",
        last_name: "Pearlman",
        emails: [ "rhonda.pearlman@vaticle.com", "rhonda@vaticle.com" ]
    }),
    (typedbThree:Project { name: "TypeDB 3.0" }),
    (cloudBeta:Project { name: "TypeDB Cloud beta" }),
    (thomas)-[:MEMBER_OF]->(typedbThree),
    (rhonda)-[:MEMBER_OF { temporary: true}]->(typedbThree),
    (rhonda)-[:MEMBER_OF]->(cloudBeta)

Limitations of graphs

While the design of graph databases is intended to avoid reification, there are cases where it is still a necessary process in order to express certain kinds of relationships. As relationships are implemented as directed edges, they can only model relationships that are binary (between exactly two nodes) and directed (with the two nodes specified as a start- and end-node). To express other kinds of relationships, such as n-ary ones (between any number of nodes), nested ones (acting as endpoints for other relationships), and undirected ones (in which neither node is the start or end), relationships must be reified, leading to object-graph mismatch. In this next example, the MEMBER_OF relationship from the previous example has been replaced with a Membership node in order to model it as a nested relationship, the endpoint of another ASSIGNED relationship.

MATCH
    (cedric:Admin)
WHERE
    "cedric.daniels@vaticle.com" in cedric.emails
CREATE
    (thomas:User {
        first_name: "Thomas",
        last_name: "Carcetti",
        emails: [ "thomas.carcetti@vaticle.com", "thomas@vaticle.com", "tommy@vaticle.com" ]
    }),
    (rhonda:User {
        first_name: "Rhonda",
        last_name: "Pearlman",
        emails: [ "rhonda.pearlman@vaticle.com", "rhonda@vaticle.com" ]
    }),
    (typedb_three:Project { name: "TypeDB 3.0" }),
    (cloud_beta:Project { name: "TypeDB Cloud beta" }),
    (thomas)-[:MEMBER_OF]->(typedb_three),
    (rhonda)-[:HAS]->(membership:Membership { temporary: true})-[:OF]->(typedb_three),
    (cedric)-[:ASSIGNED]->(membership),
    (rhonda)-[:MEMBER_OF]->(cloud_beta)

This demonstrates a fundamental issue with using graph-theoretic algebra for conceptual data modeling: in most data domains, not all relationships can be expressed as directed binary edges. A more in-depth exploration of when such relationships arise in data will be the topic of a future article, including n-ary, nested, and undirected relationships. Reification allows these more complex relationship types to be implemented, but at the cost of introducing object-graph mismatch to the model.

Summary

Object model mismatches are apparent in relational, document, and graph paradigms, even with the fairly simple models used in the examples shown. While they are not too difficult to deal with at these scales, they become unmanageable as data models approach a realistic level of complexity, especially when there is polymorphism present in the model. The key reason they occur is the division between the conceptual and logical data models. While logical data models differ across databases paradigms, there are always conceptual data structures that cannot be expressed with them. For this not to be the case, the logical model would need to be identical to the conceptual model. A database that directly implemented a conceptual model in this way would eliminate object model mismatch.

Semantic integrity

To maintain semantic integrity of data and prevent corruption, it is necessary to define conditions to which inserted data must conform to be considered semantically valid. While relational, document, and graph databases all include DDL syntax for constraint definition, this section will demonstrate that it is not possible to enforce semantic integrity in the general case. An example of a two-level inheritance hierarchy will be utilized throughout, pertaining to employees in a company, all of whom are either permanent employees or contractors.

In this example, all employees have:

  • An employee ID.
  • A first name.
  • A last name.

Permanent employees also have:

  • A start date.
  • An end date (which is null for current employees).
  • A salary.

Contractors instead have:

  • An agency.
  • A contract number.
  • An hourly rate.

Integrity in relational databases

In relational databases, the schema ensures referential integrity. Automatic referential integrity management prevents data corruption and is one of the most powerful features of relational databases. It is the primary reason they are still used today for the storage of critical data. However, referential integrity does not amount to semantic integrity, and it is perfectly possible to have a relational database in a valid logical state in which the data has been corrupted.

Three popular enterprise design patterns have emerged to deal with inheritance, coined by Martin Fowler (2002): single-table inheritance, concrete-table inheritance, and class-table inheritance. These design patterns have a range of advantages and disadvantages, which must be traded off depending on the requirements of the application. They are discussed in more detail in a previous article, and this article will focus only on semantic integrity.

Pattern 1: Single-table inheritance

In this design pattern, all data is stored in one table that represents the common parent type. The table contains columns for the parent type’s attributes and columns for the attributes of every child type.

CREATE TYPE EmployeeType
AS ENUM ('permanent_employee', 'contractor');

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    employee_type EmployeeType NOT NULL,
    start_date DATE DEFAULT NULL,
    end_date DATE DEFAULT NULL,
    agency_id INT DEFAULT NULL REFERENCES agencies(id),
    contract_number INT DEFAULT NULL,
    salary MONEY DEFAULT NULL,
    hourly_rate MONEY DEFAULT NULL
);

ALTER TABLE employees
ADD CONSTRAINT check_nulls
CHECK ( (
    employee_type = 'permanent_employee'
    AND num_nulls(start_date, salary) = 0
    AND num_nonnulls(agency_id, contract_number, hourly_rate) = 0
    ) OR (
    employee_type = 'contractor'
    AND num_nulls(agency_id, contract_number, hourly_rate) = 0
    AND num_nonnulls(start_date, end_date, salary) = 0
) );

Attribute semantics are controlled with the custom CHECK constraint defined in the schema. This ensures that each employee can only have the attributes permitted by their employee types. However, if another table references the employee table with a foreign key, there is no simple way to ensure that it only references appropriate employee types. For instance, if there is a table of performance reviews, it is not possible to ensure that only permanent employees can be referenced.

CREATE TABLE performance_reviews (
    id SERIAL PRIMARY KEY,
    employee_id INT NOT NULL REFERENCES employees(id),
    review_date DATE NOT NULL,
    review_score INT NOT NULL,
    comments TEXT DEFAULT NULL
);

As a result, while it is possible to maintain semantic integrity within a table using custom constraints, it is not possible across the entire database when employing this design pattern.

Pattern 2: Concrete-table inheritance

In this design pattern, one table is used for each concrete (non-abstract) type in the hierarchy. Each table contains columns for the parent type’s attributes in addition to those for the attributes specific to the concrete type represented.

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

This pattern solves the previous issue as the performance_reviews table can now directly reference the permanent_employees table, but another problem now occurs. All employees must have globally unique IDs, but as different employee types are split over different tables, it is no longer possible to delegate their assignment to the database using the auto-incrementing SERIAL data type. As a result, the database is unable to control semantic integrity of employee IDs, so it is possible to insert two employees (of the same or different types) with the same ID.

Pattern 3: Class-table inheritance

In this design pattern, one table is used for every type in the hierarchy, abstract and concrete. Each table only contains columns for the attributes specific to the type represented.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL
);

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY REFERENCES employee(id),
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY REFERENCES employee(id),
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

Compared to the single- and concrete-table patterns, this pattern is more architecturally complex, but it is possible to simultaneously solve the previous semantic integrity issues. The dependency of performance reviews can be correctly enforced by referencing the permanent_employees table, and global employee ID uniqueness is enforced by storing them in the root employees table. But object-relational mismatch has now been introduced to the model as employees are represented by multiple table rows, which must be inserted and read together.

DO $$
DECLARE
    employee_id INT;
BEGIN
    INSERT INTO employees (id, first_name, last_name)
    VALUES (DEFAULT, 'Kima', 'Greggs')
    RETURNING id INTO employee_id;

    INSERT INTO permanent_employees (id, start_date, salary)
    VALUES (employee_id, '2020-09-25', 65000.00);
END $$;

This causes another semantic integrity issue: it is not possible to enforce the number of types an employee can have. For example, a single employee in the employees table could have entries in both the permament_employees and contractors tables, or even in neither! This is also seen in other cases of object-relational mismatch. For instance, in the previous example in which user_emails was made a separate table to users in order to make the email a multivalued attribute, there is no way to ensure users have at least one email.

The root of the problem

While all three design patterns ensure referential integrity, none are able to enforce semantic integrity, each failing in a different way. This is a manifestation of a larger issue: SQL does not provide syntax for constraining cross-table dependencies other than foreign keys. In the single-table pattern, the type-specific dependency between the performance_reviews and employees table could not be enforced. In the class-table pattern, the shared dependency of the permanent_employees and contractors tables on employee ID could not be enforced. In the class-table pattern, the cardinality-based dependency between the employees table and the subtype tables could not be enforced.

Some relational databases include faculties for creating type hierarchies, but they are limited in scope and cannot enforce semantic integrity in the general case. Regardless, the case of the class-table pattern highlights a wider-ranging problem: due to the lack of cross-table constraints, semantic integrity cannot be enforced when object-relational mismatch occurs. This means that any model that includes polymorphic features cannot have semantic integrity enforced in relational databases.

Integrity in document databases

To implement the employee hierarchy in a document database, it is necessary to decide if they will be stored in two collections, permanent_employees and contractors, or a single collection, employees. Daniel Coupal et al. (2023) present enterprise design patterns for modeling data of this kind in a document database, in which they advocate for storing all similar documents (in this case employees) in a single collection representing the parent type, what they call the “polymorphic pattern”, with the specific types recorded as data fields. In contrast, separating similar documents into separate collections is specifically described as an anti-pattern. The primary reason for this is performance, as document databases are not optimised to perform the lookups and unions required to access data divided across multiple collections.

Utilizing this pattern for the current example, an important design decision remains: whether to embed the data pertaining to contractors’ agencies in the contractor documents, or instead reference them by ID. This section will examine both approaches.

Pattern 1: Embedding dependent data

With this approach, an agency sub-document is embedded into every contractor document.

db.employees.insert( [
    {
        "id": 1,
        "first_name": "Kima",
        "last_name": "Greggs",
        "employee_type": "permanent_employee",
        "start_date": new Date("2020-09-25"),
        "salary": 65000.00
    },
    {
        "id": 2,
        "first_name": "Reginald",
        "last_name": "Cousins",
        "employee_type": "contractor",
        "agency": { "id": 8, "name": "Baltimore Agency" },
        "contract_number": 301,
        "hourly_rate": 50.00
    }
] )

As discussed in the previous section, this leads to duplication of data. If an agency has multiple contractors, then its data must be nested in every associated contractor document. Likewise if any other documents in other collections are associated to the agency, then they must also contain a copy of the data. This leads to data redundancy, and requires the engineer to ensure the redundant copies remain synchronized, as well as imposing a heavy performance cost on writes. Without a schema to indicate where all the copies might be, potentially at different levels of nesting, there is a significant risk of data inconsistency, and with it loss of semantic integrity.

Pattern 2: Referencing dependent data

In the example, every contractor document references an agency ID, corresponding to the built-in _id field of a document in the agencies collection

baltimore_id = db.agencies.find( { "id": 8 } ).next()["_id"]

db.employees.insert( [
    {
        "id": 1,
        "first_name": "Kima",
        "last_name": "Greggs",
        "employee_type": "permanent_employee",
        "start_date": new Date("2020-09-25"),
        "salary": 65000.00
    },
    {
        "id": 2,
        "first_name": "Reginald",
        "last_name": "Cousins",
        "employee_type": "contractor",
        "agency": baltimore_id,
        "contract_number": 301,
        "hourly_rate": 50.00
    }
] )

There is no data redundancy with this approach, but a costly lookup must be performed to retrieve a contractor’s data along with that of their agency. As a result, this implementation has faster writes but slower reads than the previous one. While the lack of redundancy prevents any inconsistencies, there is another problem: the agency field of contractors only contains the literal value of the agency’s _id field with no referential guarantees. This means that it does not identify which collection the agency is in, nor ensures there is a document with a matching value anywhere in the database. Unlike in relational databases where foreign keys enforce referential integrity, here it is possible to delete the referenced document without knowing if it is referenced anywhere else in the database, leading to the creation of null references and loss of semantic integrity.

Between a rock and a hard place

Embedding dependent data results in loss of data consistency, while referencing it results in loss of referential integrity. Because both are required for semantic integrity, neither choice can guarantee it is enforced. As with the relational case, the loss of semantic integrity here occurs when cross-document dependencies need to be enforced, either the conformity of embeddings across different documents, or the integrity of cross-document references. Custom document constraints cannot help here, as they only enforce the structure of a single document. This means that, if a data model includes a many-to-many relationship (as most do), its semantic integrity cannot be enforced when implemented in a document database.

Integrity in graph databases

When implementing the employee hierarchy in a graph databases, the process is straightforward if adhering to ER modeling practices.

MATCH (baltimore:Agency {id: 8} )
CREATE
    (kima:Employee:PermanentEmployee {
        id: 1,
        first_name: "Kima",
        last_name: "Greggs",
        start_date: date("2020-09-25"),
        salary: 65000.00
    } ),
    (reginald:Employee:Contractor {
        id: 2,
        first_name: "Reginald",
        last_name: "Cousins"
    } )-[:CONTRACTED_FROM {
        contract_number: 301,
        hourly_rate: 50.00
    } ]->(baltimore);

Some of the properties of contractors become properties of their contracts. This is in line with recommendations by Barrasa (2017), who writes that one of the motivations behind the graph model was to avoid the reification necessary to model relationship properties in the triplestore model. Employees are given multiple labels to represent the multiple types in the employee type hierarchy, which can be used to define node and relationship property constraints on a per-label basis. This allows the property constraints for the parent employee type and the child types to be elegantly enforced.

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.id IS KEY;

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.id IS TYPED INTEGER;

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.first_name IS NOT NULL;

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.first_name IS TYPED STRING;

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.last_name IS NOT NULL;

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.last_name IS TYPED STRING;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.start_date IS NOT NULL;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.start_date IS TYPED DATE;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.end_date IS TYPED DATE;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.salary IS NOT NULL;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.salary IS TYPED FLOAT;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.contract_number IS KEY;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.contract_number IS TYPED INTEGER;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.hourly_rate IS NOT NULL;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.hourly_rate IS TYPED FLOAT;

Constraints are not a schema

Constraining of property values is the limit of custom constraint functionality in graph databases. This leaves two major problems. First, there is no way to control semantic grouping of labels. For instance, it is possible to create a node with the PermanentEmployee label that does not have the Employee label, a node with both the PermanentEmployee and Contractor labels, or a node with the abstract Employee label but no label for a concrete type. Second, there is no way to constrain the nodes that can act as the start- and end-nodes in a given relationship. It is possible, for example, to create a contractor that is not in any CONTRACTED_FROM relationships. It is also possible to create a CONTRACTED_FROM relationship that represents any of the following:

  • A permanent employee contracted from an agency.
  • A contractor contracted from a permanent employee.
  • An agency contracted from a contractor.
  • A permanent employee contracted from themselves!

This means that there is no way to natively ensure relationship semantics. Unlike the first problem, which only affects models in which multiple labels are involved, this problem affects all data models implemented as graph databases. In fact, these two problems can be generalised: constraints cannot be defined that controls dependencies between labels. This is analogous to the relational and document paradigms, where constraints cannot enforce cross-table and cross-document dependencies.

Summary

Semantic integrity is impossible to enforce in a relational database with polymorphic features in its data model, in a document database with many-to-many relationships, and in any graph database. Document and graph databases offer simple constraint definition and validation, but cannot give the more powerful integrity assurances possible with a schema. While relational databases implement a schema, they are not able to express the multidimensional data structures required to correctly constrain polymorphic data. In all three cases, the source of the vulnerability is the inability to describe complex constraints between types of record: tables, collections, and labels respectively. In order to eliminate this vulnerability and enforce semantic integrity, a database would need more powerful DDL syntax and validation, able to express the polymorphic constraints between types, such as the inheritance hierarchies and interface structures present in natural data domains.

Polymorphic querying

Having explored the inability to express polymorphic constraints with the DDL syntax of relational, document, and graph databases, this section focuses on the polymorphic querying capabilities of each paradigm using its DML syntax. Since Codd’s seminal work on relational databases, declarativity is considered an essential characteristic of a query language. However, contemporary query languages are not as declarative as they seem at first glance. When dealing with polymorphic data, such query languages appear more imperative in nature. This leads to brittle application architectures as queries are not sufficiently declarative to survive model extensions and migrations without refactors, as will be shown in this section.

An object model for a simple filesystem

The previous example of a company’s employees was effective for demonstrating some of the intrinsic problems in representing object models with current database paradigms, but it was unrealistically simple. This section will use a more practical model for a DAC filesystem. While the filesystem being modeled is conceptually very simple, the object model (illustrated in Java) involves both inheritance and interfaces in order to naturally model its polymorphic nature, and so is necessarily complex. This ensures that the model adheres closely to the domain semantics, and permits trivial extensibility.

The model contains three top-level entity types: users, user groups, and resources (which are abstract). It also includes the additional entity types of admins, which are subtypes of users, and files, which are subtypes of resources. Each of these entity types is represented by a class in the object model.

Each entity is keyed by a different attribute: emails for users (and therefore admins), names for user groups, and paths for files. These attributes are also represented by classes, subtypes of an abstract ID class, which are then stored as fields in the classes of their associated entities. While it might seem more straightforward to model them as primitive fields, making them classes allows for more powerful operations to be performed on the filesystem, as will be demonstrated shortly. The attribute classes each expose a “key” interface that is implemented by the associated entity classes.

Finally, two relations are modeled: resource ownerships and group ownerships, which are both subtypes of an abstract ownership type. Resource ownerships are used to relate resources to their owners, which can be users (including admins) or user groups, while groups ownerships are used to relates groups to their owners, which can only be admins. Each relation type is represented by a class, which exposes “owned” and “owner” interfaces.

The object model can be found here. It is worth acknowledging that the model is not complete. For instance, there are no resources other than files, and it is impossible to make a user a member of a user group. This model is only intended to illustrate certain properties of the class structure, and so these trivial additions are excluded as they are not directly relevant to the discussion. More resource types could easily be added to the model by copying the structure of the file class, and likewise group memberships could be added by copying the structure of the ownership classes. The model is also not intended to serve as a fully-fledged application, and many important considerations (such as ensuring that methods have appropriate access modifiers, or ensuring that re-assigning of ownerships after initial assignment is correctly handled) have been neglected for simplicity.

With the model in place, creating filesystem objects is quite simple.

// Instantiate an admin with the specified email.
Admin cedric = new Admin("cedric@vaticle.com");

// Instantiate a user with the specified email.
User jimmy = new User("jimmy@vaticle.com");

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
UserGroup engineers = new UserGroup("engineers", cedric);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
File benchmark = new File("/jimmy/benchmark-results.xlsx", jimmy);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
File roadmap = new File("/vaticle/feature-roadmap.pdf", engineers);

Retrieving data belonging to the objects is also simple. This will be examined through the lens of a query-like operation:

Retrieve the details of every ownership in the file system, specifically: the type of ownership (group or resource), the type and ID of the owned object, the type and ID of the object’s owner.

// Retrieve the details of every ownership in the filesystem.
fileSystemObjects.stream()
    .filter(object -> object instanceof Ownership.Owned)
    .map(object -> (Ownership.Owned) object)
    .map(owned -> Map.of(
        "owned-id", ((Id.Key<?>) owned).getId().value,
        "owned-type", owned.getClass().getSimpleName(),
        "owner-id", (((Id.Key<?>) owned.getOwner()).getId()).value,
        "owner-type", owned.getOwner().getClass().getSimpleName(),
        "ownership-type", owned.getOwnership().getClass().getSimpleName()
    ))
    .forEach(System.out::println);

// Output:
// {
//     owned-id=/vaticle/feature-roadmap.pdf,
//     ownership-type=ResourceOwnership,
//     owner-type=UserGroup,
//     owner-id=engineers,
//     owned-type=File
// }
// {
//     owned-id=/jimmy/benchmark-results.xlsx,
//     ownership-type=ResourceOwnership,
//     owner-type=User,
//     owner-id=jimmy@vaticle.com,
//     owned-type=File
// }
// {
//     owned-id=engineers,
//     ownership-type=GroupOwnership,
//     owner-type=Admin,
//     owner-id=cedric@vaticle.com,
//     owned-type=UserGroup
// }

In the domain, this is a very important query. Understanding ownerships in the filesystem at a holistic level can identify vulnerabilities that could otherwise go unnoticed. Because the GroupOwnership and ResourceOwnership classes are subtypes of the Ownership class, we can retrieve the set of all ownerships in the filesystem in one go. We can also easily retrieve the ID of the feature roadmap’s owner without knowing the type of the owner by using the Id.Key interface. While simple, this is immensely powerful as we can introduce new classes that implement the interfaces of Ownership and Id, and still retrieve this information without having to modify any code.

The query functions without knowing any of the exact types in the filesystem as they operate only on interface implementations. No matter how many new types we add to the filesystem, this operation will continue to function and give correct results. This is because the object model has been designed with the conceptual model of the filesystem domain in mind, which polymorphism is an integral part of. Three other queries are included in the example code, but not discussed in this article:

  • Retrieve the ID of a given resource’s owner.
  • Retrieve the IDs of all resources owned by a given user.
  • Retrieve the type and ID of every object in the filesystem.

They exhibit the same declarative polymorphic properties as the one featured here.

Relational implementation

To implement the filesystem model in a relational database, we will use the class-table inheritance pattern, as it is the most extensible.

CREATE TABLE users (
    email TEXT PRIMARY KEY
);

CREATE TABLE admins (
    email TEXT PRIMARY KEY REFERENCES users(email)
);

CREATE TABLE user_groups (
    name TEXT PRIMARY KEY
);

CREATE TABLE resources (
    id TEXT PRIMARY KEY
);

CREATE TABLE files (
    path TEXT PRIMARY KEY REFERENCES resources(id)
);

CREATE TABLE ownerships (
    id SERIAL PRIMARY KEY
);

CREATE TABLE resource_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE user_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_id TEXT REFERENCES users(email),
    resource_id TEXT REFERENCES resources(id)
);

CREATE TABLE user_group_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_group_id TEXT REFERENCES user_groups(name),
    resource_id TEXT REFERENCES resources (id)
);

CREATE TABLE group_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE admin_of_group_ownerships (
    id INT PRIMARY KEY REFERENCES group_ownerships(id),
    admin_id TEXT REFERENCES admins(email),
    user_group_id TEXT REFERENCES user_groups(name)
);

The ownership interfaces present an interesting challenge here. As each foreign key column can only reference a column in a single other table, we have to create one table for users that own resources, one for user groups that own resources, and one for admins that own groups. Next, we insert data.

-- Instantiate an admin with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('cedric@vaticle.com');

    INSERT INTO admins (email)
    VALUES ('cedric@vaticle.com');
END $$;

-- Instantiate a user with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('jimmy@vaticle.com');
END $$;

-- Instantiate a user group with the specified name,
-- and assign the admin Cedric as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO user_groups (name)
    VALUES ('engineers');

    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;

    INSERT INTO group_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO admin_of_group_ownerships (id, admin_id, user_group_id)
    VALUES (ownership_id, 'cedric@vaticle.com', 'engineers');
END $$;

-- Instantiate a file with the specified path,
-- and assign the user Jimmy as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO files (path)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_of_resource_ownerships (id, user_id, resource_id)
    VALUES (ownership_id, 'jimmy@vaticle.com', '/jimmy/benchmark-results.xlsx');
END $$;

-- Instantiate a file with the specified path,
-- and assign the Engineers group as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO files (path)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_group_of_resource_ownerships (id, user_group_id, resource_id)
    VALUES (ownership_id, 'engineers', '/vaticle/feature-roadmap.pdf');
END $$;

Because creating any filesystem object involves inserting rows into multiple tables, the queries need to be enclosed in transactions to ensure consistency. In the application code, constructor arguments enforce that resources and user groups cannot be created without an owner, which similarly utilise the transactions here. In this way, atomic operations in the application are also atomic operations in the database. The transactions are also necessary to create a scope for saving the auto-generated object IDs so they can be shared across tables. With data in place, we can now implement the example query.

-- Retrieve the details of every ownership in the filesystem.
SELECT
    ownership_type,
    owned_type,
    owned_id,
    owner_type,
    owner_id
FROM (
    SELECT
        'group_ownership' AS ownership_type,
        'user_group' AS owned_type,
        admin_of_group_ownerships.user_group_id AS owned_id,
        'admin' AS owner_type,
        admin_of_group_ownerships.admin_id AS owner_id
    FROM ownerships
    JOIN admin_of_group_ownerships USING (id)
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user' AS owner_type,
        user_of_resource_ownerships.user_id AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    WHERE user_of_resource_ownerships.user_id NOT IN (
        SELECT admins.email
        FROM admins
    )
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'admin' AS owner_type,
        admins.email AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    JOIN admins ON admins.email = user_of_resource_ownerships.user_id
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user_group' AS owner_type,
        user_group_of_resource_ownerships.user_group_id AS owner_id
    FROM ownerships
    JOIN user_group_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_group_of_resource_ownerships.resource_id
) AS ownerships;

Retrieving all the ownerships in the filesystem requires a union of several tables, in which the ownership types (corresponding to the name of the table) must be hardcoded into the query. Because of the multiple implementations of the Owned and Owner interfaces in the object model, the number of unions required will be equal to the number of combinations of concrete classes that implement the interfaces, whether directly or by inheritance.

Ownership.Owned implementationOwnership.Owner implementation
UserGroup implements GroupOwnership.OwnedAdmin implements GroupOwnership.Owner
File implements ResourceOwnership.OwnedUser implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedAdmin implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedUserGroup implements ResourceOwnership.Owner
Concrete implementations of ownership interfaces in the filesystem object model, both directly and by inheritance.

While the resources table can be ignored as it represents an abstract entity type, the users table represents an entity type that is both concrete and extended (by the admin type). As a result, an additional condition needs to be added to the query so that admins are not returned twice, as users and as admins. As with the application code, we need to retrieve the ID of each object and its owner without knowing their types. Because the ID type (corresponding to the column name) differed for each entity, we need to hardcode the correct column names into each union branch.

Document implementation

To implement the filesystem model in a document database, we will use the polymorphic design pattern. Adhering to the paradigm’s best practices for unbounded many-to-many relationships, ownerships will be recorded by referencing the owner in a field in the owned object.

// Instantiate an admin with the specified email.
db.users.insert( {
    "email": "cedric@vaticle.com",
    "user_type": "admin"
} )

// Instantiate a user with the specified email.
db.users.insert( {
    "email": "jimmy@vaticle.com",
    "user_type": "user"
} )

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
owner_id = db.users.find( { "email": "cedric@vaticle.com" } ).next()["_id"]
db.groups.insert( {
    "name": "engineers",
    "group_type": "user_group",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
owner_id = db.users.find( { "email": "jimmy@vaticle.com" } ).next()["_id"]
db.resources.insert( {
    "path": "/jimmy/benchmark-results.xlsx",
    "resource_type": "file",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
owner_id = db.groups.find( { "name": "engineers" } ).next()["_id"]
db.resources.insert( {
    "path": "/vaticle/feature-roadmap.pdf",
    "resource_type": "file",
    "owner": owner_id
} )

In order to store everything’s exact type, we’ve created a type field in every collection. If the type hierarchy had more levels, this would involve creating additional fields or using a list field to contain all the appropriate types, as collections can only represent a single type. In the following discussion, the term “ID” refers to the IDs in the original object model (emails, names, paths) rather than the built-in _id field (which is not meaningful information). We now turn to the example query.

db.groups.aggregate( [
    { "$addFields": { "ownership_type": "group_ownership" } },
    { "$unionWith": {
        "coll": "resources",
        "pipeline": [ { "$addFields": { "ownership_type": "resource_ownership" } } ]
    } },
    { "$lookup": {
        "from": "users",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_owners"
    } },
    { "$lookup": {
        "from": "groups",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_group_owners"
    } },
    { "$addFields": { "owners": { "$concatArrays": [ "$user_owners", "$user_group_owners" ] } } },
    { "$unwind": "$owners" },
    { "$addFields": {
        "owned_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "user_group" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "file" }
        ] } }
    } },
    { "$addFields": {
        "owned_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "$name" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "$path" }
        ] } }
    } },
    { "$addFields": {
        "owner_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "user" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "admin" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "user_group" }
        ] } }
    } },
    { "$project": {
        "_id": false,
        "ownership_type": true,
        "owned_type": true,
        "owned_id": true,
        "owner_type": true,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "$owners.name" }
        ] } }
    } }
] )

There’s a lot to unpack in this query, which requires an aggregation pipeline. The first step is to union the groups and resources collections, as those are the two types that can be owned. We also add a field for the ownership type at this stage, as it is dependent only on the collection parent types rather than the more specialised types in the collection. Next, lookups must be performed on the users and groups collections as the owner’s collection is not known, before concatenating the results of both into a single field. Finally, projections are used to extract the type and ID of the object and owner into the same fields in each result, involving one switch statement per projection to map onto the correct values. An alternative approach would be to nest the projection as sub-pipelines within the lookups, but this achieves the same result with roughly equal complexity.

Graph implementation

For the graph implementation, the logical model is fairly straightforward, with no significant design choices to be made.

// Instantiate an admin with the specified email.
CREATE (cedric:User:Admin {email: "cedric@vaticle.com"});

// Instantiate a user with the specified email.
CREATE (jimmy:User {email: "jimmy@vaticle.com"});

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
MATCH (cedric:Admin {email: "cedric@vaticle.com"})
CREATE
    (engineers:UserGroup {name: "engineers"}),
    (cedric)-[:OWNS {ownership_type: "GroupOwnership"}]->(engineers);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
MATCH (jimmy:User {email: "jimmy@vaticle.com"})
CREATE
    (benchmark:Resource:File {path: "/jimmy/benchmark-results.xlsx"}),
    (jimmy)-[:OWNS {ownership_type: "ResourceOwnership"}]->(benchmark);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
MATCH (engineers:UserGroup {name: "engineers"})
CREATE
    (roadmap:Resource:File {path: "/vaticle/feature-roadmap.pdf"}),
    (engineers)-[:OWNS {ownership_type: "ResourceOwnership"}]->(roadmap);

The only place where we’ve had to depart from the object model is in the storage of ownership types as properties of OWNS relationships rather than as additional labels to avoid reification, which is required to model relationships with multiple types. Now, we turn to the example query.

// Retrieve the details of every ownership in the filesystem.
MATCH (owner)-[ownership:OWNS]->(owned)
UNWIND labels(owned) AS owned_type
UNWIND labels(owner) AS owner_type
WITH ownership, owned, owned_type, owner, owner_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE owned_type IN keys(id_type_map)
AND id_type_map[owned_type] IN keys(owned)
AND owner_type IN keys(id_type_map)
AND id_type_map[owner_type] IN keys(owner)
AND NOT (
    owner_type = "User"
    AND "Admin" IN labels(owner)
)
RETURN
    ownership.ownership_type AS ownership_type,
    owned_type,
    owned[id_type_map[owned_type]] AS owned_id,
    owner_type,
    owner[id_type_map[owner_type]] AS owner_id;

This query does something quite powerful. Unlike the relational and document queries where we needed to specify the tables or collections in which to search for the owned object and owner, here we can retrieve the two nodes without knowing their labels. This is very similar to the application code in which we can retrieve them without knowing their types. However, while in the application code we can retrieve their IDs without knowing their types, here we need (the equivalent of) a switch statement to select the right properties of the nodes. We also need to filter out results where admins would be returned as users, because the unwind operation will return both User and Admin as types of admins, as with the relational implementation.

Discussion

There are some common themes across database paradigms in the implementation of the object model and the query examined. Having been discussed sufficiently in previous sections, this discussion will not mention modeling or data integrity, instead focusing on querying.

Unions of types

The relational and document implementations both need to enumerate the tables or collections in which the objects and their owners might be, and perform a union (or equivalent) of the results, with hardcoded joins or lookups from the objects to their owners. The relational implementation is particularly egregious. While there are only four branches combined in the union for this simple filesystem, as one branch is needed for each pair of types that implement Ownership.Owned and Ownership.Owner, the number of branches required in the union is combinatorial with respect to the number of types that can be involved in an ownership. If there were ten concrete types that implemented Ownership.Owned, and ten that implemented Ownership.Owner, then a total of one hundred branches would be required in the union.

The graph implementation does not have this problem, as it queries the object and owner without specifying their types, in the same way as the application code. The query is leveraging interface polymorphism, by finding entities that implement specific interfaces, in this case the Ownership.Owner and Ownership.Owned interfaces via the start- and end-node of the OWNS relationship label.

Switch statements

All three implementations need a switch statement (or equivalent) to retrieve IDs. This is less obvious in the relational implementation as projections are performed in each branch of the union. In fact, this was the alternative approach available in the document implementation, illustrating the mathematical equivalence of these two operations. The graph implementation is not able to leverage the Id.Key interface like it does the Ownership interfaces, which leaves it in the same position as the relational and document implementations. This shows that graph databases are only able to capture interface polymorphism in queries under certain circumstances when the logical model permits.

Encoding inheritance

The relational and graph implementations have also include an additional condition to ensure that admins are not also returned as users. This is due to the fact that admins occur in both the users and admins table in the relational implementation, and have both the User and Admin labels in the graph implementation. We have been able to express that admins have multiple types, but neither implementation can recognise that one of those is a subtype of the other. Both database paradigms are unable to leverage inheritance polymorphism in this way. It might seem apparent that the document database can, but this is not the case. The user_type field only contains a single value (the most specialised), and it is likewise blind to the inheritance hierarchy. If we had chosen to instead store a list field user_types, then the same problem would occur.

Summary

Despite having adhered to recommended design patterns for each database paradigm, the queries clearly have several issues. This is due to the polymorphism inherent in the data model, which cannot be generally expressed using the DML syntax of relational, document, and graph databases. In order to return polymorphic results without polymorphic querying capabilities, semantic context must be hardcoded into the queries. For example, if the relational implementation of the query is translated into a natural language question, it would be:

Retrieve the string “group_ownership”, the string “user_group”, the user group ID, the string “admin”, and the admin ID for group ownerships by admins; retrieve the string “resource_ownership”, the string “file”, the file path, the string “user”, and the user ID for resource ownerships by users where the user is not an admin; retrieve the string “resource_ownership”, the string “file”, the file path, the string “admin”, and the admin email for resource ownerships by users where the user is an admin; and retrieve the string “resource ownership”, the string “file”, the file path, the string “user_group”, and the user group ID for resource ownerships by user groups.

This is completely different to the original natural language query:

Retrieve the details of every ownership in the file system, specifically: the type of ownership (group or resource), the type and ID of the owned object, and the type and ID of the object’s owner.

In effect, the query has been irreversibly translated from its original declarative form into an implementation-specific imperative form. This is the case for all three database paradigms to different degrees, and is in contrast to the application code, in which the query operation actually captures the natural-language expression.

This has severe impacts for the maintainability and extensibility of the database implementation. The queries only express the question being asked of the database for the specific data model at the time they were written. If the data model is modified or extended, they cease to function as intended, and possibly at all, leading to intrinsically brittle application architectures. To avoid this, a database would need DML syntax capable of declaratively querying the polymorphism inherent in natural data domains, and a query parser capable or interpreting and resolving such queries into their imperative forms.

Conclusion

Current database paradigms have been left far behind programming languages in expressivity. While general-purpose programming languages have continually evolved to become more abstract and declarative, database query languages have been unable to keep up. The inability of contemporary databases to model or express the polymorphism inherent in natural data domains causes three key problems for engineers:

  • Object model mismatch prevents direct translation between application code and the database.
  • Semantic integrity of inserted data cannot be enforced, allowing data corruption to occur silently.
  • Polymorphic queries must have model context hardcoded, requiring continual maintenance.

The difficulty of maintaining polymorphic data models in databases has lead to the widespread use of object-relational mappers (ORMs). These manage the translation between the application and database models through object-oriented APIs, ensuring semantic integrity and enabling polymorphic querying. Contrary to their name, ORMs are now available for many different kinds of database. However, ORMs have been poorly regarded by seasoned engineers for as long as they have existed. They do not offer the full expressivity of a native database query language, often cannot generate data models and queries with optimal performance, are typically tied to specific database technologies, and introduce an additional layer of complexity and overhead to database architecture. A full discussion of these problems is outside of the scope of this article, and will be given in-depth coverage in a future one.

If it is possible to directly implement conceptual models in an application, why should this not be possible in a database? If it is possible to guarantee semantic integrity of polymorphic application data, why should this not be possible in a database? If it is possible to express natural-language polymorphic queries in application code, why should this not be possible in a database?

To enable all these things, a database would have to implement polymorphism as a primary component of its design. If such a database directly implemented a conceptual data model, could validate polymorphic constraints between types, and had query syntax capable of describing polymorphism, then it would be possible to eliminate object model mismatch, enforce semantic integrity, and express declarative polymorphic queries. Engineers would no longer need to manage polymorphism themselves, nor would they need to resort to relying on ORMs. Such a database would be a polymorphic database.

TypeDB is such a polymorphic database. It features a conceptual data model, a strong subtyping system, a symbolic reasoning engine, and a beautiful and elegant type-theoretic language: TypeQL. Together, these features enable the expressivity and safety necessary to describe the polymorphism in natural data domains without compromises, bringing a new polymorphic paradigm into the range of contemporary general-purpose database paradigms.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback