Officially out now: The TypeDB 3.0 Roadmap

TypeDB Fundamentals

Why We Need a New High-level Database Programming Model


A wide variety of database paradigms have been conceived since the onset of the digital age, ranging from the navigational databases of the 1960s to the multi-model databases of today. Each of those paradigms has built on previous ones to account for the continually increasing scope and complexity of modern data, but many challenges in data engineering seem to evade any solution. Why is object-relational mismatch still a problem after 50 years? Why do we need ORMs to manage data integrity for us? Why do queries need to be updated when the data model is extended, even if the question being asked remains the same?

Each successive database paradigm has only implemented workarounds to these problems without addressing the underlying cause: they cannot express many of the OOP features that we quickly take for granted, like abstraction, inheritance, and polymorphism. Programming languages have continuously become more powerful and declarative as they have evolved, but database query languages have not significantly advanced beyond SQL. This has left databases unable to directly implement object models, enforce semantic integrity of data, or resolve declarative polymorphic queries.

This article series attempts to answer the above questions by conducting an in-depth exploration of the strengths and weaknesses of three of the most popular current database paradigms. It begins by examining the historical context and design philosophies behind the relational, document, and graph database paradigms to understand their capabilities and the conditions under which object model mismatches occur. It then shows how semantic integrity of data cannot be enforced in those paradigms with a simple class inheritance hierarchy. Finally, using an object model for a lightweight filesystem application as a benchmark, the stark difference in declarativity between application code and database queries is demonstrated.

Why Object Model Mismatch is Still a Problem

The most infamous class of data engineering challenges are object model mismatches: when the data models in the application and database differ. The most well-known kind is object-relational (impedence) mismatch in relational databases. Many NoSQL databases claim to solve object-relational mismatch, but contrary to their name, ORMs are now available for many different kinds of database out of necessity. Due to their very different logical data models, object model mismatch manifests under different conditions in different database paradigms, but it continues to be a widespread problem.

In the relational query below, a single conceptual user is represented as rows within multiple tables, in order for users to have multiple emails. Essentially, the notion of object identity has been violated. The mapping of application objects representing users to the rows that represent them in the database is no longer a one-to-one mapping, and a mismatch between the object and relational models results.

-- Insert a new user with three emails.
DO $$
DECLARE
    user_id INT;
BEGIN
    INSERT INTO users (id, first_name, last_name)
    VALUES (DEFAULT, 'Thomas', 'Carcetti')
    RETURNING id INTO inserted_user_id;
  
    INSERT INTO user_emails (user_id, email)
    VALUES
        (inserted_user_id, 'thomas.carcetti@vaticle.com'),
        (inserted_user_id, 'thomas@vaticle.com'),
        (inserted_user_id, 'tommy@vaticle.com');
END $$;

The key reason mismatches occur is the division between the conceptual and logical data models in relational, document, and graph databases. While logical data models differ across databases paradigms, there are always conceptual data structures that cannot be expressed with them. For this not to be the case, the logical model would need to be identical to the conceptual model. A database that directly implements a conceptual model in this way can eliminate object model mismatch.

You can read the full article to learn more about object model mismatch in relational, document, and graph databases.

How Semantic Integrity Loss Occurs in a Database

Semantic integrity is a key requirement for virtually all databases deployed in production. A database functions as a source of truth for an organization’s data, making it the critical point of failure in semantic integrity enforcement. For this reason, the most reliable approach to managing data integrity is to build guards directly into the database. Most modern databases include some form of DDL syntax for constraint definition, but such syntax very often falls short, leading to integrity enforcement being offloaded to the backend where it can be unintentionally circumvented.

In the graph query below, there is no way to constrain the nodes that can participate in a given relationship, so it is possible to create a contractor that is not in any CONTRACTED_FROM relationships. It is also possible to create a CONTRACTED_FROM relationship that represents any of the following:

  • A permanent employee contracted from an agency.
  • A contractor contracted from a permanent employee.
  • An agency contracted from a contractor.
  • A permanent employee contracted from themselves!
// Insert two new employees.
MATCH (baltimore:Agency {id: 8} )
CREATE
    (kima:Employee:PermanentEmployee {
        id: 1,
        first_name: "Kima",
        last_name: "Greggs",
        start_date: date("2020-09-25"),
        salary: 65000.00
    } ),
    (reginald:Employee:Contractor {
        id: 2,
        first_name: "Reginald",
        last_name: "Cousins"
    } )-[:CONTRACTED_FROM {
        contract_number: 301,
        hourly_rate: 50.00
    } ]->(baltimore);

In relational, document, and graph databases, vulnerability to corruption comes from the inability to describe complex constraints between types of data instances: tables, collections, and labels respectively. In order to eliminate this vulnerability and enforce semantic integrity, a database must have more powerful DDL syntax and validation, able to express the polymorphic constraints between types that are naturally present in data models.

You can read the full article to learn more about semantic integrity loss in relational, document, and graph databases.

When Declarative Query Languages Become Imperative

Declarativity is considered an essential characteristic of a database query language. This allows engineers to focus on abstractions of data structures without having to imperatively traverse them, as was necessary in the pre-relational era of navigational databases. However, modern query languages are not as declarative as they seem at first glance, especially when compared with modern programming languages. Polymorphic queries are not sufficiently declarative to survive model extensions and migrations without refactors, leading to brittle application architectures that require continual maintainance.

In the document query below, an aggregation pipeline is requried to retrieve the ID of a file’s owner. Lookups must be made to the users and groups collections in which the owner might be, before concatenating and unwinding the results. Finally, a switch statement is needed to select the correct field containing the ID. If new file owner types are added or existing ones are changed, the query will no longer function as intended, as the lookup collections and switch statement are hardcoded into the query.

// Retrieve the ID of the feature roadmap's owner.
db.resources.aggregate( [
    { "$match": { "path": "/vaticle/feature-roadmap.pdf" } },
    { "$lookup": {
        "from": "users",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_owners"
    } },
    { "$lookup": {
        "from": "groups",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_group_owners"
    } },
    { "$addFields": { "owners": { "$concatArrays": [ "$user_owners", "$user_group_owners" ] } } },
    { "$unwind": "$owners" },
    { "$project": {
        "_id": false,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "$owners.name" }
        ] } }
    } }
] )

This is due to the polymorphism inherent in the original application model, which cannot be generally expressed using the query syntax of relational, document, or graph databases. In order to return polymorphic results without polymorphic querying capabilities, semantic context must be hardcoded into the queries. This has severe impacts for the maintainability and extensibility of the database implementations. In order to handle fully declarative queries, a database would need DML syntax capable of accurately describing the polymorphism, and a query execution engine capable of resolving such queries.

You can read the full article to learn more about imperative querying in relational, document, and graph databases.

Summary

Current database paradigms have been left far behind programming languages in expressivity. While general-purpose programming languages have continually evolved to become more abstract and declarative, database query languages have been unable to keep up. The inability of contemporary databases to model or express polymorphic data causes three key problems for engineers:

  • Object model mismatch prevents direct translation between application code and the database.
  • Semantic integrity of inserted data cannot be enforced, allowing data corruption to occur silently.
  • Polymorphic queries must have model context hardcoded, requiring continual maintenance.

ORMs can manage the translation between the application and database models through object-oriented APIs, ensuring semantic integrity and enabling polymorphic querying, and are now available for many different kinds of database. However, ORMs have been poorly regarded by seasoned engineers for as long as they have existed. They do not offer the full expressivity of a native database query language, often cannot generate data models and queries with optimal performance, are typically tied to specific database technologies, and introduce an additional layer of complexity and overhead to database architecture.

If it is possible to directly implement conceptual models in an application, why should this not be possible in a database? If it is possible to guarantee semantic integrity of polymorphic application data, why should this not be possible in a database? If it is possible to express natural-language polymorphic queries in application code, why should this not be possible in a database?

To enable all these things, a database would have to implement polymorphism as a primary component of its design. TypeDB is such a polymorphic database. It features a conceptual data model, a strong subtyping system, a symbolic reasoning engine, and a beautiful and elegant type-theoretic language: TypeQL. Together, these features enable TypeDB to eliminate object model mismatch, enforce semantic integrity, and express declarative polymorphic queries, without having to rely on ORMs.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

Why Polymorphism

Learn how current databases lack the expressivity to natively model polymorphism, leading to object model mismatch, semantic integrity loss, and imperative querying.

Watch lecture

TypeDB's Core Features

Explore TypeDB's core features in detail, including validated polymorphic data models and declarative querying, and learn about their impact on database engineering.

Read article

The Need for Subtyping

A new revolution in databases overcomes the limitations of relational, document and graph databases by using subtyping and polymorphism  to create higher abstractions and achieve greater expressivity.

Read article

Feedback