New ACM paper, free-tier cloud, and open-source license

TypeDB Fundamentals

How Semantic Integrity Loss Occurs in a Database


Semantic integrity is a key requirement for virtually all databases deployed in production. Data without semantic integrity is functionally meaningless. When semantic integrity loss occurs, it often does so silently and in unexpected ways, making the original information irrecoverable. A database functions as a source of truth for an organization’s data, making it the critical point of failure in semantic integrity enforcement. For this reason, the most reliable approach to managing data integrity is to build guards directly into the database, preventing data anomalies from ever persisting and propagating.

To maintain semantic integrity in a database, it is necessary to define conditions for inserted data to be considered valid or invalid. This is done using a database’s data definition language (DDL). Most modern databases include some form of DDL syntax for constraint definition, but this article will demonstrate that such syntax very often falls short, leading to integrity enforcement being offloaded to the backend where it can be unintentionally circumvented. This is especially the case when dealing with polymorphic data.

This article will focus on three paradigms that have seen widespread adoption in commercial software engineering for high-level application development: the relational, document, and graph paradigms. Code examples for relational, document, and graph databases will use PostgreSQL, MongoDB, and Neo4j syntax respectively.

The example model

A simple two-level inheritance hierarchy will be used in examples, pertaining to employees in a company, all of whom are either permanent employees or contractors.

In this model, all employees have:

  • An employee ID.
  • A first name.
  • A last name.

Permanent employees also have:

  • A start date.
  • An end date (if they have left the company).
  • A salary.

Contractors instead have:

  • An agency.
  • A contract number.
  • An hourly rate.

Integrity in relational databases

In relational databases, the schema ensures referential integrity. Automatic referential integrity management prevents data corruption and is one of the most powerful features of relational databases. It is the primary reason they are still used today for the storage of critical data. However, referential integrity does not amount to semantic integrity, and it is perfectly possible to have a relational database in a valid logical state in which the data has been corrupted.

To deal with the inheritance hierarchy in our data model, we will use one of three popular enterprise design patterns, coined by Martin Fowler (2002): single-table inheritance, concrete-table inheritance, and class-table inheritance. These design patterns have a range of advantages and disadvantages, which must be traded off depending on the requirements of the application. They are discussed in more detail in a previous article, and this article will focus only on semantic integrity.

Pattern 1: Single-table inheritance

In this design pattern, all data is stored in one table that represents the common parent type. The table contains columns for the parent type’s attributes and columns for the attributes of every child type.

CREATE TYPE EmployeeType
AS ENUM ('permanent_employee', 'contractor');

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    employee_type EmployeeType NOT NULL,
    start_date DATE DEFAULT NULL,
    end_date DATE DEFAULT NULL,
    agency_id INT DEFAULT NULL REFERENCES agencies(id),
    contract_number INT DEFAULT NULL,
    salary MONEY DEFAULT NULL,
    hourly_rate MONEY DEFAULT NULL
);

ALTER TABLE employees
ADD CONSTRAINT check_nulls
CHECK ( (
    employee_type = 'permanent_employee'
    AND num_nulls(start_date, salary) = 0
    AND num_nonnulls(agency_id, contract_number, hourly_rate) = 0
    ) OR (
    employee_type = 'contractor'
    AND num_nulls(agency_id, contract_number, hourly_rate) = 0
    AND num_nonnulls(start_date, end_date, salary) = 0
) );

Attribute semantics are controlled with the custom CHECK constraint to ensure that each employee can only have the correct attributes. However, if another table references the employee table with a foreign key, there is no simple way to ensure that it only references the correct employee types. For instance, if there is a table of performance reviews for permanent employees, it is possible to insert a performance review for a contractor.

CREATE TABLE performance_reviews (
    id SERIAL PRIMARY KEY,
    employee_id INT NOT NULL REFERENCES employees(id),
    review_date DATE NOT NULL,
    review_score INT NOT NULL,
    comments TEXT DEFAULT NULL
);

Pattern 2: Concrete-table inheritance

In this design pattern, one table is used for each non-abstract type in the hierarchy. Each table contains columns for the parent type’s attributes in addition to those for the attributes specific to the actual type represented.

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

This pattern solves the previous issue as the performance_reviews table can now directly reference the permanent_employees table, however another problem now occurs. All employees must have globally unique IDs, but as different employee types are split over different tables, it is no longer possible to have the database assign IDs using the auto-incrementing SERIAL data type. As a result, it is possible to insert two employees with the same ID.

Pattern 3: Class-table inheritance

In this design pattern, one table is used for every type in the hierarchy, abstract and non-abstract. Each table only contains columns for the attributes specific to the type represented.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL
);

CREATE TABLE permanent_employees (
    id INT PRIMARY KEY REFERENCES employee(id),
    start_date DATE NOT NULL,
    end_date DATE DEFAULT NULL,
    salary MONEY NOT NULL
);

CREATE TABLE contractors (
    id INT PRIMARY KEY REFERENCES employee(id),
    agency_id INT NOT NULL REFERENCES agencies(id),
    contract_number INT NOT NULL,
    hourly_rate MONEY NOT NULL
);

Compared to the single- and concrete-table patterns, this pattern is more architecturally complex, but it is possible to simultaneously solve the previous issues. The dependency of performance reviews can be correctly enforced by referencing the permanent_employees table, and global employee ID uniqueness is enforced by storing them in the root employees table. But object-relational mismatch has now been introduced to the model as each employee is represented by multiple table rows, which must be inserted and read together.

DO $$
DECLARE
    employee_id INT;
BEGIN
    INSERT INTO employees (id, first_name, last_name)
    VALUES (DEFAULT, 'Kima', 'Greggs')
    RETURNING id INTO employee_id;

    INSERT INTO permanent_employees (id, start_date, salary)
    VALUES (employee_id, '2020-09-25', 65000.00);
END $$;

This causes another semantic integrity issue: it is not possible to enforce the number of types an employee can have. For example, a single employee in the employees table could have entries in both the permament_employees and contractors tables, or even in neither!

The root of the problem

While all three design patterns ensure referential integrity, none of the patterns are able to enforce semantic integrity, each failing in a different way. This is a result of SQL not providing syntax for constraining cross-table dependencies other than foreign keys. Some relational databases include faculties for creating type hierarchies, but they are limited in scope and cannot enforce semantic integrity in the general case. The case of the class-table pattern highlights a wider-ranging problem: due to the lack of cross-table constraints, semantic integrity cannot be enforced when object-relational mismatch occurs. This means that any model that includes polymorphic features cannot have semantic integrity enforced in a relational database.

Integrity in document databases

To implement the employee hierarchy in a document database, we will store them in a single combined employees collection. Daniel Coupal et al. (2023) advocate for storing all similar documents (in this case employees) in a single collection representing the parent type, which they call the “polymorphic design pattern”, with the specific types recorded as data fields. The primary reason for this is performance, as document databases are not optimised to perform the lookups and unions required to access data divided across multiple collections. Using this pattern for the current example, an important design decision remains. Each contractor references an agency, and we must decide whether to embed the agency data in the contractor documents, or instead reference separate agency documents by ID.

Pattern 1: Embedding dependent data

With this approach, an agency sub-document is embedded into every contractor document.

db.employees.insert( [
    {
        "id": 1,
        "first_name": "Kima",
        "last_name": "Greggs",
        "employee_type": "permanent_employee",
        "start_date": new Date("2020-09-25"),
        "salary": 65000.00
    },
    {
        "id": 2,
        "first_name": "Reginald",
        "last_name": "Cousins",
        "employee_type": "contractor",
        "agency": { "id": 8, "name": "Baltimore Agency" },
        "contract_number": 301,
        "hourly_rate": 50.00
    }
] )

This leads to duplication of data. If an agency has multiple contractors, then its data must be embedded in every associated contractor document. This leads to data redundancy, and requires engineers to ensure the redundant copies remain synchronized. Copies might also exist in other collections, and without a schema to indicate where they all are, there is a significant risk of data inconsistency, and with it loss of semantic integrity.

Pattern 2: Referencing dependent data

In the example, every contractor document references an agency ID, corresponding to the built-in _id field of a document in the agencies collection.

baltimore_id = db.agencies.find( { "id": 8 } ).next()["_id"]

db.employees.insert( [
    {
        "id": 1,
        "first_name": "Kima",
        "last_name": "Greggs",
        "employee_type": "permanent_employee",
        "start_date": new Date("2020-09-25"),
        "salary": 65000.00
    },
    {
        "id": 2,
        "first_name": "Reginald",
        "last_name": "Cousins",
        "employee_type": "contractor",
        "agency": baltimore_id,
        "contract_number": 301,
        "hourly_rate": 50.00
    }
] )

There is no data redundancy with this approach, but there is another problem: the agency field of each contractor only contains the literal value of the agency’s _id field with no referential guarantees. Unlike in relational databases where foreign keys enforce referential integrity, here it is possible to delete the agency document without knowing if it is referenced elsewhere in the database, leading to the creation of null references and loss of semantic integrity.

Between a rock and a hard place

Embedding dependent data results in loss of data consistency, while referencing it results in loss of referential integrity, so neither pattern can enforce overall semantic integrity. As with the relational case, the loss of semantic integrity occurs when cross-document dependencies need to be enforced. Custom document constraints cannot help here, as they only enforce the structure of a single document. Because many-to-many relationships can only be represented using one of these two patterns, the semantic integrity of many-to-many relationships cannot be enforced when implemented in a document database.

Integrity in graph databases

When implementing the employee hierarchy in a graph databases, the process is straightforward if adhering to traditional entity-relationship modeling practices.

MATCH (baltimore:Agency {id: 8} )
CREATE
    (kima:Employee:PermanentEmployee {
        id: 1,
        first_name: "Kima",
        last_name: "Greggs",
        start_date: date("2020-09-25"),
        salary: 65000.00
    } ),
    (reginald:Employee:Contractor {
        id: 2,
        first_name: "Reginald",
        last_name: "Cousins"
    } )-[:CONTRACTED_FROM {
        contract_number: 301,
        hourly_rate: 50.00
    } ]->(baltimore);

Some of the properties of contractors become properties of their contracts. This is in line with recommendations by Barrasa (2017), as relationships can be assigned properties without reifying them. Employees are given multiple labels to represent their employee types. This allows properties to be constrained based on the combination of each node’s labels.

CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.id IS KEY;
CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.id IS TYPED INTEGER;
CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.first_name IS NOT NULL;
CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.first_name IS TYPED STRING;
CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.last_name IS NOT NULL;
CREATE CONSTRAINT FOR (n:Employee) REQUIRE n.last_name IS TYPED STRING;

CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.start_date IS NOT NULL;
CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.start_date IS TYPED DATE;
CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.end_date IS TYPED DATE;
CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.salary IS NOT NULL;
CREATE CONSTRAINT FOR (n:PermanentEmployee) REQUIRE n.salary IS TYPED FLOAT;

CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.contract_number IS KEY;
CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.contract_number IS TYPED INTEGER;
CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.hourly_rate IS NOT NULL;
CREATE CONSTRAINT FOR ()-[r:CONTRACTED_FROM]-() REQUIRE r.hourly_rate IS TYPED FLOAT;

Constraints are not a schema

Constraining of property values is the limit of DDL syntax in graph databases. This leaves two major problems. First, there is no way to control grouping of labels, so it is possible to create a node with the PermanentEmployee label that does not have the Employee label, a node with both the PermanentEmployee and Contractor labels, or a node with the abstract Employee label but no label for a concrete type. Second, there is no way to constrain the nodes that can act as the start- and end-nodes in a given relationship, so it is possible to create a contractor that is not in any CONTRACTED_FROM relationships. It is also possible to create a CONTRACTED_FROM relationship that represents any of the following:

  • A permanent employee contracted from an agency.
  • A contractor contracted from a permanent employee.
  • An agency contracted from a contractor.
  • A permanent employee contracted from themselves!

These two problems can be generalised: constraints cannot be defined that control dependencies between labels. These cross-label dependencies are analogous to the relational and document paradigms, where constraints cannot enforce cross-table and cross-document dependencies. Because they affect all models, it is not possible to enforce semantic integrity in any graph data model.

Summary

Semantic integrity is impossible to enforce in a relational database with polymorphic features in its data model, in a document database with many-to-many relationships, and in any graph database. In all three cases, the source of the vulnerability is the inability to describe complex constraints between types of data instances: tables, collections, and labels respectively. In order to eliminate this vulnerability and enforce semantic integrity, a database must have more powerful DDL syntax and validation, able to express the polymorphic constraints between types that are naturally present in data models.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

Why Polymorphism

Learn how current databases lack the expressivity to natively model polymorphism, leading to object model mismatch, semantic integrity loss, and imperative querying.

Read article

Strong Type System

The backbone of TypeDB is its strong type system, which powers the resolution of declarative polymorphic queries to determine the possible return types, and the semantic validation of data.

Read article

Concepts and Types

Learn about the general conceptual principles of data organization, and how these principles are made rigorous via types, covering the key terminology used describe TypeDB's data model.

Read article

Feedback