New ACM paper, free-tier cloud, and open-source license

TypeDB Fundamentals

When Declarative Query Languages Become Imperative


Since Codd’s seminal work on relational databases, declarativity is considered an essential characteristic of a database query language. This allows engineers to focus on abstractions of data structures without having to imperatively traverse them, as was necessary in the pre-relational era of navigational databases. This is a principle that applies equally to programming languages and query languages.

However, modern query languages are not as declarative as they seem at first glance, especially when compared with modern programming languages. When dealing with polymorphic data, most query languages are actually more imperative in nature. In those languages, polymorphic queries are not sufficiently declarative to survive model extensions and migrations without refactors, as will be shown in this article. This leads to brittle application architectures that require continual maintainance.

This article will focus on three paradigms that have seen widespread adoption in commercial software engineering for high-level application development: the relational, document, and graph paradigms. Code examples for relational, document, and graph databases will use PostgreSQL, MongoDB, and Neo4j syntax respectively.

The example model

The examples in this article will use a practical model for a DAC filesystem. While the filesystem being modeled is conceptually very simple, the object model (illustrated in Java) involves both inheritance and interfaces in order to naturally model its polymorphic nature. This ensures that the model adheres closely to the domain semantics, and permits easy extensibility.

The model describes users, user groups, and resources within the filesystem, in addition to their ownership rights. All users can own resources, but only admins can own user groups. In addition to being owned by users, resources can also be owned by user groups. There is only one type of resource: files, but the model could easily be extended to add more types. Users, user groups, and resources each have different types of ID by which they’re identified.

The object model can be found on GitHub. It is worth noting that the model is not complete. For instance, none of the objects have properties other than their IDs and ownerships, and it is impossible to make a user a member of a user group. Both would be very simple to implement, but this model is only intended to illustrate certain polymorphic data structures, and not to serve as a fully-fledged application. Many important considerations have also been neglected for simplicity, such as ensuring that methods have appropriate access modifiers, or ensuring that ownerships can be correctly re-assigned .

With the model in place, creating filesystem objects is quite simple.

// Instantiate an admin with the specified email.
Admin cedric = new Admin("cedric@vaticle.com");

// Instantiate a user with the specified email.
User jimmy = new User("jimmy@vaticle.com");

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
UserGroup engineers = new UserGroup("engineers", cedric);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
File benchmark = new File("/jimmy/benchmark-results.xlsx", jimmy);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
File roadmap = new File("/vaticle/feature-roadmap.pdf", engineers);

Retrieving data belonging to the objects is also simple, which we will consider via a query-like operation:

Retrieve the details of every ownership in the file system, specifically: the type of ownership, the type and ID of the owned object, the type and ID of the object’s owner.

// Retrieve the details of every ownership in the filesystem.
fileSystemObjects.stream()
    .filter(object -> object instanceof Ownership.Owned)
    .map(object -> (Ownership.Owned) object)
    .map(owned -> Map.of(
        "owned-id", ((Id.Key<?>) owned).getId().value,
        "owned-type", owned.getClass().getSimpleName(),
        "owner-id", (((Id.Key<?>) owned.getOwner()).getId()).value,
        "owner-type", owned.getOwner().getClass().getSimpleName(),
        "ownership-type", owned.getOwnership().getClass().getSimpleName()
    ))
    .forEach(System.out::println);

// Output:
// {
//     owned-id=/vaticle/feature-roadmap.pdf,
//     ownership-type=ResourceOwnership,
//     owner-type=UserGroup,
//     owner-id=engineers,
//     owned-type=File
// }
// {
//     owned-id=/jimmy/benchmark-results.xlsx,
//     ownership-type=ResourceOwnership,
//     owner-type=User,
//     owner-id=jimmy@vaticle.com,
//     owned-type=File
// }
// {
//     owned-id=engineers,
//     ownership-type=GroupOwnership,
//     owner-type=Admin,
//     owner-id=cedric@vaticle.com,
//     owned-type=UserGroup
// }

This is a very important query in the access control domain. Understanding ownerships in the filesystem at a holistic level can identify vulnerabilities that could otherwise go unnoticed. Because the GroupOwnership and ResourceOwnership classes are subtypes of the Ownership class, we can retrieve the set of all ownerships in the filesystem in one go. We can also easily retrieve the ID of each object’s owner without knowing the owner’s type by using the Id.Key interface. These are immensely powerful capabilities, as we can introduce new classes that implement the interfaces of Ownership and Id, and retrieve this information about the new classes without having to modify any code.

The query operation functions declaratively without knowing any of the exact types in the filesystem, as it only operates on the interfaces. No matter how many new types we add to the filesystem, this query will continue to function and give correct results. Three other queries are included in the example code, but not discussed in this article:

  • Retrieve the ID of a given resource’s owner.
  • Retrieve the IDs of all resources owned by a given user.
  • Retrieve the type and ID of every object in the filesystem.

They exhibit the same declarative properties as the one featured here.

Relational implementation

To implement the filesystem model in a relational database, we must somehow deal with the inheritance hierarchies in the data model. Of the three enterprise design patterns coined by Martin Fowler (2002), we will use the class-table inheritance pattern as it is the most extensible. The differences between these patterns are discussed in more detail in a previous article.

CREATE TABLE users (
    email TEXT PRIMARY KEY
);

CREATE TABLE admins (
    email TEXT PRIMARY KEY REFERENCES users(email)
);

CREATE TABLE user_groups (
    name TEXT PRIMARY KEY
);

CREATE TABLE resources (
    id TEXT PRIMARY KEY
);

CREATE TABLE files (
    path TEXT PRIMARY KEY REFERENCES resources(id)
);

CREATE TABLE ownerships (
    id SERIAL PRIMARY KEY
);

CREATE TABLE resource_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE user_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_id TEXT REFERENCES users(email),
    resource_id TEXT REFERENCES resources(id)
);

CREATE TABLE user_group_of_resource_ownerships (
    id INT PRIMARY KEY REFERENCES resource_ownerships(id),
    user_group_id TEXT REFERENCES user_groups(name),
    resource_id TEXT REFERENCES resources (id)
);

CREATE TABLE group_ownerships (
    id INT PRIMARY KEY REFERENCES ownerships(id)
);

CREATE TABLE admin_of_group_ownerships (
    id INT PRIMARY KEY REFERENCES group_ownerships(id),
    admin_id TEXT REFERENCES admins(email),
    user_group_id TEXT REFERENCES user_groups(name)
);

The ownership interfaces present an interesting challenge here. As each foreign key column can only reference a column in a single other table, we have to create one table for users that own resources, one for user groups that own resources, and one for admins that own groups. Next, we insert data.

-- Instantiate an admin with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('cedric@vaticle.com');

    INSERT INTO admins (email)
    VALUES ('cedric@vaticle.com');
END $$;

-- Instantiate a user with the specified email.
DO $$
BEGIN
    INSERT INTO users (email)
    VALUES ('jimmy@vaticle.com');
END $$;

-- Instantiate a user group with the specified name,
-- and assign the admin Cedric as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO user_groups (name)
    VALUES ('engineers');

    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;

    INSERT INTO group_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO admin_of_group_ownerships (id, admin_id, user_group_id)
    VALUES (ownership_id, 'cedric@vaticle.com', 'engineers');
END $$;

-- Instantiate a file with the specified path,
-- and assign the user Jimmy as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO files (path)
    VALUES ('/jimmy/benchmark-results.xlsx');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_of_resource_ownerships (id, user_id, resource_id)
    VALUES (ownership_id, 'jimmy@vaticle.com', '/jimmy/benchmark-results.xlsx');
END $$;

-- Instantiate a file with the specified path,
-- and assign the Engineers group as its owner.
DO $$
DECLARE
    ownership_id INT;
BEGIN
    INSERT INTO resources (id)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO files (path)
    VALUES ('/vaticle/feature-roadmap.pdf');
	
    INSERT INTO ownerships (id)
    VALUES (DEFAULT)
    RETURNING id INTO ownership_id;
	
    INSERT INTO resource_ownerships (id)
    VALUES (ownership_id);
	
    INSERT INTO user_group_of_resource_ownerships (id, user_group_id, resource_id)
    VALUES (ownership_id, 'engineers', '/vaticle/feature-roadmap.pdf');
END $$;

Creating almost any filesystem object involves inserting rows into multiple tables, so the queries need to be enclosed in transactions to ensure consistency. The transactions are also necessary to create a scope for storing the auto-generated ownership IDs so they can be shared across tables. With data in place, we can now implement the example query.

-- Retrieve the details of every ownership in the filesystem.
SELECT
    ownership_type,
    owned_type,
    owned_id,
    owner_type,
    owner_id
FROM (
    SELECT
        'group_ownership' AS ownership_type,
        'user_group' AS owned_type,
        admin_of_group_ownerships.user_group_id AS owned_id,
        'admin' AS owner_type,
        admin_of_group_ownerships.admin_id AS owner_id
    FROM ownerships
    JOIN admin_of_group_ownerships USING (id)
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user' AS owner_type,
        user_of_resource_ownerships.user_id AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    WHERE user_of_resource_ownerships.user_id NOT IN (
        SELECT admins.email
        FROM admins
    )
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'admin' AS owner_type,
        admins.email AS owner_id
    FROM ownerships
    JOIN user_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_of_resource_ownerships.resource_id
    JOIN admins ON admins.email = user_of_resource_ownerships.user_id
    UNION
    SELECT
        'resource_ownership' AS ownership_type,
        'file' AS owned_type,
        files.path AS owned_id,
        'user_group' AS owner_type,
        user_group_of_resource_ownerships.user_group_id AS owner_id
    FROM ownerships
    JOIN user_group_of_resource_ownerships USING (id)
    JOIN files ON files.path = user_group_of_resource_ownerships.resource_id
) AS ownerships;

Retrieving all the ownerships in the filesystem requires a union of several tables, in which the ownership types must be hardcoded into the query as strings. Because of the multiple implementations of the Owned and Owner interfaces in the object model, the number of unions required will be equal to the number of combinations of concrete classes that implement the interfaces, whether directly or by inheritance.

Ownership.Owned implementationOwnership.Owner implementation
UserGroup implements GroupOwnership.OwnedAdmin implements GroupOwnership.Owner
File implements ResourceOwnership.OwnedUser implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedAdmin implements ResourceOwnership.Owner
File implements ResourceOwnership.OwnedUserGroup implements ResourceOwnership.Owner
Concrete implementations of ownership interfaces in the filesystem object model, both directly and by inheritance.

Additionally, the users table represents a class that is both concrete and extended (by the admin type). As a result, an additional condition needs to be added to the query so that admins are not returned twice, as users and as admins. As with the application code, we need to retrieve the ID of each object and its owner without knowing their types. Because the ID type is different for each class, we need to hardcode the correct column names into each union branch.

Document implementation

For the document database implementation, we will use the polymorphic design pattern, as advocated by Daniel Coupal et al. (2023). This ensures we will not suffer from the performance costs of unnecessary lookups and unions, which document databases are not optimised to perform. Adhering to best practices for unbounded many-to-many relationships, ownerships will be recorded by referencing the owner in a field of the owned object.

// Instantiate an admin with the specified email.
db.users.insert( {
    "email": "cedric@vaticle.com",
    "user_type": "admin"
} )

// Instantiate a user with the specified email.
db.users.insert( {
    "email": "jimmy@vaticle.com",
    "user_type": "user"
} )

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
owner_id = db.users.find( { "email": "cedric@vaticle.com" } ).next()["_id"]
db.groups.insert( {
    "name": "engineers",
    "group_type": "user_group",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
owner_id = db.users.find( { "email": "jimmy@vaticle.com" } ).next()["_id"]
db.resources.insert( {
    "path": "/jimmy/benchmark-results.xlsx",
    "resource_type": "file",
    "owner": owner_id
} )

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
owner_id = db.groups.find( { "name": "engineers" } ).next()["_id"]
db.resources.insert( {
    "path": "/vaticle/feature-roadmap.pdf",
    "resource_type": "file",
    "owner": owner_id
} )

In order to store everything’s exact type, we’ve created a type field in every collection. If the type hierarchies had more levels, this would involve using a list field to contain all the appropriate types, as collections can only represent a single type. We now turn to the example query.

db.groups.aggregate( [
    { "$addFields": { "ownership_type": "group_ownership" } },
    { "$unionWith": {
        "coll": "resources",
        "pipeline": [ { "$addFields": { "ownership_type": "resource_ownership" } } ]
    } },
    { "$lookup": {
        "from": "users",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_owners"
    } },
    { "$lookup": {
        "from": "groups",
        "localField": "owner",
        "foreignField": "_id",
        "as": "user_group_owners"
    } },
    { "$addFields": { "owners": { "$concatArrays": [ "$user_owners", "$user_group_owners" ] } } },
    { "$unwind": "$owners" },
    { "$addFields": {
        "owned_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "user_group" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "file" }
        ] } }
    } },
    { "$addFields": {
        "owned_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$group_type", "user_group" ] }, "then": "$name" },
            { "case": { "$eq": [ "$resource_type", "file" ] }, "then": "$path" }
        ] } }
    } },
    { "$addFields": {
        "owner_type": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "user" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "admin" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "user_group" }
        ] } }
    } },
    { "$project": {
        "_id": false,
        "ownership_type": true,
        "owned_type": true,
        "owned_id": true,
        "owner_type": true,
        "owner_id": { "$switch": { "branches": [
            { "case": { "$eq": [ "$owners.user_type", "user" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.user_type", "admin" ] }, "then": "$owners.email" },
            { "case": { "$eq": [ "$owners.group_type", "user_group" ] }, "then": "$owners.name" }
        ] } }
    } }
] )

There’s a lot to unpack in this query, which requires an aggregation pipeline. The first step is to union the groups and resources collections, the two types that can be owned. We also add a hardcoded field for the ownership type at this stage. Next, lookups must be performed on the users and groups collections, before concatenating the results of both into a single field. Finally, projections are used to extract the type and ID of the object and owner into the final result format, involving one switch statement per projection to map onto the correct values. There are alternative approaches, but none that reduce the complexity of the query.

Graph implementation

For the graph implementation, the model is fairly straightforward when adhering to traditional entity-relationship modeling practices, with no significant design choices to be made.

// Instantiate an admin with the specified email.
CREATE (cedric:User:Admin {email: "cedric@vaticle.com"});

// Instantiate a user with the specified email.
CREATE (jimmy:User {email: "jimmy@vaticle.com"});

// Instantiate a user group with the specified name,
// and assign the admin Cedric as its owner.
MATCH (cedric:Admin {email: "cedric@vaticle.com"})
CREATE
    (engineers:UserGroup {name: "engineers"}),
    (cedric)-[:OWNS {ownership_type: "GroupOwnership"}]->(engineers);

// Instantiate a file with the specified path,
// and assign the user Jimmy as its owner.
MATCH (jimmy:User {email: "jimmy@vaticle.com"})
CREATE
    (benchmark:Resource:File {path: "/jimmy/benchmark-results.xlsx"}),
    (jimmy)-[:OWNS {ownership_type: "ResourceOwnership"}]->(benchmark);

// Instantiate a file with the specified path,
// and assign the Engineers group as its owner.
MATCH (engineers:UserGroup {name: "engineers"})
CREATE
    (roadmap:Resource:File {path: "/vaticle/feature-roadmap.pdf"}),
    (engineers)-[:OWNS {ownership_type: "ResourceOwnership"}]->(roadmap);

The only place where we’ve had to depart from the object model is in the storage of ownership types as properties of OWNS relationships rather than as additional labels, as relationships are not permitted to have multiple labels. Now, we turn to the example query.

// Retrieve the details of every ownership in the filesystem.
MATCH (owner)-[ownership:OWNS]->(owned)
UNWIND labels(owned) AS owned_type
UNWIND labels(owner) AS owner_type
WITH ownership, owned, owned_type, owner, owner_type,
    {
        User: "email",
        Admin: "email",
        UserGroup: "name",
        File: "path"
    } AS id_type_map
WHERE owned_type IN keys(id_type_map)
AND id_type_map[owned_type] IN keys(owned)
AND owner_type IN keys(id_type_map)
AND id_type_map[owner_type] IN keys(owner)
AND NOT (
    owner_type = "User"
    AND "Admin" IN labels(owner)
)
RETURN
    ownership.ownership_type AS ownership_type,
    owned_type,
    owned[id_type_map[owned_type]] AS owned_id,
    owner_type,
    owner[id_type_map[owner_type]] AS owner_id;

This query does something quite powerful. Unlike the relational and document queries where we needed to specify the tables or collections of the owned objects and their owners, here we can retrieve the two nodes without knowing their labels. This is very similar to the application code in which we can retrieve them without knowing their types. However, while in the application code we can also retrieve their IDs without knowing their types, here we need the equivalent of a switch statement to select the right properties of the nodes. We also need to filter out results so that admins are not returned twice, because the unwind operation will return both User and Admin as types of admins, as with the relational implementation.

Common pitfalls

Unions of types

The relational and document implementations both need to enumerate the tables or collections in which the objects and their owners might be, and perform a union (or equivalent) of the results, with hardcoded joins or lookups from the objects to their owners. The graph implementation does not have this problem, as it queries the object and owner without specifying their types, in the same way as the application code. The query is leveraging interface polymorphism, by finding objects that implement the interfaces Ownership.Owner and Ownership.Owned via the start- and end-node of the OWNS relationship label.

Switch statements

All three implementations need a switch statement (or equivalent) to retrieve IDs. This is less obvious in the relational implementation as projections are performed in each branch of the union instead of in a single place. The graph implementation is not able to leverage the Id.Key interface like it does the Ownership interfaces, which leaves it in the same position as the relational and document implementations. Graph databases are only able to capture interface polymorphism in queries under certain circumstances.

Encoding inheritance

The relational and graph implementations also include an additional condition to ensure that admins are not returned twice. This is because admins occur in both the users and admins table in the relational implementation, and have both the User and Admin labels in the graph implementation. We have been able to express that admins have multiple types, but neither implementation can recognise that one type is a subtype of the other. Both database paradigms are unable to leverage inheritance polymorphism in this way. Neither is the document implementation, but this is not apparent as we only stored a single type for each object. If we had chosen to store a list of types instead, then the same problem would occur.

Summary

Despite having adhered to recommended design patterns for each database paradigm, we are unable to write declarative queries. This is due to the polymorphism inherent in the original application model, which cannot be generally expressed using the query syntax of relational, document, or graph databases. In order to return polymorphic results without polymorphic querying capabilities, semantic context must be hardcoded into the queries. If the relational query is translated into a natural language question, it would be:

Retrieve the string “group_ownership”, the string “user_group”, the user group ID, the string “admin”, and the admin ID for group ownerships by admins; retrieve the string “resource_ownership”, the string “file”, the file path, the string “user”, and the user ID for resource ownerships by users where the user is not an admin; retrieve the string “resource_ownership”, the string “file”, the file path, the string “admin”, and the admin email for resource ownerships by users where the user is an admin; and retrieve the string “resource ownership”, the string “file”, the file path, the string “user_group”, and the user group ID for resource ownerships by user groups.

This is completely different to the original natural language query:

Retrieve the details of every ownership in the file system, specifically: the type of ownership, the type and ID of the owned object, and the type and ID of the object’s owner.

The query has been irreversibly translated from its original declarative form into an implementation-specific imperative form. This is the case for all three database paradigms to different degrees, and is in contrast to the application code, in which the query operation actually captures the natural-language expression.

This has severe impacts for the maintainability and extensibility of the database implementations. If the data model is modified or extended, the queries cease to function as intended, and possibly at all, leading to intrinsically brittle application architectures. In order to handle fully declarative queries, a database would need DML syntax capable of accurately describing the polymorphism, and a query execution engine capable of resolving such queries.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

Why Polymorphism

Learn how current databases lack the expressivity to natively model polymorphism, leading to object model mismatch, semantic integrity loss, and imperative querying.

Read article

Type-Theoretic Language

TypeQL is the type-theoretic query language of TypeDB, allowing for highly expressive querying power beyond the capabilities of other modern database paradigms.

Read article

Types in Databases

TypeDB combines type-theoretic data structures and language with classical database concepts. Its query language, TypeQL, is a direct reflection of this confluence.

Read article

Feedback