TypeDB Fundamentals
Why Object Model Mismatch is Still a Problem
The most infamous class of data engineering challenges are object model mismatches: when the data models in the application and database differ. The most well-known kind is object-relational (impedence) mismatch in relational databases, but equivalent mismatches are prevalent across most contemporary database paradigms. The difficulty of managing object model mismatch in databases has lead to the widespread use of object-relational mappers (ORMs), which manage the translation between the application and database models.
Many NoSQL databases claim to solve object-relational mismatch, but contrary to their name, ORMs are now available for many different kinds of database out of necessity. Due to their very different logical data models, object model mismatch manifests under different conditions in different database paradigms, but it continues to be a widespread problem.
This article will focus on three paradigms that have seen widespread adoption in commercial software engineering for high-level application development: the relational, document, and graph paradigms. Code examples for relational, document, and graph databases will use PostgreSQL, MongoDB, and Neo4j syntax respectively.
Object-relational mismatch
The first versions of relational database theory and SQL were conceived in the early 1970s when procedural programming languages dominated information technology. Codd’s set-theoretic relational algebra (Codd, 1969) is the foundation of SQL, and was a turning point for data persistence. Previous database paradigms were navigational (though this encompassing term for pre-relational databases was not coined until later by Bachman, 1973), and accessed data imperatively. Users would have to describe how to iterate from each data instance to the next, by following links between them. In comparison, relational databases function declaratively: the user provides the qualifiers that describe a data instance to retrieve, and the database returns all matching instances.
Relational databases quickly became embedded in almost every digitized industry as the prevailing convention, becoming an ANSI and ISO standard in 1986 and 1987, respectively. This was prior to the wide spread of object-oriented programming (OOP), which was spearheaded by the release of C++ in 1986, not standardized by the ISO until 1998. As a result, the most common general-purpose database paradigm made no account of what would become the most common general-purpose programming paradigm. This presented a challenge to programmers: the relational and object models were fundamentally incompatible, requiring data structures to be translated from one to the other. This disparity is known as object-relational mismatch.
Basics of relational modeling
In relational algebra, all data is described as tuples within relations. These are practically implemented under more well-known terms, as rows within tables. Relations can be combined using set operations like intersections (known as inner-joins), unions, projections, and cartesian products (cross-joins). The resulting sets are known as derived relations in the algebra, and as views in databases. Relational databases are structured, and the table definitions are enforced by the database’s schema.
Relational databases | Set-theoretic algebra |
---|---|
Table | Relation |
Column | Attribute |
Row | Tuple |
View | Derived relation |
One element of relational databases that does not originate in Codd’s algebra is the foreign key: a constraint on the values of attributes in a particular table to values that already exist in another table. This is used to guarantee referential integrity of dependencies between data instances. To retrieve dependent data instances together, the two tables must then be joined on the foreign key. Like the structure of the tables themselves, the structure of the foreign keys is also enforced by the schema.
Representing multivalued attributes
Object-relational mismatch arises from the fact that tuples are one-dimensional, and so cannot represent multi-dimensional objects. The most common example of this phenomenon is in representing multivalued attributes, such as a user in a system with multiple email addresses.
INSERT INTO users (id, first_name, last_name, emails)
VALUES (DEFAULT, 'Thomas', 'Carcetti', 'thomas.carcetti@vaticle.com;thomas@vaticle.com;tommy@vaticle.com');
The data in this example is unnormalized, as multiple email values have been concatenated into a single string. This makes adding, retrieving, and removing individual addresses harder to do without corrupting the data. In order to avoid this, the data is normalized to ensure every value in every table row is atomic. In this particular case, a new user_emails
table is created to store the emails individually, with a foreign key referencing the users
table.
DO $$
DECLARE
user_id INT;
BEGIN
INSERT INTO users (id, first_name, last_name)
VALUES (DEFAULT, 'Thomas', 'Carcetti')
RETURNING id INTO inserted_user_id;
INSERT INTO user_emails (user_id, email)
VALUES
(inserted_user_id, 'thomas.carcetti@vaticle.com'),
(inserted_user_id, 'thomas@vaticle.com'),
(inserted_user_id, 'tommy@vaticle.com');
END $$;
Inserting a user now requires multiple queries enclosed in a transaction, and retrieving all a user’s data requires joining the two tables on the foreign key. While significantly safer, this implementation has a heavy cost: a single conceptual user is now represented as rows within multiple tables. Essentially, the notion of object identity has been violated. The mapping of application objects representing users to the rows that represent them in the database is no longer a one-to-one mapping, and a mismatch between the object and relational models is created.
Tuples are not Objects
In this model, a user is a multidimensional data structure: the email list can be thought of as a tuple within the row, which is itself a tuple. The unnormalized and normalized implementations are equivalent representations of the same two-dimensional structure. In the general case, object-relational mismatch arises from the inability of tuples to store multi-dimensional objects.
The multidimensionality of object models enables the expression of powerful concepts like abstraction, inheritance, and polymorphism, all indispensable tools in modern programming that allow the natural representation of complex data structures. Multivalued attributes are just one example of a model feature that results in mismatch, with the normalized representations of one-to-many relations, many-to-many relations, inheritance hierarchies, and interface structures also leading to mismatches.
Object-document mismatch
The document database was the first NoSQL database paradigm that saw widespread adoption for general-purpose applications, with the release of MongoDB in 2009. Like most NoSQL databases, document databases came about in response to the rapidly growing prevalence of big data. For such data, the inflexibility of the schema in handling disparate data from different sources leads to table bloat and poor performance.
Document databases were designed to solve this problem by eliminating the schema and storing multi-dimensional data structures without the normalization necessary in relational databases. This allows the ingestion of semi-structured data and the storage of dependent data instances together in nested structures, which give enormous improvements to ease of modeling and query performance when working with big data. However, this comes at the cost of not being able to rely on the schema to enforce structure or integrity, and so relational databases remained prevalent where these concerns were a priority.
Basics of document modeling
In document databases, data is encoded in a serializable hierarchical format such as JSON or XML, with each level in the hierarchy comprising a document. Documents are stored in collections and contain named fields mapped to either primitive data types or embedded sub-documents. The same applies to the embedded documents themselves, allowing documents to easily represent multidimensional data structures that are purely hierarchical, leading to the convenient elimination of object model mismatch in many cases, including that of multivalued attributes. Document databases are semi-structured: there are no mandated constraints on document structures, which may be modified freely as needed.
db.users.insert( [
{
"first_name": "Thomas",
"last_name": "Carcetti",
"emails": [
"thomas.carcetti@vaticle.com",
"thomas@vaticle.com",
"tommy@vaticle.com"
]
},
{
"first_name": "Rhonda",
"last_name": "Pearlman",
"emails": [
"rhonda.pearlman@vaticle.com",
"rhonda@vaticle.com"
]
}
] )
Consequences of lacking structure
The embedding of dependent data allows for highly efficient insertion and retrieval in comparison to relational databases, as all data nested in a document is stored as a single block on disk. However, the hierarchical data model becomes problematic when dealing with data that is not strictly hierarchical. In the following example, two documents are inserted into the projects
collection, within which are embedded the previous documents from the users
collection in the members
list field.
db.projects.insert( [
{
"name": "TypeDB 3.0",
"members": [
{
"first_name": "Thomas",
"last_name": "Carcetti",
"emails": [
"thomas.carcetti@vaticle.com",
"thomas@vaticle.com",
"tommy@vaticle.com"
]
},
{
"first_name": "Rhonda",
"last_name": "Pearlman",
"emails": [
"rhonda.pearlman@vaticle.com",
"rhonda@vaticle.com"
]
}
]
},
{
"name": "TypeDB Cloud beta",
"members": [
{
"first_name": "Rhonda",
"last_name": "Pearlman",
"emails": [
"rhonda.pearlman@vaticle.com",
"rhonda@vaticle.com"
]
}
]
}
] )
This highlights a problem: the user documents must be stored multiple times within the projects
collection, and once more in the users
collection for direct querying, leading to data duplication. The representation of user objects as multiple documents violates object identity, and is a form of object-document mismatch. This is the result of a many-to-many relation between users and projects. While documents databases are able to eliminate mismatches for multivalued attributes and one-to-many relations, mismatches still occur for many-to-many relations due to the inability to represent them hierarchically.
To avoid this, it is possible to reference existing documents by an ID rather than embedding duplicates. When querying references, a lookup must then be performed from the reference value onto the collection containing the referenced data. This process is significantly less performant than a join in a relational database because document databases cannot capitalize on the schema to optimize the process. As a result, it is possible to avoid object-document mismatch for many-to-many relations, but this comes with a severe performance cost and is typically not viable in production.
Object-graph mismatch
Modern graph databases evolved from an earlier sub-category of navigational databases known as network databases. The idea had existed since the early 1990s but did not become widely popularised until the first commercially viable versions were released in the mid-2000s, notably Neo4j in 2007, the most popular graph database today. Graph databases are based on graph theory and implemented as labeled property graphs (LPGs), with inspiration drawn from Chen’s entity-relationship (ER) model (Chen, 1976). The structure of LPG data is very similar to that of the earlier triplestore paradigm, implementing the RDF standard. Triplestore databases are sometimes also referred to as graph databases, but in this article the term graph is used exclusively in reference to LPGs. One of the key differences between them will be described later in this article.
Like document databases, graph databases are schemaless and designed for semi-structured data. However, unlike them, graph databases are highly optimized to store and traverse non-hierarchical data involving multidimensional connections between data instances by storing them as pointers. Compared to document databases, traversing between dependent data instances on disk incurs a non-negligible performance cost, but this enables the storage of highly interconnected data without duplication or performance-intensive lookups.
Basics of graph modeling
In a graph database, data is stored as nodes (representing entities) and relationships between them, equivalent to the vertices and edges of a graph respectively. Both nodes and relationships can be assigned labels, which represent the categories they fall into, and properties, which store data about them.
Graph databases | Graph-theoretic algebra | ER modeling |
---|---|---|
Node | Vertex | Entity |
Relationship | Directed edge | Relationship |
Label | Label | Type |
Property | Label | Attribute |
Relationships are implemented as directed edges, equivalent to the triplestore model, with the start-node, edge, and end-node representing the subject, predicate, and object, respectively. This allows for most semantic relationships to be naturally modeled in graph databases, and allows graph databases to easily represent highly interconnected data.
A competitive edge
The most important differentiator between graph and triplestore databases is the introduction of properties, which gives nodes and relationships attribute stores as internal structures. In contrast, attributes in triplestore databases are modeled by creating a node representing the attribute and linking it to the associated node with an edge. Attributes of relationships cannot easily be modeled in triplestore databases, as edges cannot be directly connected to other edges. In order to store them, the relationship edge must be replaced by a node, which is linked to the original edge’s two start- and end-nodes by two new edges..
The process of replacing a single edge with a node and pair of edges is known as reification, and results in object model mismatch. Because the relationship is represented by multiple graph elements, object identity is violated. The inability to model attributes of relationships without reifying the relationships is a significant shortcoming of the triplestore model. By implementing the LPG model, graph databases avoid reification of this kind. This is demonstrated in the following example, where the property temporary
is assigned to a MEMBER_OF
relationship without having to reify it.
CREATE
(thomas:User {
first_name: "Thomas",
last_name: "Carcetti",
emails: [ "thomas.carcetti@vaticle.com", "thomas@vaticle.com", "tommy@vaticle.com" ]
}),
(rhonda:User {
first_name: "Rhonda",
last_name: "Pearlman",
emails: [ "rhonda.pearlman@vaticle.com", "rhonda@vaticle.com" ]
}),
(typedbThree:Project { name: "TypeDB 3.0" }),
(cloudBeta:Project { name: "TypeDB Cloud beta" }),
(thomas)-[:MEMBER_OF]->(typedbThree),
(rhonda)-[:MEMBER_OF { temporary: true}]->(typedbThree),
(rhonda)-[:MEMBER_OF]->(cloudBeta)
Limitations of graphs
The design of graph databases is intended to avoid reification, but there are cases where reification is still necessary in order to express certain kinds of relationships. As relationships are implemented as directed edges, they can only model relationships that are binary (between exactly two nodes) and directed (with one start-node and one end-node). To express other kinds of relationships, such as n-ary ones (between any number of nodes), nested ones (where another relationship is an endpoint), and undirected ones (in which neither node is the start or end), relationships must be reified, leading to object-graph mismatch. In this next example, the MEMBER_OF
relationship from the previous example has been replaced with a Membership
node in order to model it as a nested relationship, the endpoint of another ASSIGNED
relationship.
MATCH
(cedric:Admin)
WHERE
"cedric.daniels@vaticle.com" in cedric.emails
CREATE
(thomas:User {
first_name: "Thomas",
last_name: "Carcetti",
emails: [ "thomas.carcetti@vaticle.com", "thomas@vaticle.com", "tommy@vaticle.com" ]
}),
(rhonda:User {
first_name: "Rhonda",
last_name: "Pearlman",
emails: [ "rhonda.pearlman@vaticle.com", "rhonda@vaticle.com" ]
}),
(typedb_three:Project { name: "TypeDB 3.0" }),
(cloud_beta:Project { name: "TypeDB Cloud beta" }),
(thomas)-[:MEMBER_OF]->(typedb_three),
(rhonda)-[:HAS]->(membership:Membership { temporary: true })-[:OF]->(typedb_three),
(cedric)-[:ASSIGNED]->(membership),
(rhonda)-[:MEMBER_OF]->(cloud_beta)
This is a fundamental issue with using graphs for conceptual data modeling: in most data domains, not all relationships can be expressed as directed binary edges. Reification allows these more complex relationship types to be implemented, but at the cost of introducing object-graph mismatch to the model.
Summary
Object model mismatches are apparent in relational, document, and graph paradigms, even with the fairly simple models used in the examples shown. While they are not too difficult to deal with at these scales, they become unmanageable as data models approach a realistic level of complexity, especially when there is polymorphism present in the model. The key reason they occur is the division between the conceptual and logical data models. While logical data models differ across databases paradigms, there are always conceptual data structures that cannot be expressed with them. For this not to be the case, the logical model would need to be identical to the conceptual model. A database that directly implements a conceptual model in this way can eliminate object model mismatch.