TypeDB Blog


Inheritance and polymorphism: where the cracks in SQL begin to show

Dr. James Whiteside


Polymorphism can be one of the most challenging things to model in a database. Not only do you want your database to be performant, you also want it to feel intuitive so that queries you write make sense and are simple to maintain, and polymorphic data can easily mess up that balance if you don’t take the time to consider how best to implement it. As polymorphism is a typical characteristic of any but the most simple data sets, every database necessarily has approaches to modeling it, and some are easier to use than others. Today we’re going to explore the different ways that relational databases handle a polymorphic dataset in the context of an identity and access management system, then compare them to TypeDB’s inference.

What is polymorphism?

Polymorphism in a data model is when different instances of the same type of data have different properties. Let’s take a look at a few examples, starting with a document listing employees:

[
 {
   "first name": "Luke",
   "last name": "Skywalker",
   "employment type": "permanent",
   "start date": "2020-09-25",
   "end date": null,
   "salary": 65000.00
 },
 {
   "first name": "Han",
   "last name": "Solo",
   "employment type": "contractor",
   "agency": "Hutt Holdings PLC",
   "hourly rate": 50.00
 }
]

In this example, we’ve got three attributes first name, last name, and employment type that are common to all employees, along with three attributes start date, end date, and salary that are only applicable to permanent employees and two agency and hourly rate that are only applicable to contractors. This is an example of attribute polymorphism, when you have entities of the same type that have different attributes. The other type of polymorphism is role player polymorphism. To demonstrate this, we’re going to look at a document listing memberships of business units:

[
 {
   "business unit": "Engineering",
   "member": "Leia Organa"
 },
 {
   "business unit": "Engineering",
   "member": "Cloud"
 }
]

Here we’ve got two different types of members of the Engineering team: people like Leia, and business sub-units like the Cloud team. In this case we have two entities of different types that are playing the same member role in the business unit membership relationship.

Why is polymorphism a challenge when designing a database?

Polymorphism makes allowed data patterns difficult to define in relational databases. The first versions of relational database theory and SQL were conceived in the early 1970s, when procedural programming languages dominated the tech industry. Object oriented programming did not become widespread until the mid 1980s, and polymorphism is a feature of data arising from the object model. As a result, SQL was not built to handle inheritance and polymorphism so, while these paradigms can be modeled, the result is ugly, difficult to work with, and easily broken. NoSQL databases don’t fare much better. While the lack of schema means that polymorphism is natural and easy to implement, it is a double-edged sword, as any and all constraints have to be implemented in the application, otherwise a user could write any nonsensical data to the database that they want.

How does SQL handle attribute polymorphism?

Let’s take a closer look at the attribute polymorphism example from earlier. How do you go about implementing it in a SQL schema while maintaining referential integrity? We’re going to cover three design patterns commonly found in SQL literature: single table inheritance, concrete table inheritance, and class table inheritance. They are sometimes alternatively known as table per hierarchy, table per concrete class, and table per type respectively. For these examples, we’ll be using PostgreSQL syntax.

Single table inheritance

CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  first_name VARCHAR(256) NOT NULL,
  last_name VARCHAR(256) NOT NULL,
  employment_type VARCHAR(256) NOT NULL,
  start_date DATE,
  end_date DATE,
  agency_id INT REFERENCES agency(id),
  salary MONEY,
  hourly_rate MONEY
);

In this design pattern, we create a single table for all our employees, including columns for all of the attributes they might have. An employment_type column indicates the type of employee (“permanent” or “contractor”). When we add a new row to this table, any non-applicable columns are filled with nulls. This approach is easy to build and query as there are no joins, unions, or foreign keys to worry about, plus the lack of joins makes queries highly performant, but this pattern quickly runs into integrity problems and fails to scale well.

Adding a new type of employee type, say “part time”, means the table has to be modified in place to add extra columns (which is always a pain), and as the number of employee types grows, the table becomes massively bloated. Worse still, the majority of values in each row will be nulls, which can take up extra space, and queries become bottlenecked because they’re all accessing the same table. There’s also no metadata describing the business logic of the columns, so it would be possible, for instance, to create an employee with both a salary and an hourly rate, or a permanent employee with no start date as non-null constraints that apply to only one employee type can’t be enforced. Overall, the lack of data integrity constraints make this pattern clearly unsuitable for a production environment.


Pros:

  • Simple to build and manage.
  • Easy to query.
  • Highly performant at small scales due to lack of joins.

Cons:

  • Adding new subtypes requires modifying the table with data already in it.
  • Sparse data can take up a lot of extra space.
  • The table can become very large, which hurts performance and bottlenecks queries.
  • No metadata defines which attributes can and can’t be owned by each entity.
  • Non-null constraints cannot be enforced.

Concrete table inheritance

CREATE TABLE permanent_employee (
  id INT PRIMARY KEY,
  first_name VARCHAR(256) NOT NULL,
  last_name VARCHAR(256) NOT NULL,
  start_date DATE NOT NULL,
  end_date DATE DEFAULT NULL,
  salary MONEY NOT NULL
);

CREATE TABLE contractor (
  id INT PRIMARY KEY,
  first_name VARCHAR(256) NOT NULL,
  last_name VARCHAR(256) NOT NULL,
  agency_id INT REFERENCES agency(id) NOT NULL,
  hourly_rate MONEY NOT NULL
);

Let’s try another approach. In this design pattern, we create separate tables for each of our employee types, with each table using only the columns it needs. Adding a new employee is as simple as adding it to the appropriate table, and we no longer need a column for employment type as that information is captured in the table name. Several of the problems with the Single Table pattern have been solved: attributes can no longer be owned by the wrong kinds of employees, we can use non-null constraints to enforce our business logic, adding a new type of employee just means we need to add a new table without any need to modify existing ones, we still never need joins in our queries, and the access load is spread evenly across the different tables.

This looks pretty good, right? Wrong: all we’ve done is trade one set of problems for another. The id field has had to be changed from a self-incrementing serial to a user-provided integer, otherwise we would get different types of employees with the same IDs. This means we now have to implement a stateful key allocation system at the application level. There’s also no record of which columns are common to all employees, or even which tables are tables of employees, and if we want to change one of these common columns it has to be modified in every table. Finally, if we want to search through all employees we have to use unions in our queries, which means that any time we add a new employee type we have to go back and add the new table to all the existing queries. So while this pattern has solved most of the data integrity problems, it’s a maintenance nightmare. Surely we can do better than this?


Pros:

  • Adding new subtypes only requires creating new tables for them.
  • No type column is necessary.
  • Attributes cannot be owned by invalid entities.
  • Mandatory attributes can have non-null constraints applied.
  • Joins are never needed.
  • Access load is spread evenly across tables.

Cons:

  • Does not enforce key uniqueness across all subtype tables, requiring a stateful key allocation system.
  • Common and subtype-specific attributes cannot be distinguished.
  • Table structure becomes repetitive due to duplication of common columns.
  • Changing a common field requires modifying it in every table.
  • Querying over all subtypes requires unions, and these queries break when new subtypes are added.

Class table inheritance

CREATE TABLE employee (
  id SERIAL PRIMARY KEY,
  first_name VARCHAR(256),
  last_name VARCHAR(256)
);

CREATE TABLE permanent_employee (
  id INT PRIMARY KEY REFERENCES employee(id),
  start_date DATE,
  end_date DATE,
  salary MONEY
);

CREATE TABLE contractor (
  id INT PRIMARY KEY REFERENCES employee(id),
  agency_id INT REFERENCES agency(id),
  hourly_rate MONEY
);

Third time’s the charm. In this design pattern, we create one top-level employee table that holds all the data common to all employees, and one more table for each type of employee we have. With this approach, we get the best of both worlds. We can apply our non-null constraints, we don’t need to modify existing tables to add new employee types, attributes can’t be erroneously owned by the wrong employees, we can distinguish common and type-specific attributes, we have no duplicated columns or sparse data, the top-level table enforces global ID uniqueness, we don’t need any unions to query all employees, and perhaps best of all: we have a direct one-to-one mapping between our database tables and our application classes. Surely this time we’ve found a solution.

But yet again, we run into issues. When we want to insert a new employee, it has to be inserted in two places. Same for updates and deletes. In fact, our data has become denormalized, and this makes it possible to insert an employee without a specific type or with more than one type if we don’t prevent it at the application level. If we want to query all of an employee’s data then we need to perform joins, but if we don’t know what kind of employee they are then we have to perform a full outer join for each subtype table, which is costly and breaks our queries if we add new employee subtypes. And once again, the access load will be concentrated on the top-level table so our queries will start to bottleneck as the data grows. This pattern is somewhere in the middle of the previous two: we get the best of both worlds, but we also get the worst of both.


Pros:

  • Mandatory attributes can have non-null constraints applied.
  • Adding new subtypes only requires creating new tables for them.
  • Attributes cannot be owned by invalid entities.
  • Common and subtype-specific attributes can be distinguished.
  • No type column is necessary.
  • No sparse data or duplicated columns.
  • Top level table enforces global ID uniqueness.
  • No unions are needed to query over all subtypes.
  • There is a direct mapping from database tables to application classes.

Cons:

  • Multiple tables must be modified when writing data, and integrity must be manually enforced.
  • There is no way to enforce the number of subtypes an entity should have.
  • Joins are required to retrieve subtype-specific attributes.
  • Querying over all subtypes requires performance-heavy full outer joins, and these queries break when new subtypes are added.
  • Supertype tables can get bottlenecked due to frequent access.

How does SQL handle role player polymorphism?

SQL’s approach to role player polymorphism directly mirrors its approach to attribute polymorphism, which is the natural result of the same data structures, tables, being used to model both entities and relationships. Let’s take a look at the role player polymorphism example from earlier.

Single table inheritance

CREATE TABLE unit_membership (
  business_unit_id INT REFERENCES business_unit(id) NOT NULL,
  member_employee_id INT REFERENCES employee(id) DEFAULT NULL,
  member_unit_id INT REFERENCES business_unit(id) DEFAULT NULL
);

Here we simply add foreign keys to the tables for both kinds of role player (employee and business_unit). Let’s quietly overlook the fact that employees might be spread over more than one table depending on what design pattern we chose for them in the previous section. It’s already getting complicated enough. Just like with the attribute polymorphism, this is the quick and dirty solution. There’s no way to enforce that each row should have exactly one member ID, moving business logic into our application. In fact, it’s not even possible to define a primary key because of the null values, without resorting to a meaningless row ID.

Concrete table inheritance

CREATE TABLE unit_membership (
  business_unit_id INT REFERENCES business_unit(id),
  member_table VARCHAR(256),
  member_id INT,
  PRIMARY KEY (business_unit_id, member_table, member_id)
);

This one’s a bit better. By using two attributes to point to the member role player (member_table and member_id) we can enforce that each row refers to exactly one member, and our primary key is well defined. But wait, where’s the foreign key? Well there isn’t one. Because a foreign key column can reference only a single column in a single other table, it’s impossible to enforce a foreign key constraint on the member ID, which again moves business logic into the application.

Class table inheritance

CREATE TABLE unit_membership (
  business_unit_id INT REFERENCES business_unit(id),
  member_id INT REFERENCES subject(id),
  PRIMARY KEY (business_unit_id, member_id)
);

CREATE TABLE subject (
  id SERIAL PRIMARY KEY
);

CREATE TABLE employee (
  id INT PRIMARY KEY REFERENCES subject(id),
  ...
  ...
);

CREATE TABLE business_unit (
  id INT PRIMARY KEY REFERENCES subject(id),
  ...
  ...
);

Okay now we’re getting somewhere. By creating an abstract subject table using the class table pattern, our unit_membership table is now in an ideal state: just two foreign keys that together make up the primary key. But all we’ve accomplished with this is shifting the complexity elsewhere. We need multiple full outer joins to get the member’s data, and we can’t guarantee that each subject has exactly one reference among the subtype tables which, you guessed it, moves the business logic into the application.

Drop SQL (but keep the schema)

All three of these design patterns have resulted in business logic being moved from the database to the application. Frameworks like Ruby on Rails and Hibernate exist that can implement polymorphism in SQL by handling the complex constraints themselves, but wouldn’t it be great if we could capture all of the logic within the database? All the information necessary to enforce it is contained within the data, so it should be possible. What’s missing?

SQL has had its time. In an era of procedural programming, it revolutionized the database industry and set out new data storage and querying techniques that have been the standard for decades, but as object-oriented programming and strong typing have become the norms, it has quickly been made apparent that SQL is not expressive or flexible enough to handle the complexities of modern data. Some RDBMSs have implemented simple inheritance functionality, but these implementations are flakey at best because they lack the robustness of something built from the ground up.

The question is then, what to use instead? The NoSQL movement has gained enormous traction in recent years, with many household names emerging (you know who I’m talking about), but forgoing a schema can often create more problems than it solves. The schema itself is not the problem, and removing it is not the answer. In fact, the schema is the solution. We just need a schema that’s expressive enough to handle polymorphic data elegantly and cleanly, and a query language that is built with object-oriented programming foremost in mind.

How does TypeDB handle polymorphism?

Being a strongly typed database, TypeDB is built to handle polymorphism and inheritance from the get-go. In the schema, you declare types and how they are allowed to interact with each other, and then insert data by instantiating those types. Let’s see it in action.

define

employee sub entity,
  owns id @key,
  owns first-name,
  owns last-name,
  plays business-unit-membership:member;

permanent-employee sub employee,
  owns start-date,
  owns end-date,
  owns salary;

contractor sub employee,
  owns hourly-rate,
  plays agency-membership:member;

business-unit sub entity,
  owns id @key,
  owns name,
  plays business-unit-membership:unit,
  plays business-unit-membership:member;

agency sub entity,
  owns id @key,
  owns name,
  plays agency-membership:agency;

business-unit-membership sub relation,
  relates unit,
  relates member;

agency-membership sub relation,
  relates agency,
  relates member;

id sub attribute, value long;
first-name sub attribute, value string;
last-name sub attribute, value string;
start-date sub attribute, value datetime;
end-date sub attribute, value datetime;
pay sub attribute, abstract, value double;
salary sub pay, value double;
hourly-rate sub pay, value double;
name sub attribute, value string;

Let’s break this down. We’ve created an employee type and two of its subtypes: permanent-employee and contractor. Both permanent employees and contractors inherit all the properties of employees thanks to TypeDB’s type inheritance, so they all have first and last names, and can be members of business units, but only permanent employees can have salaries and only contractors can have hourly rates, amongst other things. As for the business unit memberships, both employees and business sub-units can play the role of member in business unit memberships thanks to TypeDB’s in-built polymorphism. Now let’s try writing queries to insert our example data:

insert
  $luke isa permanent-employee,
    has first-name "Luke",
    has last-name "Skywalker",
    has start-date 2020-09-25,
    has salary 65000.00;
match
  $hutt isa agency, has name "Hutt Holdings PLC";
insert
  $han isa contractor,
    has first-name "Han",
    has last-name "Solo",
    has hourly-rate 50.00;
  (agency: $hutt, member: $han) isa agency-membership;

In the first insert query, we insert the permanent employee Luke, and in the second match-insert query, we insert the contractor Han and the relation that gives him membership of the agency Hutt Holdings (which already exists in the database). Both Luke and Han have first and last names, and we don’t even have to mention that they’re also employees. If we query our employees with:

match
  $employee isa employee,
    has first-name $first,
    has last-name $last,
    has $pay-type $pay;
  $pay-type sub pay;

then we’ll get both Luke and Han back automatically thanks to TypeDB’s type inference. Not only that, we’ll also get back the amount and type (“salary” or “hourly-rate”) of their pay. That’s pretty elegant. What about the business unit memberships?

match
  $leia isa employee, has first-name "Leia", has last-name "Organa";
  $engineering isa business-unit, has name "Engineering";
insert
  (unit: $engineering, member: $leia) isa business-unit-membership;
match
  $cloud isa business-unit, has name "Cloud";
  $engineering isa business-unit, has name "Engineering";
insert
  (unit: $engineering, member: $cloud) isa business-unit-membership;

In the first query, we make Leia a member of the Engineering team, and in the second, we make the Cloud team a sub-unit of the Engineering team. If we look closely at the queries, their format is exactly the same, because we’ve defined that both employees and business units can play the member role. No nulls, no joins, no unions, no foreign keys, no lack of referential integrity, no missing constraints. All business logic is implemented through the schema, and our application can focus on business processes while TypeDB does all of the legwork.

Concluding remarks

SQL can no longer meet the needs of modern data. Having tried three different enterprise-standard design patterns for implementing polymorphism, each attempt ended up causing loss of constraints and/or referential integrity, opening the door for broken data states to make their way into our database. In order to counter this, we have to move business logic into our applications, which slows them down and places the obligation of maintaining clean data on developers rather than the database architect.

With TypeDB, there’s only one way to implement polymorphism because it works. TypeDB is a cutting-edge database designed specifically to handle these difficult data needs that SQL simply can’t keep up with, no matter how many times the language standard gets updated. No need to weigh up different design patterns to try and optimize for your requirements, just write your schema and queries in TypeQL’s near-natural language and let the database do all the work for you, with a smooth and elegant type system backed up by powerful type inference capabilities. And that’s without even talking about the even more powerful rule inference. We’ll leave that for another day.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback