Lesson 9.3: Avoiding data redundancies
Identifying redundancies
In the previous lesson, we created a first iteration of a data model for book data. In this lesson, we’ll see how we can eliminate data redundancies generated by the model.
Lesson 9.2 schema
define
entity book,
owns isbn-13,
owns isbn-10,
owns title,
owns format,
owns page-count,
owns genre,
owns price,
owns stock,
plays contribution:work,
plays publishing:published;
entity author,
owns name,
plays contribution:contributor;
entity editor,
owns name,
plays contribution:contributor;
entity illustrator,
owns name,
plays contribution:contributor;
entity other-contributor,
owns name,
plays contribution:contributor;
entity publisher,
owns name,
plays publishing:publisher;
entity city,
owns name,
plays publishing:location;
entity state,
owns name,
plays publishing:location;
entity country,
owns name,
plays publishing:location;
relation contribution,
relates contributor,
relates work;
relation publishing,
relates publisher,
relates published,
relates location,
owns year;
attribute isbn-13, value string;
attribute isbn-10, value string;
attribute title, value string;
attribute format, value string;
attribute page-count, value integer;
attribute genre, value string;
attribute price, value double;
attribute stock, value integer;
attribute name, value string;
attribute year, value integer;
Currently, when recording the location in which a book was published, we must record the city, state, and country.
match
$book isa book, has isbn-13 "9780575104419";
$publisher isa publisher, has name "Hachette Book Group";
$city isa city, has name "New York City";
$state isa state, has name "New York";
$country isa country, has name "United States";
insert
$rel isa publishing (
publisher: $publisher,
published: $book,
location: $city,
location: $state,
location: $country
), has year 2010;
However, there is a functional dependency of state on city, as knowing the city (by identity, not just by name) in which a book is published is sufficient to determine the state in which it was published. Similarly, there is a functional dependency of country on state, and thus country on city, as functional dependencies are transitive. Simply recording the city should be sufficient, as the state and country should then be inferrable using a function.
This highlights a problem with the data model. The entity types state
and country
have functional dependencies on the entity type city
. Meanwhile, city
has a functional dependency on book
via the publishing
relation type, in which they both play roles. Thus, state
and country
both have transitive function dependencies on book
. However, state
and country
also have direct functional dependencies on book
via publishing
. This means that state
and country
have duplicate dependencies on book
. When dependencies are duplicated, this creates redundancy in the data model, which can lead to data inconsistencies.
Eliminating redundancies
In order to prevent these inconsistencies, we must refactor the model in such a way that ensures there are no duplicate dependencies. In a relational database, this process is known as lossless join decomposition. To do this, we must first explicitly record the dependencies of country
on state
and of state
on city
, which are currently not captured by the data model. Then, we must remove the dependencies of country
and state
on book
. The end result will be:
-
city
depends onbook
. -
state
depends oncity
. -
country
depends onstate
.
city
, state
, and country
will then each have only a single functional dependency on book
, preventing data redundancies and inconsistencies. To be able to record the new dependencies, we’ll introduce a new locating
relation type.
define
relation locating,
relates located,
relates location;
entity city plays locating:located;
entity state plays locating:location,
plays locating:located;
entity country plays locating:location;
Next, we’ll record the locations of cities using the new type.
match
$city isa city, has name "New York City";
$state isa state, has name "New York";
$country isa country, has name "United States";
insert
locating (located: $city, location: $state);
locating (located: $state, location: $country);
Now, whenever we record the location a book is published, we only need to store the city.
match
$book isa book, has isbn-13 "9780575104419";
$publisher isa publisher, has name "Hachette Book Group";
$city isa city, has name "New York City";
insert
$rel isa publishing (publisher: $publisher, published: $book, location: $city),
has year 2010;
If we wish to know the state or the country it was published, we can easily query this information via the location
relations.