Lesson 9.3: Avoiding data redundancies
Identifying redundancies
In the previous lesson, we created a first iteration of a data model for book data. In this lesson, we’ll see how we can eliminate data redundancies generated by the model.
Data table
ISBN-13 | ISBN-10 | Title | Format | Authors | Editors | Illustrators | Other contributors | Publisher | Year | City | State | Country | Page count | Genres | Price | Stock |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9780008627843 |
0008627843 |
The Hobbit |
ebook |
J.R.R. Tolkien |
J.R.R. Tolkien |
Harper Collins |
2023 |
New York City |
New York |
United States |
310 |
fantasy;fiction |
16.99 |
|||
9780060929794 |
0060929790 |
One Hundred Years of Solitude |
paperback |
Garcia Marquez, Gabriel |
Perennial |
1998 |
New York City |
New York |
United States |
458 |
fiction;historical fiction |
6.12 |
4 |
|||
9780195153446 |
0195153448 |
Classical Mythology |
paperback |
Lenardon, Robert J.;Morford, Mark P. O. |
Oxford University Press |
2002 |
New York City |
New York |
United States |
820 |
history;nonfiction |
34.98 |
12 |
|||
9780375801679 |
0375801677 |
The Iron Giant |
ebook |
Hughes, Ted |
Davidson, Andrew |
Knopf Books for Young Readers |
1999 |
New York City |
New York |
United States |
79 |
fiction;children’s fiction |
33.97 |
|||
9780387881355 |
0387881352 |
Electron Backscatter Diffraction in Materials Science |
hardback |
Schwartz, Adam J.;Kumar, Mukul;Adams, Brent L.;Field, David P. |
Springer |
2009 |
New York City |
New York |
United States |
425 |
nonfiction;technology |
230.37 |
9 |
|||
9780393045215 |
0393045218 |
The Mummies of Urumchi |
paperback |
Barber, Elizabeth Wayland |
W.W. Norton & Company |
1999 |
New York City |
New York |
United States |
240 |
history;nonfiction |
21.6 |
1 |
|||
9780393634563 |
0393634566 |
The Odyssey |
ebook |
Homer |
Wilson, Emily |
W.W. Norton & Company |
2017 |
New York City |
New York |
United States |
656 |
fiction;classics |
13.99 |
|||
9780446310789 |
0446310786 |
To Kill a Mockingbird |
paperback |
Harper Lee |
Grand Central Publishing |
1988 |
New York City |
New York |
United States |
281 |
fiction;historical fiction |
21.64 |
16 |
|||
9780451162076 |
0451162072 |
Pet Sematary |
paperback |
King, Stephen |
Signet |
1984 |
New York City |
New York |
United States |
374 |
horror;fiction |
93.22 |
1 |
|||
9780500026557 |
0500026556 |
Hokusai’s Fuji |
paperback |
Wada, Kyoko |
Katsushika, Hokusai |
Thames & Hudson |
2024 |
London |
United Kingdom |
416 |
art;nonfiction |
24.47 |
11 |
|||
9780500291221 |
0500291225 |
Great Discoveries in Medicine |
paperback |
Bynum, William;Bynum, Helen |
Thames & Hudson |
2023 |
London |
United Kingdom |
352 |
history;nonfiction |
12.05 |
18 |
||||
9780553212150 |
055321215X |
Pride and Prejudice |
paperback |
Austen, Jane |
Bantam Classics |
1983 |
New York City |
New York |
United States |
295 |
fiction;historical fiction |
17.99 |
15 |
|||
9780575104419 |
0575104414 |
Dune |
ebook |
Herbert, Frank |
Hachette Book Group |
2010 |
New York City |
New York |
United States |
624 |
fiction;science fiction |
5.49 |
||||
9780671461492 |
0671461494 |
The Hitchhiker’s Guide to the Galaxy |
paperback |
Adams, Douglas |
1982 |
New York City |
New York |
United States |
215 |
fiction;science fiction |
91.47 |
9 |
||||
9780679425601 |
0679425608 |
Under the Black Flag: The Romance and the Reality of Life Among the Pirates |
hardback |
Cordingly, David |
Random House |
1996 |
New York City |
New York |
United States |
296 |
history;nonfiction |
34.73 |
13 |
|||
9780740748479 |
0740748475 |
The Complete Calvin and Hobbes |
hardback |
Watterson, Bill |
Watterson, Bill |
Andrews McMeel Publishing |
2005 |
Kansas City |
Missouri |
United States |
1451 |
comics;fiction |
128.71 |
6 |
||
9781098108274 |
1098108272 |
Fundamentals of Data Engineering |
ebook |
Reis, Joe;Housley, Matt |
O’Reilly Media |
2022 |
Sevastopol |
California |
United States |
450 |
nonfiction;technology;children’s fiction |
47.99 |
||||
9781489962287 |
148996228X |
Interpretation of Electron Diffraction Patterns |
paperback |
Keown, Samuel Robert;Andrews, Kenneth William;Dyson, David John |
Springer |
1967 |
New York City |
New York |
United States |
199 |
nonfiction;technology |
47.17 |
15 |
|||
9781859840665 |
1859840663 |
The Motorcycle Diaries: A Journey Around South America |
paperback |
Guevara, Ernesto |
Wright, Ann |
Verso |
1996 |
London |
United Kingdom |
160 |
biography;nonfiction |
14.52 |
4 |
|||
9783319398778 |
Physical Principles of Electron Microscopy: An Introduction to TEM, SEM, and AEM |
ebook |
Egerton, R.F. |
Springer |
2016 |
London |
United Kingdom |
196 |
nonfiction;technology |
19.5 |
||||||
9798691153570 |
Business Secrets of The Pharoahs |
paperback |
Crorigan, Mark |
British London Publishing |
2020 |
London |
United Kingdom |
260 |
business;nonfiction |
11.99 |
8 |
Lesson 9.2 schema
define
book sub entity,
owns isbn-13,
owns isbn-10,
owns title,
owns format,
owns page-count,
owns genre,
owns price,
owns stock,
plays contribution:work,
plays publishing:published;
author sub entity,
owns name,
plays contribution:contributor;
editor sub entity,
owns name,
plays contribution:contributor;
illustrator sub entity,
owns name,
plays contribution:contributor;
other-contributor sub entity,
owns name,
plays contribution:contributor;
publisher sub entity,
owns name,
plays publishing:publisher;
city sub entity,
owns name,
plays publishing:location;
state sub entity,
owns name,
plays publishing:location;
country sub entity,
owns name,
plays publishing:location;
contribution sub relation,
relates contributor,
relates work;
publishing sub relation,
relates publisher,
relates published,
relates location,
owns year;
isbn-13 sub attribute, value string;
isbn-10 sub attribute, value string;
title sub attribute, value string;
format sub attribute, value string;
page-count sub attribute, value long;
genre sub attribute, value string;
price sub attribute, value double;
stock sub attribute, value long;
name sub attribute, value string;
year sub attribute, value long;
Currently, when recording the location in which a book was published, we must record the city, state, and country.
match
$book isa book, has isbn-13 "9780575104419";
$publisher isa publisher, has name "Hachette Book Group";
$city isa city, has name "New York City";
$state isa state, has name "New York";
$country isa country, has name "United States";
insert
(
publisher: $publisher,
published: $book,
location: $city,
location: $state,
location: $country
) isa publishing, has year 2010;
However, there is a functional dependency of state on city, as knowing the city (by identity, not just by name) in which a book is published is sufficient to determine the state in which it was published. Similarly, there is a functional dependency of country on state, and thus country on city, as functional dependencies are transitive. Simply recording the city should be sufficient, as the state and country should then be inferrable. In fact, we saw this inference performed explicitly in Lesson 3.3.
This highlights a problem with the data model. The entity types state
and country
have functional dependencies on the entity type city
. Meanwhile, city
has a functional dependency on book
via the publishing
relation type, in which they both play roles. Thus, state
and country
both have transitive function dependencies on book
. However, state
and country
also have direct functional dependencies on book
via publishing
. This means that state
and country
have duplicate dependencies on book
. When dependencies are duplicated, this creates redundancy in the data model, which can lead to data inconsistencies.
Eliminating redundancies
In order to prevent these inconsistencies, we must refactor the model in such a way that ensures there are no duplicate dependencies. In a relational database, this process is known as lossless join decomposition. To do this, we must first explicitly record the dependencies of country
on state
and of state
on city
, which are currently not captured by the data model. Then, we must remove the dependencies of country
and state
on book
. The end result will be:
-
city
depends onbook
. -
state
depends oncity
. -
country
depends onstate
.
city
, state
, and country
will then each have only a single functional dependency on book
, preventing data redundancies and inconsistencies. To be able to record the new dependencies, we’ll introduce a new locating
relation type.
define
locating sub relation,
relates located,
relates location;
city plays locating:located;
state plays locating:location,
plays locating:located;
country plays locating:location;
Next, we’ll record the locations of cities using the new type.
match
$city isa city, has name "New York City";
$state isa state, has name "New York";
$country isa country, has name "United States";
insert
(located: $city, location: $state) isa locating;
(located: $state, location: $country) isa locating;
Now, whenever we record the location a book is published, we only need to store the city.
match
$book isa book, has isbn-13 "9780575104419";
$publisher isa publisher, has name "Hachette Book Group";
$city isa city, has name "New York City";
insert
(publisher: $publisher, published: $book, location: $city) isa publishing,
has year 2010;
If we wish to know the state or the country it was published, we can easily query this information via the location
relations.