TypeDB Blog


Accelerating drug discovery with applied knowledge engineering and TypeDB

Dr. James Whiteside


Knowledge discovery involves a tremendous amount of heterogeneous data, which is difficult to integrate due to its complex nature and rich semantics. Understanding the relationships buried in the data is a key goal of many fields, but modeling, ingesting, and connecting the data is never straightforward. The power of a knowledge model to discover new facts is directly proportional to the amount and quality of data it is provided with. It is particularly by connecting together indirectly related information, too complex to do manually, that the greatest insights are derived, but also where the greatest difficulty lies in processing the data.

With TypeDB, we can make use of strong typing to streamline data ingestion and automate the creation of semantic connections. In this blog post, we’ll look at how TypeDB can be used to model complex biological relationships and accelerate the drug discovery processs as part of the open-source TypeDB Bio project. We’ll explore simplifying the ingestion of disparate data with subtyping and rule-inference, identifying high-level relationships by abstracting over direct connections with deep inference chains, and then see how this enables us to find candidate drug targets for a sample disease.

Ingesting data

One of the biggest challenges when ingesting data from multiple sources is recognizing when entries in two different sources refer to the same thing, especially if those two sources use different identifiers. Let’s take a look at an example from the UniProt database of proteins:

Entry: Q9NPB9
Entry name: ACKR4_HUMAN
Status: reviewed
Protein names: Atypical chemokine receptor 4 (C-C chemokine receptor type 11) (C-C CKR-11) (CC-CKR-11) (CCR-11) (CC chemokine receptor-like 1) (CCRL1) (CCX CKR)
Gene names: ACKR4 CCBP2 CCR11 CCRL1 VSHK1
Organism: Homo sapiens (Human)
Length: 350
Ensembl transcript: ENST00000249887
Cross-reference (GeneID): 51554
Cross-reference (Reactome): R-HSA-380108
Cross-reference (DisGeNET): 51554
Cross-reference (CTD): 51554
Gene names (primary): ACKR4
Gene names (synonym): CCBP2 CCR11 CCRL1 VSHK1
Cross-reference (IntAct): Q9NPB9

As we can see, the protein Q9NPB9 has 8 different names, and a whole host of IDs, some that reference the protein directly and some that can still be used as identifiers for the protein but actually belong to related entities, like the gene ACKR4 that encodes for this protein. We also see that some IDs belonging to other databases like Reactome and DisGeNET have been recorded to enable cross-compatibility. This is a big problem for data ingestion, and in a relational database we would be asking ourselves questions like which ID to store, and how to clean our data on ingestion so the same ID is used everywhere. We could just store all of them, but then we’d need to think about how we handle foreign keys, or how to query against the database using an ID of an unknown kind.

TypeDB’s type system handles all of this elegantly. We can create a top-level id attribute type and give it all the subtypes we need:

The top-level ID attribute and its some of its subtypes.

Because TypeDB manages references to role players in relations (whether they’re entities, relations or attributes), we don’t have to worry about maintaining referential integrity like we would with a relational database. And thanks to inheritance, a uniprot-id is also a protein-id and thus id too. This means that, if we insert a new protein into our database with the UniProt ID Q9NPB9:

insert $p isa protein, has uniprot-id "Q9NPB9";

then all of the following queries will return the protein we just inserted:

match $p isa protein, has uniprot-id "Q9NPB9";
match $p isa protein, has protein-id "Q9NPB9";
match $p isa protein, has id "Q9NPB9";

This means we can be as general or specific as we like when searching for proteins, or any other entities for that matter. We don’t even have to know what we’re dealing with to be able to find the matching entity:

match $x has id "Q9NPB9";

Let’s look at another example from the TissueNet database of protein-protein interactions:

ENSG00000162385 ENSG00000169045 Protein Protein PPI
ENSG00000010017 ENSG00000121022 Protein Protein PPI
ENSG00000117724 ENSG00000197321 Protein Protein PPI
ENSG00000150991 ENSG00000170638 Protein Protein PPI
ENSG00000072110 ENSG00000134982 Protein Protein PPI
ENSG00000134470 ENSG00000142867 Protein Protein PPI
ENSG00000113580 ENSG00000116560 Protein Protein PPI
ENSG00000132155 ENSG00000174775 Protein Protein PPI

Even though this is a list of protein-protein interactions, it doesn’t actually list the IDs of the proteins. The IDs starting with ENSG are the Ensembl IDs of the genes which encode the proteins. It wouldn’t be too difficult to sort this out. We can just write a script that will read out the mappings from gene ID to protein ID based on the gene-protein encodings, transform our dataset into two lists of protein IDs, and then insert the interaction relation between each pair of proteins. But that’s not how I want to spend my morning.

With TypeDB, we can directly insert a relation between each pair of genes, and then have the inference engine automatically map it onto proteins for us. We’ll start by inserting a relation between the genes:

match
  $g1 isa gene, has id "ENSG00000162385";
  $g2 isa gene, has id "ENSG00000169045";
insert
  (encoding-gene:$g1, encoding-gene:$g2) isa encoded-protein-interaction;
An encoded protein interaction between two genes.

Next we define a rule that describes the mapping for the inference engine:

rule inferred-protein-interaction:
  when {
    $g1 isa gene;
    $g2 isa gene;
    (encoding-gene:$g1, encoding-gene:$g2) isa encoded-protein-interaction;
    $p1 isa protein;
    $p2 isa protein;
    (encoding-gene:$g1, encoded-protein:$p1) isa protein-encoding;
    (encoding-gene:$g2, encoded-protein:$p2) isa protein-encoding;

  } then {
    (interacting-protein:$p1, interacting-protein:$p2) isa protein-protein-interaction;
  };

And now we can now query the protein-protein interactions directly:

match
$p1 isa protein, has id $id1;
$p2 isa protein, has id $id2;
($p1, $p2) isa protein-protein-interaction;
An inferred protein-protein interaction between two proteins.

By using techniques like this, we can save ourselves a lot of time as we don’t have to transform the data, and we can be sure that there are no mistakes in the data loaded because we’ve recorded it in the database exactly as we found it at the source. The only mistake we can make is in writing the rule, but that couldn’t be easier to fix: because rules are resolved at query-time, they can be undefined and redefined at any time, even if the data they describe has already been inserted. All the queries will automatically show the new results when re-run, without having to modify them. No more having to reload all your data because of a mistake in the transformation pipeline.

Exposing buried connections

Now that the data has been ingested, we can start thinking about putting our model to work. Let’s say we want to get a list of proteins that are enhanced in the heart muscle. It’s a pretty simple query, just a long chain of entities and relations, but it’s going to be pretty long if we don’t take advantage of TypeDB’s inference engine:

match
  $t isa tissue, has name "heart muscle";
  ($t, $c) isa tissue-composition;
  $c isa cell;
  ($c, $g) isa cell-expression;
  $g isa gene;
  ($g, $tr) isa transcription;
  $tr isa transcript;
  ($tr, $p) isa translation;
  $p isa protein, has uniprot-id $id;
An example result of proteins that are enhanced in the heart muscle tissue.

Another frustrating issue is that we can’t tell what the query actually does at face value. We can tell by reading through it that it retrieves a protein ID based on a tissue name, but we have to think about it to work out exactly what the results mean.

When we start using the inference engine, queries like this become a breeze. Let’s first define some rules to infer protein-encoding and tissue-expression relations:

rule inferred-protein-encoding:
  when {
    (transcribed-gene:$g, synthesised-transcript:$t) isa transcription;
    (translated-transcript:$t, synthesised-protein:$p) isa translation;
  } then {
    (encoding-gene:$g, encoded-protein:$p) isa protein-encoding;
  };

rule inferred-tissue-expression:
  when {
    (composed-tissue:$t, composing-cell:$c) isa tissue-composition;
    (expressing-cell:$c, expressed-gene:$g) isa cell-expression;
  } then {
    (expressing-tissue:$t, expressed-gene:$g) isa tissue-expression;
  };

With these two rules defined, we can now directly query those two inferred relations:

match
  $t isa tissue, has name "heart muscle";
  ($t, $g) isa tissue-expression;
  $g isa gene;
  ($g, $p) isa protein-encoding;
  $p isa protein, has uniprot-id $id;
An example result of proteins that are enhanced in the heart muscle tissue.

We can use these new inferred relations in any queries we write. In fact, we already used the protein-encoding relation earlier on in the section on ingesting data. Let’s go one step further, and use our two new relations within another rule:

rule inferred-tissue-enhancement:
  when {
    (encoded-protein:$p, encoding-gene:$g) isa protein-encoding;
    (expressed-gene:$g, expressing-tissue:$t) isa tissue-expression;
  } then {
    (expressed-protein:$p, enhanced-tissue:$t) isa tissue-enhancement;
  };

Now we can directly query the relation we’re interested in:

match
  $t isa tissue, has name "heart muscle";
  ($t, $p) isa tissue-enhancement;
  $p isa protein, has uniprot-id $id;
An example result of proteins that are enhanced in the heart muscle tissue.

Not only is our query far shorter and more easily readable, but we can instantly understand what it’s asking for, as we’ve expressed the actual relation we’re looking for rather than the steps needed to find it. That’s because TypeQL patterns are declarative, and this feature is only enhanced when used in combination with inference. With these features combined, we can easily surface the high-level connections buried in our data, and let TypeDB figure out how.

Identifying potential drug targets

Finally, we’ve got all the tools we need to generate potential new knowledge and answer some real questions, like what drugs might be potential treatments for specific diseases. Let’s pick a specific disease to look into, for instance neoplasms, a form of cancer, and phrase our business question:

What drugs might be potential treatments for neoplasms?

We can then construct our business question as a query:

match
  $di isa disease, has name "Neoplasms";
  $dr isa drug, has id $dr-id;
  ($di, $dr) isa potential-disease-treatment;
Drugs that might be potential treatments for neoplasms.

We’ve hit seven results and, as we can see from the green outlines, those results were generated by inference. Let’s look a bit deeper into one of them:

match
  $di isa disease, has name "Neoplasms";
  $dr isa drug, has id "chembl:CHEMBL2108738", has name $dr-n;
  $pdt ($di, $dr) isa potential-disease-treatment;
Nivolumab, a drug that might be a potential treatment for neoplasms.

We see that this drug is Nivolumab, which is a common treatment for many types of cancer. It’s unsurprising that we got this result back as indeed Nivolumab might be used to treat neoplasms, but how did TypeDB arrive at this conclusion? We can investigate by using TypeDB Studio’s explanations feature, which will show us how the inference took place. If we explain the potential disease treatment relation, we see that it exists because both neoplasms and Nivolumab are associated with a particular protein:

Nivolumab, a drug that might be a potential treatment for neoplasms.

But here we can see that both the disease-protein association and drug-protein association are inferred as well. If we go a level deeper, then we’ll get more information about the inference route:

Nivolumab, a drug that might be a potential treatment for neoplasms.

Now we can see that both neoplasms and Nivolumab have interactions with particular proteins, and those two proteins are associated with the protein we found earlier. If we continue to explain the relations repeatedly, we’ll eventually get to the root cause of the result:

Nivolumab, a drug that might be a potential treatment for neoplasms.

Now we can see the non-inferred blue lines that connect the disease to the drug via a chain of ten relations. If we follow them along, it looks like:

  • The drug, Nivolumab, has an interaction with a gene.
  • That gene is used to synthesize a transcript.
  • That transcript is used to synthesize a protein.
  • That protein participates in a particular pathway.
  • A second protein also participates in that pathway.
  • That second protein participates in a second pathway.
  • A third protein also participates in that second pathway.
  • That third protein is synthesized by a transcript.
  • That transcript is synthesized by a gene.
  • That gene has an interaction with the disease, neoplasms.

If we wanted to query that potential disease treatment directly, the query would be over twenty lines long, and this would be the case with most other databases. But with TypeDB, we can abstract all of that complexity away with inference, and ask only the question we’re interested in:

match
  $di isa disease, has name "Neoplasms";
  $dr isa drug, has id $dr-id;
  ($di, $dr) isa potential-disease-treatment;

The real beauty of TypeDB’s declarative queries is even more subtle. We got six other results for this query for neoplasm treatments, and not all of them were inferred in the same way as Nivolumab was. That’s because, with the rules defined in the schema, there are multiple ways that a potential disease treatment relation can be inferred, but when we run this query, we get all of the results. That’s what it really means for a query language to be declarative: we specify what we want, but not how to work it out. TypeDB does that for us, and we can always ask it to explain how it got there.

Concluding remarks

We’ve seen how TypeDB can streamline data ingestion, learned how it can expose deeply buried connections in our data, and then explored a practical example of using inference to identify potential drug targets.

Given the large amount of data involved, knowledge discovery is most often performed using machine learning methods. TypeDB’s strong typing provides machine learning tools with richly labelled and structured data, enabling them to get the most meaning out of it. Inference also provides deductive reasoning, which complements the inductive capabilities of typical machine learning tools, enabling more accurate conclusions than can be delivered by machine learning alone.

You can find out more about building machine learning models over TypeDB in our previous blog post on link prediction. To learn more about the TypeDB Bio project and knowledge engineering as a whole, sign up to our upcoming drug discovery webinar. Alternatively, to start working with data straight away, check out the TypeDB Bio project made available under our Open Source Initiative.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter
Feedback