TypeDB helps Origin Sciences build a cancer research platform that finds what conventional pipelines overlook

A new platform, built by Origin Sciences, shows how TypeDB enabled the development of a pathway-level methylation analysis system for colorectal cancer research, processing a 22 GB clinical dataset end-to-end in under an hour, and recovering oncogenic signals that conventional pipelines tend to overlook.
Origin Sciences is a leading genomics company developing early detection tools for colorectal cancer. Their OriCol™ device collects rectal mucus. Their research programme has produced peer-reviewed findings in Nature Communications, including the first published demonstration of CRC-associated methylation signal from that sample type, a finding published in Nature Communications. EpiGraph is the internal platform they built to make that kind of analysis tractable at scale.
Genome-scale methylation analysis sits at the intersection of two hard problems. The first is numerical; across millions of CpG sites and hundreds of samples that can’t be casually loaded into memory. The second is biological: raw genomic coordinates are meaningless without context when mapped to genes, promoters, CpG islands, pathways, and functional annotations, most of which is relationship structure rather than numbers. Most pipelines solve one at the cost of the other, which is the problem EpiGraph was built to avoid.
The dataset: 864 samples, ~4 million genomic sites, 22 GB of raw data drawn from three independent clinical studies.
Architecture
EpiGraph uses two storage layers, each chosen for its particular specialism.
TypeDB holds the biological structure, genes, pathways, CpG-to-region mappings, functional annotations, sample metadata. Parquet files, queried via DuckDB, hold the dense beta-value matrix. A thin Python layer bridges the two, pulling structure from the graph, slicing signal from columnar storage, and writing derived per-pathway features back into the graph as first-class entities.
The numerical work (gene aggregation and pathway over-representation) runs against the Parquet/DuckDB layer and completes in 139s and 101s respectively on a single workstation. The biological queries (pathway membership, CpG annotation, directional metadata) run against TypeDB and stay expressed in biological terms rather than hand-coded SQL joins.
TypeDB specifically shaped the design. The function construct makes transitive biological queries natural to express: a four-line TypeQL function replaces a recursive CTE. @values and @card constraints turn the schema into a specification of valid biology rather than a container for whatever happens to be inserted.
The TypeDB AI assistant compressed schema iteration substantially; the first working version of the full software stack (schema, loaders, analysis layer, reports, tests, CI) came together in a single engineering day.
What it found
Running against the full cohort:
- 4,850 genes reached FDR significance in the CRC-vs-control comparison across 49,371 gene-level features
- 1,152 further distinguished CRC from pre-cancerous polyps
- 11 Reactome pathways reached significance, clustering around PI3K/AKT, RAF/MAPK, and FGFR2 — the same growth-factor axes that recurrent driver mutations target in CRC, recovered from methylation data alone with no mutation-panel input
The FGFR2 result illustrates why the architecture matters. None of the seven genes involved would have been significant individually. In TypeDB, the relationship between gene and pathway is a first-class property of the graph. So, evaluated as a group, under a graph-native pathway-membership query, they produced a statistically significant and biologically coherent result that aligns with independent mutation-level knowledge of CRC driver biology. A gene-by-gene pipeline wouldn’t have surfaced it.
A direction-aware re-analysis added further signal. Using curated CpG-direction metadata stored directly in the graph, EpiGraph produced separate matrices for hyper- and hypo-methylated genes. 90% of hyper-methylated genes (FDR q < 0.05) carried cohort-level CRC signal, against 3.5% of hypo-methylated genes. The canonical CRC-CIMP pattern, surfaced from a single pre-specified graph property rather than derived from the cohort data.
The same per-sample pathway feature vector also functions as a cohort QC layer. Samples whose pathway-level signature is inconsistent with their recorded group can be flagged for provenance and labelling review. Because every feature is defined in biological terms, anomalies arrive with an immediate biological handle, faster to triage than inspecting anonymous statistical outliers.
On choosing TypeDB
“I evaluated property-graph alternatives before committing to TypeDB. The difference was constraint enforcement. In a property graph, controlled vocabularies and cardinality rules live in your application code — you find out they’re broken late. In TypeDB, they live in the schema. For a biological domain with real structure, that’s not a developer-experience nicety. It’s a correctness feature.”
— Andrew Page, Origin Sciences
The constraint enforcement point generalises beyond biology. Any domain where the data has real structure – controlled vocabularies, typed relationships, cardinality rules that mean something – benefits from encoding that structure at the schema layer rather than the application layer. EpiGraph is a concrete example of what that looks like in practice.
EpiGraph is open source under MIT licence: github.com/originsciences/epigraph. The full technical whitepaper is available from Origin Sciences on request.



