Improving LLM understanding of TypeQL with Text2TypeQL

We are excited to release text2typeql, a new dataset designed to accelerate LLM mastery of TypeQL.
Our goal is to make TypeQL generation faster, more accurate, and seamlessly integrated into the English-to-TypeQL agentic workflows that are becoming standard in the industry.
While frontier models already handle TypeQL’s high-level semantics reasonably well, achieving production-grade reliability often requires iterative prompting or latency-expensive reasoning chains. Ecosystems like SQL and Cypher have benefited from years of public repositories and benchmark datasets to streamline this process.
With text2typeql, we are releasing nearly 14,000 validated TypeQL queries paired with natural language questions, available on GitHub under the Apache 2.0 license. This open-source resource empowers developers to fine-tune local, private, and low-latency models, ensuring that TypeQL generation is as robust and efficient as the database itself.
Text2Typeql
text2typeql is an open-source dataset of 13,939 natural-language questions paired with validated TypeQL 3.0 queries, drawn from two source datasets and spanning fifteen diverse domains. Each domain has its own TypeQL schema against which TypeQL queries were validated.
- Social networks (Twitter, Bluesky) – users, tweets, hashtags, follows
- Streaming platforms (Twitch, Neoflix) – streamers, subscriptions, ratings
- Film industry (Movies) – actors, directors, producers, reviews, roles
- Recommendations – users, movies, genres, ratings, actors
- Corporate graphs (Companies) – organizations, subsidiaries, articles, cities
- Fiction networks (Game of Thrones) – characters, houses, battles, interactions
- Q&A (StackOverflow, Buzzoverflow) – users, questions, comments, votes
- Financial crime (FinCEN) – filings, countries, originators, beneficiaries
- Business reviews (GrandStack) – businesses, users, reviews, categories
- Infrastructure (Network) – devices, interfaces, connections, configurations
- Supply chain (Northwind) – products, suppliers, customers, orders, categories
- Investigations (OffshoreLeaks) – officers, entities, intermediaries, addresse
The dataset is built from two source collections. Synthetic-1 contains 4,733 converted queries. Synthetic-2 contains 9,206 were successfully converted queries. In total, 104 queries (0.7%) were skipped as some features (such as vectors, casting, and date manipulation functions), are not available yet.
Converted queries include an English question, a Cypher query, and the validated TypeQL query. The dataset ships with fifteen TypeQL schemas modelling each domain. Due to TypeDB’s schema-first nature, queries can be validated for correctness by compiling them against the dataset’s schema!
Here’s an example of a converted english prompt, with Cypher and TypeQL implementations.
English:
List the top 3 categories based on the number of products they contain.
Cypher:
MATCH
(p:Product)-[:PART_OF]->(c:Category)
WITH c.categoryName AS category, COUNT(p) AS productCount
RETURN category, productCount
ORDER BY productCount DESC
LIMIT 3
TypeQL:
match
$c isa category;
$p isa product;
part_of (product: $p, category: $c);
reduce $product_count = count($p) groupby $c;
sort $product_count desc;
limit 3;
fetch { "category": $c.category_name, "product_count": $product_count };
Where It Comes From
This dataset is derived from Neo4j Labs’ text2cypher benchmark, which contains several synthetically generated datasets of English/Cypher pairs across demo databases with schemas.
The first dataset, which we call synthetic-1, corresponds to their opus dataset, generated by Claude Opus. The second dataset, synthetic-2, corresponds to their gpt4o dataset, generated by GPT-4o.
We took the same questions and schemas, converted everything to TypeQL 3.0, and preserved the original Cypher alongside each query for direct comparison. Full credit to Neo4j Labs for creating and releasing the original datasets.
We ❤️ open source!
How This Helps
Fine-tuning. The dataset provides supervised training data for fine-tuning smaller, faster models (Llama, Mistral, Phi, and similar) on TypeQL generation. The fifteen-domain coverage ensures models encounter diverse schema patterns — from simple entity lookups to complex multi-hop financial crime investigations. This enables local, low-latency, cost-effective text-to-TypeQL without relying on frontier model APIs.
Few-shot prompting and RAG. The thousands of validated examples serve as a rich retrieval corpus for retrieval-augmented generation or few-shot in-context learning. Given a user’s natural-language question, a system can retrieve similar questions from the dataset and include their TypeQL as examples in the prompt.
Learning resource. Side-by-side Cypher and TypeQL for the same English question makes the dataset a practical reference for engineers learning TypeQL. It can help to see how familiar Cypher patterns map to TypeQL — explicit roles, reduce for aggregation, let for computed values, etc.
Feature coverage. The fifteen domains collectively exercise nearly the full breadth of TypeQL query syntax, from simple entity lookups to recursive stream functions for transitive closure. The expanded synthetic-2 domains add patterns underrepresented in the original seven — financial transaction chains (FinCEN), supply chain hierarchies (Northwind), network topologies (Network), and investigation graphs (OffshoreLeaks). Advanced features like fetch subqueries, chained aggregation, custom functions, and datetime arithmetic appear throughout. Models trained on this data will encounter the patterns they need for real-world use.
Get Involved
The dataset is available on GitHub. We welcome contributions: additional domains, alternative TypeQL formulations for existing queries, or conversions of the remaining queries as TypeQL gains new features.
As always, feel free to join our Discord or contact us!
Coming Soon
Stay tuned for our post on how we made this dataset. We’ll cover our agentic pipeline and detail how TypeQL query generation had far fewer semantic errors than Cypher due to validation against strict schemas.
