Lessons in Claude & TypeQL - cost, context, correctness

How we created the text2typeql dataset, automated TypeQL query generation, and learned from our agentic workflow experiences.

Joshua Send

We recently announced our text2typeql dataset. It contains 14,000 pairs of English questions and TypeQL queries generated to conform to the question.

The creation of this dataset for TypeQL yielded many lessons in using an agentic system to automate and orchestrate work to be as reliable and cost effective as possible.

Let’s dive into how we automated the creation of this dataset using Claude Code, and share what we learned along the way.

Creating text2typeql

The source dataset is Neo4j’s text2cypher. It contains synthetically generated datasets, each of which contains multiple domains. Each domain has natural language queries and a sort of soft database “schema”. In their repository, they also provide converted Cypher queries against each of the natural language queries.

Prerequisite: schema conversion

The schemas provided with each domain in the source dataset are simple JSON descriptions of the data. Critically, Neo4j itself is ‘schema optional’—any data can be added at any time. When Neo4j ‘schemas’ do exist, they are often simple constructs like uniqueness constraints and indexes, rather than tools designed to build and validate expressive, advanced data models.

To convert this to TypeQL, we needed to formalize these loose descriptions. Conceptually, this meant mapping node labels to entity types, relationship types to relation types (with explicit roles), and properties to attributes.

We used Claude to kickstart this process, feeding it the source JSON to generate initial TypeQL schemas. It generally did a decent job guessing the required structure simply from the general language semantics of the domain’s labels.

However, TypeQL’s richer type system allowed us—and sometimes required us—to extend the schemas beyond the Neo4j originals. During our quick manual review, we refined the generated schemas to capture semantics that Cypher often leaves implicit in property values or query-time conventions. This included:

Adding explicit entity subtypes where the JSON was flat.
Defining role distinctions for clearer relation semantics.
Tightening cardinality and key constraints to match the intended logic of the domain.

This human-in-the-loop review ensured the final schemas were clean and more “typedb-y”!

Here’s an example Neo4j schema snippet from the companies dataset:

"node_props": {
    "Person": [
      {
        "property": "name",
        "type": "STRING",
        "values": [ // truncated // ],
        "distinct_count": 7987
      },
       {
        "property": "id",
        "type": "STRING",
        "values": [ // truncated // ],
        "distinct_count": 8064
      },
      {
        "property": "summary",
        "type": "STRING",
        "values": [ // truncated //],
        "distinct_count": 6401
      }
  },
  ...
},
"relationships": [
    {
      "start": "Person",
      "type": "HAS_PARENT",
      "end": "Person"
    },
    {
      "start": "Person",
      "type": "HAS_CHILD",
      "end": "Person"
    },
...
]

This produced the following TypeQL snippet:

define
entity person,
    owns name,
    owns person_id @key,
    owns summary,
    plays parent_of:parent,
    plays parent_of:child;

relation parent_of,
    relates parent,
    relates child;

We have found, and continue to find elsewhere as well, that data models and schemas are still best with humans in the loop: it’s a way to encode hard-won domain knowledge. You know your the nuances of you problem better than any LLM – help it out by tweaking the schema or giving it precisely the description of what you want!

And, because TypeDB uses a strict schema, you just have to encode your domain knowledge once, and then your system will have those patterns and requirements enforced forever.

Attempt 1: script + API

Schemas completed, the first approach to automate query conversion was to have Claude write a script which handled the conversion of a row of data, build a prompt, fired it off to an LLM provider’s API, and then parse the output back out.

To advantage of TypeDB’s schema and close the loop, the script would take the output query and validate it against a running TypeDB loaded with the schema. Any errors were fed back into the prompt and re-submitted, up to 3 times. Voilà! A self-correcting loop!

But… this gets really expensive really fast. And we had thousands of queries to convert, with thousands of API calls. No thanks!

Attempt 2: sampling MCP

The obviously savings exist by leveraging the much more cost effective Claude Code subscription that was driving the work. So, the next attempt was to reformulate the script as an MCP server, and use sampling in order to ask Claude Code to do the actual work of query conversion.

Of course, Anthropic is very unfriendly here and has not implemented sampling in Claude Code. I think this is actually kind of terrible – their pricing already has rate limits at multiple levels: presumably to ensure they have a reasonable cost basis. That’s fine – but let me burn my tokens however works for me please!

And anyway, we’ll just find ways to work around this restriction…

Attempt 3: simple Claude Code

To me this was the most interesting attempt, and caught me out – right when I thought I was done!

Instead of handing off the work to another API or sampling, I asked Claude to do the conversion itself. In fact – it was already eagerly doing so as soon as the API-based program failed to convert a query before.

This ran successful and cost effectively for hours, occasionally hitting the 5-hour rate limits. Converting queries in batches helped make the process even faster.

But the story changed when looking into the quality of the generated TypeQL – it was horrible! A spot check showed that many of the generated queries were duplicates, or didn’t really answer the question.

Here’s an example from the movies dataset:

English:

			
Find all movies where Nancy Meyers was involved either by 
acting, directing, producing, or writing.

TypeQL:

match
  $p isa person, has name "Nancy Meyers";
  $m has title $title;
fetch { "movie": $title };

It’s a “correct” query that type checks in TypeDB – it finds a person called Nancy Meyers and some movie – but it’s completely missing the relations!

Retrying this query again in a clean context was successful however. It feels like an indicator that having a lot of queries, conversions, failed attempts, etc, in the context as it gets longer and longer leads to deteriorating quality. In general, longer context = less accurate instruction following and higher chance of confusion.

Attempt 4: subagents to the rescue

Luckily there’s a great way to manage context: subagents. A small context lends itself to much cleaner, reproducible behaviours.

Here’s a snippet from the subagent I defined to handle the conversion:

			
convert-query-runner.md
Use this agent when you need to convert a single Cypher query from the Neo4j dataset to TypeQL format with full validation.
...
# Conversion Steps
Step 1: Check if Already Processed
  ...
Step 2: Get the Query
  ...
Step 3: Load Schema
  ...Reads the schema file for the current domain...
Step 4: Convert to TypeQL
  ...
Step 5: Validate Against TypeDB
  ... Uses predefined python script to submit to a running TypeDB server...
Step 6: Semantic Review
  ...
Step 7: Write Result
  ...

		

This approach incorporates all the key learnings:

Subagents use a clean context that reduces output variance
Every generated query is executed against a live TypeDB instance to verify parsing and type-checking
- This step catches syntax errors, incorrect role names, missing attributes, and type mismatches that were syntactically plausible but semantically invalid against the loaded schema. These errors helps the subagent fix the query autonomously.
Every generated query is verified by the LLM to ensure it actually answers the English question – not just that it is valid TypeQL.
- This caught wrong sort directions, incorrect aggregation targets, and cases where optional-match semantics required try {} blocks rather than mandatory patterns.

Ultimately, this approach was highly successful at converting the bulk of the queries, and was amenable to parallelization across datasets, at the cost of speed of conversion only.

Tidying up

The conversion process produces a leftover set of failed.csv query lines. This is where a human can step in, providing tips for writing TypeQL queries the LLM didn’t know, pointing to documentation pages, and extending the schema with new structures.

Lastly, one sanity check – submitting each query for another semantic review against the english question, and validating it against TypeDB.

And that’s it!