Officially out now: The TypeDB 3.0 Roadmap

TypeDB Fundamentals

Creating complex DB workflows with pipelines



This article is part of our TypeDB 3.0 preview series. Sign up to our newsletter to stay up-to-date with future updates and webinars on the topic!

A query pipeline is a sequence of query clauses. Each clause describes an operation that takes an input stream of data, and produces an output stream of data. In this article we describe how, from a set of basic query operations, complex data pipelines can be crafted in this way.

At a glance: pipeline syntax essentials

In TypeDB, query inputs and outputs are streams of so-called concept maps: these are maps associating variable names to concepts (i.e., objects, values, or types). TypeDB’s query pipelines are composed of the following basic query clauses as building blocks to operate on such streams.

  • A “match” clause is a query clause of the form match P , where P is a TypeQL pattern. A match clause takes a concept map stream as its input, and produces an output stream by augmenting (in zero or more ways) each mapping in its input with the matched results for additional variables in the pattern P.
  • The “insert” clause insert S (for a sequence of statements S) takes in a stream of maps. It executes each insert statement in S with the given variable mappings. To produce an output stream, each map in the stream is augmented with the newly inserted concepts bound to the given variables.
  • The “delete” clause delete S takes in a map stream, and executes each delete statement in S with the given variable mapping. To produce the output stream, we remove the deleted concepts from each map in the stream.
  • The (new!) “put” clause put S should thought of as “try matching all of S; if results are found, behave like a match, and otherwise like an insert”.
  • We provide various “stream modifier” clauses, such as filter $x1, $x2 ... (which filters maps in streams to a specific sets of variables), or sort $x (which re-sorts the entire stream based on value that$x is mapped to), etc.
  • And, finally, we introduce a new assert clause which can be used both to impose custom constraints at query runtime or to control the flow of query pipelines.

We remark that the above concerns data queries, which are separate from the schema queries (using the usual clauses define and undefine) and which remain almost unchanged in their function.

  • The beginning of a query pipeline may contain with clauses (which we collectively refer to as the preamble of the query) used to define additional query-level function, as discussed in our function fundamentals.
  • The end of a query pipeline may comprise a reduce clause which reduces the incoming stream into a single return value (or tuple thereof), or a fetch clause which format the stream in its final format.

Example pipeline

Let’s see how query pipelines work in practice with a slightly more complex example. Feel free to read through this example in full already — in the following sections we will go through each individual part of the pipeline in more detail.

with fun available_cars($model: car_model) -> cars[] :
  match $c isa car, has car_model $model, has status "available";
  return list($c);
with fun requests_by_priority($model: car_model, $limit: int) -> request[] :
  match $request (car_model: $model) isa request, has priority $priority;
  sort $priority desc; # sort by priority high to low
  limit $limit;
  return list($request);
match
  $model isa car_model;
match
  $cars = available_cars($model);
  $requests = requests_by_priority($model, length($cars));
match
  $request = $requests[$number], links (customer: $customer);
  $car = $cars_for_rent[$number];
filter $request, $car, $customer;
put
  $assign (car: $car) isa car_assignment;
assert count($assign) == 1;
assert has_payment_method($customer) == true;
insert
  $assign links (customer: $customer) @replace;
  $request has status "processed" @replace;
match
  $left_over_request isa request;
  not { $left_over has status "processed" };
reduce count($left_over_request); 

The with preamble

We’ve already seen how to use with clauses to define query-level functions. In the above pipeline we have two such clauses back-to-back, each defining a function. Note that these functions are using stream modifier clauses (like sort and limit) in their body! We will discuss this shortly.

Extending results with match

Let’s first settle some terminology once and for all:

  • A concept map M is a mapping of the form ($x1 -> c1, $x2 -> c2, ...) where the $x‘s are variables and the c‘s are concepts (a concept can be either a data instance, some other computed value, or a type).
  • A map stream S is an (ordered) set of concept maps { M1, M2, ... }.

Note that we use the word set as usually our streams will have no duplicates. However, by deleting concepts using a delete a stream state with duplicates may be reached (as we may forego de-duplication for performance reasons).

A match clause is of the form match P where P is a called a pattern (a term familiar term in TypeQL land). A match clause operates on streams as follows. It takes an input stream S, and for each map M in S it assigns variables in P the data given in M. It then matches possible results R for the remaining unassigned variables, e.g. ($y1 -> r1, $y2 -> r2, ...). For each such result R, it combines M and R into a single map, and adds it to the output stream. Note, there may be zero results R! So in this case, nothing gets added to the output stream.

Match clause #1 in our pipeple

Let’s get one technical aside out of the way: in the beginning of the pipeline the incoming stream is set to only contain the empty concept map M = () — this is not the empty stream, as it does contains a map!

Now, the very first match clause extends the unit map M with all possible results R for its “new” (i.e. all of its) variables. These extended maps are the form ($model -> "some car model attribute"). The output stream of this first clause could look something like:

{ 
  ( $model -> "Ford Fiesta" ),
  ( $model -> "Audi A3" ),
  ( $model -> "Rolls-Royce Phantom" ) 
}

This output stream becomes the input stream to the second match clause.

Match clause #2

The second match clause is of the form:

match
  $cars = available_cars($model);
  $requests = requests_by_priority($model, length($cars));

extends maps in its input stream with two further variables: $cars and $requests. Inspecting the query, both these variables are assigned to the single-return of functions returning lists. The output may look something like this:

{ 
  ( $model -> "Ford Fiesta", $cars -> [<car7>, <car2>, <car4>],
    $requests -> [<req1>, <req3>] },
  ( $model -> "Audi A3", $cars -> [<car4>], $requests -> [<req2>] ) 
}

where we use <obj> to indicate objects (i.e. entities or relations) in our database.

Note how the third map (for the Rolls-Royce Phantom car model) from the input stream was dropped, because no results matching the new variables were found: in other words, either no available cars or no requests were found in the database for the instance "Rolls-Royce Phantom" of $model.

Match clause #3

In the third match, we query:

match
  $request = $requests[$number], links (customer: $customer);
  $car = $cars_for_rent[$number];

This extend the maps our stream with four further variables: $number (a member of an integer range list), $car, $request, and $customer. The end state of this stage of the query could look something like the following:

{ 
  ( $model -> "Ford Fiesta", $cars -> [<car7>, <car2>, <car4>], 
    $requests -> [<req1>, <req3>], $number -> 0, $car -> <car7>, 
    $request -> <req1>, $customer -> <cust113> ),
  ( $model -> "Ford Fiesta", $cars -> [<car7>, <car2>, <car4>], 
    $requests -> [<req1>, <req3>], $number -> 0, $car -> <car2>, 
    $request -> <req1>, $customer -> <cust284> ),
  ( $model -> "Audi A3", $cars -> [<car4>], $requests -> [<req2>], 
    $number -> 0, $car -> <car4>, $request -> <req2>, $customer -> <cust8> ) 
}

Effectively, this last match clause “unwinds” in parallel the two lists into their individual items while still keeping track of the original lists … these will be dropped by the subsequent filter as we discuss shortly.

About match clause chaining

If we have two match clauses match P; and match Q; for patterns P; and Q; there is, in some sense, no difference between writing match P; match Q;, match P; Q;, match Q; P; and match Q; match P;: indeed, in the declarative semantics of TypeDB, either option will result in the same set of maps providing all solutions to the constraints of both P and Q.

Nonetheless, in the above pipeline example, we have split our patterns across multiple match clauses. Why? There are two reasons:

  1. It is often intuitive to think of your concept map being built gradually in steps. This ties in with our later discussion of using concept API calls. (On the other hand, for performance, it may be beneficial to inline all constraints.)
  2. Since a match clause extends each map in its input stream individually, chaining match clause has a “grouping” effect, i.e. later matches are grouped by earlier matches. In other words, while the set of results is unaffected by splitting up matches, the order of results may be.

Note that in performance critical situations, we will potentially provide an option that disables the “grouping” effect described in (2.): this this may yield speed-ups in some situations.

Stream modifier clauses

TypeDB 3.0 will ship with four modifier clauses: filter, sort, limit, and offset.

Filtering streams

The filter $x1, $x2, ... ; clause filters maps in a stream down to the given set of variables $x1, $x2, .... By default, filtering also deduplicates the maps in the resulting streams (we will provide disabling this with an option to allow duplication, as this may yield speed-ups in some situations). For example, for the map

{
  ($car -> <car1>, $model -> "Fiat 500"),
  ($car -> <car9>, $model -> "Fiat 500"),
  ($car -> <car9>, $model -> "Seat Ibiza")
}

the clause filter $model; would yield the output stream

{
  ($model -> "Fiat 500"),
  ($model -> "Seat Ibiza")
}

Note that, we can also filter on optional variables (i.e. those we need not be in every map of a stream — see our optionality fundamentals): for example, filter $x, $y will turn the map ($x -> <x>, $z -> <z>) into ($x -> <x>) (so $y is the optional variable here).

Sorting streams

The sort $x (asc | desc); clause will sort its input stream (in ascending or descending order) based on the data of the variable $x.

{
  ($car -> <car1>, $model -> "Fiat 500"),
  ($car -> <car9>, $model -> "Fiat 500"),
  ($car -> <car5>, $model -> "Seat Ibiza")
}

Then the clause sort $model desc; would yield the output stream

{
  ($car -> <car5>, $model -> "Seat Ibiza"),
  ($car -> <car1>, $model -> "Fiat 500"),
  ($car -> <car9>, $model -> "Fiat 500")
}

Note that, we can also sort on optional variables $x, in which case maps with missing data for $x will be put at the end of the stream (independent on whether the sort is ascending or descending).

Limiting streams

The limit NUM; clause limits the length of a stream to NUM elements: it truncates its input stream after the NUMth element.

In a query pipeline, the NUM must be an expression not containing variables from the pipeline. In functions, however, it may include variables that are supplied as arguments to the function.

Complementing the limit clause, the offset NUM; clause offsets a stream by NUM elements: it ignores the first NUM elements of the its input stream, and outputs the rest of the stream.

Modifiers in functions

As illustrated by our example pipeline, all modifiers can be used in functions as well, but only in between the match and return clauses of a function.

Next, let us see how we can change the data of our database from within query pipeline, using the dedicated clauses insert, delete, and put

Inserting data with insert

An insert clause is of the form insert S, where S is a sequence of “insert statements”. These statements perform two different operations: creating new data and new dependencies between data.

Inserting new data

For each map M in the input stream, we first assign data from M to the corresponding variables in S—all remaining variables $y are the “data to be created”. Each such “to-be-created” variable $y must appear in unique a typing statement of the following form.

  • Object creation: a statement $y isa T where T resolves to an entity or relation type. In this case, the insert clause will then create a new object <obj> of type T. We then extend the map M with the mapping $y -> <obj>.
  • Attribute creation: a statement of the form $y EXPR isa T, where T resolves to an attribute type. In this case, the insert clause will then create a new attribute <attr> of type T with underlying value EXPR. We then extend the map M with the mapping $y -> EXPR.

The final case of inserting new data, is copying a pre-assigned value into an attribute. Namely, a statement $y isa T, where a mapping $y -> EXPR is already in the input concept map M, will create a new attribute <attr> of type T with underlying value EXPR. Copying a value into an attribute will not extend the map M (i.e. no new variables are being created).

Inserting new data dependencies

Besides creating new objects and copying existing values, an insert clause may contain the following statements that insert dependencies between data.

  • Attribute owners: a statement of the form $obj has ATT or $obj has T EXPR will insert ownership of either an existing typed attribute expression ATT, or, a to-be-created instance of (list) attribute type T and (list) value EXPR. Note, in the second case the type T needs to be supplied. For example, we cannot write $p has "John" as "John" is an value expression of type string; we must write $p has name "John" instead.
  • Relation roles: a statement of the form $obj links (EXPR) or $obj links (T: EXPR) will insert the role player(s) EXPR in the relation object $obj. In the second variation a role type T is supplied. Note, a type needs to be supplied if EXPR could play multiple roles in $obj.

Both of the above statements can also have an @replace annotation, as we explained in our constraint fundamentals.

The fineprint

An insert clause will result in an error if any schema constraints are violated.

The usual syntactic shortenings apply as well. E.g. $obj isa T; obj links (A); $obj links (B); can be shortened to $obj isa T, links (A, B); (and, even more concisely, to $obj (A, B) isa T; for compatibility with pre-3.0 syntax).

Deleting data with delete

A delete clause is of the form delete S, where S is a sequence of “delete statements”. For each map M in the input stream, we again assign data from M to the corresponding variables in S—no other variables may appear in the clause.

With all our variables assigned, we then perform deletion of the resulting data-assigned statements. There are three cases of such statements:

  • x isa T deletes x from the type T unless, unless the object x appears as a roleplayer in some “deletion-blocking” relation object of type R which has not been marked with @cascade, or, a query-level cascade annotation is provided (see next section).
  • x has y (or x has T y) deletes y as an owned attribute for the object x.
  • x links y (or x links (T: y)) deletes y as a roleplayer from the relation object x.

The delete clause will result in an error if any schema constraints are violated by performing the above deletions.

Schema- and Query-level @cascade

At the schema-level, we allow deleting role players of relation objects of type R which have not been marked with @cascade. But only once role player cardinality drops below what has been specified in the schema, we delete the relation object itself.

At the query-level, the deletion blocking behavior of relation objects can be modified by supplying the annotation @cascade(R1, R2, ...) where R1, R2, ... is a list of relation types: in this case, any deletion-blocking relation object from the listed types will be deleted together with x. This also works for nested relations: if x is a roleplayer in a R1-typed relation y, which is a roleplayer in a R2-typed relation z, then the deletion of x will trigger both the deletion of y and z.

We emphasize again that the query- and schema-level behavior of cascading deletes differs in an important point: for schema-level @cascade, relations get deleted once they have insufficient role player cardinality; for query-level @cascade, relations get deleted if any of their role players gets deleted in the delete query.

Inserting data conditionally with put

The put clause has been newly introduced in TypeDB 3.0. Its the purpose to “insert data only if it doesn’t exist yet”. The clause is of the form put S, for a sequence of statements S. It can explained very easily as follows.

  1. For each map M in the input stream, we first run match S which matches all of S (as a single pattern).
  2. If we any results are returned, we extend the map M accordingly with those results.
  3. Otherwise, if no results are returned, we run insert S, and extend the map M accordingly.

Continuing our pipeline (put in action)

Let’s continue our pipeline example from where we left off (just after the first three match clauses). First, after filtering our last stream of maps, we arrive at the stream:

{ 
  ( $car -> <car7>, $request -> <req1>, $customer -> <cust113> ),
  ( $car -> <car2>, $request -> <req1>, $customer -> <cust284> ),
  ( $car -> <car4>, $request -> <req2>, $customer -> <cust8> ) 
}

Starting with the first map, the subsequent put clause, performs the following.

  1. Run match $assign (car: <car7>) isa car_assignment; where we assigned the object <car7> from the first map in the stream to the variable $car.
  2. If the match is non-empty, extend the map from the stream with the results for the new variable $assign.
  3. Otherwise, create a new car_assignment object, that links the (car: <car7>), and add that new object to the

The same logic will be applied to the remaining maps in the input stream.

Note that our usage of put here ensures that if a car had been assigned already then, instead of creating another duplicate car_assignment, we overwrite the existing one in the next step.

This next step after the put clause (ignoring the assert clauses for a moment) is to run an insert clause, which replaces the customer role in our car assignment by the appropriate new customer for the given car, and sets the request‘s status to "processed". The final output stream would look like:

{ 
  ( $car -> <car7>, $request -> <req1>, $customer -> <cust113> ),
  ( $car -> <car2>, $request -> <req1>, $customer -> <cust284> ),
  ( $car -> <car4>, $request -> <req2>, $customer -> <cust8> ) 
}

Control flow with assert

We use assert clauses to control the data flowing through our pipeline. If its stated condition is not satisfied, an assert clause will throw an error, causing the pipeline to fail before the next step.

An assert clause will comprise a single “condition” statement. This may be either:

  1. A per-stream statement, which uses a reduction function (such as count, sum, median, …) on one of the stream’s variables, and checks an appropriate comparator statement for the obtained value (e.g. count($x) >= 3).
  2. A per-map statement, which also uses comparator statements but checks these on each map in the input stream. Note in this case our expressions may include user-defined functions of appropriate type.

In our earlier pipeline example, we used two assert clauses:

assert count($assign) == 1;  # per-stream condition
assert has_payment_method($customer) == true;  # per-map condition

Indeed, we wanted to ensure both of these conditions before performing the subsequent insert. Indeed:

  1. If there are two or more assignments to the same car, then something is wrong in our data and we should abort the pipeline.
  2. Similar (somewhat more on the business logic side), we want to double-check that each $customer really has a valid payment method. We do say by calling some function has_payment_method, whose definition we omitted in the example.

So if an assert throws an error, how can we know what happened? This brings us to the question of debugging query pipelines.

Debugging query pipelines

Each clause (except with ) can be annotated with the annotation @debug. This will print input and output streams of the query, and supply other useful information to the developer.

Fetch clause with subqueries and functions

Finally, let’s briefly discuss the fetch clause, which is a special “stream formatting” clause that can go at the end of query pipelines. In essence, the clause operates in two steps as follows:

  1. It takes an input stream of maps, and filters this stream by exactly those variables which appear in the body of the fetch clause.
  2. We then format the data of each map in the filtered stream as a JSON, i.e. nested key-(list-of-)value pairs. Importantly, the (list-of-)value may be the result of another query! This may either be an attribute projection (which simply lists a attributes of specific type of an given object from the input map), or a subquery (which can may use any of the map’s data as arguments), or a function call.

A subquery looks (and acts) similar to the body of a function: it comprises a match close, followed by stream modifier, and terminates with either another fetch statement or a reduce clause that reduces the subquery stream into a single (not tupled!) value. In particular, a subquery is not an entire query pipeline.

Let’s see how this plays out in practice.

A fetch subquery

Consider the following query (which also runs in TypeDB 3.0)

match  # find married employees and their colleagues
  ($employee, company: $company) isa employment;
  ($employee, $spouse) isa marriage; 
fetch
  $employee as "married-employee" : hobby;  # attribute projection
  "good-friends" : {                        # start of the subquery
    match 
      ($employee, $friend) isa friendship;
      ($spouse, $friend) isa friendship;
    fetch $friend as "friend" : name, age;
  }

After the match clause, the input stream to the fetch clause may look like this

{
  ($person -> <p1>, $spouse -> <p5>, $company -> <c13>), 
  ($person -> <p29>, $spouse -> <p2>, $company -> <c4>), 
  ($person -> <p29>, $spouse -> <p2>, $company -> <c13>)
}

The filtering stage

First things first, the fetch clause filters this stream to all the variables appearing in its body and obtains the map:

{
  ($person -> <p1>, $spouse -> <p5>), 
  ($person -> <p29>, $spouse -> <p2>)
}

The formatting stage

Next, we format this stream as a JSON stream. Since our input stream has two maps, our output stream will have two JSONs. The output of two JSONs will look something like the following:

# first JSON
{
  "married-employee" : { 
      "hobby" : [ <hobby3> , <hobby5> ],
      "type" : <person>
  },
  "good-friends" : [
    {
      "friend" : { "name": [ <name7> ], "age": [ <age18> ], "type": <person> }
    },
    {
      "friend" : { "name": [ <name8> ], "age": [ <age30> ], "type": <person> }
    }
  ]
}
# second JSON
{
  "married-employee" :  { 
      "hobby" : [ <hobby3> , <hobby5> ],
      "type" : <android_robot>
  },
  "good-friends" : [
      # no friends :-(
  ]
},

Here we write <concept> for the JSON representation of given concepts (e.g., <hobby3> represents some hobby attribute, and <person> represents the person type itself).

Using functions in fetch (the theory)

In 3.0, we also add the ability to call functions from fetch, in the place of subqueries. The call structure is a follows:

fetch
  "my_key" : my_fun($arg1, $arg2, ...) as "key_1", "key_2", ...;

Here, the key labels "key_x" provide the JSON keys for the outputs of the function: there are as many such keys as there are elements in the output tuples of my_fun (whether the function is single-return, T1, T2, ..., or stream-return, {T1, T2, ...}).

The following types of functions can be called from fetch exclusively:

  • Stream-return functions with return type {T1, T2, ... } where each of the T‘s is either an attribute or value types.
  • Single-return function with return type T1, T2, ... where each of the T‘s is either an attribute or value type.

We’ll discuss how this cases are formatted in JSON below.

Using functions in fetch (the practice)

As a practical example, let’s rewrite our earlier fetch clause using functions and compare the respective JSON output.

with fun hobbies($employee) -> hobby[] : ...
with fun friend($employee) -> {name, age} : ...
match  ...  # as before
fetch
  "married-employee" : hobbies($employee) as "hobby";
  "good-friends" : friend($employee) as "name", "age";

(Note that we do not provide the implementation of the functions here for brevity.) As before, running the above will yield two JSON, but these now look as follows:

# first JSON
{
  "married-employee" : [ 
    "hobby" : "Guitar playing", 
    "hobby" : "Bird watching" 
  ],
  "good-friends" : [
    {
      "name": "Joseph",
      "age": 24
    },
    {
      "name": "Jenny",
      "age": 42
    }
  ]
}
# second JSON
{
  "married-employee" : [ 
    "hobby" : "Vacuum cleaning", 
    "hobby" : "Recharging" 
  ],
  "good-friends" : [
      # still no friends :-(
  ]
},

Function and struct JSON conventions

Note that the output is much simpler than that of a subquery, which carries around more metadata. For functions, we only collect their returned values! More precisely, here are the convention for calling function from fetch:

  1. For stream-return functions, their stream of results are re-formatted as a lists of results [ { "key_1": <result>, "key_2": ... }, { "key_1":... }, ... ], where <result> can be either a JSON-format attribute object <att>, or a single value <val>, or a list of either (this addresses 4 case of output types: Att, Att[], Val, and Val[]).
  2. For single-return functions, the result is formatted as a single JSON object { "key_1": <result>, "key_2": <result>, ... } is formatted as in the previous case.

Importantly, the case of values <val> above includes both primitive and struct values: struct values are formatted as we would expect, with their named fields becoming the keys for the JSON output.

Moreover, in both (1.) and (2.) above, if there is only one "key" and its <result> is primitive value (e.g. a boolean, string, or integer; but not a list!), then we allow omitting the "key" in our fetch clause completely. If the user choses to do so, then we simply return a single list [ val, val, ... ] in case (1.) and a single value val in case (2.) above.

Summary

Pipelines are a powerful tool: they allow user to craft their queries incrementally, and built complex database workflows. The fact that queries become compositional

In the future, we may add further features to pipelines (e.g. branching pipelines, other output formats other than JSON), but for now our model will stick to the “basics” of linear pipelines as outlined above.

Share this article

TypeDB Newsletter

Stay up to date with the latest TypeDB announcements and events.

Subscribe to Newsletter

Further Learning

The TypeDB 3.0 Roadmap

The upcoming release of version 3.0 will represent a major milestone for TypeDB: it will bring about fundamental improvements to the architecture and feel, and incorporate pivotal insights from our research and user feedback.

Read article

Constraints (3.0 Preview)

Learn about the constraint language of TypeDB's functional database programming model.

Read article

Functions (3.0 Preview)

Functions provide powerful abstractions of query logic, which can be nested, recursed, or negated, and they natively embed into TypeQL's declarative patterns.

Read article

Feedback