Officially out now: The TypeDB 3.0 Roadmap >>

Formatting data

During bulk loading, data is often sourced from text files, most often in CSV or JSON format. Before it can be loaded, it must be formatted into a TypeQL query. This is best done by using scripts that build the queries from the data files. Let’s consider the following JSON object, which describes a book to be loaded into a bookstore database.

{
    "ISBN-13": "9780008627843",
    "ISBN-10": "0008627843",
    "Title": "The Hobbit",
    "Format": "ebook",
    "Authors": ["J.R.R. Tolkien"],
    "Editors": [],
    "Illustrators": ["J.R.R. Tolkien"],
    "Other contributors": [],
    "Publisher": "Harper Collins",
    "Page count": 310,
    "Genres": ["fiction", "fantasy"],
    "Price": 16.99
}

We might format it into an Insert query using the following Python function.

def build_book_query(entry: dict) -> str:
    query_parts = ["insert"]

    query_parts += [
        f"""$book isa {entry["Format"]};""",
        f"""$book has isbn-13 "{entry["ISBN-13"]}";""",
        f"""$book has title "{entry["Title"]}";""",
        f"""$book has page-count {entry["Page count"]};""",
        f"""$book has price {entry["Price"]};""",
    ]

    if "ISBN-10" in entry:
        query_parts += [f"""$book has isbn-10 "{entry["ISBN-10"]}";"""]

    query_parts += [f"""$book has genre "{genre}";""" for genre in entry["Genres"]]

    contributors: dict[str, int] = dict()

    for contributor in entry["Authors"] + entry["Editors"] + entry["Illustrators"] + entry["Other contributors"]:
        if contributor not in contributors:
            contributors[contributor] = len(contributors) + 1

    query_parts += [f"""$contributor_{index} isa contributor;""" for contributor, index in contributors.items()]
    query_parts += [f"""$contributor_{index} has name "{contributor}";""" for contributor, index in contributors.items()]
    query_parts += [f"""(work: $book, author: $contributor_{contributors[author]}) isa authoring;""" for author in entry["Authors"]]
    query_parts += [f"""(work: $book, editor: $contributor_{contributors[editor]}) isa authoring;""" for editor in entry["Editors"]]
    query_parts += [f"""(work: $book, illustrator: $contributor_{contributors[illustrator]}) isa illustrating;""" for illustrator in entry["Illustrators"]]
    query_parts += [f"""(work: $book, contributor: $contributor_{contributors[contributor]}) isa contribution;""" for contributor in entry["Other contributors"]]

    query_parts += [
        f"""$publisher isa publisher;""",
        f"""$publisher has name "{entry["Publisher"]}";""",
        f"""(published: $book, publisher: $publisher) isa publishing;""",
    ]

    return "\n".join(query_parts)

When called with the above JSON object, the function will return the following TypeQL query string.

insert
$book isa ebook;
$book has isbn-13 "9780008627843";
$book has title "The Hobbit";
$book has page-count 310;
$book has price 16.99;
$book has isbn-10 "0008627843";
$book has genre "fiction";
$book has genre "fantasy";
$contributor_1 isa contributor;
$contributor_1 has name "J.R.R. Tolkien";
(work: $book, author: $contributor_1) isa authoring;
(work: $book, illustrator: $contributor_1) isa illustrating;
$publisher isa publisher;
$publisher has name "Harper Collins";
(published: $book, publisher: $publisher) isa publishing;

We have employed a few strategies to help generate the query:

  • Rather than using composite statements, we only use simple statements. This makes queries easier to construct and debug.

  • Because a book might have multiple contributors, we automatically generate variable bindings for them by taking the string "$contributor_" and appending a number. This is a useful way of building a query that includes multiple similar variables.

  • We build the query up as a list of statements and then join them into a string at the end. This helps prevent errors from occurring.