Getting started

Everything runs in Docker

rete is developed and built entirely inside a container — nothing runs on the host. The dev container (.devcontainer/) carries the Rust 1.92 toolchain, rustfmt, clippy, Python, the wasm32-unknown-unknown target, and wasm-pack.

Open the folder in a dev container (VS Code: Reopen in Container), or run the same image directly:

docker build -t rete-dev -f .devcontainer/Dockerfile .
docker run --rm -it -v "${PWD}:/work" -w /work rete-dev bash
# then, inside:
cargo build --release -p rete-cli

The compiled CLI is at target/release/rete. The examples below assume it is on your PATH (or substitute cargo run -p rete-cli --).

Building a `.rete` file

rete build packs your triples into one immutable file — dictionary, permutation indexes, and a community pyramid — that you can drop on any URL.

rete build accepts N-Triples (.nt), N-Quads (.nq), and Turtle (.ttl), detected by extension. Multiple inputs are merged under one shared dictionary, and - reads standard input.

rete build data.nt -o data.rete                  # single file
rete build part1.nt part2.nt -o merged.rete      # merge several inputs
curl -s https://host/data.nt | rete build - -o data.rete   # from a pipe
rete build dump.unknown --format nt -o out.rete  # force a format

N-Quads inputs build a dataset: one shared dictionary, a default-graph index, and one index per named graph.

A full ontology: RDF/XML → an optimized `.rete`

Most OBO/W3C ontologies ship as RDF/XML (*.owl), which rete build does not read — convert it to N-Triples first with rapper (from raptor2-utils; it streams, so it handles gigabyte files):

# 1. RDF/XML (OWL) -> N-Triples
rapper -i rdfxml -o ntriples chebi.owl > chebi.nt        # 812 MB owl -> 8.83 M triples

# 2. assemble, with a self-describing Dataset Card (title/license/source/examples)
rete build chebi.nt -o chebi.rete --card \
  --title "ChEBI (full)" --license "CC BY 4.0" --source "https://www.ebi.ac.uk/chebi/"
#   -> 8.83 M triples, 3.15 M terms, 6 pyramid levels, 120 MB

The build is parallel and allocation-frugal by design (the CLI enables the parallel feature): the dictionary dedups terms with a HashSet and sorts once, and the six permutation indexes (SPO/POS/OSP/SOP/PSO/OPS) are built concurrently with parallel sorts. This is what lets it scale to millions of unique terms (definitions, synonyms, SMILES/InChI strings) without the build collapsing into allocation churn — and the output is byte-identical to a serial build, so the speedup is free. Turtle-native sources (e.g. a 239 MB .ttl) skip rapper and feed rete build --format ttl directly.

The result is a single range-queryable file: drop it on any HTTP host and query it lazily (below), or register it in the playground as a remote dataset. To go alongside it with a columnar/SQL view, generate lossless entity tables straight from the same N-Triples.

Querying locally

# Triple pattern — any of subject/predicate/object may be omitted (a wildcard):
rete query data.rete --predicate '<http://ex/knows>'
rete query data.rete --object   '<http://ex/Alice>'

# Basic Graph Pattern (multi-pattern join); ?x is a variable, ' . ' separates:
rete bgp data.rete "?x <http://ex/knows> ?y . ?y <http://ex/knows> ?z"

# Full SPARQL (SELECT / ASK / CONSTRUCT / DESCRIBE):
rete sparql data.rete "PREFIX e: <http://ex/> SELECT ?x ?z WHERE { ?x e:knows ?y . ?y e:knows ?z }"

# Standard SPARQL Results JSON, for piping into other tools:
rete sparql data.rete "SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10" --json

See the SPARQL support page for the full feature list.

Validating shapes

Use rete shacl when you need semantic data-quality checks, not just syntax or file integrity. Shapes are Turtle files; the command exits non-zero if the graph does not conform.

rete shacl data.rete --shapes shapes.ttl
rete shacl data.rete --shapes shapes.ttl --format json

See SHACL validation for the supported SHACL Core surface.

Inspecting a file

rete info   data.rete          # raw header
rete stats  data.rete          # size, counts, top predicates, planner stats, entity shapes
rete verify data.rete          # check the blake3 content hash (detect corruption)
rete graphs data.rete          # list named-graph IRIs
rete search data.rete gluc     # prefix-search labels (autocomplete; no literal scan)
rete export data.rete          # dump back to N-Quads (lossless)

Two coarse-graph views answer questions without reading the triple index:

rete summary data.rete   # structural: Louvain community quotient graph
rete schema  data.rete   # semantic: relations between rdf:type classes
rete predicates data.rete  # exact per-predicate totals, from the summary alone

rete stats also prints two index-free profiles read from the pyramid: the planner stats (per predicate: distinct subjects/objects, multiplicities, and functional / inverse-functional hints — the cardinality the cost-based join planner uses) and the entity shapes (the most common characteristic sets — which predicate-combinations subjects carry, e.g. {type, name, age} ×N).

Label search (autocomplete)

rete search data.rete <prefix> resolves a case-insensitive label prefix without scanning the literals. At build time rete extracts the display labels (rdfs:label, skos:prefLabel/altLabel, foaf:name, dc(terms):title, schema:name) of the most-connected subjects into a bounded, label-sorted block in the pyramid-meta; a prefix query is then a binary search over that block:

rete search data.rete "alan"          # label<TAB><iri> rows, case-insensitive
rete search data.rete "alan" --limit 5 --json   # [{"label":…,"subject":…}]
rete search data.rete                 # empty prefix → the first --limit labels

This is the fast path for autocomplete: a binary search plus a short walk, versus a FILTER(STRSTARTS(LCASE(?l), …)) scan over every label triple (measured ~22× faster at 6k labeled subjects; the gap widens with size — the scan is linear in the label count, the index is O(log n + matches)). The block is bounded (top 8,192 labeled subjects by degree), so on a very large graph it covers the prominent entities; an exhaustive match still needs the FILTER scan. Files built before this feature have no label index — rebuild to add it (the block is additive and backward-compatible, so old readers ignore it).

Full-text search (word / CONTAINS)

Label prefix search finds an entity by the start of its label. To find entities by a word anywhere in any of their literals, build with --text-index and query with rete search --contains:

rete build data.nt -o data.rete --text-index   # opt-in TEXT_INDEX section

rete search data.rete --contains glucose            # subjects whose literals say "glucose"
rete search data.rete --contains glucose phosphate  # AND — both words (any literal)
rete search data.rete --contains-prefix einst       # word starting with "einst…"
rete search data.rete --contains glucose --json     # [{"subject":…}]

Matching is whole-word and case-insensitive (the same tokenizer builds and queries: Unicode-alphanumeric runs, lowercased, length ≥ 2). The index maps each word to its sorted subject ids as its own range-readable section (§6.3 of the SPEC): the token table is read once, then each queried word faults only its posting list — so a --contains over a remote multi-GB file fetches kilobytes, not the whole index. It is opt-in because it is sizable; a build without --text-index is byte-identical to one built before the feature, and rete search --contains on such a file reports that there is no text index. rete stats shows the section's size when present.

Deploying & querying over a URL

A .rete file is immutable and self-describing, so any static host that honors HTTP Range requests works — S3, GCS, GitHub, a CDN, or the bundled dev server:

# Local range-capable server for testing:
python3 scripts/range_server.py 8000 .
rete query-url http://127.0.0.1:8000/data.rete --object '<http://ex/Dave>'

# Real https hosts (rustls; no http-only limitation):
rete query-url   https://my-bucket.s3.amazonaws.com/data.rete --predicate '<http://ex/knows>'
rete summary-url https://raw.githubusercontent.com/me/repo/main/data.rete

query-url resolves bound terms from the dictionary, then fetches only the selected permutation payload (the best of the six) for the triple pattern. summary-url reads just the header, dictionary, and summary — the index is never downloaded. The host must return 206 Partial Content to a Range request; a host that ignores Range (returns 200) is rejected with a clear error rather than silently returning wrong bytes.

Generating synthetic test data

scripts/synth_graph.py generates a realistic scholarly knowledge graph — papers, authors, venues, institutions, grants, fields — with the statistics real graphs have (power-law citations via preferential attachment, field communities, Zipfian venue popularity, log-normal team sizes, typed literals, per-year temporal structure). Two orthogonal knobs control it on demand:

# 10k papers (~315k triples), clean:
uv run python scripts/synth_graph.py --papers 10000 -o clean.nt

# Same size, 20% deliberate mess (cross-field rewires, temporal violations,
# missing attributes, mangled literals) — for robustness/quality testing:
uv run python scripts/synth_graph.py --papers 10000 --noise 0.2 --seed 7 -o messy.nt

# N-Quads with one named graph per publication year:
uv run python scripts/synth_graph.py --papers 5000 --quads -o by-year.nq

rete build clean.nt -o clean.rete

Identical arguments + seed reproduce the byte-identical graph; different seeds give natural variability at the same size/noise point. A per-kind breakdown of every noise event goes to stderr, so a test knows exactly what mess it got. (scripts/gen_graph.py is the older, simpler social-graph generator used by scripts/bench.sh.)

Scaling to ~1 GB

The generator and rete build scale linearly, so a big stress-test graph is just a bigger --papers. Output is roughly 85 bytes/triple as N-Triples and 31 triples/paper, so ~1 GB is about 400k papers / 12.5M triples:

uv run python scripts/synth_graph.py --papers 400000 --seed 1 -o big.nt  # ~1.1 GB, ~22 s
rete build big.nt -o big.rete                                            # ~56 s -> ~100 MB

Measured on the dev container (12.5M triples, 2.0M terms, ~30k communities): build is ~56 s and the .rete is ~100 MB (zstd). The point of the size is what querying it then doesn't read: a selective pattern answers in under a second, rete predicates reads ~20 MB of summary rather than the 80 MB index, and a lazy query open (rete cost big.rete "<query>") touches ~7 MB in ~50 range requests instead of the whole file — the range-query promise, at 1 GB.

The playground's scholar / scholar-noisy demo datasets are built with this generator (250 papers, seed 42, noise 0 and 0.25 — the exact commands are in the scripts/build_playground.py docstring).

Lossless entity tables (the best of both worlds)

scripts/rdf_to_entity_tables.py is the lossless counterpart: it keeps the readable one-table-per-type shape without dropping anything. Each class table has the frequent properties as named LIST columns (occupation, citizenship, date of birth…) plus three things that make it complete: a types column (all P31 values, so a multi-typed entity lives in exactly one table, never duplicated), an extra MAP(predicate → objects) column that catches every other property (rare ones, all the multilingual labels), and an _untyped residual table for subjects with no type. Object values are stored as N-Triples term tokens (<iri>, "lit", "lit"@en), so IRIs, literals and language tags round-trip. Explode types + every column + extra across all tables and you get back exactly the triples — --verify checks that (reconstructed == input). It can emit Parquet, a DuckDB file, and a SQLite file (list/map columns as JSON text) in one run:

uv run python scripts/rdf_to_entity_tables.py --parts 1 --limit 12000000 --props 24 \
  --min-entities 50 -o data/ent --duckdb data/ent.duckdb --sqlite data/ent.sqlite --verify

--props only changes how many properties get their own column vs. land in extra — it never affects losslessness. The _manifest.parquet records each class's column → predicate map so reconstruction is mechanical, and N-Triples is the interchange hub (rete export ↔ rete build ↔ these tables).

It works on any RDF, not just the Wikidata Parquet source: pass --nt <file> to read N-Triples directly (objects, language tags and datatypes round-trip verbatim) and --type-predicate <iri> to group by something other than Wikidata's P31 — e.g. rdf:type for OBO ontologies. This is how the chebi-full companions are built from the same chebi.nt as the .rete:

uv run python scripts/rdf_to_entity_tables.py --nt chebi.nt \
  --type-predicate "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" \
  --props 24 --min-entities 50 -o data/chebi-tables \
  --duckdb data/chebi.duckdb --sqlite data/chebi.sqlite --verify

Companion: columnar property tables (Parquet, split by type)

To compare the .rete graph against a columnar layout, scripts/rdf_to_property_tables.py denormalizes the same Wikidata triples into one Parquet table per entity type (the classic RDF "property table"): rows are entities, grouped by wdt:P31 (instance-of); columns are that class's most common structured properties, multi-valued as LIST(VARCHAR); an English label column is added and the labelling/description predicates are excluded so the columns are the real properties. It runs entirely in DuckDB from the source Parquet:

pip install --break-system-packages duckdb
uv run python scripts/rdf_to_property_tables.py --parts 10 --limit 120000000 -o data/wd-tables
# -> data/wd-tables/Q5.parquet (human), Q16521.parquet (taxon), … + _manifest.parquet

Each class table is independently queryable (SELECT … FROM 'Q5.parquet'), the _manifest.parquet maps class IRI → label/entity-count/file and each column id back to its predicate, and a single DuckDB over the set is just views: CREATE VIEW human AS SELECT * FROM 'data/wd-tables/Q5.parquet'. Match the .rete slice by passing the same --parts/--limit. This is a property-table companion for benchmarking storage/query tech against the graph format — not a lossless graph encoding (sparse properties become NULLs, heterogeneous classes get wide; that's the point of the comparison).

Alternative approach: a virtual knowledge graph over the companions

The companions above are materialized the rete way: triples → a .rete (the graph) plus tabular exports you query with SQL. The playground's Explore tab already queries those exports lazily over httpfs — DuckDB-WASM / SQLite-WASM fetch only the Parquet row-groups a query touches, the same range-read transport the .rete uses — so you can compare the same class across the rete engine and the columnar engines side by side.

A different school of thought skips materialization entirely. A Virtual Knowledge Graph (VKG / OBDA) — e.g. Ontop over DuckDB — keeps the Parquet as the source of truth and answers SPARQL by rewriting it to SQL at query time through declarative R2RML mappings; no RDF file is ever built (tech note). The two are complementary; the trade is materialized-vs-virtual:

	rete (this project)	Virtual KG (Ontop + DuckDB)
RDF	materialized into a graph-native `.rete`	virtual — never materialized
Source of truth	the `.rete` file	the Parquet
SPARQL	answered directly over `.rete` (range reads)	rewritten to SQL over Parquet via mappings
Parquet's role	a tabular companion/export	the data
Trade-off	a build step; self-contained & graph-native (pyramid, communities, reachability, SHACL, coherence, provenance)	a mapping + a SPARQL→SQL engine at query time; always fresh, no ingestion

Both lean on the same lazy transport — Parquet's footer + row groups are range-friendly the way the .rete header + tiles are — so the honest comparison isn't "lazy vs eager" but a graph-native materialized file vs a virtual SPARQL view over a columnar source.

And rete's companions are already VKG-ready: the _manifest.parquet beside each one records the column → predicate map (and class IRI → table) — exactly the R2RML mapping a VKG needs, generated for free. So because the companions are range-readable HF objects, a VKG can ATTACH them in place: a single Ontop/DuckDB endpoint over every Parquet in the HF bucket — driven by the manifests — is a "knowledge graph over all of HF" with no download and no .rete: the federated, virtual mirror of rete's materialized file. (Materializing the same thing is also mechanical: the entity tables are lossless, so reconstruct each → merge, or build per-dataset .rete and rete federate across them.)

Benchmark candidate (TODO): rete's range-read SPARQL on a .rete vs an Ontop-over-DuckDB VKG on the equivalent Parquet — same queries, same HTTP-range hosting — measuring bytes fetched, latency, and which graph operations (pyramid / reachability / SHACL / coherence) the VKG can't answer cheaply.

A real-world graph: a Wikidata biology slice

For a genuinely large, real dataset, scripts/fetch_wikidata_bio.py pulls a life-sciences slice from the Wikidata Query Service: genes, the proteins they encode, the diseases they associate with, drugs that treat those diseases, and a disease subclass hierarchy — one connected graph, every entity labelled in English. It runs a handful of bounded CONSTRUCT queries (each well under the WDQS timeout) and merges them as N-Triples.

uv run python scripts/fetch_wikidata_bio.py --limit 4000 -o data/wikidata-bio.nt
rete build data/wikidata-bio.nt -o bio.rete
rete stats bio.rete        # ~40k triples, ~27k terms, hundreds of communities

A --limit 4000 run is roughly 40,000 triples (≈2,800 genes, ≈4,000 proteins, ≈3,600 diseases) — the community pyramid finds hundreds of organism/disease clusters, and it exercises every surface: typed-class queries, label joins, the disease hierarchy via wdt:P279, and HTTP range queries over a real graph. --taxon Q83310 fetches mouse instead of human; --limit trades size against WDQS time. Output lands in data/ (git-ignored, like all fetched datasets — the script is tracked, the bytes are regenerated on demand). Be a good WDQS citizen: it is rate-limited, so fetch a slice, not a firehose.

Real Wikidata at gigabyte scale (Parquet)

The Query Service is for slices, not bulk. For a genuinely large, real linked-data graph, scripts/wikidata_parquet_to_nt.py reads the full Wikidata "truthy" dump from the piebro/wikidata-extraction Parquet conversion on Hugging Face (~80 partitions, subject/predicate/object/ language columns) with DuckDB — httpfs streams the remote files, so a bounded slice needs no full download — and writes N-Triples.

pip install --break-system-packages duckdb
uv run python scripts/wikidata_parquet_to_nt.py --limit 12000000 -o data/wd.nt  # ~1 GB
rete build data/wd.nt -o wd.rete

The source Parquet drops literal datatypes, so the converter recovers them (--datatypes, default auto): it resolves each property's datatype from a local cache or one WDQS wikibase:propertyType query (~13.5k properties, cached for reuse) and re-types each literal — dates xsd:dateTime, quantities xsd:decimal, coordinates geo:wktLiteral; strings stay plain, monolingual text keeps its language tag, entity values are IRIs. If that map is unavailable (e.g. WDQS rate-limited), it falls back to an offline heuristic that types the unambiguous values — ISO timestamps and WKT geometries — leaving numbers plain (a bare number is indistinguishable from a numeric external-id without the map). --datatypes heuristic forces the offline path; none emits plain literals. Once typed, the engine's DATATYPE(?o) / LANG(?o) filters can select by datatype.

Measured on the dev container, the full --limit 12000000 (~1 GB) run: converting streams in ~24 s (1.25 GB N-Triples, datatypes recovered) and builds in ~52 s to a 110 MB .rete — 5 pyramid levels, ~115k communities — with typed literals intact ("1830-10-04T00:00:00Z"^^xsd:dateTime, "Point(5.47 49.50)"^^geo:wktLiteral) and a selective lookup answering in under a second. The slice is a real cross-section of all of Wikidata (people, places, works, taxa, …); for a curated biology-only graph use the WDQS fetcher above. --parts N draws from N whole partitions; --local-dir reads partitions you have already downloaded.

Testing

cargo test            # full suite (unit, round-trip, robustness, ranged, HTTP)
cargo clippy --workspace --all-targets -- -D warnings
bash scripts/smoke.sh # end-to-end acceptance test of every CLI subcommand

CI runs all of this — plus the feature matrix and the wasm build — in containers, so nothing ever builds on the host.