Benchmark
Synthetic social graph, 139,093 triples (20k people in ~200 communities,
~5 knows edges each + age/name literals). Generated by
scripts/gen_graph.py, measured by scripts/bench.sh, run in the dev container
(release build, Rust 1.92). Query timings use the compiled binary directly
(target/release/rete), so they exclude cargo overhead.
Size
| Artifact | Bytes | vs raw |
|---|---|---|
| raw N-Triples | 8,384,101 | 1.0× |
gzip -9 of the N-Triples | 564,590 | 14.8× |
.rete (zstd + pyramid summary) | 1,302,792 | 6.4× |
.rete is ~2.3× the size of gzip while being queryable in place and over HTTP
ranges (gzip answers no query without a full download + scan). As of format v0.4
it stores all six permutation indexes (SPO/POS/OSP/SOP/PSO/OPS) — each triple
stored 6× — plus the dictionary and the small pyramid summary; that is what roughly
doubled the file versus the earlier three-permutation v0.3 (715,959 bytes), and the
payoff is that any triple pattern — not just the three canonical leads — resolves
to a contiguous scan with its bound components leading, which also unlocks
sort-merge joins. Per-community tiles remain out of the default file because they
duplicate every triple; exact single-pattern range routing fetches one selected
permutation payload instead of the whole index container, while physical
community-tile directories are the next storage step.
Build
| Step | Time |
|---|---|
rete build (parse → dict → 6 indexes → Louvain pyramid → zstd) | 320 ms |
Result: 3 pyramid levels, 137,512 quads, 40,063 terms. (Caching the dictionary section metadata cut build time ~3.4× — see finding 1.)
Build memory (big graphs)
The build's peak RAM is dominated by the raw string statements (every term an
owned String, heavily duplicated), which are redundant with the dictionary once
interned. Stream-parsing the file (the whole text is never resident) and dropping
the string statements right after encoding them to id-triples — before the
pyramid + index phases — cuts the peak. On a 3 M-triple / 189 MB N-Triples
build (synthetic, scripts-style entities with type/label/edges):
| OLD | optimized | |
|---|---|---|
process peak RSS (VmHWM) | 1371 MiB | 836 MiB (−39%) |
Output is byte-identical. Reproduce the phase-by-phase heap profile with
cargo run --release -p rete-bench -- --build-mem <file.nt> — it snapshots the
live heap (exact, via a counting allocator) after parse, dictionary, encode,
the drop, pyramid, index, and write.
Build time (big graphs)
The build is dominated by the Louvain community pyramid — ~91% of wall-clock
on the 3 M-triple build. Its local-moving inner loop now uses a reusable dense
scratch (community → edge-weight + touched-list reset) instead of a per-node
HashMap, with unchanged accumulation order and candidate scan, so the partition
and the file content hash are byte-identical. Set RETE_BUILD_TIMING=1 on
rete build for the per-phase breakdown.
| phase (3 M triples) | OLD | optimized |
|---|---|---|
build_dendrogram (Louvain) | 118.2 s | 44.1 s (2.7×) |
whole rete build (wall-clock) | ~141 s | 67 s (~2.1×) |
A parallel Louvain would not be byte-identical (the partition depends on the
sequential move order), so the pyramid stays single-threaded; the six index
permutations and the dictionary/permutation sorts already use all cores (rayon).
Query result serialization (the WASM hot path)
A query's result is returned to the browser as a JSON string. Writing that
envelope directly into a String (rete_core::results_envelope_json) instead
of building a serde_json::Value tree and stringifying it dominates the cost on a
large result. On a 500k-row SELECT (27 MB JSON), profiled with cargo run --release -p rete-bench -- --query-mem:
| serializer | peak heap | time |
|---|---|---|
Value tree + to_string | 670 MiB (25× the payload) | 408 ms |
direct to String | 50 MiB (1.9×) | 41 ms |
~13× less peak heap, ~10× faster, byte-identical JSON (a unit test pins the
escaping to serde_json). For reference the query itself (eval_query) was
290 MiB / 344 ms — i.e. the old serializer cost more than the query. On the
bounded WASM heap the 670 MiB peak was an OOM risk for big results.
Query latency (end-to-end: open + decompress + evaluate)
5-run averages of the compiled binary in the dev container (includes process startup each run; format v0.4, six permutations).
| Query | Time |
|---|---|
triple pattern (?s knows p100) | 33 ms |
2-hop BGP join (p0 knows ?y . ?y knows ?z) | 39 ms |
property path (p0 knows+ ?y, reaches whole graph) | 59 ms |
| GROUP BY COUNT (degree of every node) | 107 ms |
| per-predicate totals (summary only, index not read) | 24 ms |
Every dictionary lookup and triple-block decode is bounds-checked
(slice.get(..)? instead of raw indexing) so the reader cannot panic on
malformed input — a few percent on the resolution-heavy hot path, and worth it
(see finding 5). Resolution-light queries (triple pattern, summary) are
unaffected.
HTTP range
A point query over HTTP is bounded, PMTiles-style. The full SPARQL URL path still
opens the broad index section, but query-url and rete cost now have an exact
single-pattern route: header + dictionary + one selected permutation
payload. The progressive path is cheaper still for summary-safe queries:
fetching just the coarse community graph (summary-url) reads only the header +
dictionary + summary — e.g. 2.2 KB of a 15.6 KB file (14%) in 3 requests, the
index never fetched — the "overview first, drill down later" promise.
Scaling
At 347,884 triples (2.5× the table above) everything scales ~linearly, with no pathological blowup: build 890 ms, triple query 54 ms, property path 103 ms, GROUP BY 261 ms, predicate totals 26 ms (summary stays cheap regardless of size), file 3.31 MB (same ~6.4× vs raw / ~2.3× vs gzip ratios as the table above).
The pyramid: cost vs benefit
A .rete carries two pyramidal structures with very different economics. This
measures both, against --no-pyramid, on a typed subClassOf ontology fixture
(dev/gen_ontology_demo.py, scaled by entity count).
Bytes — file overhead, the growing super-edge summary, and what each read tier
fetches over HTTP (q ±pyr = a node-selective query with / without the pyramid):
| triples | file +pyr | file −pyr | pyr % | pyramid-meta | super-edges | card-url | summary-url | q +pyr | q −pyr |
|---|---|---|---|---|---|---|---|---|---|
| 8,595 | 99 KB | 90 KB | 9.6% | 8.7 KB | 1,364 | 32.7 KB | 17 KB | 43,823 | 43,823 |
| 34,354 | 282 KB | 264 KB | 7.2% | 19 KB | 3,455 | 32.8 KB | 43 KB | 68,295 | 68,295 |
| 137,279 | 990 KB | 946 KB | 4.6% | 43.6 KB | 8,580 | 33.0 KB | 116 KB | 111,054 | 111,054 |
Time (release build; median) — what the pyramid costs to build (Louvain
community detection + schema rollup, vs --no-pyramid) and how fast each read is:
| triples | build +pyr | build −pyr | pyramid cost | card read | summary read | selective query |
|---|---|---|---|---|---|---|
| 8,595 | 70 ms | 51 ms | +19 ms | 19 ms | 23 ms | 28 ms |
| 34,354 | 222 ms | 109 ms | +113 ms | 19 ms | 24 ms | 30 ms |
| 137,279 | 1,352 ms | 353 ms | +999 ms | 22 ms | 30 ms | 41 ms |
What the numbers say:
- The community super-edge summary scales with the graph — in bytes (super-edges
1.4k → 8.6k;
summary-url17 KB → 116 KB) and in time (the Louvain build cost is super-linear: +19 ms → +999 ms, ~4× the--no-pyramidbuild at 137k triples). - It gives node-selective lazy queries nothing —
q +pyrandq −pyrare byte-for-byte identical (43,823 == 43,823, …). For pure selective serving the pyramid is overhead;--no-pyramidis smaller and builds far faster. - The schema pyramid is the bounded one — its size and read cost track the
ontology, not the graph, so the leveled type/relation legend stays cheap at any
scale. The flat-cost champion is the card (
card-url≈ 33 KB and ≈ 20 ms read at every size — it skips the dictionary and super-edges entirely).
Rule of thumb: keep the pyramid for index-free overview / exploration; build
with --no-pyramid when you only serve selective queries at scale (smaller file,
~4× faster build, identical query latency). See the
Semantic zoom guide for the schema-pyramid side of this. Repro:
dev/bench_pyramid_value.sh (bytes) and dev/bench_pyramid_time.sh (time).
Coherence checking: tier costs
How much of a remote .rete each coherence check actually fetches, as the graph
grows. A synthetic medical ontology — a fixed T-Box with a planted unsatisfiable
class (:Relapsed ⊑ :Healthy and ⊑ :Sick, which are owl:disjointWith) plus a
few instance-level clashes — at increasing instance counts, measured through a
CountingReader over the file image. Repro:
cargo run --release --example coherence_bench [n …].
| instances | file | Tier-0 schema (index-free) | Tier-1 selective slice | Tier-2 full graph |
|---|---|---|---|---|
| 1,000 | 33 KB | 986 B · 2 req · 0.02 ms | 26 KB · 36 req · 7 ms | 33 KB · 36 req · 10 ms |
| 10,000 | 283 KB | 996 B · 2 req · 0.02 ms | 32 KB · 37 req · 54 ms | 171 KB · 38 req · 186 ms |
| 100,000 | 9.4 MB | 8.1 KB · 2 req · 0.09 ms | 346 KB · 53 req · 1.1 s | 1.8 MB · 48 req · 2.9 s |
| 500,000 | 48.8 MB | 8.1 KB · 2 req · 0.10 ms | 1.4 MB · 90 req · 7.5 s | 8.5 MB · 54 req · 15.9 s |
The three tiers find different defects, by design: Tier-0 finds the schema-level unsatisfiable class (1 point) from the ontology alone; Tier-1/Tier-2 also find the instance-level disjoint clashes (3 points). So Tier-0 is a fast "is the ontology coherent?" gate, and Tier-1/2 answer "is the data coherent?".
The headline: Tier-0 reads ~1–8 KB in 2 range requests at any graph size —
8.1 KB of a 48.8 MB file (0.017%), sub-0.1 ms. The check-an-ontology's-coherence-from-
a-few-KB-of-a-multi-GB-file claim holds. Tier-1 (selective) is the practical
instance-coherence check: it faults only the rdf:type + T-Box tiles — 346 KB of
9.4 MB at 100k, 1.4 MB of 48.8 MB at 500k — while Tier-2 reads the whole graph.
Three fixes this benchmark drove
detect_inconsistencieswas O(types²). The disjoint-class check looped over every pair ofrdf:typetriples in the whole graph (~4.3 s at 10k instances). Grouping each individual's class set first and checking only the pairs within one individual makes it ~linear: Tier-1 4320 ms → 54 ms, Tier-2 4607 ms → 186 ms at 10k. (The functional-property check had the same shape and got the same fix.)- Tier-0 was reading the whole dictionary.
check_schemawent throughSummaryView, which fetches the dictionary to label predicates — but coherence needs none of it (the schema pyramid carries its own class-string table). A dictionary-freeread_schema_coherence_rangedcut Tier-0 from 20.9 KB → ~2 KB at 10k. - Tier-0 then still scaled with the community summary. The schema pyramid is
bounded by the ontology, but it shared the pyramid-meta section with the community
super-edge summary, which grows with the entity count (every labelled node is a
vertex) — so reading the whole pyramid-meta cost 6.7 MB at 100k, 36 MB at 500k.
The fix: the writer now records the trailing schema block's byte length in the
header (the 4 previously-reserved bytes), so a reader fetches only that block.
Tier-0 dropped to a flat 8.1 KB — a ~4400× cut at 500k. Backward-compatible: a
file without the header field (
schema_meta_len == 0) falls back to the whole-section read, and a typeless file's header byte stays 0, so it remains byte-identical.
Comparison vs Oxigraph (real OpenCitations network)
Dataset: the real OpenCitations citation network for the AlphaFold paper (Nature 2021), enriched with disciplines, authors, venues and citation counts (539,246 triples loaded into both engines)
This pits rete (a queryable file) against Oxigraph (a full in-memory triplestore with a mature SPARQL planner). Honest summary: rete opens much faster because its indexes are prebuilt in the file, wins several scan/aggregate shapes, still loses where Oxigraph can stream, stop early, or apply a mature planner, and wins decisively on multi-source reachability.
This section is generated from
docs/benchmark-opencitations.json with
scripts/render_benchmark_doc.py. Latest run:
2026-06-15. Oxigraph 0.5, in-memory store
(no RocksDB). Machine: 32 logical cores.
Load / open (one-time)
| Engine | Step | Time | Resident heap after load |
|---|---|---|---|
| rete | Rete::open - indexes already built in the file | 19.9 ms | 13.35 MiB |
| Oxigraph | bulk-load N-Triples + build in-memory indexes | 2437 ms | 144.98 MiB |
Process peak RSS after both loads (VmHWM): 207 MiB.
rete's "load" just maps a file whose dictionary + permutation indexes already exist on disk; Oxigraph parses the triples and builds its indexes on every startup. This is the format's core promise: publish once, open instantly, query in place.
SPARQL operator coverage (eager, lazy range-read, and Oxigraph)
24 queries spanning supported forms and operators, on three
engines: rete eager (Rete::open, whole file in memory), rete lazy
(Rete::open_ranged_lazy — a fresh cold open + HTTP-range-style reads per
query, the path a browser/remote client runs), and Oxigraph (in-memory). Row
counts are a cross-engine correctness check across the language surface, not
just a speed race. Median ±sd of 5 warm runs. lazy fetched is
the bytes that one query pulled from the file image — the rest is never touched.
| Operator / form | rete eager | rete lazy | lazy fetched | Oxigraph | eager vs oxi | rows | ok |
|---|---|---|---|---|---|---|---|
| SELECT count (aggregate) | 3.78 ±0.85 ms | 4.65 ±0.62 ms | 77.4 KB | 9.14 ±0.62 ms | 2.4x | 1 | yes |
| SELECT DISTINCT | 4.79 ±0.52 ms | 5.44 ±0.32 ms | 51.2 KB | 10.6 ±1.9 ms | 2.2x | 6 | yes |
| ASK | 0.06 ±0.20 ms | 0.67 ±0.03 ms | 51.2 KB | 0.01 ±0.04 ms | 0.2x | 1 | yes |
| CONSTRUCT | 0.01 ±0.02 ms | 1.84 ±1.40 ms | 70.1 KB | 0.05 ±0.51 ms | 3.7x | 9 | yes |
| DESCRIBE (impl-defined) | 0.01 ±0.00 ms | 0.64 ±0.02 ms | 70.1 KB | 0.03 ±0.05 ms | 2.8x | 11 | yes |
| VALUES (inline data) | 5.22 ±0.45 ms | 6.63 ±1.29 ms | 177.3 KB | 11.0 ±4.4 ms | 2.1x | 10962 | yes |
| UNION | 7.03 ±1.27 ms | 7.97 ±1.39 ms | 177.3 KB | 12.9 ±1.7 ms | 1.8x | 10993 | yes |
| OPTIONAL (left join) | 0.30 ±0.03 ms | 2.28 ±0.59 ms | 125.6 KB | 0.41 ±0.16 ms | 1.4x | 200 | yes |
| MINUS | 2.64 ±0.98 ms | 6.34 ±1.46 ms | 200.5 KB | 3.29 ±2.00 ms | 1.2x | 2728 | yes |
| FILTER NOT EXISTS | 2.15 ±0.20 ms | 6.69 ±1.18 ms | 200.5 KB | 10.9 ±4.3 ms | 5.1x | 2728 | yes |
| 3-way join + LIMIT | 0.18 ±0.04 ms | 1.75 ±0.16 ms | 143.1 KB | 0.19 ±0.12 ms | 1.0x | 50 | yes |
| FILTER REGEX (case-insens.) | 6.99 ±0.69 ms | 8.24 ±0.88 ms | 79.5 KB | 0.90 ±0.24 ms | 0.1x | 200 | yes |
| FILTER arith + logical | 0.27 ±0.06 ms | 1.88 ±0.25 ms | 204.5 KB | 0.75 ±0.33 ms | 2.7x | 200 | yes |
| BIND + SUBSTR + CONCAT | 0.36 ±0.02 ms | 2.56 ±0.61 ms | 151.7 KB | 0.33 ±0.15 ms | 0.9x | 200 | yes |
| path sequence a/b | 0.20 ±0.02 ms | 1.75 ±0.14 ms | 162.5 KB | 0.25 ±0.09 ms | 1.2x | 200 | yes |
| path inverse ^p (count) | 3.32 ±0.35 ms | 3.95 ±0.51 ms | 77.4 KB | 9.71 ±0.92 ms | 2.9x | 1 | yes |
| path + transitive (count) | 6.94 ±0.18 ms | 8.74 ±0.60 ms | 89.5 KB | 14.0 ±1.6 ms | 2.0x | 1 | yes |
| path * zero-or-more (count) | 6.55 ±0.27 ms | 7.03 ±0.16 ms | 89.5 KB | 10.4 ±0.8 ms | 1.6x | 1 | yes |
| GROUP BY + ORDER BY | 3.81 ±0.60 ms | 3.53 ±0.07 ms | 51.2 KB | 10.7 ±1.8 ms | 2.8x | 6 | yes |
| GROUP BY + HAVING | 6.12 ±2.18 ms | 6.94 ±2.14 ms | 51.2 KB | 10.3 ±0.4 ms | 1.7x | 5 | yes |
| AVG per group | 27.8 ±7.5 ms | 31.9 ±5.5 ms | 95.0 KB | 43.9 ±2.3 ms | 1.6x | 6 | yes |
| MIN/MAX/SUM | 9.14 ±2.02 ms | 5.51 ±0.91 ms | 78.5 KB | 10.7 ±0.6 ms | 1.2x | 1 | yes |
| COUNT(DISTINCT) | 2.48 ±0.08 ms | 2.74 ±0.09 ms | 57.4 KB | 7.73 ±0.29 ms | 3.1x | 1 | yes |
| ORDER BY + LIMIT + OFFSET | 4.05 ±0.06 ms | 5.02 ±0.15 ms | 125.0 KB | 22.7 ±1.0 ms | 5.6x | 10 | yes |
24 / 24 identical row counts across SELECT/ASK/CONSTRUCT/DESCRIBE, algebra operators, filters/functions, property paths, and aggregates.
Reading the times honestly:
- rete now wins or ties the large majority of shapes: aggregates,
GROUP BY, DISTINCT, path closures, sorted pagination (top-k), and
large UNION/VALUES/MINUS results. The engine evaluates as a lazy
pipeline over integer slot rows, switches to index-nested-loop
probes under small LIMIT/ASK demand, and resolves terms only at
projection. A hybrid BGP join also switches to per-row index
probes mid-join once the running intermediate is small (so a
star/snowflake's trailing
?x rdf:type Cchecks cost a few lookups, not a full-extent scan), and never seeds a join from a class enumeration — together cutting the LUBM snowflakes Q4 ~14x and Q7 ~12x. The index stores all six permutations (format v0.4), so any join key can be streamed pre-sorted for a sort-merge join; at these sizes the hash/probe joins already win, so the merge path is neutral here — it pays off at much larger scale, for ~2x the index payload. - Oxigraph keeps an edge on REGEX scans whose pattern needs the real regex engine (rete's regex-lite has no literal prefilter; literal patterns use a substring fast path) and, by fractions of a millisecond, on ASK and the tightest LIMIT joins — floor effects, not the orders-of-magnitude gaps from before the engine rework.
- The lazy range-read engine — what a browser or remote client runs
over HTTP — adds only a ~1 ms cold-open tax over eager while fetching
single-digit-percent of the file per query (here ~50–205 KB of a 2.8 MB
file), and still beats in-memory Oxigraph on most shapes. Querying a
.retewhere it sits (S3 / HTTP, no server) is not a latency compromise.
Batch transitive reachability - coauthor+ from 300 seeds
"From each seed author, who is reachable through co-authorship?" rete
exposes this as a dedicated primitive (rete reach, batch_reach_*); on
Oxigraph it is a coauthor+ property path evaluated per seed.
| Engine / mode | Time | vs rete-serial |
|---|---|---|
rete - batch_reach_serial (1 core) | 641 ±101 ms | 1.0x |
rete - batch_reach_parallel (32 cores) | 39.0 ±12.4 ms | 16.4x |
Oxigraph - coauthor+ property path, per seed | 3026 ±48 ms | 0.2x |
rete serial and parallel both reached 1,636,200 nodes; Oxigraph touched 1,636,200 result cells. The dedicated parallel primitive is a different abstraction level from a general SPARQL property path, so read this as: use the right tool for multi-source reach.
Reproduce
# In the dev container (Docker). The OpenCitations + synthetic-enrichment data
# comes from scripts/fetch_opencitations.py + scripts/enrich.py (-> enriched-all.nt).
# Sanitize malformed compound-DOI IRIs so both engines load identical data:
grep -vE "<[^>]* [^>]*>" data/opencitations/enriched-all.nt \
> data/opencitations/enriched-clean.nt
./target/release/rete build data/opencitations/enriched-clean.nt \
-o data/opencitations/enriched-clean.rete
cargo build --release -p rete-bench
./target/release/rete-bench --json data/opencitations/enriched-clean.rete \
data/opencitations/enriched-clean.nt 300 > /tmp/bench.json
# preserve the curated dataset note + run date from the existing JSON:
uv run python scripts/merge_bench_metadata.py /tmp/bench.json \
docs/benchmark-opencitations.json --date $(date +%F)
uv run python scripts/render_benchmark_doc.py docs/benchmark-opencitations.json \
--input docs/BENCHMARK.md --output docs/BENCHMARK.md
cargo run -p docgen
The rete-bench crate pulls in Oxigraph only for this comparison; its
in-memory store needs no RocksDB/clang, so default-features = false keeps
the build light.
LUBM-style benchmark (standard 14 queries)
The LUBM data model and its 14
standard queries, on identical pre-materialized data in both engines.
LUBM(1): 93,139 generated +
26,007 RDFS/OWL-RL-materialized =
118,680 triples (.rete: 555,581 bytes).
Read the caveats: the generator is a faithful reimplementation of
UBA's documented cardinalities, not the official Java tool, so counts are
not comparable to published LUBM figures; inference is rete's RDFS/OWL-RL
subset, applied up front to the data both engines load — so
restriction-defined classes (Q12's Chair) are empty on both sides by
construction. The correctness anchor is cross-engine row parity:
14 / 14 queries agree.
| Engine | Load | Resident heap |
|---|---|---|
rete Rete::open | 6.2 ms | 3.66 MiB |
| Oxigraph bulk-load | 180 ms | 37.50 MiB |
| Query | rete | Oxigraph | rete vs oxi | peak heap MiB (rete / oxi) | rows | ok |
|---|---|---|---|---|---|---|
| Q1 grad students of a course | 0.30 ±0.03 ms | 0.83 ±0.25 ms | 2.8x | 0.00 / 0.00 | 3 | yes |
| Q2 grad student / univ / dept triangle | 3.96 ±0.18 ms | 5.36 ±0.55 ms | 1.4x | 1.57 / 0.01 | 2036 | yes |
| Q3 publications of a professor | 0.77 ±0.09 ms | 2.42 ±0.30 ms | 3.1x | 0.01 / 0.00 | 9 | yes |
| Q4 professors of a department (+3 props) | 0.24 ±0.02 ms | 0.49 ±0.03 ms | 2.1x | 0.09 / 0.01 | 26 | yes |
| Q5 persons member of a department | 1.39 ±0.10 ms | 6.49 ±0.52 ms | 4.7x | 0.34 / 0.00 | 544 | yes |
| Q6 all students | 2.47 ±0.39 ms | 1.71 ±0.42 ms | 0.7x | 4.39 / 0.00 | 7232 | yes |
| Q7 students of a professor's courses | 0.94 ±0.07 ms | 0.08 ±0.02 ms | 0.1x | 0.04 / 0.01 | 49 | yes |
| Q8 students of a university's departments | 11.7 ±1.5 ms | 18.8 ±0.7 ms | 1.6x | 5.05 / 0.01 | 7232 | yes |
| Q9 student / advisor / course triangle | 6.31 ±0.33 ms | 12.1 ±1.1 ms | 1.9x | 1.53 / 0.01 | 115 | yes |
| Q10 students of a graduate course | 0.75 ±0.08 ms | 2.91 ±0.46 ms | 3.9x | 0.00 / 0.00 | 0 | yes |
| Q11 research groups of a university | 0.21 ±0.02 ms | 0.15 ±0.02 ms | 0.7x | 0.14 / 0.00 | 230 | yes |
| Q12 chairs of a university's departments | 0.01 ±0.00 ms | 0.02 ±0.00 ms | 2.4x | 0.00 / 0.01 | 0 | yes |
| Q13 alumni of a university | 2.31 ±0.16 ms | 2.47 ±0.42 ms | 1.1x | 1.62 / 0.00 | 2655 | yes |
| Q14 all undergraduate students | 2.37 ±0.13 ms | 1.75 ±0.54 ms | 0.7x | 4.39 / 0.00 | 7232 | yes |
Reproduce: cargo run --release -p rete-bench -- --json --lubm 1 > docs/benchmark-lubm.json then re-render this doc.
Parallelism
A prototype data-parallel query evaluator (rete-core feature parallel,
module parallel.rs, behind rayon — off by default, never compiled into
wasm) splits decomposable work across communities/seeds and harmonizes the
partials. Every parallel workload ships with a serial reference that produces a
bit-identical result; the benchmark asserts serial == parallel (RESULTS MATCH) before trusting any timing. Numbers below are from
cargo run --release -p rete-core --example parallel_bench --features parallel
in the dev container, best-of-N after a warm-up.
Machine: 32 cores (available_parallelism = 32, rayon = 32 threads).
137,499 triples (20k people, 184 communities)
| Workload | Serial | Parallel | Speedup | Match |
|---|---|---|---|---|
| A — predicate count (per-community + sum) | 2.68 ms | 1.71 ms | 1.57× | ✓ |
| B — per-subject out-degree (per-community + BTreeMap merge) | 2.77 ms | 2.75 ms | 1.01× | ✓ |
| C — batch reachability, 512 seeds (embarrassingly parallel) | 2,648.9 ms | 186.8 ms | 14.18× | ✓ |
343,844 triples (50k people, 346 communities) — 2.5× scale
| Workload | Serial | Parallel | Speedup | Match |
|---|---|---|---|---|
| A — predicate count | 7.08 ms | 1.44 ms | 4.92× | ✓ |
| B — per-subject out-degree | 7.94 ms | 4.87 ms | 1.63× | ✓ |
| C — batch reachability, 512 seeds | 9,057.8 ms | 584.6 ms | 15.50× | ✓ |
Honest analysis
- Batch reachability is the clear win (14–15×). N independent single-source
BFS traversals over a shared read-only adjacency map are embarrassingly
parallel — one rayon task per seed, no shared mutable state, no merge step.
This is the realistic use case: multi-source impact analysis ("what does each
of these 512 nodes transitively reach?"). It does not hit linear 32× because
the per-seed reach sets are huge (~20k–50k nodes each) and the work per seed
varies, so memory bandwidth and load imbalance cap the gain — but a 14–15×
wall-clock cut on a 32-core box is real and reproducible. This workload is
shipped as a command:
rete reach --parallel(see the CLI reference). - Per-subject out-degree barely moves (1.01× at 137k, 1.63× at 344k). The
per-community map-build parallelizes, but the harmonize — merging hundreds of
BTreeMaps into one — is serial in thereduceand dominates when the per-tile work is tiny. It only starts to pay (1.63×) once each tile carries enough triples to amortize the merge. This is textbook Amdahl: the serial merge fraction caps the speedup. - Predicate count is the "cheap scan" caveat, reported plainly. At 137k the
whole scan is ~2 ms, so rayon's fork/join overhead is on the same order as the
work and the speedup is noise-dominated (1.57×). It looks better than it
should because the comparison isn't perfectly apples-to-apples: the serial
reference walks the engine's index (
match_ids, which materializes aVecof every match), while the parallel side filters pre-tiled in-memory triples. Lesson: a sub-millisecond-per-core workload is not worth parallelizing — the tiling + dispatch overhead eats the gain, and any apparent win is fragile. - This is CPU-side only. The HTTP/range query path is I/O-bound (latency is
in the network round-trips, not the CPU), so threading the evaluator does not
help there. And wasm has no threads at all — which is exactly why
parallelis an opt-in native feature and the wasm crate builds with--no-default-features(verified: rayon is never compiled for wasm).
Conclusion on "split across communities + harmonize": it helps when (a) the per-partition work is substantial relative to fork/join + merge overhead, and (b) the harmonize step is cheap or itself parallel. Batch reachability nails both; per-community aggregations only pay off at scale once the serial merge is amortized; cheap single scans should stay serial.