Benchmark

Synthetic social graph, 139,093 triples (20k people in ~200 communities, ~5 knows edges each + age/name literals). Generated by scripts/gen_graph.py, measured by scripts/bench.sh, run in the dev container (release build, Rust 1.92). Query timings use the compiled binary directly (target/release/rete), so they exclude cargo overhead.

Size

Artifact	Bytes	vs raw
raw N-Triples	8,384,101	1.0×
`gzip -9` of the N-Triples	564,590	14.8×
`.rete` (zstd + pyramid summary)	1,302,792	6.4×

.rete is ~2.3× the size of gzip while being queryable in place and over HTTP ranges (gzip answers no query without a full download + scan). As of format v0.4 it stores all six permutation indexes (SPO/POS/OSP/SOP/PSO/OPS) — each triple stored 6× — plus the dictionary and the small pyramid summary; that is what roughly doubled the file versus the earlier three-permutation v0.3 (715,959 bytes), and the payoff is that any triple pattern — not just the three canonical leads — resolves to a contiguous scan with its bound components leading, which also unlocks sort-merge joins. Per-community tiles remain out of the default file because they duplicate every triple; exact single-pattern range routing fetches one selected permutation payload instead of the whole index container, while physical community-tile directories are the next storage step.

Build

Step	Time
`rete build` (parse → dict → 6 indexes → Louvain pyramid → zstd)	320 ms

Result: 3 pyramid levels, 137,512 quads, 40,063 terms. (Caching the dictionary section metadata cut build time ~3.4× — see finding 1.)

Build memory (big graphs)

The build's peak RAM is dominated by the raw string statements (every term an owned String, heavily duplicated), which are redundant with the dictionary once interned. Stream-parsing the file (the whole text is never resident) and dropping the string statements right after encoding them to id-triples — before the pyramid + index phases — cuts the peak. On a 3 M-triple / 189 MB N-Triples build (synthetic, scripts-style entities with type/label/edges):

	OLD	optimized
process peak RSS (`VmHWM`)	1371 MiB	836 MiB (−39%)

Output is byte-identical. Reproduce the phase-by-phase heap profile with cargo run --release -p rete-bench -- --build-mem <file.nt> — it snapshots the live heap (exact, via a counting allocator) after parse, dictionary, encode, the drop, pyramid, index, and write.

Build time (big graphs)

The build is dominated by the Louvain community pyramid — ~91% of wall-clock on the 3 M-triple build. Its local-moving inner loop now uses a reusable dense scratch (community → edge-weight + touched-list reset) instead of a per-node HashMap, with unchanged accumulation order and candidate scan, so the partition and the file content hash are byte-identical. Set RETE_BUILD_TIMING=1 on rete build for the per-phase breakdown.

phase (3 M triples)	OLD	optimized
`build_dendrogram` (Louvain)	118.2 s	44.1 s (2.7×)
whole `rete build` (wall-clock)	~141 s	67 s (~2.1×)

A parallel Louvain would not be byte-identical (the partition depends on the sequential move order), so the pyramid stays single-threaded; the six index permutations and the dictionary/permutation sorts already use all cores (rayon).

Query result serialization (the WASM hot path)

A query's result is returned to the browser as a JSON string. Writing that envelope directly into a String (rete_core::results_envelope_json) instead of building a serde_json::Value tree and stringifying it dominates the cost on a large result. On a 500k-row SELECT (27 MB JSON), profiled with cargo run --release -p rete-bench -- --query-mem:

serializer	peak heap	time
`Value` tree + `to_string`	670 MiB (25× the payload)	408 ms
direct to `String`	50 MiB (1.9×)	41 ms

~13× less peak heap, ~10× faster, byte-identical JSON (a unit test pins the escaping to serde_json). For reference the query itself (eval_query) was 290 MiB / 344 ms — i.e. the old serializer cost more than the query. On the bounded WASM heap the 670 MiB peak was an OOM risk for big results.

Query latency (end-to-end: open + decompress + evaluate)

5-run averages of the compiled binary in the dev container (includes process startup each run; format v0.4, six permutations).

Query	Time
triple pattern (`?s knows p100`)	33 ms
2-hop BGP join (`p0 knows ?y . ?y knows ?z`)	39 ms
property path (`p0 knows+ ?y`, reaches whole graph)	59 ms
GROUP BY COUNT (degree of every node)	107 ms
per-predicate totals (summary only, index not read)	24 ms

Every dictionary lookup and triple-block decode is bounds-checked (slice.get(..)? instead of raw indexing) so the reader cannot panic on malformed input — a few percent on the resolution-heavy hot path, and worth it (see finding 5). Resolution-light queries (triple pattern, summary) are unaffected.

HTTP range

A point query over HTTP is bounded, PMTiles-style. The full SPARQL URL path still opens the broad index section, but query-url and rete cost now have an exact single-pattern route: header + dictionary + one selected permutation payload. The progressive path is cheaper still for summary-safe queries: fetching just the coarse community graph (summary-url) reads only the header + dictionary + summary — e.g. 2.2 KB of a 15.6 KB file (14%) in 3 requests, the index never fetched — the "overview first, drill down later" promise.

Scaling

At 347,884 triples (2.5× the table above) everything scales ~linearly, with no pathological blowup: build 890 ms, triple query 54 ms, property path 103 ms, GROUP BY 261 ms, predicate totals 26 ms (summary stays cheap regardless of size), file 3.31 MB (same ~6.4× vs raw / ~2.3× vs gzip ratios as the table above).

The pyramid: cost vs benefit

A .rete carries two pyramidal structures with very different economics. This measures both, against --no-pyramid, on a typed subClassOf ontology fixture (dev/gen_ontology_demo.py, scaled by entity count).

Bytes — file overhead, the growing super-edge summary, and what each read tier fetches over HTTP (q ±pyr = a node-selective query with / without the pyramid):

triples	file +pyr	file −pyr	pyr %	pyramid-meta	super-edges	card-url	summary-url	q +pyr	q −pyr
8,595	99 KB	90 KB	9.6%	8.7 KB	1,364	32.7 KB	17 KB	43,823	43,823
34,354	282 KB	264 KB	7.2%	19 KB	3,455	32.8 KB	43 KB	68,295	68,295
137,279	990 KB	946 KB	4.6%	43.6 KB	8,580	33.0 KB	116 KB	111,054	111,054

Time (release build; median) — what the pyramid costs to build (Louvain community detection + schema rollup, vs --no-pyramid) and how fast each read is:

triples	build +pyr	build −pyr	pyramid cost	card read	summary read	selective query
8,595	70 ms	51 ms	+19 ms	19 ms	23 ms	28 ms
34,354	222 ms	109 ms	+113 ms	19 ms	24 ms	30 ms
137,279	1,352 ms	353 ms	+999 ms	22 ms	30 ms	41 ms

What the numbers say:

The community super-edge summary scales with the graph — in bytes (super-edges 1.4k → 8.6k; summary-url 17 KB → 116 KB) and in time (the Louvain build cost is super-linear: +19 ms → +999 ms, ~4× the --no-pyramid build at 137k triples).
It gives node-selective lazy queries nothing — q +pyr and q −pyr are byte-for-byte identical (43,823 == 43,823, …). For pure selective serving the pyramid is overhead; --no-pyramid is smaller and builds far faster.
The schema pyramid is the bounded one — its size and read cost track the ontology, not the graph, so the leveled type/relation legend stays cheap at any scale. The flat-cost champion is the card (card-url ≈ 33 KB and ≈ 20 ms read at every size — it skips the dictionary and super-edges entirely).

Rule of thumb: keep the pyramid for index-free overview / exploration; build with --no-pyramid when you only serve selective queries at scale (smaller file, ~4× faster build, identical query latency). See the Semantic zoom guide for the schema-pyramid side of this. Repro: dev/bench_pyramid_value.sh (bytes) and dev/bench_pyramid_time.sh (time).

Coherence checking: tier costs

How much of a remote .rete each coherence check actually fetches, as the graph grows. A synthetic medical ontology — a fixed T-Box with a planted unsatisfiable class (:Relapsed ⊑ :Healthy and ⊑ :Sick, which are owl:disjointWith) plus a few instance-level clashes — at increasing instance counts, measured through a CountingReader over the file image. Repro: cargo run --release --example coherence_bench [n …].

instances	file	Tier-0 schema (index-free)	Tier-1 selective slice	Tier-2 full graph
1,000	33 KB	986 B · 2 req · 0.02 ms	26 KB · 36 req · 7 ms	33 KB · 36 req · 10 ms
10,000	283 KB	996 B · 2 req · 0.02 ms	32 KB · 37 req · 54 ms	171 KB · 38 req · 186 ms
100,000	9.4 MB	8.1 KB · 2 req · 0.09 ms	346 KB · 53 req · 1.1 s	1.8 MB · 48 req · 2.9 s
500,000	48.8 MB	8.1 KB · 2 req · 0.10 ms	1.4 MB · 90 req · 7.5 s	8.5 MB · 54 req · 15.9 s

The three tiers find different defects, by design: Tier-0 finds the schema-level unsatisfiable class (1 point) from the ontology alone; Tier-1/Tier-2 also find the instance-level disjoint clashes (3 points). So Tier-0 is a fast "is the ontology coherent?" gate, and Tier-1/2 answer "is the data coherent?".

The headline: Tier-0 reads ~1–8 KB in 2 range requests at any graph size — 8.1 KB of a 48.8 MB file (0.017%), sub-0.1 ms. The check-an-ontology's-coherence-from- a-few-KB-of-a-multi-GB-file claim holds. Tier-1 (selective) is the practical instance-coherence check: it faults only the rdf:type + T-Box tiles — 346 KB of 9.4 MB at 100k, 1.4 MB of 48.8 MB at 500k — while Tier-2 reads the whole graph.

Three fixes this benchmark drove

detect_inconsistencies was O(types²). The disjoint-class check looped over every pair of rdf:type triples in the whole graph (~4.3 s at 10k instances). Grouping each individual's class set first and checking only the pairs within one individual makes it ~linear: Tier-1 4320 ms → 54 ms, Tier-2 4607 ms → 186 ms at 10k. (The functional-property check had the same shape and got the same fix.)
Tier-0 was reading the whole dictionary. check_schema went through SummaryView, which fetches the dictionary to label predicates — but coherence needs none of it (the schema pyramid carries its own class-string table). A dictionary-free read_schema_coherence_ranged cut Tier-0 from 20.9 KB → ~2 KB at 10k.
Tier-0 then still scaled with the community summary. The schema pyramid is bounded by the ontology, but it shared the pyramid-meta section with the community super-edge summary, which grows with the entity count (every labelled node is a vertex) — so reading the whole pyramid-meta cost 6.7 MB at 100k, 36 MB at 500k. The fix: the writer now records the trailing schema block's byte length in the header (the 4 previously-reserved bytes), so a reader fetches only that block. Tier-0 dropped to a flat 8.1 KB — a ~4400× cut at 500k. Backward-compatible: a file without the header field (schema_meta_len == 0) falls back to the whole-section read, and a typeless file's header byte stays 0, so it remains byte-identical.

Comparison vs Oxigraph (real OpenCitations network)

Dataset: the real OpenCitations citation network for the AlphaFold paper (Nature 2021), enriched with disciplines, authors, venues and citation counts (539,246 triples loaded into both engines)

This pits rete (a queryable file) against Oxigraph (a full in-memory triplestore with a mature SPARQL planner). Honest summary: rete opens much faster because its indexes are prebuilt in the file, wins several scan/aggregate shapes, still loses where Oxigraph can stream, stop early, or apply a mature planner, and wins decisively on multi-source reachability.

This section is generated from docs/benchmark-opencitations.json with scripts/render_benchmark_doc.py. Latest run: 2026-06-15. Oxigraph 0.5, in-memory store (no RocksDB). Machine: 32 logical cores.

Load / open (one-time)

Engine	Step	Time	Resident heap after load
rete	`Rete::open` - indexes already built in the file	19.9 ms	13.35 MiB
Oxigraph	bulk-load N-Triples + build in-memory indexes	2437 ms	144.98 MiB

Process peak RSS after both loads (VmHWM): 207 MiB.

rete's "load" just maps a file whose dictionary + permutation indexes already exist on disk; Oxigraph parses the triples and builds its indexes on every startup. This is the format's core promise: publish once, open instantly, query in place.

SPARQL operator coverage (eager, lazy range-read, and Oxigraph)

24 queries spanning supported forms and operators, on three engines: rete eager (Rete::open, whole file in memory), rete lazy (Rete::open_ranged_lazy — a fresh cold open + HTTP-range-style reads per query, the path a browser/remote client runs), and Oxigraph (in-memory). Row counts are a cross-engine correctness check across the language surface, not just a speed race. Median ±sd of 5 warm runs. lazy fetched is the bytes that one query pulled from the file image — the rest is never touched.

Operator / form	rete eager	rete lazy	lazy fetched	Oxigraph	eager vs oxi	rows	ok
SELECT count (aggregate)	3.78 ±0.85 ms	4.65 ±0.62 ms	77.4 KB	9.14 ±0.62 ms	2.4x	1	yes
SELECT DISTINCT	4.79 ±0.52 ms	5.44 ±0.32 ms	51.2 KB	10.6 ±1.9 ms	2.2x	6	yes
ASK	0.06 ±0.20 ms	0.67 ±0.03 ms	51.2 KB	0.01 ±0.04 ms	0.2x	1	yes
CONSTRUCT	0.01 ±0.02 ms	1.84 ±1.40 ms	70.1 KB	0.05 ±0.51 ms	3.7x	9	yes
DESCRIBE (impl-defined)	0.01 ±0.00 ms	0.64 ±0.02 ms	70.1 KB	0.03 ±0.05 ms	2.8x	11	yes
VALUES (inline data)	5.22 ±0.45 ms	6.63 ±1.29 ms	177.3 KB	11.0 ±4.4 ms	2.1x	10962	yes
UNION	7.03 ±1.27 ms	7.97 ±1.39 ms	177.3 KB	12.9 ±1.7 ms	1.8x	10993	yes
OPTIONAL (left join)	0.30 ±0.03 ms	2.28 ±0.59 ms	125.6 KB	0.41 ±0.16 ms	1.4x	200	yes
MINUS	2.64 ±0.98 ms	6.34 ±1.46 ms	200.5 KB	3.29 ±2.00 ms	1.2x	2728	yes
FILTER NOT EXISTS	2.15 ±0.20 ms	6.69 ±1.18 ms	200.5 KB	10.9 ±4.3 ms	5.1x	2728	yes
3-way join + LIMIT	0.18 ±0.04 ms	1.75 ±0.16 ms	143.1 KB	0.19 ±0.12 ms	1.0x	50	yes
FILTER REGEX (case-insens.)	6.99 ±0.69 ms	8.24 ±0.88 ms	79.5 KB	0.90 ±0.24 ms	0.1x	200	yes
FILTER arith + logical	0.27 ±0.06 ms	1.88 ±0.25 ms	204.5 KB	0.75 ±0.33 ms	2.7x	200	yes
BIND + SUBSTR + CONCAT	0.36 ±0.02 ms	2.56 ±0.61 ms	151.7 KB	0.33 ±0.15 ms	0.9x	200	yes
path sequence a/b	0.20 ±0.02 ms	1.75 ±0.14 ms	162.5 KB	0.25 ±0.09 ms	1.2x	200	yes
path inverse ^p (count)	3.32 ±0.35 ms	3.95 ±0.51 ms	77.4 KB	9.71 ±0.92 ms	2.9x	1	yes
path + transitive (count)	6.94 ±0.18 ms	8.74 ±0.60 ms	89.5 KB	14.0 ±1.6 ms	2.0x	1	yes
path * zero-or-more (count)	6.55 ±0.27 ms	7.03 ±0.16 ms	89.5 KB	10.4 ±0.8 ms	1.6x	1	yes
GROUP BY + ORDER BY	3.81 ±0.60 ms	3.53 ±0.07 ms	51.2 KB	10.7 ±1.8 ms	2.8x	6	yes
GROUP BY + HAVING	6.12 ±2.18 ms	6.94 ±2.14 ms	51.2 KB	10.3 ±0.4 ms	1.7x	5	yes
AVG per group	27.8 ±7.5 ms	31.9 ±5.5 ms	95.0 KB	43.9 ±2.3 ms	1.6x	6	yes
MIN/MAX/SUM	9.14 ±2.02 ms	5.51 ±0.91 ms	78.5 KB	10.7 ±0.6 ms	1.2x	1	yes
COUNT(DISTINCT)	2.48 ±0.08 ms	2.74 ±0.09 ms	57.4 KB	7.73 ±0.29 ms	3.1x	1	yes
ORDER BY + LIMIT + OFFSET	4.05 ±0.06 ms	5.02 ±0.15 ms	125.0 KB	22.7 ±1.0 ms	5.6x	10	yes

24 / 24 identical row counts across SELECT/ASK/CONSTRUCT/DESCRIBE, algebra operators, filters/functions, property paths, and aggregates.

Reading the times honestly:

rete now wins or ties the large majority of shapes: aggregates, GROUP BY, DISTINCT, path closures, sorted pagination (top-k), and large UNION/VALUES/MINUS results. The engine evaluates as a lazy pipeline over integer slot rows, switches to index-nested-loop probes under small LIMIT/ASK demand, and resolves terms only at projection. A hybrid BGP join also switches to per-row index probes mid-join once the running intermediate is small (so a star/snowflake's trailing ?x rdf:type C checks cost a few lookups, not a full-extent scan), and never seeds a join from a class enumeration — together cutting the LUBM snowflakes Q4 ~14x and Q7 ~12x. The index stores all six permutations (format v0.4), so any join key can be streamed pre-sorted for a sort-merge join; at these sizes the hash/probe joins already win, so the merge path is neutral here — it pays off at much larger scale, for ~2x the index payload.
Oxigraph keeps an edge on REGEX scans whose pattern needs the real regex engine (rete's regex-lite has no literal prefilter; literal patterns use a substring fast path) and, by fractions of a millisecond, on ASK and the tightest LIMIT joins — floor effects, not the orders-of-magnitude gaps from before the engine rework.
The lazy range-read engine — what a browser or remote client runs over HTTP — adds only a ~1 ms cold-open tax over eager while fetching single-digit-percent of the file per query (here ~50–205 KB of a 2.8 MB file), and still beats in-memory Oxigraph on most shapes. Querying a .rete where it sits (S3 / HTTP, no server) is not a latency compromise.

Batch transitive reachability - `coauthor+` from 300 seeds

"From each seed author, who is reachable through co-authorship?" rete exposes this as a dedicated primitive (rete reach, batch_reach_*); on Oxigraph it is a coauthor+ property path evaluated per seed.

Engine / mode	Time	vs rete-serial
rete - `batch_reach_serial` (1 core)	641 ±101 ms	1.0x
rete - `batch_reach_parallel` (32 cores)	39.0 ±12.4 ms	16.4x
Oxigraph - `coauthor+` property path, per seed	3026 ±48 ms	0.2x

rete serial and parallel both reached 1,636,200 nodes; Oxigraph touched 1,636,200 result cells. The dedicated parallel primitive is a different abstraction level from a general SPARQL property path, so read this as: use the right tool for multi-source reach.

Reproduce

# In the dev container (Docker). The OpenCitations + synthetic-enrichment data
# comes from scripts/fetch_opencitations.py + scripts/enrich.py (-> enriched-all.nt).
# Sanitize malformed compound-DOI IRIs so both engines load identical data:
grep -vE "<[^>]* [^>]*>" data/opencitations/enriched-all.nt \
  > data/opencitations/enriched-clean.nt
./target/release/rete build data/opencitations/enriched-clean.nt \
  -o data/opencitations/enriched-clean.rete

cargo build --release -p rete-bench
./target/release/rete-bench --json data/opencitations/enriched-clean.rete \
  data/opencitations/enriched-clean.nt 300 > /tmp/bench.json
# preserve the curated dataset note + run date from the existing JSON:
uv run python scripts/merge_bench_metadata.py /tmp/bench.json \
  docs/benchmark-opencitations.json --date $(date +%F)
uv run python scripts/render_benchmark_doc.py docs/benchmark-opencitations.json \
  --input docs/BENCHMARK.md --output docs/BENCHMARK.md
cargo run -p docgen

The rete-bench crate pulls in Oxigraph only for this comparison; its in-memory store needs no RocksDB/clang, so default-features = false keeps the build light.

LUBM-style benchmark (standard 14 queries)

The LUBM data model and its 14 standard queries, on identical pre-materialized data in both engines. LUBM(1): 93,139 generated + 26,007 RDFS/OWL-RL-materialized = 118,680 triples (.rete: 555,581 bytes).

Read the caveats: the generator is a faithful reimplementation of UBA's documented cardinalities, not the official Java tool, so counts are not comparable to published LUBM figures; inference is rete's RDFS/OWL-RL subset, applied up front to the data both engines load — so restriction-defined classes (Q12's Chair) are empty on both sides by construction. The correctness anchor is cross-engine row parity: 14 / 14 queries agree.

Engine	Load	Resident heap
rete `Rete::open`	6.2 ms	3.66 MiB
Oxigraph bulk-load	180 ms	37.50 MiB

Query	rete	Oxigraph	rete vs oxi	peak heap MiB (rete / oxi)	rows	ok
Q1 grad students of a course	0.30 ±0.03 ms	0.83 ±0.25 ms	2.8x	0.00 / 0.00	3	yes
Q2 grad student / univ / dept triangle	3.96 ±0.18 ms	5.36 ±0.55 ms	1.4x	1.57 / 0.01	2036	yes
Q3 publications of a professor	0.77 ±0.09 ms	2.42 ±0.30 ms	3.1x	0.01 / 0.00	9	yes
Q4 professors of a department (+3 props)	0.24 ±0.02 ms	0.49 ±0.03 ms	2.1x	0.09 / 0.01	26	yes
Q5 persons member of a department	1.39 ±0.10 ms	6.49 ±0.52 ms	4.7x	0.34 / 0.00	544	yes
Q6 all students	2.47 ±0.39 ms	1.71 ±0.42 ms	0.7x	4.39 / 0.00	7232	yes
Q7 students of a professor's courses	0.94 ±0.07 ms	0.08 ±0.02 ms	0.1x	0.04 / 0.01	49	yes
Q8 students of a university's departments	11.7 ±1.5 ms	18.8 ±0.7 ms	1.6x	5.05 / 0.01	7232	yes
Q9 student / advisor / course triangle	6.31 ±0.33 ms	12.1 ±1.1 ms	1.9x	1.53 / 0.01	115	yes
Q10 students of a graduate course	0.75 ±0.08 ms	2.91 ±0.46 ms	3.9x	0.00 / 0.00	0	yes
Q11 research groups of a university	0.21 ±0.02 ms	0.15 ±0.02 ms	0.7x	0.14 / 0.00	230	yes
Q12 chairs of a university's departments	0.01 ±0.00 ms	0.02 ±0.00 ms	2.4x	0.00 / 0.01	0	yes
Q13 alumni of a university	2.31 ±0.16 ms	2.47 ±0.42 ms	1.1x	1.62 / 0.00	2655	yes
Q14 all undergraduate students	2.37 ±0.13 ms	1.75 ±0.54 ms	0.7x	4.39 / 0.00	7232	yes

Reproduce: cargo run --release -p rete-bench -- --json --lubm 1 > docs/benchmark-lubm.json then re-render this doc.

Parallelism

A prototype data-parallel query evaluator (rete-core feature parallel, module parallel.rs, behind rayon — off by default, never compiled into wasm) splits decomposable work across communities/seeds and harmonizes the partials. Every parallel workload ships with a serial reference that produces a bit-identical result; the benchmark asserts serial == parallel (RESULTS MATCH) before trusting any timing. Numbers below are from cargo run --release -p rete-core --example parallel_bench --features parallel in the dev container, best-of-N after a warm-up.

Machine: 32 cores (available_parallelism = 32, rayon = 32 threads).

137,499 triples (20k people, 184 communities)

Workload	Serial	Parallel	Speedup	Match
A — predicate count (per-community + sum)	2.68 ms	1.71 ms	1.57×	✓
B — per-subject out-degree (per-community + BTreeMap merge)	2.77 ms	2.75 ms	1.01×	✓
C — batch reachability, 512 seeds (embarrassingly parallel)	2,648.9 ms	186.8 ms	14.18×	✓

343,844 triples (50k people, 346 communities) — 2.5× scale

Workload	Serial	Parallel	Speedup	Match
A — predicate count	7.08 ms	1.44 ms	4.92×	✓
B — per-subject out-degree	7.94 ms	4.87 ms	1.63×	✓
C — batch reachability, 512 seeds	9,057.8 ms	584.6 ms	15.50×	✓

Honest analysis

Batch reachability is the clear win (14–15×). N independent single-source BFS traversals over a shared read-only adjacency map are embarrassingly parallel — one rayon task per seed, no shared mutable state, no merge step. This is the realistic use case: multi-source impact analysis ("what does each of these 512 nodes transitively reach?"). It does not hit linear 32× because the per-seed reach sets are huge (~20k–50k nodes each) and the work per seed varies, so memory bandwidth and load imbalance cap the gain — but a 14–15× wall-clock cut on a 32-core box is real and reproducible. This workload is shipped as a command: rete reach --parallel (see the CLI reference).
Per-subject out-degree barely moves (1.01× at 137k, 1.63× at 344k). The per-community map-build parallelizes, but the harmonize — merging hundreds of BTreeMaps into one — is serial in the reduce and dominates when the per-tile work is tiny. It only starts to pay (1.63×) once each tile carries enough triples to amortize the merge. This is textbook Amdahl: the serial merge fraction caps the speedup.
Predicate count is the "cheap scan" caveat, reported plainly. At 137k the whole scan is ~2 ms, so rayon's fork/join overhead is on the same order as the work and the speedup is noise-dominated (1.57×). It looks better than it should because the comparison isn't perfectly apples-to-apples: the serial reference walks the engine's index (match_ids, which materializes a Vec of every match), while the parallel side filters pre-tiled in-memory triples. Lesson: a sub-millisecond-per-core workload is not worth parallelizing — the tiling + dispatch overhead eats the gain, and any apparent win is fragile.
This is CPU-side only. The HTTP/range query path is I/O-bound (latency is in the network round-trips, not the CPU), so threading the evaluator does not help there. And wasm has no threads at all — which is exactly why parallel is an opt-in native feature and the wasm crate builds with --no-default-features (verified: rayon is never compiled for wasm).

Conclusion on "split across communities + harmonize": it helps when (a) the per-partition work is substantial relative to fork/join + merge overhead, and (b) the harmonize step is cheap or itself parallel. Batch reachability nails both; per-community aggregations only pay off at scale once the serial merge is amortized; cheap single scans should stay serial.