Benchmark

Synthetic social graph, 139,093 triples (20k people in ~200 communities, ~5 knows edges each + age/name literals). Generated by scripts/gen_graph.py, measured by scripts/bench.sh, run in the dev container (release build, Rust 1.92). Query timings use the compiled binary directly (target/release/rete), so they exclude cargo overhead.

Size

ArtifactBytesvs raw
raw N-Triples8,384,1011.0×
gzip -9 of the N-Triples564,59014.8×
.rete (zstd + pyramid summary)1,302,7926.4×

.rete is ~2.3× the size of gzip while being queryable in place and over HTTP ranges (gzip answers no query without a full download + scan). As of format v0.4 it stores all six permutation indexes (SPO/POS/OSP/SOP/PSO/OPS) — each triple stored 6× — plus the dictionary and the small pyramid summary; that is what roughly doubled the file versus the earlier three-permutation v0.3 (715,959 bytes), and the payoff is that any triple pattern — not just the three canonical leads — resolves to a contiguous scan with its bound components leading, which also unlocks sort-merge joins. Per-community tiles remain out of the default file because they duplicate every triple; exact single-pattern range routing fetches one selected permutation payload instead of the whole index container, while physical community-tile directories are the next storage step.

Build

StepTime
rete build (parse → dict → 6 indexes → Louvain pyramid → zstd)320 ms

Result: 3 pyramid levels, 137,512 quads, 40,063 terms. (Caching the dictionary section metadata cut build time ~3.4× — see finding 1.)

Build memory (big graphs)

The build's peak RAM is dominated by the raw string statements (every term an owned String, heavily duplicated), which are redundant with the dictionary once interned. Stream-parsing the file (the whole text is never resident) and dropping the string statements right after encoding them to id-triples — before the pyramid + index phases — cuts the peak. On a 3 M-triple / 189 MB N-Triples build (synthetic, scripts-style entities with type/label/edges):

OLDoptimized
process peak RSS (VmHWM)1371 MiB836 MiB (−39%)

Output is byte-identical. Reproduce the phase-by-phase heap profile with cargo run --release -p rete-bench -- --build-mem <file.nt> — it snapshots the live heap (exact, via a counting allocator) after parse, dictionary, encode, the drop, pyramid, index, and write.

Build time (big graphs)

The build is dominated by the Louvain community pyramid — ~91% of wall-clock on the 3 M-triple build. Its local-moving inner loop now uses a reusable dense scratch (community → edge-weight + touched-list reset) instead of a per-node HashMap, with unchanged accumulation order and candidate scan, so the partition and the file content hash are byte-identical. Set RETE_BUILD_TIMING=1 on rete build for the per-phase breakdown.

phase (3 M triples)OLDoptimized
build_dendrogram (Louvain)118.2 s44.1 s (2.7×)
whole rete build (wall-clock)~141 s67 s (~2.1×)

A parallel Louvain would not be byte-identical (the partition depends on the sequential move order), so the pyramid stays single-threaded; the six index permutations and the dictionary/permutation sorts already use all cores (rayon).

Query result serialization (the WASM hot path)

A query's result is returned to the browser as a JSON string. Writing that envelope directly into a String (rete_core::results_envelope_json) instead of building a serde_json::Value tree and stringifying it dominates the cost on a large result. On a 500k-row SELECT (27 MB JSON), profiled with cargo run --release -p rete-bench -- --query-mem:

serializerpeak heaptime
Value tree + to_string670 MiB (25× the payload)408 ms
direct to String50 MiB (1.9×)41 ms

~13× less peak heap, ~10× faster, byte-identical JSON (a unit test pins the escaping to serde_json). For reference the query itself (eval_query) was 290 MiB / 344 ms — i.e. the old serializer cost more than the query. On the bounded WASM heap the 670 MiB peak was an OOM risk for big results.

Query latency (end-to-end: open + decompress + evaluate)

5-run averages of the compiled binary in the dev container (includes process startup each run; format v0.4, six permutations).

QueryTime
triple pattern (?s knows p100)33 ms
2-hop BGP join (p0 knows ?y . ?y knows ?z)39 ms
property path (p0 knows+ ?y, reaches whole graph)59 ms
GROUP BY COUNT (degree of every node)107 ms
per-predicate totals (summary only, index not read)24 ms

Every dictionary lookup and triple-block decode is bounds-checked (slice.get(..)? instead of raw indexing) so the reader cannot panic on malformed input — a few percent on the resolution-heavy hot path, and worth it (see finding 5). Resolution-light queries (triple pattern, summary) are unaffected.

HTTP range

A point query over HTTP is bounded, PMTiles-style. The full SPARQL URL path still opens the broad index section, but query-url and rete cost now have an exact single-pattern route: header + dictionary + one selected permutation payload. The progressive path is cheaper still for summary-safe queries: fetching just the coarse community graph (summary-url) reads only the header + dictionary + summary — e.g. 2.2 KB of a 15.6 KB file (14%) in 3 requests, the index never fetched — the "overview first, drill down later" promise.

Scaling

At 347,884 triples (2.5× the table above) everything scales ~linearly, with no pathological blowup: build 890 ms, triple query 54 ms, property path 103 ms, GROUP BY 261 ms, predicate totals 26 ms (summary stays cheap regardless of size), file 3.31 MB (same ~6.4× vs raw / ~2.3× vs gzip ratios as the table above).

The pyramid: cost vs benefit

A .rete carries two pyramidal structures with very different economics. This measures both, against --no-pyramid, on a typed subClassOf ontology fixture (dev/gen_ontology_demo.py, scaled by entity count).

Bytes — file overhead, the growing super-edge summary, and what each read tier fetches over HTTP (q ±pyr = a node-selective query with / without the pyramid):

triplesfile +pyrfile −pyrpyr %pyramid-metasuper-edgescard-urlsummary-urlq +pyrq −pyr
8,59599 KB90 KB9.6%8.7 KB1,36432.7 KB17 KB43,82343,823
34,354282 KB264 KB7.2%19 KB3,45532.8 KB43 KB68,29568,295
137,279990 KB946 KB4.6%43.6 KB8,58033.0 KB116 KB111,054111,054

Time (release build; median) — what the pyramid costs to build (Louvain community detection + schema rollup, vs --no-pyramid) and how fast each read is:

triplesbuild +pyrbuild −pyrpyramid costcard readsummary readselective query
8,59570 ms51 ms+19 ms19 ms23 ms28 ms
34,354222 ms109 ms+113 ms19 ms24 ms30 ms
137,2791,352 ms353 ms+999 ms22 ms30 ms41 ms

What the numbers say:

  • The community super-edge summary scales with the graph — in bytes (super-edges 1.4k → 8.6k; summary-url 17 KB → 116 KB) and in time (the Louvain build cost is super-linear: +19 ms → +999 ms, ~4× the --no-pyramid build at 137k triples).
  • It gives node-selective lazy queries nothingq +pyr and q −pyr are byte-for-byte identical (43,823 == 43,823, …). For pure selective serving the pyramid is overhead; --no-pyramid is smaller and builds far faster.
  • The schema pyramid is the bounded one — its size and read cost track the ontology, not the graph, so the leveled type/relation legend stays cheap at any scale. The flat-cost champion is the card (card-url ≈ 33 KB and ≈ 20 ms read at every size — it skips the dictionary and super-edges entirely).

Rule of thumb: keep the pyramid for index-free overview / exploration; build with --no-pyramid when you only serve selective queries at scale (smaller file, ~4× faster build, identical query latency). See the Semantic zoom guide for the schema-pyramid side of this. Repro: dev/bench_pyramid_value.sh (bytes) and dev/bench_pyramid_time.sh (time).

Coherence checking: tier costs

How much of a remote .rete each coherence check actually fetches, as the graph grows. A synthetic medical ontology — a fixed T-Box with a planted unsatisfiable class (:Relapsed ⊑ :Healthy and ⊑ :Sick, which are owl:disjointWith) plus a few instance-level clashes — at increasing instance counts, measured through a CountingReader over the file image. Repro: cargo run --release --example coherence_bench [n …].

instancesfileTier-0 schema (index-free)Tier-1 selective sliceTier-2 full graph
1,00033 KB986 B · 2 req · 0.02 ms26 KB · 36 req · 7 ms33 KB · 36 req · 10 ms
10,000283 KB996 B · 2 req · 0.02 ms32 KB · 37 req · 54 ms171 KB · 38 req · 186 ms
100,0009.4 MB8.1 KB · 2 req · 0.09 ms346 KB · 53 req · 1.1 s1.8 MB · 48 req · 2.9 s
500,00048.8 MB8.1 KB · 2 req · 0.10 ms1.4 MB · 90 req · 7.5 s8.5 MB · 54 req · 15.9 s

The three tiers find different defects, by design: Tier-0 finds the schema-level unsatisfiable class (1 point) from the ontology alone; Tier-1/Tier-2 also find the instance-level disjoint clashes (3 points). So Tier-0 is a fast "is the ontology coherent?" gate, and Tier-1/2 answer "is the data coherent?".

The headline: Tier-0 reads ~1–8 KB in 2 range requests at any graph size — 8.1 KB of a 48.8 MB file (0.017%), sub-0.1 ms. The check-an-ontology's-coherence-from- a-few-KB-of-a-multi-GB-file claim holds. Tier-1 (selective) is the practical instance-coherence check: it faults only the rdf:type + T-Box tiles — 346 KB of 9.4 MB at 100k, 1.4 MB of 48.8 MB at 500k — while Tier-2 reads the whole graph.

Three fixes this benchmark drove

  • detect_inconsistencies was O(types²). The disjoint-class check looped over every pair of rdf:type triples in the whole graph (~4.3 s at 10k instances). Grouping each individual's class set first and checking only the pairs within one individual makes it ~linear: Tier-1 4320 ms → 54 ms, Tier-2 4607 ms → 186 ms at 10k. (The functional-property check had the same shape and got the same fix.)
  • Tier-0 was reading the whole dictionary. check_schema went through SummaryView, which fetches the dictionary to label predicates — but coherence needs none of it (the schema pyramid carries its own class-string table). A dictionary-free read_schema_coherence_ranged cut Tier-0 from 20.9 KB → ~2 KB at 10k.
  • Tier-0 then still scaled with the community summary. The schema pyramid is bounded by the ontology, but it shared the pyramid-meta section with the community super-edge summary, which grows with the entity count (every labelled node is a vertex) — so reading the whole pyramid-meta cost 6.7 MB at 100k, 36 MB at 500k. The fix: the writer now records the trailing schema block's byte length in the header (the 4 previously-reserved bytes), so a reader fetches only that block. Tier-0 dropped to a flat 8.1 KB — a ~4400× cut at 500k. Backward-compatible: a file without the header field (schema_meta_len == 0) falls back to the whole-section read, and a typeless file's header byte stays 0, so it remains byte-identical.

Comparison vs Oxigraph (real OpenCitations network)

Dataset: the real OpenCitations citation network for the AlphaFold paper (Nature 2021), enriched with disciplines, authors, venues and citation counts (539,246 triples loaded into both engines)

This pits rete (a queryable file) against Oxigraph (a full in-memory triplestore with a mature SPARQL planner). Honest summary: rete opens much faster because its indexes are prebuilt in the file, wins several scan/aggregate shapes, still loses where Oxigraph can stream, stop early, or apply a mature planner, and wins decisively on multi-source reachability.

This section is generated from docs/benchmark-opencitations.json with scripts/render_benchmark_doc.py. Latest run: 2026-06-15. Oxigraph 0.5, in-memory store (no RocksDB). Machine: 32 logical cores.

Load / open (one-time)

EngineStepTimeResident heap after load
reteRete::open - indexes already built in the file19.9 ms13.35 MiB
Oxigraphbulk-load N-Triples + build in-memory indexes2437 ms144.98 MiB

Process peak RSS after both loads (VmHWM): 207 MiB.

rete's "load" just maps a file whose dictionary + permutation indexes already exist on disk; Oxigraph parses the triples and builds its indexes on every startup. This is the format's core promise: publish once, open instantly, query in place.

SPARQL operator coverage (eager, lazy range-read, and Oxigraph)

24 queries spanning supported forms and operators, on three engines: rete eager (Rete::open, whole file in memory), rete lazy (Rete::open_ranged_lazy — a fresh cold open + HTTP-range-style reads per query, the path a browser/remote client runs), and Oxigraph (in-memory). Row counts are a cross-engine correctness check across the language surface, not just a speed race. Median ±sd of 5 warm runs. lazy fetched is the bytes that one query pulled from the file image — the rest is never touched.

Operator / formrete eagerrete lazylazy fetchedOxigrapheager vs oxirowsok
SELECT count (aggregate)3.78 ±0.85 ms4.65 ±0.62 ms77.4 KB9.14 ±0.62 ms2.4x1yes
SELECT DISTINCT4.79 ±0.52 ms5.44 ±0.32 ms51.2 KB10.6 ±1.9 ms2.2x6yes
ASK0.06 ±0.20 ms0.67 ±0.03 ms51.2 KB0.01 ±0.04 ms0.2x1yes
CONSTRUCT0.01 ±0.02 ms1.84 ±1.40 ms70.1 KB0.05 ±0.51 ms3.7x9yes
DESCRIBE (impl-defined)0.01 ±0.00 ms0.64 ±0.02 ms70.1 KB0.03 ±0.05 ms2.8x11yes
VALUES (inline data)5.22 ±0.45 ms6.63 ±1.29 ms177.3 KB11.0 ±4.4 ms2.1x10962yes
UNION7.03 ±1.27 ms7.97 ±1.39 ms177.3 KB12.9 ±1.7 ms1.8x10993yes
OPTIONAL (left join)0.30 ±0.03 ms2.28 ±0.59 ms125.6 KB0.41 ±0.16 ms1.4x200yes
MINUS2.64 ±0.98 ms6.34 ±1.46 ms200.5 KB3.29 ±2.00 ms1.2x2728yes
FILTER NOT EXISTS2.15 ±0.20 ms6.69 ±1.18 ms200.5 KB10.9 ±4.3 ms5.1x2728yes
3-way join + LIMIT0.18 ±0.04 ms1.75 ±0.16 ms143.1 KB0.19 ±0.12 ms1.0x50yes
FILTER REGEX (case-insens.)6.99 ±0.69 ms8.24 ±0.88 ms79.5 KB0.90 ±0.24 ms0.1x200yes
FILTER arith + logical0.27 ±0.06 ms1.88 ±0.25 ms204.5 KB0.75 ±0.33 ms2.7x200yes
BIND + SUBSTR + CONCAT0.36 ±0.02 ms2.56 ±0.61 ms151.7 KB0.33 ±0.15 ms0.9x200yes
path sequence a/b0.20 ±0.02 ms1.75 ±0.14 ms162.5 KB0.25 ±0.09 ms1.2x200yes
path inverse ^p (count)3.32 ±0.35 ms3.95 ±0.51 ms77.4 KB9.71 ±0.92 ms2.9x1yes
path + transitive (count)6.94 ±0.18 ms8.74 ±0.60 ms89.5 KB14.0 ±1.6 ms2.0x1yes
path * zero-or-more (count)6.55 ±0.27 ms7.03 ±0.16 ms89.5 KB10.4 ±0.8 ms1.6x1yes
GROUP BY + ORDER BY3.81 ±0.60 ms3.53 ±0.07 ms51.2 KB10.7 ±1.8 ms2.8x6yes
GROUP BY + HAVING6.12 ±2.18 ms6.94 ±2.14 ms51.2 KB10.3 ±0.4 ms1.7x5yes
AVG per group27.8 ±7.5 ms31.9 ±5.5 ms95.0 KB43.9 ±2.3 ms1.6x6yes
MIN/MAX/SUM9.14 ±2.02 ms5.51 ±0.91 ms78.5 KB10.7 ±0.6 ms1.2x1yes
COUNT(DISTINCT)2.48 ±0.08 ms2.74 ±0.09 ms57.4 KB7.73 ±0.29 ms3.1x1yes
ORDER BY + LIMIT + OFFSET4.05 ±0.06 ms5.02 ±0.15 ms125.0 KB22.7 ±1.0 ms5.6x10yes

24 / 24 identical row counts across SELECT/ASK/CONSTRUCT/DESCRIBE, algebra operators, filters/functions, property paths, and aggregates.

Reading the times honestly:

  • rete now wins or ties the large majority of shapes: aggregates, GROUP BY, DISTINCT, path closures, sorted pagination (top-k), and large UNION/VALUES/MINUS results. The engine evaluates as a lazy pipeline over integer slot rows, switches to index-nested-loop probes under small LIMIT/ASK demand, and resolves terms only at projection. A hybrid BGP join also switches to per-row index probes mid-join once the running intermediate is small (so a star/snowflake's trailing ?x rdf:type C checks cost a few lookups, not a full-extent scan), and never seeds a join from a class enumeration — together cutting the LUBM snowflakes Q4 ~14x and Q7 ~12x. The index stores all six permutations (format v0.4), so any join key can be streamed pre-sorted for a sort-merge join; at these sizes the hash/probe joins already win, so the merge path is neutral here — it pays off at much larger scale, for ~2x the index payload.
  • Oxigraph keeps an edge on REGEX scans whose pattern needs the real regex engine (rete's regex-lite has no literal prefilter; literal patterns use a substring fast path) and, by fractions of a millisecond, on ASK and the tightest LIMIT joins — floor effects, not the orders-of-magnitude gaps from before the engine rework.
  • The lazy range-read engine — what a browser or remote client runs over HTTP — adds only a ~1 ms cold-open tax over eager while fetching single-digit-percent of the file per query (here ~50–205 KB of a 2.8 MB file), and still beats in-memory Oxigraph on most shapes. Querying a .rete where it sits (S3 / HTTP, no server) is not a latency compromise.

Batch transitive reachability - coauthor+ from 300 seeds

"From each seed author, who is reachable through co-authorship?" rete exposes this as a dedicated primitive (rete reach, batch_reach_*); on Oxigraph it is a coauthor+ property path evaluated per seed.

Engine / modeTimevs rete-serial
rete - batch_reach_serial (1 core)641 ±101 ms1.0x
rete - batch_reach_parallel (32 cores)39.0 ±12.4 ms16.4x
Oxigraph - coauthor+ property path, per seed3026 ±48 ms0.2x

rete serial and parallel both reached 1,636,200 nodes; Oxigraph touched 1,636,200 result cells. The dedicated parallel primitive is a different abstraction level from a general SPARQL property path, so read this as: use the right tool for multi-source reach.

Reproduce

# In the dev container (Docker). The OpenCitations + synthetic-enrichment data
# comes from scripts/fetch_opencitations.py + scripts/enrich.py (-> enriched-all.nt).
# Sanitize malformed compound-DOI IRIs so both engines load identical data:
grep -vE "<[^>]* [^>]*>" data/opencitations/enriched-all.nt \
  > data/opencitations/enriched-clean.nt
./target/release/rete build data/opencitations/enriched-clean.nt \
  -o data/opencitations/enriched-clean.rete

cargo build --release -p rete-bench
./target/release/rete-bench --json data/opencitations/enriched-clean.rete \
  data/opencitations/enriched-clean.nt 300 > /tmp/bench.json
# preserve the curated dataset note + run date from the existing JSON:
uv run python scripts/merge_bench_metadata.py /tmp/bench.json \
  docs/benchmark-opencitations.json --date $(date +%F)
uv run python scripts/render_benchmark_doc.py docs/benchmark-opencitations.json \
  --input docs/BENCHMARK.md --output docs/BENCHMARK.md
cargo run -p docgen

The rete-bench crate pulls in Oxigraph only for this comparison; its in-memory store needs no RocksDB/clang, so default-features = false keeps the build light.

LUBM-style benchmark (standard 14 queries)

The LUBM data model and its 14 standard queries, on identical pre-materialized data in both engines. LUBM(1): 93,139 generated + 26,007 RDFS/OWL-RL-materialized = 118,680 triples (.rete: 555,581 bytes).

Read the caveats: the generator is a faithful reimplementation of UBA's documented cardinalities, not the official Java tool, so counts are not comparable to published LUBM figures; inference is rete's RDFS/OWL-RL subset, applied up front to the data both engines load — so restriction-defined classes (Q12's Chair) are empty on both sides by construction. The correctness anchor is cross-engine row parity: 14 / 14 queries agree.

EngineLoadResident heap
rete Rete::open6.2 ms3.66 MiB
Oxigraph bulk-load180 ms37.50 MiB
QueryreteOxigraphrete vs oxipeak heap MiB (rete / oxi)rowsok
Q1 grad students of a course0.30 ±0.03 ms0.83 ±0.25 ms2.8x0.00 / 0.003yes
Q2 grad student / univ / dept triangle3.96 ±0.18 ms5.36 ±0.55 ms1.4x1.57 / 0.012036yes
Q3 publications of a professor0.77 ±0.09 ms2.42 ±0.30 ms3.1x0.01 / 0.009yes
Q4 professors of a department (+3 props)0.24 ±0.02 ms0.49 ±0.03 ms2.1x0.09 / 0.0126yes
Q5 persons member of a department1.39 ±0.10 ms6.49 ±0.52 ms4.7x0.34 / 0.00544yes
Q6 all students2.47 ±0.39 ms1.71 ±0.42 ms0.7x4.39 / 0.007232yes
Q7 students of a professor's courses0.94 ±0.07 ms0.08 ±0.02 ms0.1x0.04 / 0.0149yes
Q8 students of a university's departments11.7 ±1.5 ms18.8 ±0.7 ms1.6x5.05 / 0.017232yes
Q9 student / advisor / course triangle6.31 ±0.33 ms12.1 ±1.1 ms1.9x1.53 / 0.01115yes
Q10 students of a graduate course0.75 ±0.08 ms2.91 ±0.46 ms3.9x0.00 / 0.000yes
Q11 research groups of a university0.21 ±0.02 ms0.15 ±0.02 ms0.7x0.14 / 0.00230yes
Q12 chairs of a university's departments0.01 ±0.00 ms0.02 ±0.00 ms2.4x0.00 / 0.010yes
Q13 alumni of a university2.31 ±0.16 ms2.47 ±0.42 ms1.1x1.62 / 0.002655yes
Q14 all undergraduate students2.37 ±0.13 ms1.75 ±0.54 ms0.7x4.39 / 0.007232yes

Reproduce: cargo run --release -p rete-bench -- --json --lubm 1 > docs/benchmark-lubm.json then re-render this doc.

Parallelism

A prototype data-parallel query evaluator (rete-core feature parallel, module parallel.rs, behind rayonoff by default, never compiled into wasm) splits decomposable work across communities/seeds and harmonizes the partials. Every parallel workload ships with a serial reference that produces a bit-identical result; the benchmark asserts serial == parallel (RESULTS MATCH) before trusting any timing. Numbers below are from cargo run --release -p rete-core --example parallel_bench --features parallel in the dev container, best-of-N after a warm-up.

Machine: 32 cores (available_parallelism = 32, rayon = 32 threads).

137,499 triples (20k people, 184 communities)

WorkloadSerialParallelSpeedupMatch
A — predicate count (per-community + sum)2.68 ms1.71 ms1.57×
B — per-subject out-degree (per-community + BTreeMap merge)2.77 ms2.75 ms1.01×
C — batch reachability, 512 seeds (embarrassingly parallel)2,648.9 ms186.8 ms14.18×

343,844 triples (50k people, 346 communities) — 2.5× scale

WorkloadSerialParallelSpeedupMatch
A — predicate count7.08 ms1.44 ms4.92×
B — per-subject out-degree7.94 ms4.87 ms1.63×
C — batch reachability, 512 seeds9,057.8 ms584.6 ms15.50×

Honest analysis

  • Batch reachability is the clear win (14–15×). N independent single-source BFS traversals over a shared read-only adjacency map are embarrassingly parallel — one rayon task per seed, no shared mutable state, no merge step. This is the realistic use case: multi-source impact analysis ("what does each of these 512 nodes transitively reach?"). It does not hit linear 32× because the per-seed reach sets are huge (~20k–50k nodes each) and the work per seed varies, so memory bandwidth and load imbalance cap the gain — but a 14–15× wall-clock cut on a 32-core box is real and reproducible. This workload is shipped as a command: rete reach --parallel (see the CLI reference).
  • Per-subject out-degree barely moves (1.01× at 137k, 1.63× at 344k). The per-community map-build parallelizes, but the harmonize — merging hundreds of BTreeMaps into one — is serial in the reduce and dominates when the per-tile work is tiny. It only starts to pay (1.63×) once each tile carries enough triples to amortize the merge. This is textbook Amdahl: the serial merge fraction caps the speedup.
  • Predicate count is the "cheap scan" caveat, reported plainly. At 137k the whole scan is ~2 ms, so rayon's fork/join overhead is on the same order as the work and the speedup is noise-dominated (1.57×). It looks better than it should because the comparison isn't perfectly apples-to-apples: the serial reference walks the engine's index (match_ids, which materializes a Vec of every match), while the parallel side filters pre-tiled in-memory triples. Lesson: a sub-millisecond-per-core workload is not worth parallelizing — the tiling + dispatch overhead eats the gain, and any apparent win is fragile.
  • This is CPU-side only. The HTTP/range query path is I/O-bound (latency is in the network round-trips, not the CPU), so threading the evaluator does not help there. And wasm has no threads at all — which is exactly why parallel is an opt-in native feature and the wasm crate builds with --no-default-features (verified: rayon is never compiled for wasm).

Conclusion on "split across communities + harmonize": it helps when (a) the per-partition work is substantial relative to fork/join + merge overhead, and (b) the harmonize step is cheap or itself parallel. Batch reachability nails both; per-community aggregations only pay off at scale once the serial merge is amortized; cheap single scans should stay serial.