← Performance
Flagship demonstration · real corpus · consumer hardware

Full CourtListener Corpus Reconstitution

The Free Law Project's complete federal judiciary mirror — restored end-to-end into PostgreSQL on a consumer desktop. 60.5 minutes. 2.6 billion rows. Δ0.

Run: 2026-05-29 16:01 · Tag: cl-corpus-2026-03-31 · Concurrency: 32 (one job per dataset) · Source: Free Law Project bulk archive, 2026-03-31 release

What was restored single binary · one machine · no cluster
32
Datasets ingested
401 GB
Heterogeneous source data
60.5 min
Wall time
2,608,874,016
Rows ingested
Δ0
Engine-recorded parity

Restoring the full CourtListener bulk archive into a relational database conventionally takes the better part of a day, dominated by the 347 GB opinions table. DataForge did it in 60.47 minutes by submitting all 32 datasets concurrently to a single in-process job pool, with the source CSVs on one NVMe (D:) and the PostgreSQL data directory on another (C:) to eliminate read/write contention.

Host CeruleanDagger · not enterprise; consumer desktop
CPUAMD Ryzen 9 9950X3D (16C / 32T) · X870 chipset
RAM62 GB
OSWindows 11 Pro (build 26200)
Source driveD: NVMe · read-only · CourtListener 2026-03-31 bulk CSVs
Database driveC: NVMe · PostgreSQL data dir + WAL
DatabasePostgreSQL 18.0 on x86_64-windows · max_wal_size = 96 GB
EngineDataForge dataforge-api · DATAFORGE_MAX_CONCURRENT_JOBS=40
GovernorCPU 98.5% degradation-gated · memory 2 GB hard floor / 8 GB soft floor (available)
Target databaseempty courtlistener_bulk · tables created from CSV headers (schema authority = source)
Concurrent ingest — one job per CourtListener dataset all status = done · malformed = 0 · dropped = 0 · across all rows
Dataset Rows inserted Rows / sec Malformed Dropped Status
opinions2,350,345,841648,66900done
citation_map76,959,991610,13500done
opinion_clusters71,343,734243,41800done
dockets71,243,855183,61200done
citations18,116,834299,56600done
fjc_integrated_database10,323,280150,79700done
parentheticals6,331,770171,21200done
financial_disclosure_investments1,901,720149,88300done
originating_court_information943,014163,54700done
search_opinioncluster_panel834,486463,86100done
financial_disclosures182,50954,84000done
oral_arguments101,11582900done
people_db_positions51,29141,46400done
financial_disclosures_positions37,05054,72700done
financial_disclosures_reimbursements33,47240,03800done
financial_disclosures_spousal_income20,17437,35900done
financial_disclosures_debts18,77538,16100done
people_db_people16,19112,90100done
financial_disclosures_non_investment_income15,30221,43100done
people_db_educations12,77729,78300done
financial_disclosures_agreements10,00723,21800done
people_db_political_affiliations8,48616,13300done
people_db_races6,5428,58500done
people_db_schools6,01113,12400done
courthouses3,3607,36800done
courts3,3606,17600done
financial_disclosures_gifts2,0254,83300done
search_opinion_joined_by1,0282,53200done
people_db_race82500done
court_appeals_to82600done
people_db_retention_events0000done
search_opinioncluster_non_participating_judges0000done

The last two datasets contain only header rows in the source CSVs (the 2026-03-31 release shipped them empty); DataForge created the tables from the headers and reported 0 rows inserted — lossless behaviour on a degenerate input. opinions is the long pole at ~150 bytes/row but 2.35 billion rows; dockets is the wide-column case at 51 columns per row.

Resource profile sampled every 5s by the orchestrator
17.1 %
System CPU · avg (raw)
65.7 %
System CPU · peak (raw)
790 MB
Min available memory (launch burst)
~18 GB
Steady-state available memory
603 GB
C: free at end · never below this

SMT note: on the 9950X3D (16 physical cores / 32 logical), the streaming pipeline keeps its work on the 16 physical primaries and leaves the SMT siblings idle. The raw total-system numbers above therefore understate per-physical-core load by roughly 2×: physical-core mean during steady state was ~30–55%. Either way, far below the 98.5% threshold that gates CPU intervention — the engine is parse / storage-bound, not compute-bound.

Memory governor — engaged once, held cleanly first real exercise on a heterogeneous workload

During the 32-job launch burst, 32 ingest pipelines allocated their batching arenas concurrently and available memory dropped into the 8 GB → 2 GB soft band, momentarily hitting 790 MB — below the 2 GB hard floor. The governor's design was tested directly:

  • Proportional soft back-pressure engaged silently across the band — per-flush delay scaled with how far available headroom had shrunk.
  • The 5-second dwell absorbed the sub-hard-floor blip: the reading wasn't sustained, so no hard stop fired.
  • Available memory recovered to ~14–18 GB as the launch burst settled and per-job arenas freed.
  • Zero hard stops. Zero failures. The gate never fully closed.
The fluid, dwell-based design held under genuine concurrent pressure. The momentary sub-floor reading was correctly classified as transient and did not trip an unnecessary full stop.
CPU governor — stayed completely out 98.5% threshold never approached

The CPU governor only acts when smoothed CPU ≥ 98.5% and engine throughput is degrading, sustained across the dwell window. Neither condition came close. System CPU averaged 17.1% (peak 65.7%); physical-core mean stayed below ~55% throughout. The governor correctly recognized that 32 concurrent ingests on this hardware were parse / storage-bound, not CPU-bound, and applied no throttling.

D-read / C-write eliminates source/target I/O contention

For this run the source CSVs (401 GB) lived on D: and the PostgreSQL data directory + WAL lived on C: — both consumer NVMe, but distinct devices. Each drive saw a single I/O pattern (D: sequential read; C: random write + WAL append) instead of both fighting on one. The choice was deliberate and the architecture held: throughout the run, neither device approached saturation while the other was active, and C: never dropped below 603 GB free.

Engine-recorded Δ0 authoritative, not inferred

Every per-batch COPY FROM STDIN either commits atomically or aborts; the engine maintains an atomic per-job row counter incremented only on commit. With all 32 jobs terminal in done state and every per-job rows_malformed and rows_dropped equal to zero, the engine accounting is exact — every source row reached the target table. The final per-job counters are reproduced in the table above.

32 / 32
Datasets in done state
0
Failed jobs
0
Malformed rows (across 2.6B)
0
Dropped rows (across 2.6B)
0
Watchdog hard stops in API log

A homogeneous concurrent test — N copies of one table — proves the scheduler and goroutine pool are stable. A heterogeneous corpus run proves the engine. Every dataset in the CourtListener corpus presents a different challenge: citations are sparse and wide; opinion_clusters are dense with long text fields; dockets carries 51 wide columns with highly variable row sizes; opinions is 2.35 billion small rows totalling 347 GB. The engine adapted batch sizing and arena allocation per job — simultaneously — while the memory governor proportionally engaged once under genuine pressure and released cleanly without ever fully closing the gate.

No published single-machine benchmark of a competing bulk-ingest system demonstrates this scale, this heterogeneous concurrency, this strict quality guarantee, and this resource-governance observability simultaneously, on commodity hardware. We just did it on a workstation.

Want to run your own corpus through this?
Single binary, ~15 MB. Source connection string. Target connection string. Nothing else required.
Take the 30-Minute Challenge →