Full CourtListener Corpus Reconstitution
The Free Law Project's complete federal judiciary mirror — restored end-to-end into PostgreSQL on a consumer desktop. 60.5 minutes. 2.6 billion rows. Δ0.
Restoring the full CourtListener bulk archive into a relational database conventionally takes the better part of a day, dominated by the 347 GB opinions table. DataForge did it in 60.47 minutes by submitting all 32 datasets concurrently to a single in-process job pool, with the source CSVs on one NVMe (D:) and the PostgreSQL data directory on another (C:) to eliminate read/write contention.
max_wal_size = 96 GBdataforge-api · DATAFORGE_MAX_CONCURRENT_JOBS=40courtlistener_bulk · tables created from CSV headers (schema authority = source)status = done · malformed = 0 · dropped = 0 · across all rows
| Dataset | Rows inserted | Rows / sec | Malformed | Dropped | Status |
|---|---|---|---|---|---|
| opinions | 2,350,345,841 | 648,669 | 0 | 0 | done |
| citation_map | 76,959,991 | 610,135 | 0 | 0 | done |
| opinion_clusters | 71,343,734 | 243,418 | 0 | 0 | done |
| dockets | 71,243,855 | 183,612 | 0 | 0 | done |
| citations | 18,116,834 | 299,566 | 0 | 0 | done |
| fjc_integrated_database | 10,323,280 | 150,797 | 0 | 0 | done |
| parentheticals | 6,331,770 | 171,212 | 0 | 0 | done |
| financial_disclosure_investments | 1,901,720 | 149,883 | 0 | 0 | done |
| originating_court_information | 943,014 | 163,547 | 0 | 0 | done |
| search_opinioncluster_panel | 834,486 | 463,861 | 0 | 0 | done |
| financial_disclosures | 182,509 | 54,840 | 0 | 0 | done |
| oral_arguments | 101,115 | 829 | 0 | 0 | done |
| people_db_positions | 51,291 | 41,464 | 0 | 0 | done |
| financial_disclosures_positions | 37,050 | 54,727 | 0 | 0 | done |
| financial_disclosures_reimbursements | 33,472 | 40,038 | 0 | 0 | done |
| financial_disclosures_spousal_income | 20,174 | 37,359 | 0 | 0 | done |
| financial_disclosures_debts | 18,775 | 38,161 | 0 | 0 | done |
| people_db_people | 16,191 | 12,901 | 0 | 0 | done |
| financial_disclosures_non_investment_income | 15,302 | 21,431 | 0 | 0 | done |
| people_db_educations | 12,777 | 29,783 | 0 | 0 | done |
| financial_disclosures_agreements | 10,007 | 23,218 | 0 | 0 | done |
| people_db_political_affiliations | 8,486 | 16,133 | 0 | 0 | done |
| people_db_races | 6,542 | 8,585 | 0 | 0 | done |
| people_db_schools | 6,011 | 13,124 | 0 | 0 | done |
| courthouses | 3,360 | 7,368 | 0 | 0 | done |
| courts | 3,360 | 6,176 | 0 | 0 | done |
| financial_disclosures_gifts | 2,025 | 4,833 | 0 | 0 | done |
| search_opinion_joined_by | 1,028 | 2,532 | 0 | 0 | done |
| people_db_race | 8 | 25 | 0 | 0 | done |
| court_appeals_to | 8 | 26 | 0 | 0 | done |
| people_db_retention_events | 0 | 0 | 0 | 0 | done |
| search_opinioncluster_non_participating_judges | 0 | 0 | 0 | 0 | done |
The last two datasets contain only header rows in the source CSVs (the 2026-03-31 release shipped them empty); DataForge created the tables from the headers and reported 0 rows inserted — lossless behaviour on a degenerate input. opinions is the long pole at ~150 bytes/row but 2.35 billion rows; dockets is the wide-column case at 51 columns per row.
SMT note: on the 9950X3D (16 physical cores / 32 logical), the streaming pipeline keeps its work on the 16 physical primaries and leaves the SMT siblings idle. The raw total-system numbers above therefore understate per-physical-core load by roughly 2×: physical-core mean during steady state was ~30–55%. Either way, far below the 98.5% threshold that gates CPU intervention — the engine is parse / storage-bound, not compute-bound.
During the 32-job launch burst, 32 ingest pipelines allocated their batching arenas concurrently and available memory dropped into the 8 GB → 2 GB soft band, momentarily hitting 790 MB — below the 2 GB hard floor. The governor's design was tested directly:
- Proportional soft back-pressure engaged silently across the band — per-flush delay scaled with how far available headroom had shrunk.
- The 5-second dwell absorbed the sub-hard-floor blip: the reading wasn't sustained, so no hard stop fired.
- Available memory recovered to ~14–18 GB as the launch burst settled and per-job arenas freed.
- Zero hard stops. Zero failures. The gate never fully closed.
The CPU governor only acts when smoothed CPU ≥ 98.5% and engine throughput is degrading, sustained across the dwell window. Neither condition came close. System CPU averaged 17.1% (peak 65.7%); physical-core mean stayed below ~55% throughout. The governor correctly recognized that 32 concurrent ingests on this hardware were parse / storage-bound, not CPU-bound, and applied no throttling.
For this run the source CSVs (401 GB) lived on D: and the PostgreSQL data directory + WAL lived on C: — both consumer NVMe, but distinct devices. Each drive saw a single I/O pattern (D: sequential read; C: random write + WAL append) instead of both fighting on one. The choice was deliberate and the architecture held: throughout the run, neither device approached saturation while the other was active, and C: never dropped below 603 GB free.
Every per-batch COPY FROM STDIN either commits atomically or aborts; the engine maintains an atomic per-job row counter incremented only on commit. With all 32 jobs terminal in done state and every per-job rows_malformed and rows_dropped equal to zero, the engine accounting is exact — every source row reached the target table. The final per-job counters are reproduced in the table above.
done stateA homogeneous concurrent test — N copies of one table — proves the scheduler and goroutine pool are stable. A heterogeneous corpus run proves the engine. Every dataset in the CourtListener corpus presents a different challenge: citations are sparse and wide; opinion_clusters are dense with long text fields; dockets carries 51 wide columns with highly variable row sizes; opinions is 2.35 billion small rows totalling 347 GB. The engine adapted batch sizing and arena allocation per job — simultaneously — while the memory governor proportionally engaged once under genuine pressure and released cleanly without ever fully closing the gate.
No published single-machine benchmark of a competing bulk-ingest system demonstrates this scale, this heterogeneous concurrency, this strict quality guarantee, and this resource-governance observability simultaneously, on commodity hardware. We just did it on a workstation.