← Benchmark Kit
Runbook

DataForge Benchmark Runbook

Step-by-step instructions for running the DataForge benchmark harness against your own PostgreSQL instance. Covers prerequisites, installation, configuration, execution, and result interpretation.

Engine: dataforge-0.4.x · Harness: bench + bench-report · Requires PostgreSQL 15+

01

Prerequisites

Operating System
Linux x86_64 (Ubuntu 22.04+ or equivalent)
PostgreSQL
15.0 or higher (18.0 recommended)
CPU
16+ cores recommended for multi-worker runs. AMD Ryzen 9 7950X class or equivalent.
RAM
32 GB minimum. 64 GB recommended for 20-worker saturation runs.
Disk
NVMe SSD strongly recommended. Each worker requires its own copy of the source file to avoid I/O contention. 20-worker run requires ~26 GB of staged data.
Network
Local PostgreSQL socket preferred for baseline runs. Network DB supported for cloud characterization.
Pre-flight check: The harness performs a system inventory before staging any data. Review the reported CPU, RAM, disk, and PostgreSQL version before proceeding. Ensure your available disk space exceeds the total staged data size for your chosen concurrency level.
02

Installation

The benchmark kit ships as a tarball. Extract it and set execute permissions on the binaries.

Verify checksums (SHA-256)
sha256sum dataforge-bench-20260326-2cbe1b1.tar.gz
# expected: 1b0a3d0b13257eab17b495fa3a2000a66d07b28ad2ceb9a6a44784f3510b438f

sha256sum citation-map-2025-10-31.tar
# expected: 248a8863f2bcdbb13468831db4210532b695af731bd3fcc6e643757874c62ae3
Extract and install
tar -xzf dataforge-bench-20260326-2cbe1b1.tar.gz
cd dataforge-bench-20260326-2cbe1b1/
chmod +x bench bench-report
./bench --version

The kit contains:

  • bench — ingestion benchmark binary
  • bench-report — report generator (reads run output JSON, produces HTML)
  • bench.yaml — pre-built configuration template
  • This runbook URL

The DataForge API server starts automatically when bench runs. No separate service startup is required.

Dataset: The benchmark uses the CourtListener opinions corpus as its adversarial test dataset (75,814,101 rows, 42 columns, mixed types including long-text fields). Download instructions are included in the kit README. The dataset is publicly available from CourtListener.
03

Configuration

Edit bench.yaml to point at your PostgreSQL instance. The remaining defaults are tuned for the standard benchmark suite.

bench.yaml
database:
  host: localhost
  port: 5432
  name: dataforge_bench
  user: bench_user
  password: ""         # set via BENCH_DB_PASSWORD env var

source:
  file: ./data/courtlistener_opinions_2024.csv
  format: csv

benchmark:
  concurrency_presets: [1, 10, 20]   # workers per run
  staging_strategy: per_worker_copy   # each worker gets its own file copy
  output_dir: ./runs/
  tag: ""                              # optional label for this run session

Staging strategy: per_worker_copy gives each worker its own copy of the source file, eliminating read I/O contention between workers. This is the correct strategy for characterizing ingestion throughput. It requires additional disk space (N workers × source file size).

Concurrency presets:

  • [1] — Baseline characterization. Isolates the engine's single-worker throughput. Use this as your reference point.
  • [1, 10] — Baseline + concurrency validation. Shows how the engine scales under parallel load without saturating your storage.
  • [1, 10, 20] — Full suite including saturation test. The 20-worker run will approach or reach your storage write ceiling. Disk utilization is expected to be high.
Create benchmark database
psql -U postgres -c "CREATE DATABASE dataforge_bench;"
psql -U postgres -c "CREATE USER bench_user WITH PASSWORD 'your_password';"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE dataforge_bench TO bench_user;"
04

Running the Benchmark

Run the full benchmark suite using the configured presets:

Run benchmark suite
export BENCH_DB_PASSWORD=your_password
./bench run --config bench.yaml --tag "my-run-$(date +%Y%m%d)"

To run a single concurrency level (e.g., baseline only):

Single preset
./bench run --config bench.yaml --concurrency 1

Each run produces a JSON result file in ./runs/ named c{N}.json (e.g., c1.json, c10.json). To generate the HTML report:

Generate report
./bench-report --runs ./runs/ --out ./bench_report.html --tag "my-run"
Expected run times at reference hardware (AMD Ryzen 9 7950X, NVMe, PostgreSQL 18 local):
1-worker: ~30s · 10-worker: ~140s · 20-worker: ~290s. Your times will vary based on hardware, PostgreSQL configuration, and storage class.
05

Interpreting Results

Rows/sec — The primary throughput metric. For single-worker runs, this is the engine's clean ingestion rate into your PostgreSQL configuration. For multi-worker runs, the aggregate figure reflects total throughput across all concurrent jobs.

Disk utilization % — Values below 50% indicate disk is not yet the bottleneck. As you add workers, watch this metric. When utilization approaches 90–100%, your storage device is the limiting factor — not the engine. Throughput will plateau or decline slightly, but integrity should remain unaffected.

Write queue depth — Average I/O queue depth across the run. Values above 8–10 indicate the storage subsystem is absorbing more writes than it can service immediately. This is expected behavior at high concurrency and is not a sign of data loss risk.

CPU utilization — DataForge is primarily I/O bound, not CPU bound. High CPU% (90%+) at low disk utilization suggests a CPU-constrained database configuration (e.g., insufficient shared_buffers, synchronous_commit overhead). Review PostgreSQL configuration if you observe this pattern.

Delta — Row-count difference between source and destination after each run. This must be zero. Any non-zero delta indicates a configuration or connectivity issue. The engine does not silently drop rows.

Stage 4 validation — Each run ends with a four-point check: row count, schema conformance, checksum, and job completion status. All must pass. If any validation fails, the run result is flagged and the report highlights the failure.

Warmed vs. cold conditions: The first run into a fresh PostgreSQL instance will typically be slower due to write-ahead log initialization and buffer pool warming. Subsequent runs against the same instance will often show higher throughput as WAL efficiency improves. The published reference results reflect warmed conditions. Both are valid measurements — just label which condition applies.
06

Expected Results Reference

Reference results at AMD Ryzen 9 7950X / 63.2 GB RAM / NVMe SSD / PostgreSQL 18.0 local socket. Your results will vary based on hardware class, storage type, and PostgreSQL configuration.

Concurrency Total Rows Wall Time Agg. Rows/sec CPU Avg Disk Util Integrity
1 worker 75,814,101 30.1s 2,516,818 ~12% ~22% PASS
10 workers 758,141,010 139.7s 5,426,774 39.3% 41.5% PASS
20 workers 1,516,282,020 291.4s 5,202,719 ~80% avg
98.3% peak
54.5%
1,954 MB/s peak
PASS
Variance guidance: Single-worker throughput within ±15% of reference is expected across comparable NVMe hardware. Significant deviation (more than 30% below reference) typically indicates PostgreSQL configuration overhead (synchronous_commit, max_wal_size) or storage class differences. Cloud database targets (Cloud SQL, RDS, Azure Database) will show lower throughput due to network transport and managed service overhead — this is expected and not a DataForge limitation.
07

Contact

For questions about benchmark configuration, result interpretation, or to discuss your findings:

[email protected]

Include your bench-report HTML or the ./runs/ directory contents when describing unexpected results. The JSON run files contain the full execution context needed to diagnose configuration-specific behavior.

Ready to
talk throughput?

Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.