Runbook

DataForge Benchmark Runbook

Step-by-step instructions for running the DataForge benchmark harness against your own PostgreSQL instance. Covers prerequisites, installation, configuration, execution, and result interpretation.

Engine: dataforge-0.4.x · Harness: bench + bench-report · Requires PostgreSQL 15+

// Contents

Prerequisites
Installation
Configuration
Running the Benchmark
Interpreting Results
Expected Results Reference
Contact

Prerequisites

Operating System

Native Linux x86_64 (Ubuntu 22.04+) or a dedicated cloud Linux VM. WSL is not supported for full benchmark runs — its virtual disk and memory caps cause SIGBUS / bus errors and filesystem corruption under sustained ingestion load. Use a bare-metal Linux box or a cloud VM (e.g. a GCP/AWS instance) with the disk sized below.

PostgreSQL

15.0 or higher (18.0 recommended)

CPU

16+ cores recommended for multi-worker runs. AMD Ryzen 9 7950X class or equivalent.

RAM

32 GB minimum. 64 GB recommended for 20-worker saturation runs.

Disk

NVMe SSD strongly recommended. Each worker requires its own copy of the source file to avoid I/O contention. 20-worker run requires ~26 GB of staged data.

Network

Local PostgreSQL socket preferred for baseline runs. Network DB supported for cloud characterization.

Pre-flight check: The harness performs a system inventory before staging any data. Review the reported CPU, RAM, disk, and PostgreSQL version before proceeding. Ensure your available disk space exceeds the total staged data size for your chosen concurrency level.

Installation

The benchmark kit ships as a tarball. Extract it and set execute permissions on the binaries.

Verify checksums (SHA-256)

sha256sum dataforge-bench-20260407-86a7745.tar.gz
# expected: 623f9bdc85c97d24700b569306718c4737cf4a155b13bfe0061d18e89e3264b3

sha256sum citation-map-2025-10-31.tar
# expected: 248a8863f2bcdbb13468831db4210532b695af731bd3fcc6e643757874c62ae3

Extract and install

tar -xzf dataforge-bench-20260407-86a7745.tar.gz
cd dataforge-bench-20260407-86a7745/
chmod +x bench bench-report
./bench --version

The kit contains:

bench — ingestion benchmark binary
bench-report — report generator (reads run output JSON, produces HTML)
bench.yaml — pre-built configuration template
This runbook URL

The DataForge API server starts automatically when bench runs. No separate service startup is required.

Dataset: The benchmark uses the CourtListener opinions corpus as its adversarial test dataset (75,814,101 rows, 42 columns, mixed types including long-text fields). Download instructions are included in the kit README. The dataset is publicly available from CourtListener.

Configuration

Edit bench.yaml to point at your PostgreSQL instance. The remaining defaults are tuned for the standard benchmark suite.

bench.yaml

database:
  host: localhost
  port: 5432
  name: dataforge_bench
  user: bench_user
  password: ""         # set via BENCH_DB_PASSWORD env var

source:
  file: ./data/courtlistener_opinions_2024.csv
  format: csv

benchmark:
  concurrency_presets: [1, 10, 20]   # workers per run
  staging_strategy: per_worker_copy   # each worker gets its own file copy
  output_dir: ./runs/
  tag: ""                              # optional label for this run session

Staging strategy: per_worker_copy gives each worker its own copy of the source file, eliminating read I/O contention between workers. This is the correct strategy for characterizing ingestion throughput. It requires additional disk space (N workers × source file size).

Concurrency presets:

[1] — Baseline characterization. Isolates the engine's single-worker throughput. Use this as your reference point.
[1, 10] — Baseline + concurrency validation. Shows how the engine scales under parallel load without saturating your storage.
[1, 10, 20] — Full suite including saturation test. The 20-worker run will approach or reach your storage write ceiling. Disk utilization is expected to be high.

Create benchmark database

psql -U postgres -c "CREATE DATABASE dataforge_bench;"
psql -U postgres -c "CREATE USER bench_user WITH PASSWORD 'your_password';"
psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE dataforge_bench TO bench_user;"

Running the Benchmark

Run the full benchmark suite using the configured presets:

Run benchmark suite

export BENCH_DB_PASSWORD=your_password
./bench run --config bench.yaml --tag "my-run-$(date +%Y%m%d)"

To run a single concurrency level (e.g., baseline only):

Single preset

./bench run --config bench.yaml --concurrency 1

Each run produces a JSON result file in ./runs/ named c{N}.json (e.g., c1.json, c10.json). To generate the HTML report:

Generate report

./bench-report --runs ./runs/ --out ./bench_report.html --tag "my-run"

Expected run times at reference hardware (AMD Ryzen 9 7950X, NVMe, PostgreSQL 18 local):
1-worker: ~30s · 10-worker: ~140s · 20-worker: ~290s. Your times will vary based on hardware, PostgreSQL configuration, and storage class.

Interpreting Results

Rows/sec — The primary throughput metric. For single-worker runs, this is the engine's clean ingestion rate into your PostgreSQL configuration. For multi-worker runs, the aggregate figure reflects total throughput across all concurrent jobs.

Disk utilization % — Values below 50% indicate disk is not yet the bottleneck. As you add workers, watch this metric. When utilization approaches 90–100%, your storage device is the limiting factor — not the engine. Throughput will plateau or decline slightly, but integrity should remain unaffected.

Write queue depth — Average I/O queue depth across the run. Values above 8–10 indicate the storage subsystem is absorbing more writes than it can service immediately. This is expected behavior at high concurrency and is not a sign of data loss risk.

CPU utilization — DataForge is primarily I/O bound, not CPU bound. High CPU% (90%+) at low disk utilization suggests a CPU-constrained database configuration (e.g., insufficient shared_buffers, synchronous_commit overhead). Review PostgreSQL configuration if you observe this pattern.

Delta — Row-count difference between source and destination after each run. This must be zero. Any non-zero delta indicates a configuration or connectivity issue. The engine does not silently drop rows.

Stage 4 validation — Each run ends with a four-point check: row count, schema conformance, checksum, and job completion status. All must pass. If any validation fails, the run result is flagged and the report highlights the failure.

Warmed vs. cold conditions: The first run into a fresh PostgreSQL instance will typically be slower due to write-ahead log initialization and buffer pool warming. Subsequent runs against the same instance will often show higher throughput as WAL efficiency improves. The published reference results reflect warmed conditions. Both are valid measurements — just label which condition applies.

Expected Results Reference

Reference results at AMD Ryzen 9 7950X / 63.2 GB RAM / NVMe SSD / PostgreSQL 18.0 local socket. Your results will vary based on hardware class, storage type, and PostgreSQL configuration.

Concurrency	Total Rows	Wall Time	Agg. Rows/sec	CPU Avg	Disk Util	Integrity
1 worker	75,814,101	30.1s	2,516,818	~12%	~22%	PASS
10 workers	758,141,010	139.7s	5,426,774	39.3%	41.5%	PASS
20 workers	1,516,282,020	291.4s	5,202,719	~80% avg 98.3% peak	54.5% 1,954 MB/s peak	PASS

Variance guidance: Single-worker throughput within ±15% of reference is expected across comparable NVMe hardware. Significant deviation (more than 30% below reference) typically indicates PostgreSQL configuration overhead (synchronous_commit, max_wal_size) or storage class differences. Cloud database targets (Cloud SQL, RDS, Azure Database) will show lower throughput due to network transport and managed service overhead — this is expected and not a DataForge limitation.

Contact

For questions about benchmark configuration, result interpretation, or to discuss your findings:

[email protected]

Include your bench-report HTML or the ./runs/ directory contents when describing unexpected results. The JSON run files contain the full execution context needed to diagnose configuration-specific behavior.

View Published Evidence Back to Kit Download

Ready to
talk throughput?

Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.

Send an inquiry Read the research