← Back to overview
Performance claims

Performance Observations &
Execution Characteristics

The DataForge engine has been validated across local and cloud execution environments using real-world datasets and production-representative configurations. All results reflect sustained throughput under controlled conditions and are reproducible with comparable inputs.

Performance Summary

Validated results across execution environments and target systems.

8,151,597
rows / sec · enterprise hardware peak
Intel Xeon Gold 6326 · Pure Storage FlashArray X90R4
30 concurrent workers · 2.274B rows · zero failures · zero delta · Linux · PostgreSQL 18
3,032,564,040
rows · single machine · 40 workers · Δ0
Maximum validated scale · same hardware
Zero failed jobs · zero dropped rows · zero malformed · deterministic parity confirmed
2,516,818
rows / sec · 86.0 MB/sec throughput
Consumer NVMe · local PostgreSQL
75.8M rows · 30.1s elapsed · single worker · warmed system conditions
883,017
rows / sec
Cloud SQL (Enterprise Plus · 8 vCPU / 64 GB)
Executed via Cloud Run Jobs
~300,000
rows / sec
Local SQL Server ingestion
C# native implementation · SqlBulkCopy
249,353
rows / sec
Local SQL Server ingestion
Go implementation
~8×
throughput advantage
PostgreSQL vs SQL Server
Under comparable conditions and identical source data
413,325
rows / sec · no staging
SQL Server → PostgreSQL · direct DB-to-DB
75.8M rows · 3m03s elapsed · lossless · zero intermediary
241,009
rows / sec · no staging
PostgreSQL → SQL Server · direct DB-to-DB
75.8M rows · 5m14s elapsed · lossless · zero intermediary

Benchmark Record

Two hardware classes. Same engine. Same pipeline. Same zero-delta guarantee.

2,274,000,000
Rows inserted · 30 workers
8,151,597
Rows / sec · peak
Δ0
Net row delta
0
Failed jobs
2,840,000
Rows / sec · single worker (c1)
6,530,000
Rows / sec · 10-worker peak (fresh drive)
5,488,766
Rows / sec · 20-worker · storage-saturated
Δ0
Stage 3 parity · all runs
  • Same binary, same pipeline, same zero-delta guarantee on consumer and enterprise hardware
  • ~1.5× the prior 7950X ladder at the same concurrency on the same dataset
  • c10 fresh-drive 6.53M rows/sec; sustained back-to-back runs settle to a ~5.2M rows/sec storage-bound floor as the consumer NVMe's SLC write cache exhausts — a drive characteristic, not an engine limit
  • CPU never approached saturation on real workloads — engine is parse/storage-bound, not compute-bound
2,840,000 rows/sec c=1 · single worker · 75.8M rows
6,530,000 rows/sec c=10 · fresh-drive · 758M rows
5,488,766 rows/sec c=20 · storage-saturated · 1.516B rows
0 failed · 0 malformed · Stage 3 Δ0 throughout

Auto-generated HTML reports from the bench harness — environment fingerprint, per-phase system metrics, PostgreSQL deltas, and Stage 3 reconciliation are included in each.

All reports were produced by running bench against a live dataforge-api instance — the same harness any prospect downloads from the Benchmark Kit. No staging, no post-processing.

// 40-worker run — enterprise hardware — maximum validated scale
3,032,564,040
total rows ingested · single machine · 40 concurrent workers
  • Peak throughput8,151,597 rows/sec
  • HardwareIntel Xeon Gold 6326 · Pure Storage X90R4
  • Failed jobs0
  • Dropped rows0
  • Row-count delta0
On the EverPure X90R4, the bottleneck migrated entirely off storage and onto CPU. The engine reached 8.15M rows/sec sustained — 3.03 billion rows, one machine, zero failures, exact parity. The software introduced no ceiling of its own. This is what the same binary does on enterprise storage; the rest of this page is what it does on a consumer desktop.

The Free Law Project's CourtListener bulk archive is the federal judiciary's complete digital record — 32 structurally heterogeneous datasets totaling 401 GB. Conventional restoration tooling typically takes the better part of a day, especially on the 347 GB opinions table. DataForge restored the entire corpus into an empty PostgreSQL 18 database in a single concurrent pass on a consumer desktop, with the source on one NVMe (D:) and the database on another (C:) to eliminate read/write contention. Drive separation held; the engine did the rest.

// CourtListener full corpus — all 32 datasets — concurrent reconstitution — 2026-05-29
60.5 min
wall time · 401 GB heterogeneous source · 32 distinct schemas · single binary · one consumer desktop
  • Datasets ingested32 / 32 — including the 347 GB opinions table
  • Total rows2,608,874,016 — engine-recorded
  • Aggregate throughput719,109 rows/sec sustained for 60 min
  • HardwareAMD Ryzen 9 9950X3D (16C/32T) · 62 GB RAM · consumer NVMe · X870
  • Schema range3 cols → 51 cols (dockets) · ~150 B/row (opinions) → multi-KB/row
  • Failed jobs0
  • Malformed rows0
  • Dropped rows0
  • Row-count delta0 — lossless
  • Governor hard stops0

A homogeneous concurrent test proves the scheduler and goroutine pool are stable. A heterogeneous corpus run proves the engine. Every dataset in the CourtListener corpus presents a different challenge: citation tables are wide and sparse; opinion clusters are dense with long text fields; opinions itself is 2.35 billion small rows totaling 347 GB; dockets carry 51 wide columns with highly variable row sizes. The engine adapted batch sizing and arena allocation per job — simultaneously — while the memory governor proportionally engaged once under genuine pressure (available memory dipped to 790 MB during the 32-job launch burst) and released cleanly without ever fully closing the gate. Zero failures across 2.6 billion rows.

Not a stress test. A real dataset. A real workload. A real result — on commodity hardware.
// Primary source

The full methodology, hardware configuration, NUMA anomaly data, WAL determinism analysis, and per-concurrency throughput curves for the enterprise benchmark runs are documented in the published whitepaper: Pressure Curves and Bottleneck Migration: Hardware Limits Revealed by the DataForge™

Read the Paper →
View Benchmark Evidence → Run It on Your Infrastructure

Direct. No staging.
No orchestration layer.

DataForge streams directly from a source database into a destination database — no intermediate files, no ETL extract step, no message queue, no coordination service. The same Parse → Filter → Accumulate → Write pipeline that ingests flat files applies unchanged to live database sources across flavors.

75,814,101
Rows transferred
3m 03s
Elapsed time
413,325
Rows / sec
Lossless
Parity signal
75,814,101
Rows transferred
5m 14s
Elapsed time
241,009
Rows / sec
Lossless
Parity signal
  • No extract-to-file step
  • No intermediate staging table or bucket
  • No orchestration service or job scheduler
  • No message queue or change-data-capture layer
  • No manual schema mapping — columns streamed by position

Both runs produced a confirmed lossless result: rows scanned equals rows written, with zero malformed, zero dropped, zero skipped. The engine's internal parity counter closes in both directions independently.

75,814,101 = 75,814,101 Scanned = Written · MS SQL → PostgreSQL
75,814,101 = 75,814,101 Scanned = Written · PostgreSQL → MS SQL

The 1.7× difference between directions is protocol-driven, not engine-driven. PostgreSQL's COPY FROM STDIN issues a single streaming protocol call per batch. SQL Server's TDS bulk insert protocol carries per-row encoding and driver-layer overhead. The DataForge execution path is identical in both directions.

The asymmetry is a destination system characteristic. The engine is a constant.

Key Observations

01

The System Is Not Compute-Bound

Across all environments, high throughput is sustained with CPU utilization remaining low relative to output and disk I/O pressure remaining minimal under streaming conditions.

The limiting factor is not processing capacity, but data movement and destination system behavior.
02

Execution Model Determines Realized Throughput

Cloud results demonstrate a significant divergence between execution modes against identical infrastructure:

883K rows/sec Cloud Run Jobs — sustained CPU allocation
~15K rows/sec Cloud Run Services — lifecycle throttled post-202
Throughput is governed by execution model constraints, not infrastructure alone.
03

Destination System Characteristics Define Upper Bounds

The engine's performance ceiling is determined by the characteristics of the target system, not the ingestion pipeline itself.

  • PostgreSQL supports 2.5M+ rows/sec under warmed, optimized conditions via COPY FROM STDIN — single streaming protocol call per batch
  • SQL Server throughput is constrained by TDS bulk copy protocol overhead and UTF-16 encoding at the driver layer
04

Implementation Layer Impacts Throughput Efficiency

Observed differences between implementations on SQL Server under equivalent conditions:

~300K rows/sec C# · SqlBulkCopy · native protocol integration
249K rows/sec Go · go-mssqldb · driver and batching overhead
The engine remains constant. The execution layer determines how efficiently its performance is expressed.
05

Local vs Cloud Performance Characteristics

Local execution provides minimal network overhead, direct I/O access, and no container lifecycle constraints — producing the highest achievable throughput ceiling.

Cloud execution introduces network transfer overhead and service model constraints. Despite these differences, properly configured cloud execution remains within the same order of magnitude as local performance.

The 59× gap between Cloud Run Services and Jobs at constant hardware is attributable entirely to CPU allocation policy — not schema, data, or network.
06

Concurrency and Scaling Behavior

Performance improves with concurrency: overhead is amortized across workloads, throughput increases without proportional infrastructure expansion, and per-unit cost decreases as utilization rises.

  • Each Cloud Run Job handles one file at full throughput
  • The concurrency ladder scales linearly against the storage write ceiling
  • Infrastructure footprint remains stable as concurrency increases
Scaling is achieved through concurrency, not infrastructure growth.
07

Heterogeneous Concurrency Holds at Full-Corpus Scale

Running 31 structurally different tables concurrently is categorically distinct from running one table 31 times. Schema complexity, row size distribution, and per-dataset memory pressure all differ simultaneously. The engine holds:

  • ~37 GB of heterogeneous source data ingested in under 10 minutes
  • Schema range 3 to 203 columns — same execution path for all
  • Zero failed jobs, zero dropped rows, exact parity confirmed across all 31 tables
  • 32nd table (>300 GB) excluded — exceeds available local storage capacity for validation, not an engine constraint
Concurrency stability is not schema-specific. The engine does not need to be tuned per dataset.
08

DB-to-DB Transfer Eliminates the Staging Layer Entirely

Conventional database migration paths require an extract step — writing to file, staging bucket, or intermediate table — before ingestion can begin. DataForge removes this step. The source database cursor is streamed directly into the destination write path using the same internal pipeline.

  • 413K rows/sec sustained — SQL Server to PostgreSQL — 75.8M rows, lossless
  • 241K rows/sec sustained — PostgreSQL to SQL Server — 75.8M rows, lossless
  • Cross-flavor: SQL Server and PostgreSQL are treated as interchangeable source and destination
No staging. No orchestration. No intermediary. Source cursor to destination write — one pipeline, one pass.

A different approach to throughput.

The distinction is architectural, not incremental.

Traditional systems

Increase throughput by adding infrastructure — more nodes, more services, more coordination layers. Each addition introduces latency, failure surface, and operational cost. The infrastructure is the answer to every throughput question.

vs
DataForge

Increases throughput by increasing concurrency within a stable execution boundary — amortizing overhead, not multiplying it. The infrastructure is fixed; the throughput scales with utilization.

Operational Implications

Prefer sustained compute allocation

Use execution models that allow full CPU access for the duration of the job. Jobs over Services. Bare-metal over throttled containers. The mode is as important as the hardware.

Minimize unnecessary data movement

Network overhead is real and measurable. Colocate the execution environment with the target system where possible. Same-region deployments confirmed no throughput difference between public IP and VPC private IP.

Align target system selection with requirements

PostgreSQL and SQL Server carry different throughput ceilings under the same workload. System selection is a performance decision. The 8× gap exists at the protocol layer before DataForge is in the picture.

Treat runtime as an optimization layer

Language and driver selection affect how efficiently the engine's output is expressed. Native protocol integration outperforms abstraction layers at scale. The engine is constant; the delivery mechanism is a variable.

All results:

  • Derived from real-world datasets (CourtListener public corpus)
  • Executed under observable, repeatable conditions
  • Reproducible with comparable configurations and hardware class

DataForge consistently delivers high-throughput data movement across environments, with performance bounded primarily by external system constraints rather than internal processing limits.

Ready to
talk throughput?

Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.