Performance Observations &
Execution Characteristics
The DataForge engine has been validated across local and cloud execution environments using real-world datasets and production-representative configurations. All results reflect sustained throughput under controlled conditions and are reproducible with comparable inputs.
Performance Summary
Validated results across execution environments and target systems.
Benchmark Record
Two hardware classes. Same engine. Same pipeline. Same zero-delta guarantee.
- Same binary, same pipeline, same zero-delta guarantee on consumer and enterprise hardware
- ~1.5× the prior 7950X ladder at the same concurrency on the same dataset
- c10 fresh-drive 6.53M rows/sec; sustained back-to-back runs settle to a ~5.2M rows/sec storage-bound floor as the consumer NVMe's SLC write cache exhausts — a drive characteristic, not an engine limit
- CPU never approached saturation on real workloads — engine is parse/storage-bound, not compute-bound
Auto-generated HTML reports from the bench harness — environment fingerprint, per-phase system metrics, PostgreSQL deltas, and Stage 3 reconciliation are included in each.
- citation_map · c=10 · 9950X3D · degradation-gated CPU governor →
- dockets · c=1 · single wide-column stream (51 cols, ~150 B/row) →
- dockets · c=5 · five concurrent wide-column streams →
All reports were produced by running bench against a live dataforge-api instance — the same harness any prospect downloads from the Benchmark Kit. No staging, no post-processing.
The Free Law Project's CourtListener bulk archive is the federal judiciary's complete digital record — 32 structurally heterogeneous datasets totaling 401 GB. Conventional restoration tooling typically takes the better part of a day, especially on the 347 GB opinions table. DataForge restored the entire corpus into an empty PostgreSQL 18 database in a single concurrent pass on a consumer desktop, with the source on one NVMe (D:) and the database on another (C:) to eliminate read/write contention. Drive separation held; the engine did the rest.
A homogeneous concurrent test proves the scheduler and goroutine pool are stable. A heterogeneous corpus run proves the engine. Every dataset in the CourtListener corpus presents a different challenge: citation tables are wide and sparse; opinion clusters are dense with long text fields; opinions itself is 2.35 billion small rows totaling 347 GB; dockets carry 51 wide columns with highly variable row sizes. The engine adapted batch sizing and arena allocation per job — simultaneously — while the memory governor proportionally engaged once under genuine pressure (available memory dipped to 790 MB during the 32-job launch burst) and released cleanly without ever fully closing the gate. Zero failures across 2.6 billion rows.
The full methodology, hardware configuration, NUMA anomaly data, WAL determinism analysis, and per-concurrency throughput curves for the enterprise benchmark runs are documented in the published whitepaper: Pressure Curves and Bottleneck Migration: Hardware Limits Revealed by the DataForge™
Direct. No staging.
No orchestration layer.
DataForge streams directly from a source database into a destination database — no intermediate files, no ETL extract step, no message queue, no coordination service. The same Parse → Filter → Accumulate → Write pipeline that ingests flat files applies unchanged to live database sources across flavors.
- No extract-to-file step
- No intermediate staging table or bucket
- No orchestration service or job scheduler
- No message queue or change-data-capture layer
- No manual schema mapping — columns streamed by position
Both runs produced a confirmed lossless result: rows scanned equals rows written, with zero malformed, zero dropped, zero skipped. The engine's internal parity counter closes in both directions independently.
The 1.7× difference between directions is protocol-driven, not engine-driven. PostgreSQL's COPY FROM STDIN issues a single streaming protocol call per batch. SQL Server's TDS bulk insert protocol carries per-row encoding and driver-layer overhead. The DataForge execution path is identical in both directions.
Key Observations
The System Is Not Compute-Bound
Across all environments, high throughput is sustained with CPU utilization remaining low relative to output and disk I/O pressure remaining minimal under streaming conditions.
Execution Model Determines Realized Throughput
Cloud results demonstrate a significant divergence between execution modes against identical infrastructure:
Destination System Characteristics Define Upper Bounds
The engine's performance ceiling is determined by the characteristics of the target system, not the ingestion pipeline itself.
- PostgreSQL supports 2.5M+ rows/sec under warmed, optimized conditions via COPY FROM STDIN — single streaming protocol call per batch
- SQL Server throughput is constrained by TDS bulk copy protocol overhead and UTF-16 encoding at the driver layer
Implementation Layer Impacts Throughput Efficiency
Observed differences between implementations on SQL Server under equivalent conditions:
Local vs Cloud Performance Characteristics
Local execution provides minimal network overhead, direct I/O access, and no container lifecycle constraints — producing the highest achievable throughput ceiling.
Cloud execution introduces network transfer overhead and service model constraints. Despite these differences, properly configured cloud execution remains within the same order of magnitude as local performance.
Concurrency and Scaling Behavior
Performance improves with concurrency: overhead is amortized across workloads, throughput increases without proportional infrastructure expansion, and per-unit cost decreases as utilization rises.
- Each Cloud Run Job handles one file at full throughput
- The concurrency ladder scales linearly against the storage write ceiling
- Infrastructure footprint remains stable as concurrency increases
Heterogeneous Concurrency Holds at Full-Corpus Scale
Running 31 structurally different tables concurrently is categorically distinct from running one table 31 times. Schema complexity, row size distribution, and per-dataset memory pressure all differ simultaneously. The engine holds:
- ~37 GB of heterogeneous source data ingested in under 10 minutes
- Schema range 3 to 203 columns — same execution path for all
- Zero failed jobs, zero dropped rows, exact parity confirmed across all 31 tables
- 32nd table (>300 GB) excluded — exceeds available local storage capacity for validation, not an engine constraint
DB-to-DB Transfer Eliminates the Staging Layer Entirely
Conventional database migration paths require an extract step — writing to file, staging bucket, or intermediate table — before ingestion can begin. DataForge removes this step. The source database cursor is streamed directly into the destination write path using the same internal pipeline.
- 413K rows/sec sustained — SQL Server to PostgreSQL — 75.8M rows, lossless
- 241K rows/sec sustained — PostgreSQL to SQL Server — 75.8M rows, lossless
- Cross-flavor: SQL Server and PostgreSQL are treated as interchangeable source and destination
A different approach to throughput.
The distinction is architectural, not incremental.
Increase throughput by adding infrastructure — more nodes, more services, more coordination layers. Each addition introduces latency, failure surface, and operational cost. The infrastructure is the answer to every throughput question.
Increases throughput by increasing concurrency within a stable execution boundary — amortizing overhead, not multiplying it. The infrastructure is fixed; the throughput scales with utilization.
Operational Implications
Use execution models that allow full CPU access for the duration of the job. Jobs over Services. Bare-metal over throttled containers. The mode is as important as the hardware.
Network overhead is real and measurable. Colocate the execution environment with the target system where possible. Same-region deployments confirmed no throughput difference between public IP and VPC private IP.
PostgreSQL and SQL Server carry different throughput ceilings under the same workload. System selection is a performance decision. The 8× gap exists at the protocol layer before DataForge is in the picture.
Language and driver selection affect how efficiently the engine's output is expressed. Native protocol integration outperforms abstraction layers at scale. The engine is constant; the delivery mechanism is a variable.
All results:
- Derived from real-world datasets (CourtListener public corpus)
- Executed under observable, repeatable conditions
- Reproducible with comparable configurations and hardware class
DataForge consistently delivers high-throughput data movement across environments, with performance bounded primarily by external system constraints rather than internal processing limits.
talk throughput?
Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.