VOOZH about

URL: https://dzone.com/articles/fast-parallel-data-loading-ml-research

⇱ Fast Parallel Data Loading for ML Research


Related

  1. DZone
  2. Data Engineering
  3. Data
  4. Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research

Removing the Experimental Bottleneck: Fast Parallel Data Loading for ML Research

Transformed 5-hour data loads into 1-2 minutes using Oracle's APPEND+PARALLEL+NOLOGGING, enabling researchers to go from 1-2 experiments/day to 2-3/hour.

By Mar. 30, 26 · Analysis
Likes
Comment
Save
1.7K Views

Join the DZone community and get the full member experience.

Join For Free

The Problem

Traditional INSERT for benchmark data loading:

  • Takes 5+ hours for 4M rows
  • Sequential execution
  • Normal logging and buffering
  • 94% of experiment time wasted on data reload

The Solution: Three Techniques Combined

1. APPEND HINT

Tells Oracle: Skip normal buffering, write directly to disk.

Impact: ~10-20x speedup

2. PARALLEL EXECUTION

Tells Oracle: Use all CPU cores instead of sequential.

Impact: ~5-10x speedup

3. NOLOGGING MODE

Tells Oracle: No need to log test data changes.

Impact: ~3-5x speedup

Multiplication effect: 10-20x × 5-10x × 3-5x = 150-300x

Step 1: Create TPC-H Tables

SQL
CREATE TABLE LINEITEM (
 L_ORDERKEY NUMBER,
 L_PARTKEY NUMBER,
 L_SUPPKEY NUMBER,
 L_LINENUMBER NUMBER,
 L_QUANTITY NUMBER,
 L_EXTENDEDPRICE NUMBER,
 L_DISCOUNT NUMBER,
 L_TAX NUMBER,
 L_RETURNFLAG VARCHAR2(1),
 L_LINESTATUS VARCHAR2(1),
 L_SHIPDATE DATE,
 L_COMMITDATE DATE,
 L_RECEIPTDATE DATE,
 L_SHIPINSTRUCT VARCHAR2(25),
 L_SHIPMODE VARCHAR2(10),
 L_COMMENT VARCHAR2(44)
);

CREATE TABLE ORDERS (
 O_ORDERKEY NUMBER,
 O_CUSTKEY NUMBER,
 O_ORDERSTATUS VARCHAR2(1),
 O_TOTALPRICE NUMBER,
 O_ORDERDATE DATE,
 O_ORDERPRIORITY VARCHAR2(15),
 O_CLERK VARCHAR2(15),
 O_SHIPPRIORITY NUMBER,
 O_COMMENT VARCHAR2(79)
);


Step 2: Generate and Load TPC-H Data (Python)

Python
from oracle_26ai_setup_load.oracle_connector import get_connection

conn = get_connection()
c = conn.cursor()

# Populate ORDERS (700K rows)
print("Populating ORDERS (700K)...")
for i in range(1, 700001):
 cust = ((i - 1) % 150000) + 1
 c.execute(f"""
 INSERT INTO ORDERS 
 VALUES ({i}, {cust}, 'O', {i*100}, SYSDATE - {i % 365}, 
 'PRIO{i % 5}', 'CLERK{i % 100}', {i % 2}, 'comment')
 """)
 if i % 500 == 0: conn.commit()
conn.commit()

# Populate LINEITEM (4M rows)
print("Populating LINEITEM (4M)...")
li = 1
for o in range(1, 700001):
 lines = (o % 7) + 1
 for l in range(1, lines + 1):
 part = ((li - 1) % 20000) + 1
 supp = ((li - 1) % 10000) + 1
 c.execute(f"""
 INSERT INTO LINEITEM 
 VALUES ({o}, {part}, {supp}, {l}, {(li % 50) + 1}, 
 {(li*10) % 100000}, {(li % 10) * 0.1}, 
 {(li % 10) * 0.05}, 'R', 'F', 
 SYSDATE - {li % 365}, SYSDATE - {li % 300}, 
 SYSDATE - {li % 200}, 'INST{li % 5}', 
 'MODE{li % 5}', 'comment')
 """)
 li += 1
 if o % 50 == 0: conn.commit()
conn.commit()

# Verify
c.execute("SELECT COUNT(*) FROM LINEITEM")
print(f"LINEITEM rows: {c.fetchone()[0]:,}")  # Should show ~4,000,000


The Optimization: Before vs. After

Before (traditional):

SQL
-- Standard insert into 
-- a backup/test table
INSERT INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 5 hours 
to copy 4M rows


After (optimized):

SQL
-- Setup (one-time)
ALTER TABLE LINEITEM_TEST 
NOLOGGING;

-- Fast parallel copy
INSERT /*+ APPEND PARALLEL(8) */ 
INTO LINEITEM_TEST
SELECT * FROM LINEITEM;
COMMIT;

Result: 1-2 minutes 
to copy 4M rows


150-300x speedup achieved.

Timing Comparison (Per Experiment Cycle)

Phase Traditional Optimized
Run optimization code 10 min 10 min
Reset data (reload 4M) 5 hours 1-2 min
Run queries 10 min 10 min
TOTAL PER CYCLE ~5.3 hours ~21-22 minutes


Speedup: 15x faster per experiment.

What's the Bottleneck?

Here are some traditional INSERT bottlenecks:

  1. Cache management: Normal insert uses buffering (slow) → APPEND removes buffering
  2. Sequential processing: One-at-a-time rows → PARALLEL spreads across 8 cores
  3. Redo logging: Every change gets logged → NOLOGGING skips logging for ephemeral test data

Key insight: None of these bottlenecks matters for research data (you reload it anyway). Remove all three = multiplicative speedup.

Infrastructure bottlenecks determine what research is feasible. Remove them, and rigorous research becomes practical.

Why the new setup works:

Technique Problem It Solves Impact
APPEND Cache inefficiency during bulk writes ~10-20x speedup
PARALLEL Sequential processing bottleneck ~5-10x speedup
NOLOGGING Logging overhead on bulk operations ~3-5x speedup
All three combined All bottlenecks removed ~150-300x speedup


Before and After: What Changes for Researchers

Old Paradigm (Data Loading Constrained) New Paradigm (Fast Loading Enabled)
1-2 experiments per working day 2-3 experiments per working hour
100-experiment study = 50-100 hours of compute 100-experiment study = 1-2 hours of compute time
7-14 days of continuous compute time Same working day
Limits research scope to what's feasible in weeks Enables comprehensive research that was previously impractical


Translation: The difference between "I can run a few experiments" and "I can run rigorous, statistical validation studies."

Real-World Use Cases

ML/AI Database Optimization

Using reinforcement learning to optimize database configuration? Fast loading means I can run comprehensive parameter tuning instead of guessing.

Learned Index Structures

Testing neural network-based indexes? I need to run dozens of variants. Fast loading makes this feasible.

Automated Query Planning

Developing a learned query optimizer? I need hundreds of experiments. Fast loading is essential.

Benchmarking Database Improvements

Validating a new indexing strategy across multiple workloads? Fast loading lets me test rigorously instead of settling for "spot checks."

Why This Matters More For Your Research

Infrastructure optimization deserves recognition as a legitimate research contribution.

This isn't a micro-optimization. This isn't shaving seconds off a 30-second process. This is removing a constraint that determines what research is feasible.

When your bottleneck is data loading, you optimize around it:

  • You run fewer experiments
  • You compromise on statistical rigor
  • You avoid comprehensive validation

Remove the bottleneck, and suddenly rigorous research becomes practical.

So, what's next?

  1. Try it on your own data. Test with your actual TPC-H setup.
  2. Measure your improvement. Document your before/after times.
  3. Scale your experiments. Run the comprehensive studies that were previously impractical.
  4. Share results. If you're publishing research, this infrastructure improvement is worth documenting.

For the Skeptics: Why This Works

"Why haven't I seen this before?"

These techniques exist in Oracle's documentation but are scattered across different guides. The key insight is understanding that they compose multiplicatively—none creates a bottleneck for the others when all three are applied together.

"Is this risky?"

In production? Yes, you'd want logging and transaction safety. In research? No — your test data is ephemeral. You'll reload it anyway.

"What about other databases?"

  • PostgreSQL: COPY command is similarly fast; parallel loading is possible.
  • MySQL: Similar techniques available (but vary by storage engine).
  • SQL Server: Bulk insert with similar tricks.
  • Snowflake/BigQuery: Already optimized for bulk loads.

Conclusion

Data loading shouldn't be your research bottleneck. It doesn't have to be.

Using 25-year-old database capabilities in the right combination, I can transform experimental workflows from "heavily constrained" to "limited only by ambition."

For researchers doing ML/AI work on database optimization, this infrastructure fix is the difference between surface-level experiments and rigorous validation studies.

Try it. Measure it. Share it.

Resources

Location intelligence Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial
  • From ETL to Lakeflow: Shifting to a Declarative Data Paradigm
  • Stop Loading Everything into Redshift: A Spectrum + Iceberg Pattern for Hybrid Analytics
  • Operationalizing Enterprise AI at Scale: Architecture, Governance, and Adoption

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

Let's be friends: