VOOZH about

URL: https://thenewstack.io/duckdb-in-process-python-analytics-for-not-quite-big-data/

⇱ DuckDB: In-Process Python Analytics for Not-Quite-Big Data - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-05-31 04:30:33
DuckDB: In-Process Python Analytics for Not-Quite-Big Data
Data / Databases / Python

DuckDB: In-Process Python Analytics for Not-Quite-Big Data

An in-process analytics database, DuckDB can work with surprisingly large data sets without having to maintain a distributed multiserver system. Best of all? You can analyze data directly from your Python app.
May 31st, 2024 4:30am by Joab Jackson
👁 Featued image for: DuckDB: In-Process Python Analytics for Not-Quite-Big Data

PITTSBURGH — You don’t always need a cluster to analyze even a very large data set. There is a lot you can pack into a single server running the open source DuckDB in-process analytical database system.

This was one takeaway from a number of presentations comparing the performance of analytics solutions that were given at PyCon, a Python programmer’s conference held last week in Pittsburgh. There, they compared systems and asked, for instance, if a Dask system was faster at analytics than Apache Spark.

But if you can avoid setting up a distributed system altogether, you can avoid a lot of headaches around upkeep.

As explained in a presentation given by Kevin Kho and Han Wang, you can get a lot of mileage from a single machine, if it is optimized correctly. And this is the mission of DuckDB.

👁 Image

In 2021, H20.ai tested DuckDB in a set of benchmarks comparing the processing speed for various database-like tools popular in open source data science.

The testers ran five queries across 10 million rows and nine columns (about 0.5GB). Duck completed the task in a mere two seconds. That was surprising for a database running on a single computer. Even more surprising, it chewed through 100 million rows (5GB) in 14 seconds.

These numbers were impressive, and in 2023, the DuckDB folks went back and tweaked the configuration settings and upgraded the hardware and got the 5GB workload down to two seconds and the 0.5GB in less than a second.

It even tackled the 50GB workload — normally reserved for distributed systems such as Spark — in 24 seconds.

“This is a mind-blowing number. The improvements are amazing,” said Wang, who is the tech lead of Lyft Machine Learning Platform, in the presentation.

👁 Image

DuckDB’s benchmark of Big Data systems, 2003.

The takeaway? A surprising number of self-styled “big data”-styled projects don’t need Spark or some other distributed solution: They can fit nicely onto a single server, Wang noted. Taking this approach eliminates the considerable overhead of managing a distributed system, and keeps all the data and code on the local machine.

Introducing DuckDB

There’s a lot happening with DuckDB, an analytical, relational in-process SQL database system created in 2018. Two things that immediately set it apart from the other data platforms.

1: It combines SQL with Python, giving developers/analysts an expressive query language that executes against data in the application process itself.

2: It is meant to run only on a single machine. This is a feature, not a bug, as it eliminates all the complexity of running a data platform on a distributed platform.

“As soon as a problem gets a little bit too big for Pandas, you have to throw a giant distributed system at it. It’s like cracking a nut with a sledgehammer. It’s not ergonomic,” said said Alex Monahan, in another Pycon presentation. Monham is a forward-deployed software engineer for MotherDuck, which offers a serverless analytics service based on Duck.

The two creators of DuckDB — Hannes Mühleisen (CEO) and Mark Raasveldt (CTO) — have founded DuckDB Labs, which provides commercial support for the database system, which was designed to offer a fast, easy-to-deploy mid-sized data analysis.

They took considerable inspiration from the little database that could, considering DuckDB to be the SQLite of columns, rather than rows.

With a Python-esque interface, Duck was also built specifically for the data science community. Data will be analyzed, modeled, and visualized. Data scientists tend not to use databases, instead relying on CSV files and other un- or semi-structured data sources. Duck allows them to embed data operations directly into their code itself.

The MIT-licensed open source software is written in C++, so it is fast.

DuckDB is made to go fast, taking advantage of all the server’s cores and cache hierarchies. And whereas SQLite is a row-based database engine that processes one row at a time, Duck can process a whole vector, of 2,048 rows, at one time.

It is a single binary install from the Python Installer It is available for multiple platforms, all pre-compiled so they can be downloaded and run through a command line, or through the client libraries. There’s even a version that runs in a browser via WebAssembly.

It is an in-process application, and writes to disk, meaning it is not limited a server’s RAM, it can use the whole hard drive, opening the path to working with data sizes that are terabytes in size. Unlike a client-server database, it does not rely on a third-party transport mechanism to ship the data from the server to the client. Instead, just like SQLite, the application can pull the data as part of a Python call, in an in-process communication within the same memory space.

“You read it right where it sits,” Monahan said.

You can write data frames natively to the database in a number of different ways, including user-defined functions, a full relational API, the Ibis library to simultaneously write data frames simultaneously across multiple back end data sources, and PySpark but with a different import statement.

How DuckDB and Python Work Together

In addition to the command line, it comes with clients for 15 languages. Python is the most popular, but there is also Node, JBDC, and OBDC. It can read CSV, JSON files, Apache Iceberg files. DuckDB can natively read Pandas, Polaris and Arrow files, without copying the data into another format. Unlike most SQL-only database systems, it keeps the original of the data as it is ingested.

“So this could fit into a lot of workflows,” Monahan said.

It can also read files over the Internet, including those from GitHub (via FTP), Amazon S3, Azure Blob storage and Google Cloud Storage. It can output TensorFlow and Pytorch Tensors.

DuckDB uses a SQL variant that is very Python-esque, one that can ingest data frames natively.

Monahan produced a sample “Hello World” app to illustrate:

will produce the output:

[(42,)]

The database uses PostgreSQL as the base, though some modifications were made to the SQL, both for simplifying the language and for extending its capabilities.

👁 Image

The ways DuckDB extends and simplifies SQL (Alex Monahan presentation at Pycon)

Is Big Data Dead?

In summary, DuckDB is a fast database with a revolutionary intent, that of making single computer analytics possible for even very large datasets. It questions the need for Big Data-based solutions.

In a widely-circulated 2023 MotherDuck blog post, provocatively entitled “Big Data Is Dead,” Jordan Tigani noted that “most applications do not need to process massive amounts of data.”

“The amount of data processed for analytics workloads is almost certainly smaller than you think,” he wrote. So it makes sense to look at a simple single computer-based analytics software before jumping into a more expensive data warehouse or distributed analytics system.

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.