WTF? OLAP vs OLTP : TIL

Or - DuckDB is my favourite software of 2025

I really like SQLite. If you can fit your service on a single host, you should consider it for your database and for your cache. It's a very fast database. I recently discovered another embedded database - DuckDB. Where SQLite is a OLTP, DuckDB is an OLAP. OnLine Transaction Processor / OnLine Analytics Processor. Terribly named concepts - names only a committee could create.

OLTP databases: (like SQLite, PostgreSQL) excel at: - Updates, inserts, deletes - Finding a single row by ID in milliseconds - Handling thousands of concurrent transactions: "Find user with email='test@example.com' and update their password"

OLAP databases: (like DuckDB) excel at: - Scanning millions of rows - Grouping, summarising, aggregating data - Complex analytics queries: "What were my top 10 products by revenue last quarter?" 

OLAP databases are terrible at single-row lookups. OLTP databases are slow at scanning entire tables.

Let’s make it concrete by performing a count of Events from PostgreSQL and from DuckDB.

rails> Event.count
=> 13655304
rails> Benchmark.realtime { Event.count }
=> 15.080846183001995

# We'll setup DuckDB with a copy the Event table I prepared earlier:
rails> db = DuckDB::Database.open("storage/example.duckdb")
rails> conn = db.connect
rails> conn.execute("SELECT count(*) FROM events;").to_a
=> [[15215749]]
rails> Benchmark.realtime { conn.execute("SELECT count(*) FROM events;").to_a }
=> 0.001221388578414917

OK - so, pretty fast. But, as we know, counting the number of rows in a PostgreSQL database has always been a bit painful. (And yes, DuckDB also has more data in this example!) Let’s do a proper query. First ActiveRecord and Postgres, then raw SQL in DuckDB. These Events are HTTP requests in a WAF I’m building - rows of mostly text data. Lets count how many events occurred per day, for the last 9 days:

rails> Event.where('timestamp >= ? AND timestamp < ?', 9.days.ago, 1.second.ago)
rails>  .group("DATE_TRUNC('day', timestamp)")
rails>  .select("DATE_TRUNC('day', timestamp) as day, COUNT(*) as count")
rails>  .to_h { |result| [result.day, result.count] }
=>
{2025-12-01 00:00:00 UTC => 1011162,
 2025-12-04 00:00:00 UTC => 807626,
 2025-11-27 00:00:00 UTC => 1779464,
 2025-11-28 00:00:00 UTC => 1721218,
 2025-11-25 00:00:00 UTC => 861706,
 2025-12-03 00:00:00 UTC => 1751837,
 2025-11-30 00:00:00 UTC => 916592,
 2025-11-29 00:00:00 UTC => 1691162,
 2025-11-26 00:00:00 UTC => 1752491,
 2025-12-02 00:00:00 UTC => 1357359}
rails> Benchmark.realtime { Event.where('timestamp >= ? AND timestamp < ?', 9.days.ago, 1.second.ago)
rails>  .group("DATE_TRUNC('day', timestamp)")
rails>  .select("DATE_TRUNC('day', timestamp) as day, COUNT(*) as count")
rails>  .to_h { |result| [result.day, result.count] }}
=> 33.83886259049177

rails> conn.query("SELECT DATE_TRUNC('day', timestamp) AS day, COUNT(*) AS count FROM events WHERE timestamp >= ? AND timestamp < ? GROUP by day order by day", 9.days.ago, 1.second.ago).to_a.to_h
=>
{Tue, 25 Nov 2025 => 867168,
 Wed, 26 Nov 2025 => 1752491,
 Thu, 27 Nov 2025 => 1779464,
 Fri, 28 Nov 2025 => 1721218,
 Sat, 29 Nov 2025 => 1691162,
 Sun, 30 Nov 2025 => 916584,
 Mon, 01 Dec 2025 => 1011162,
 Tue, 02 Dec 2025 => 1357359,
 Wed, 03 Dec 2025 => 1751827,
 Thu, 04 Dec 2025 => 793468}
rails> Benchmark.realtime { conn.query("SELECT DATE_TRUNC('day', timestamp) as day, COUNT(*) as count FROM events WHERE timestamp >= ? AND timestamp < ? GROUP by day order by day", 9.days.ago, 1.second.ago).to_a.to_h }
=> 0.10836964845657349

DuckDB did that query 338x faster than PostgreSQL. Those kind of numbers made my eyes pop right out - how can that kind of performance improvement be possible? That group by count in DuckDB is without any indexes, just in case you were wondering. Well - perhaps that performance comes at some kind of outrageous cost in storage?

rails> ActiveRecord::Base.connection.execute( "SELECT pg_size_pretty(pg_relation_size('events')) as size" ).first["size"]
=> "24 GB"

conn.query("PRAGMA database_size;").to_a
=> [["example", "1.3 GiB"

No, definitely not larger - considerably smaller. 24GB in Postgres, vs 1.3GiB in DuckDB ~ x18 smaller!

These kinds of numbers are fantastically good. I hope I've inspired you to take a look at DuckDB. It has a bunch of other fascinating tricks up its sleeves which I'd like to get around to blogging about.

One last comparison to show it’s not all sunshine and rainbows:

rails> Benchmark.realtime { Event.find(34577316) }
=> 0.0020098499953746796
rails> Benchmark.realtime { conn.query("SELECT * FROM events WHERE id = 34577316") }
=> 0.005289033055305481

DuckDB is slower here ( and in general, much slower for this kind of query ).

So, what makes it so fast at the analytical queries? The key to this performance is its column oriented arrangement of data. OLTPs store their data on disk with the rows arranged in order:

Row-oriented:
{[1][Alice][alice@example.com], [2][Bob][bob@example.com ],[3][Carol][carol@example.com]}

Column-oriented:
{[1][2][3], [Alice][Bob][Carol], [alice@example.com][bob@example.com][carol@example.com]}

Organising data like this, you can quickly see why the OLTP arrangement makes it easy to grab a row of data at a time - the entire row is packed together in a contiguous block. Likewise, adding a row is easy - append it to the end of the file. Conversely, grabbing a single row from the columnar arrangement means plucking values from the middle of contiguous blocks of other data - much slower!

Another consequence of the columnar arrangement, is that like-data is clustered with like-data, which in-turn, allows compression of those contiguous blocks of similar data. Those integer "id, id" sections can use a compression system optimised for packing integers together in the most efficient manner, while the string sections for title and author can use something like zstandard - a very fast compression algorithm. ZStandard will be extra effective used like this, as all the data in a column is likely to be similar, which compression systems like ZStandard are perfect at exploiting. This ability to tune the compression algorithm to the data in an entire column, explains the excellent compression ratio seen above.

DuckDB (and formats like Parquet) organises data into row groups. Within each row group, various statistics are stored—such as count, and min/max values for each column. When querying for specific values, like particular dates, columnar databases can skip entire row groups that don’t satisfy the filter condition. This optimisation is called predicate pushdown—a common term in analytical databases.

Predicate pushdown means applying filtering conditions (predicates) as early as possible in the query process, excluding irrelevant row groups from further processing right at the scan stage. DuckDB automatically creates zonemaps (min-max indexes) for columns, enabling this skipping behaviour. This works efficiently with Parquet files and DuckDB’s native storage format.

The high compression ratios means moving smaller amounts of data through the networks, filesystems, CPU caches and memory, while skippable row groups further improve this and together, they work very well to enable analytic type queries. That’s actual synergy right there-not the buzzword kind.

This post is long enough for now, In the next post, I’ll discuss how you can use some of these OLAP technologies in your Rails app. DuckDB, Parquet and the very new DuckLake!