DuckDB and File-based Storage
I was playing around with some idea from work. One of the use case is to archive some Forex data for at least 10 year in the future - I believe this will later changed to forever.
My current solution is pretty straight-forward. I will query the data from DB every day and write it on csv file - thinking that this is a good and future-proof solution, after all this is nothing but a text file right? text file has been around since dawn of computer and unlikely to ever change.
Which mean, I donβt have to worry about deprecated technology at all. I am pretty worried about deprecated technology because how fast tech improve. I was once face an issue with moving data from a Sybase DB, and I was βluckyβ that in the end the task doesnβt fall on my plate.
My solution to saving data on csv form is indeed simple. Zipped, the whole data are only 2.5 GB per month, in 10 years it will accumulated to 300 GB. 300 GB maybe can be considered βsmallβ but knowing I have not compress the data to the extend, the issue keep appearing on the back of my mind, especially when I read about Time-Series DB.
The data I work with is actually the very definition of data that should have been stored inside a Time-Series DB. Some initial research show that the data can be compressed more to less than 20% than my initial size. Unfortunately it is unlikely I can use the DB on my work since it required a little something known as βmoneyβ. And so, with the issue on the back of my head I have decide to read more and more about this topic.
I finally encounter something known as parquet.
Parquet
Apache Parquet is a format of a file, or an open source, column-oriented data file formatΒ designed for efficient data storage and retrieval - I still yet fully understand what this mean.
But what is the feature? I need to know the feature before I can even bother to say its definition out-loud.
Parquet file allow us to store time-series data like the one I own to a highly compressed file. This is in essence a Time-Series DB on budget.
I quickly imagined a sqlite file since hearing about this! Although it maybe better to imagine it as a csv file. Each file has a column and parquet is designed to be stored this kind of file, not some free text data.
I have found 2 way to wrap write-read of parquet file so far: with Avro and Protobuf. Parquet itself doesnβt demand us to use one of those two but it just happen that Avro and Protobuf is a good existing technology that compliment the columnar rule on Parquet, it make sure that parquet has strict column and a clear schema that can be understood by anyone. I can just give someone the json schema of Avro or Protobuf contract, and they will understand what to expect in my parquet file.
Some thing we had to remember:
- parquet file is immutable, meaning that we canβt append it like csv, data manipulation must be done by reading the old file and writing it on new file.
- parquet format is promised to be backward-compatible, so far at the time of writing there are already parquet v1 and v2, lets hope the promise stay true.
- all parquet reader always can read all files no matter the version.
Okay we had the file, now what?
I mention that this is a budget time-series DB, but we just talk about the storage itself.
To fully understand the beauty of this, we must talk about the duck.
DuckDB
DuckDB is an open-source column-oriented Relational Database Management System - I had better idea what this is, but not really.
DuckDB is an interesting tool. It able to read any columnar data from many place like file(s) or remote object storage. Its installation is a single a portable file, a single .exe file ( to whom it may concern: Sorry Iβm on Windows). The file can be opened and DuckDB is ready to use.
DuckDB will be able to read any popular columnar file be it parquet or even csv.
Lets have an example of this csv data:
currencyPair,bid,ask,dateEpochMicros
USDIDR,894530.39550781248000,1075689.0869140624000,1775970369322000
USDIDR,547915.100097656256000,1084149.719238281248000,1775970370322000
USDIDR,501558.990478515632000,10224578.61328124992000,1775970371322000
USDIDR,268244.277954101568000,1042509.64355468752000,1775970372322000
USDIDR,70381.7520141601568000,1057338.68408203120000,1775970373322000
USDIDR,434182.73925781248000,1016457.824707031248000,1775970374322000
.....
<for 1 million row>
This file is located at output/forex.csv.
With DuckDB, I will be able to query this file. with simple well known query of
select * from 'output/forex.csv' limit 10;
It will spit out data from inside that file
memory D select * from 'output/forex.csv' limit 10;
ββββββββββββββββ¬βββββββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββ
β currencyPair β bid β ask β dateEpochMicros β
β varchar β double β double β int64 β
ββββββββββββββββΌβββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββββββββ€
β USDIDR β 894530.3955078125 β -1075689.0869140625 β 1775970369322000 β
β USDIDR β 547915.1000976562 β -284149.71923828125 β 1775970370322000 β
β USDIDR β 501558.9904785156 β -1224578.61328125 β 1775970371322000 β
β USDIDR β 268244.27795410156 β -942509.6435546875 β 1775970372322000 β
β USDIDR β 70381.75201416016 β -757338.6840820312 β 1775970373322000 β
β USDIDR β 434182.7392578125 β -416457.82470703125 β 1775970374322000 β
β USDIDR β 264118.59130859375 β -270017.1203613281 β 1775970375322000 β
β USDIDR β 156302.18505859375 β -164953.6895751953 β 1775970376322000 β
β USDIDR β 902779.78515625 β -855730.224609375 β 1775970377322000 β
β USDIDR β 1580336.5478515625 β -782791.6259765625 β 1775970378322000 β
ββββββββββββββββ΄βββββββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββββββββ
10 rows 4 columns
That is amazing
That is not all, I have multiple csv file, one for each day. DuckDB also can handle that with query like this:
SELECT * FROM 'output/*.csv' LIMIT 10;
This will automatically query all data on all csv in that folder! I no longer have to worry about querying 300 GB worth of csv to find what I need.
That is not all. Just like a time-series function it also can do some signature function of a time-series DB like averaging data every 5 minutes like so:
SELECT
currencyPair,
-- Truncate time to 5-minute intervals
time_bucket(INTERVAL '5 minutes', epoch_micros(dateEpochMicros)) AS five_min_window,
AVG(bid) AS avg_bid,
AVG(ask) AS avg_ask
FROM 'output/*.csv'
GROUP BY 1, 2
ORDER BY five_min_window ASC;
I wonβt talk about time_bucket here. But know this that we can do it.
Epilogue
DuckDB and parquet is a match made in heaven. It give us a way to store a time-series DB efficiently and then analyze the data easily. I am still thinking whether I can use this combination for more serious work like a real-time query.
Yes its open some issue about distributed and backup, but for cold-archive this is definitely the solution I can hold on, even if I decide not to use parquet file at all I at least knew there is a tool that can help me query all that csv data.