“Big Data” these days usually means “time series.” There are only two relevant attributes of time series data: like Stregga Nona’s pasta, it tends to grow without bound, while simultaneously becoming less useful as time passes.
I can understand why developers, faced with an overflowing pot of pasta (data), are quick to look elsewhere besides their SQL database, but as someone who’s been down that path with DynamoDB and Redshift, let me tell you that Postgres had what you were looking for all along.
Sharding the data
Before we begin, let me say that I owe an immense debt to this Engine Yard post for being the only comprehensive source for most of this information.
Suppose you’re keeping tabs on the current weather conditions. You start with a simple table:
but soon the table grows so big that simply inserting rows and updating the relevant indices slows everything down.
You’d much rather have multiple tables weather_2015_01, weather_2015_02, etc. which contain only the events for that particular month. Should you pre-create all the tables to a decade out? Should you write custom application code to route weather events to the appropriate table? What about querying for a time range? More application code?
By judiciously using Postgres’ support for child tables (using INHERITS) and writing a function to be triggered on insert, we can automate all of the above!
To run this function on every insertion is straightforward:
There are two tricks in the above: checking pg_catalog.pg_class to see if the correct table shard exists, and CREATE TABLE ... INHERITS (public.weather). The first allows us to automatically create tables as needed while the second allows us to perform
to automatically query only the weather_2015_01 child table.
Reaping old shards
Clearing out old data can now be as simple as DROP TABLE weather_2014_01. However, we’ve seen how simple our life became by appropriately scripting the insertion process, so let’s try the same for deletion.
To clear out old shards, simply execute SELECT partition_maintenance('weather', CURRENT_DATE - INTERVAL '1 year') to get rid of anything from more than a year ago.
If you’re in the business of collecting time series data, you’ve probably found that keeping the data in a separate store (DynamoDB, MongoDB, CSVs in S3, etc.) is a huge pain when you want to join it with some of your other data sets. Keeping everything in one database like Postgres is a huge win for ad-hoc queries and needn’t bog down your Postgres instance with unneeded, old data.