85
In the IoT world, general-purpose databases can’t cut it
By Linda Musthaler
We live in an age of instrumentation, where everything that
can
be measured
is
being measured so
that it can be analyzed and acted upon, preferably in real time or near real time. This
instrumentation and measurement process is happening in both the physical world, as well as the
virtual world of IT.
In the IT world, events are being measured to determine when to autoscale a system’s virtual
infrastructure. For example, a company might want to correlate a number of things taking place at
once — visitors to a website, product lookups, purchase transactions, etc. — to determine when to
burst the cloud capacity for a short time to accommodate more sales or other kinds of activity.
Much of this data is time-series data, where it’s important to stamp the precise time when an event
occurs, or a metric is measured. The data can then be observed and analyzed over time to
understand what changes are taking place within the system.
Time-series databases can grow quite large, depending on how many events or metrics they are
collecting and storing. Consider the case of autonomous vehicles, which are collecting and
evaluating an enormous number of data points every second to determine how the vehicle should
operate. A general-purpose database, such as a Cassandra or a MySQL, isn’t well suited for time-
series data. A database that is purpose-built to handle time-series data has to have the following
capabilities, which general-purpose databases don’t have.
The database needs to be able to ingest data in almost real time. Some applications – like the
one for the autonomous vehicle – could conceivably produce millions or hundreds of
millions of data points per second, and the database must handle the ingest.
You have to be able to query the database in real time if you want to use the database to
monitor and control things, and the queries have to be able to run continuously. With a
general-purpose database, queries are batches and not streaming.
Compression of data is important and is relatively straight forward if the database is
specifically designed for time-series data.
You have to be able to evict data as fast as you ingest it. Time-series data is often only
needed for a specific period, such as a week or month, and then it can be discarded. Normal
databases aren’t constructed to remove data so quickly.
And finally, you have to be able to “down sample” by removing some but not all data. Say
you are taking in data points every millisecond. You need that data to be high resolution for
about a week. After that, you can get rid of much of the data, but keep some at a resolution
of one data point per second. In time-series data, high resolution is very important at first,
and then lower-resolution data is often fine for the longer term.