Influxdb 2017
Influxdb 2017
Influxdb 2017
Advanced Databases
Winter Semester 2017-2018
Authors:
Syeda Noor Zehra Naqvi
Supervisor:
(000455274)
Dr. Esteban Zimányi
Sofia Yfantidou
(000456361)
2 INFLUXDB 8
2.1 General Information & Architecture . . . . . . . . . . . . . . . 8
2.1.1 Key Concepts . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Storage Engine . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Customers & Use Cases . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 DevOps Monitoring: The IBM Case . . . . . . . . . . . 13
2.2.2 IoT Monitoring: The Spiio Case . . . . . . . . . . . . . 13
2.2.3 Real-Time Analytics: The eBay Case . . . . . . . . . . 14
2.3 Pros & Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Pros . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 When not to use InfluxDB . . . . . . . . . . . . . . . . 17
2.4 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 HANDS-ON WORK 18
3.1 Dataset Presentation . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 InfluxDB Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Database Setup . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Schema Design . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Data Import . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Basic Queries . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Benchmarking SQL Server vs InfluxDB . . . . . . . . . . . . . 28
3.3.1 Query Properties . . . . . . . . . . . . . . . . . . . . . 28
1
3.3.2 Hardware Specifications . . . . . . . . . . . . . . . . . 29
3.3.3 Benchmarking Queries . . . . . . . . . . . . . . . . . . 30
3.3.4 Benchmarking Query Results . . . . . . . . . . . . . . 32
3.4 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 InfluxDB vs. Cassandra . . . . . . . . . . . . . . . . . 36
3.4.2 InfluxDB vs. Elasticsearch . . . . . . . . . . . . . . . . 38
3.4.3 InfluxDB vs. OpenTSDB . . . . . . . . . . . . . . . . . 41
2
1 TIME SERIES & TIME SERIES DBs
1.1 Time Series
1.1.1 Definition
“Time Series is an ordered sequence of values of a variable (e.g.temperature)
at equally spaced time intervals (e.g. hourly).” Thus it is a sequence of
discrete-time data. For instance timestamped data, such as log files and IoT
devices’ measurements can be considered time series. The measurements that
constitute a time series are ordered on a timeline, which reveals information
about underlying patterns. Ordering matters, because there is a dependency
between time and measurements and changing the order could change the
meaning of the data [4]. Example time series would be the hourly measure-
ments of temperature at a specific weather station, daily measurements of
the closing price of a specific stock, etc.
1.1.2 Uses
Time series are used in various context, the most common of them being1 :
3
1.2 Time Series Databases
1.2.1 Definition
A Time Series Database (TSDB) is a database type which is optimized for
time series or time-stamped data. It is built specifically for handling metrics,
events or measurements that are time-stamped. A TSDB is optimized for
measuring change over time. A TSDB allows its users to create, enumerate,
update, destroy and organize various time series in a more efficient manner.
The key difference with time series data from regular data is that mostly you
ask questions about it over time. Nowadays, the majority of the companies
are generating a insanely large stream of metrics and events (time series data)
and hence the need of a TSDBs is unavoidable.
1.2.2 Properties
The main properties distinguishing time series data from the regular data
workloads are summarization, data life cycle management, and large range
scans of many records. The overview of some of the required properties of a
TSDB is as follows:
4
operations during peak loads because they are usually designed to stay
available even under the most demanding conditions. Time series data
is usually being recorded every second or even less than that, so write
operations need to be fast.
1.2.3 Popularity
It is obvious that TSDBs are to handle time series data, but their popular-
ity seems to have increased with the emergence of Internet of things (IoT).
IoT is a network of physical devices/objects with connectivity which enables
them to exchange and collect data. Such technologies are generating large
amount of data which is usually time-stamped, so with the increase in pop-
ularity of IoT, TSDBs popularity increased even more, because they can be
used to efficiently store sensors and devices’ data in this domain. Some other
common uses of TSDBs are DevOps monitoring and real time data analy-
sis. Nowadays, many large companies like Facebook, eBay etc. are using
2
Quartz Media. More details here: https://qz.com/344466/
connected-cars-will-send-25-gigabytes-of-data-to-the-cloud-every-hour/.
5
TSDBs instead of relational databases especially for data monitoring pur-
poses. From the graph in Figure 1a it can be seen that the popularity of
TSDBs is increasing rapidly in the past couple of years. From 2015 to 2016
the popularity of TSDBs in increased by 26.7% which is twice as much as
Graph database management systems which is 2nd in the list. The popular-
ity of TSDBs continues to increase till now. In Figure 1b we visualize the
increase in popularity of IoT starting from 2015 as well, to give an idea of
the simultaneous grow of the two fields.
6
IoT devices or time series data points in a continuous flow and perform
real-time analysis.
Some use cases for TSDBs includes monitoring software systems like virtual
machines, different services or applications, monitoring physical systems for
example some equipment or machines, connected devices, the environment,
home management systems, human bodies etc. Another use of TSDBs is
in financial trading systems for classic securities or in crypto currencies (bit
coins etc). TSDBs can also be used as eventing applications for tracking
user/customer interaction data and in business intelligence tools for tracking
key metrics and the general health of the business.
7
data logging and graphing tool for time series data. It’s supported program-
ming languages are C, C#, Java, JavaScript, Lua, Perl, PHP, Python, Ruby.
Graphite is ranked third in the list which is also an open source data logging
and graphing tool which is implemented in Python and supports JavaScript
and Python programming languages. Both RRDtool and Graphite deals with
numeric values.
2 INFLUXDB
2.1 General Information & Architecture
InfluxDB is an open-source schemaless time series database with optional
closed-sourced components developed by InfluxData. It is written in Go pro-
gramming language and it is optimized to handle time series data as defined
in Section 1.1.1. It porvides an SQL-like query language. The open-source
8
version, namely the TICK Stack (See Image 3), provides a full time series
database platform with various services including the InfluxDB core and can
run on cloud and on premises on a single node. The closed-source versions,
namely InfluxEnterprise (IE) and InfluxCloud (IC), offer extra functionali-
ties, such as high availability, scalability, backup and restore, and run either
on premises (IE) or on cloud (IC). More details on the suitability of each
version for specific use cases will be given in Section 2.3.
9
name=passengers
time minors adults location driver
2015-08-18T00:00:00Z 1 2 1 doe
2015-08-18T00:00:00Z 2 2 1 jones
2015-08-18T00:06:00Z 1 1 1 doe
2015-08-18T00:06:00Z 0 1 1 jones
2015-08-18T05:54:00Z 0 2 2 doe
2015-08-18T06:30:00Z 2 2 2 doe
2015-08-18T06:06:00Z 3 1 2 jones
2015-08-18T06:30:00Z 0 4 2 jones
actual data and are always associated with a timestamp. Each field key-field
value pair is a field set. Our dataset has 8 field sets:
The location and driver columns are called tags. Again. each tag consists
of a tag key and a tag value. Each tag key-tag value pair constitutes a tag
set. The following 8 tag sets are included in the dataset:
The difference between tags and fields is that tags are indexed, which
means than queries on tags are faster compared to queries on simple fields,
which are not indexed. Note that the primary key consists of the timestamp
10
Series Number Retention Policy Measurement Tag set
Series 1 autogen passengers location=1, driver=doe
Series 2 autogen passengers location=1, driver=jones
Series 3 autogen passengers location=2, driver=doe
Series 4 autogen passengers location=2, driver=jones
2.1.2 Sharding
Sharding is the horizontal partitioning of data in a database. Each partition
is called shard. InfluxDB stores data in shard groups, which are organized
by retention policy and store data with timestamps that fall within a specific
time interval. The length of the aforementioned time interval depends on the
duration of the retention policy (RP). The default shard group durations are
1 hour for RP less than 2 days, 1 day for RP between 2 days and 6 months
and 7 days for RP greater than 6 months. The duration of the shard group
is important for efficient drop operations, where data is dropped per shard,
not per data point. For instance if a RP has a duration of 10 hours, it makes
no sense to divide the data in 5-hour intervals. However, short shard group
durations for large RP can harm compression and speed. Recommendation
for appropriate sharding and schema design can be found here.
11
2.1.3 Storage Engine
InfluxDB currently uses its in-house built data structure, the Time Struc-
tured Merge Tree (TSM Tree). More details on this storage format will
be given shortly. However, InfluxDB has utilized various storage formats
over different versions. Initially it used LevelDB (a database based on Log
Structured Merge Trees (LSM)), which optimizes write throughput and of-
fers built-in compression. However, LevelDB does not provide hot backup
functionality, which means you need to close the database to safely copy it.
For this reason InfluxDB utilized LevelDB variants, such as RocksDB and
HyperLevelDB, which also use LSM Trees. There is an inherent problem
with LSM Trees though, deletion is an expensive operation and a time series
DB requires deletions on a large scale due to automatic data retention (See
Section 2.1.1). That’s why InfluxDB switched to an alternative data struc-
ture, the mmap B+Tree. It used BoltDB as the underlying storage engine,
which may perform slightly worse in write operations, but offers increased
stability and reliability. However, they realized that when the database be-
came larger, writes would start spiking Input Output Operations per Second
(IOPS). Subsequently, the InfluxDB team decided to build their own storage
format, the TSM Tree. The TSM Tree is similar to a LSM Tree in a sense
that it uses write ahead log, index files that are read only, and it occasionally
performs compactions to combine index files. However, it does not suffer by
the deletion problem and offers better compression rates (45x improvement
in disk space usage) compared to a B+ Tree4 .
12
2.2.1 DevOps Monitoring: The IBM Case
“The IBM® Trusteer® products help detect and prevent the full range
of attack vectors responsible for the majority of online, mobile and cross-
channel fraud.” In order to provide full protection against online fraud at
all times, the Trusteer platform needs to maintain high availability. For this
purpose its team uses DevOps monitoring techniques powered by InfluxDB,
Telegraf5 (another product of the InfluxData ecosystem) and Grafana6 (open-
source software for time series analytics). They use Telgraf for collecting
data, InfluxDB for storing them and Grafana for analysis and visualization.
They collect data on infrastructure and application performance, in order to
monitor their cloud system, which contains hundreds of virtual servers.
13
the impact of factors that influence plant performance.
14
own infrastructure or cloud services. In other words, InfluxDB offers
various solutions to match every potential business need.
3. Holistic Solution: InfluxDB is designed to work perfectly along with
the rest of the InfluxData ecosystem, namely Kapacitor, Telegraf and
Chronograf. In this sense it is so much more than a simple
database. It is part of a holistic solution that offers accumulation,
analysis and visualization, all in one package.
4. Various Input Plugins: InfluxDB does not limit itself to one or two
input methods, like other TSDBs, but it offers various input plugins
free of charge. Apart from the HTTP API, it offers a UDP plugin,
Graphite plugin, which allows input in the Graphite line protocol
format, CollectD plugin, which allows input in collectd native format,
OpenTSDB plugin, which allows Telnet and HTTP OpenTSDB
protocol. This means that InfluxDB can act as a drop-in replacement
for an OpenTSDB system.
5. Grafana Support: Grafana is the go-to software for time series an-
alytics, with well over 100,000 active installations. Grafana has in-
troduced a plugin for InfluxDB as a data source for their analytics
dashboards.
6. Extensive Programming Languages Support: InfluxDB offers
support for various programming languages, including, but not limited
to: .Net, Java, Perl, PHP, Python, R, Ruby, Scala and more.
7. SQL-like Query Language: InfluxDB comes with an SQL-like query
language, InfluxQL, which means it does not have a steep learning curve
and is easier to write and understand by non-tech people as well com-
pared for instance to OpenTSDB which does not provide such query
language. This is extremely important when it comes to the world of
businesses, where data plays a major role and is handled by people
with different backgrounds.
8. Continuous Query Support: “Continuous Queries (CQ) are In-
fluxQL queries that run automatically and periodically on realtime
data and store query results in a specified measurement.”10 CQs enable
10
InfluxData — Documentation — Continuous Queries. 2017. Retrieved from https:
//docs.influxdata.com/influxdb/v1.2/query_language/continuous_queries/.
15
downsampling (roll-up) of commonly-queried, high granularity data to
a lower granularity. Queries on data with lower granularity require
fewer resources and are faster than queries with higher granularity.
10. Auto-Expiration: Time series data may become less relevant or even
useless depending on the application as time goes by. InfluxDB with the
use of Retention Policies as discussed in Section 2.1.1 enables automatic
expiration of stale data.
11. Unlimited fields: The new storage engine of InfluxDB, TSM Tree,
as discussed in Section 2.1.3, is columnar format, which means that the
number of fields does not affect querying performance in a negative way.
As a result the number of fields in measurement (See Section 2.1.1) do
not have any limitations as well.
2.3.2 Cons
1. Scalability as a Close-Source Feature: When InfluxDB announced
that clustering would not be included in the open-source version, it re-
ceived an outcry from the community. Currently, InfluxDB high avail-
ability and scalability features are close-source. However, InfluxData
provides an open-source replication solution for high availability, while
many users use sharding (See Section 2.1.2) as a work-around for the
missing clustering functionality in the open-source version. This for in-
stance may make InfluxDB look unattractive for start-ups with limited
budget. However, InfluxData provides various plans starting from $149
16
a month (or $249 a month for the cloud version), a cost not prohibit-
ing even for smaller companies that want to invest in a good scalable
solution.
2. As influxDB mostly works with frequent data, you can only group time
by 1 week at maximum. If there is a requirement to group by more
than a week e.g by a month, it can not be done using influxDB.
2.4 Popularity
InfluxDB is currently the most popular TSDB according to DB-Engines
Rankings as seen in Figure 4 with a 3.38% increase in popularity since Oc-
tober 2016. The rest of the TSDBs have an increase lower that 1% or even
a decrease in popularity. The ranking is based on various factors including:
number of related returned results in search engines, amount of interest in
the system, amount of discussion in technical forums e.g. Stack Overflow,
number of job offers, number of profiles in professional networks and social
network presence.
17
Figure 4: Ranking of the top-10 TSDBs (Source: https://db-engines.com/).
2.5 Comparisons
Figure 5 compare some characteristics of most used technologies/databases
for dealing with time series data. The detailed comparison of the metrics
will be done in the later Section 3.4. It can be seen that InfluxDB is released
after the other competitive technologies and still among the top list. The
SQL-like query language helps it make easier to use and adapt by people
who are use to working with relational databases like MySQL.
3 HANDS-ON WORK
3.1 Dataset Presentation
The timestamped dataset used is provided by the NYC Taxi and Limousine
Commission (TLC) and was collected by technology providers authorized un-
der the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP)11 .
It includes records of all the yellow cab rides from 2009 to 2017. For bench-
marking purposes we utilized records from January 2016 to May 2016 (more
11
NYC Taxi and Limousine Commission - Trip Record Data. 2017. Retrieved from
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
18
Figure 5: System Properties Comparison (Source: https://db-engines.com/).
19
than 30,000,000 records) and we removed specific attributes that were not
relevant to the Time Series concepts. Finally, each record includes the fol-
lowing information:
• VendorID: A code indicating the TPEP provider that provided the
record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
• tpep pickup datetime: The date and time when the meter was en-
gaged.
• Passenger count: The number of passengers in the vehicle.
• Trip distance: The elapsed trip distance in miles reported by the
taximeter.
• RateCodeID: The final rate code in effect at the end of the trip.
1= Standard rate, 2=JFK, 3=Newark, 4=Nassau or Westchester, 5=Ne-
gotiated fare, 6=Group ride
• Payment type: A numeric code signifying how the passenger paid
for the trip.
1= Credit card, 2= Cash, 3= No charge, 4= Dispute, 5= Unknown,
6= Voided trip
• Fare amount: The time-and-distance fare calculated by the meter.
• Extra: Miscellaneous extras and surcharges. Currently, this only in-
cludes the $0.50 and $1 rush hour and overnight charges.
• MTA tax: $0.50 MTA tax that is automatically triggered based on
the metered rate in use.
• Improvement surcharge: $0.30 improvement surcharge assessed trips
at the flag drop.
• Tip amount: Tip amount – This field is automatically populated for
credit card tips. Cash tips are not included.
• Tolls amount: Total amount of all tolls paid in trip.
• Total amount: The total amount charged to passengers. Does not
include cash tips.
20
3.2 InfluxDB Tutorial
3.2.1 Database Setup
Setting up InfluxDB on Windows is a relatively easy process. After download-
ing the official Windows Binaries, we immediately ran the server (influxd.exe)
and then the CLI (influx.exe). InfluxDB is an out-of-the-box platform. We
also downloaded Chronograf, an open-source monitoring and visualization
UI for InfluxDB. Locally it can be found on port 8888 of localhost by default
after running the chronograf.exe file. Chronograf enables better visualiza-
tion of the results, especially the SELECT statements, as it creates graphs,
tables and CSV files for query results. However, due to these extra features
it is slower than a typical UI, such as the visualization tool of SQL Server
Management Studio.
• Tags are indexed and fields are not, so store data in tags if they are
commonly-queried meta data or used in GROUP BY clauses. Store
data in fields if they are used with aggregation functions.
• Avoid having too many series. In other words, avoid tags that are too
varied like IDs.
• Avoid putting more than one piece of information in one tag. For
instance convert location = north − N Y to two tags, location = N Y
and region = north.
• Adjust your Shard duration to your Retention Policy (See Section 2.1.2).
For longer Retention Policies, increasing the shard group duration can
improve compression and write speed. A recommendation is to adjust
12
InfluxData — Documentation — Schema Design. 2017. Retrieved from https://
docs.influxdata.com/influxdb/v1.3/concepts/schema_and_data_layout/.
21
your Shard duration so that is is two times your longest typical query’s
time range.
• Keep in mind that InfluxDB is not designed to support joins, thus your
schema should support all potential queries without the need for a join.
22
To this end we tried various open-source converters (csv2influx,csv2influxdb,
csv-to-influxdb, etc.), that convert CSV files to an InfluxDB acceptable for-
mat. Most of them suffer from limited functionality and unresolved bugs. Fi-
nally, we utilized csv-to-influxdb, which allows specifying timestamp-column,
tag-columns, measurement and batch size at conversion time. Batch size is
important since InfluxDB allows up to 5000 rows to be imported in one
batch. Note that if the timestamp column name is different than time, then
InfluxDB automatically creates a timestamp column called name, containing
the import time. Since import time is not needed in our case, we renamed
our pick-up time column to time.
However, for using the CSV file with the aforementioned converter, we
needed to modify it accordingly. First and foremost, we needed to split each
CSV file into 3 files containing only the columns needed for the specified
measurement (fare, distance, passengers as seen in Section 3.2.2). For each
subfile We needed to remove useless columns, format dates so that the fol-
low a specific format (2016-01-15 00:00:00), format floating point numbers so
they all follow the same format (13.45), and change the delimiter to comma.
Handling files with millions of rows in Microsoft Excel is not possible (limit
of 1,048,576 rows), so we wrote our own Java code that performs this for-
matting.
A conversion and import command using csv-to-influxdb in command line
looks like this:
1 .\ csv - to - i n f l u x d b _ w i n d o w s _ a m d 6 4 . exe -d cabs -m fare
2 -t VendorID , RatecodeID , payment_type - ts time
3 yellow_tripdata_fare_2016 -05. csv
The command above is of a specific format, csv-to-influxdb [options] csv-file-
path, where the [options] used refer to:
• –server, -s Server address (default http://localhost:8086)
• –database, -d Database name (default test)
• –measurement, -m Measurement name (default data)
• –tag-columns, -t Comma-separated list of columns to use as tags
• –timestamp-column, -ts Header name of the column to use as the times-
tamp (default timestamp)
• –batch-size, -b Batch insert size (default 5000)
23
3.2.4 Basic Queries
InfluxQL is an SQL-like query language provided by InfluxDB, which pro-
vides statements for data and schema exploration, database manage-
ment, continuous queries, mathematical and aggregation function
and authentication and authorization.
For Data Exploration, InfluxQL supports basic SELECT statement, as
well as clauses, such as WHERE, GROUP BY, INTO, ORDER BY, LIMIT
and OFFSET. Moreover, InfluxQL provides subqueries functionality as an
alternative to SQL’s HAVING clause. Examples of such statements will
be given in Section 3.3.3. Moreover, it specifically supports SLIMIT and
SOFFSET for limiting and offsetting point the number of series returned
respectively.
For instance, the query below uses an InfluxQL function and a time in-
terval in the GROUP BY clause to calculate the average total fare for each
1week interval in the query’s time range. SLIMIT 1 requests a single series
associated with the fare measurement.
1 SELECT MEAN ( total_amount )
2 FROM fare
3 WHERE time >= ' 2016 -01 -01 00:00:00 ' AND
4 time <= ' 2016 -01 -31 00:00:00 '
5 GROUP BY * , time (1 w )
6 SLIMIT 1
InfluxQL also provides a TIMEZONE clause (tz()) which returns the
UTC offset for the specified timezone. For instance appending tz(’America /
Chicago’) in the previous query would return a time column that would like
like 2016-01-01T19:00:00-05:00.
The most important Schema Exploration statements and their results
can be seen in Figure 6. Regarding Database Management statements
like DROP/CREATE database, DROP series, DROP measurement, DROP
shard and DROP/ALTER/CREATE retention policy are provided and are
self explanatory. However, when authorization is enabled these commands
are only available to admin users. Moreover the database can be backed
up using the command .\influxd backup -database [database name] -retention
[retention policy] -since [start date for backup] [destination folder] and re-
stored using first the command influxd restore -metadir [path to InfluxDB
metadata folder] [backup path] for restoring the metadata and the command
influxd restore -database [database name] -datadir [path to InfluxDB data
24
Figure 6: Basic Schema Exploration statements.
25
• Downsample Data. Use CQs to automatically downsample high pre-
cision data to a lower precision and remove the high precision data from
the database if not needed.
26
Figure 7: Identifying seasonality patterns for the sum of total fares in Chrono-
graf. The dots represent the approximate seasonal data points. Each month
contains 4 dots with similar pattern.
27
seasonality pattern S. The method must be used along with a GROUP BY
time() clause.
The N predicted values occur at the same interval as the GROUP BY
time() interval. If your GROUP BY time() interval is 1w and N is 10 you’ll
receive ten predicted values that are each 1 week apart. S is the seasonal
pattern parameter and delimits the length of a seasonal pattern according to
the GROUP BY time() interval. If your GROUP BY time() interval is 1w
and S is 4, then the seasonal pattern occurs every 4 weeks (1 month), that
is, every four data points. If seasonality is not needed S can be set to 0 or 1.
1 SELECT HOLT_WINTERS [ _WITH - FIT ]( < function >( < field_key >) , <N > , <S >)
2 [ INTO_clause ]
3 FROM_clause
4 [ WHERE_clause ]
5 GROUP_BY_clause
6 [ ORDER_BY_clause ]
7 [ LIMIT_clause ]
8 [ OFFSET_clause ]
9 [ SLIMIT_clause ]
10 [ SOFFSET_clause ]
An example of a Holt Winters query can be seen below. The query
predicts N = 10 weekly future values for the sum of total fares after 1st of
June. Note that the predicted values will be in weekly time intervals because
of the group by query. The seasonality pattern is set to 4 because we noticed
that there is a slight pattern on the sum of total fares on a monthly basis
(See Figure 7). The results can be seen in Figure 8.
1 SELECT H O L T _ W I N T E R S _ W I T H _ F I T ( SUM ( total_amount ) ,10 ,4)
2 FROM fare
3 WHERE time > ' 2016 -01 -01 ' AND time < ' 2016 -06 -01 '
4 GROUP BY time (1 w )
• Query Interval
28
• Aggregation
• Object Identity
• Dimension
• Condition Type
29
• System Type: x64-based PC
30
Query 4 (Range, Aggregation, Known): What was the total fare
calculated by the meters per day in January 2016 for Creative
Mobile Technologies?
1 SELECT SUM ( fare_amount )
2 FROM fare
3 WHERE time >= ' 2016 -01 -01 ' AND time < ' 2016 -02 -01 ' AND VendorID = '1 '
4 GROUP BY time (1 d )
31
Query 9 (Point, No Aggregation, Unknown): We noticed that there
is an abnormally high fare value for timestamp 2016-03-10 22:59:51.
Which TPEP provider charged a client with an abnormal fare and
what is this fare?
1 SELECT VendorID , total_amount FROM fare
2 WHERE time = ' 2016 -03 -10 22:59:51 '
32
Query Interval Aggregation Object Identity SQL Time InfluxDB Time
1 Range Yes Unknown 786 70
2 Unbounded Yes Unknown 3104.8 1537
3 Point Yes Unknown 621 33.2
4 Range Yes Known 2271.8 160.13
5 Unbounded Yes Known 2960 672.939
6 Point Yes Known 651.5 33.9
7 Range No Unknown 3079.1 1402.475
8 Unbounded No Unknown 2759.7 1584.96
9 Point No Unknown 4 2.5
10 Range No Known 663 628.14
11 Unbounded No Known 642.9 36
12 Point No Known 3.2 3
Table 3: Execution time in milliseconds (ms) for different query types in SQL
Server and InfluxDB.
33
InfluxDB does not have an official plugin that handles insertions from
CSV, thus an external tool was used as mentioned in Section 3.2.3.
This tool converts CSV data to the appropriate format and then in-
serts them to InfluxDB. However, SQL Server is designed to work with
CSV data. Thus, the write throughputs cannot be comparable, as in
InfluxDB we had to convert and then insert data, while in SQL Server
we could import data immediately. That’s why we decided to exclude
write throughput from this report.
34
query performance increases by adding extra nodes as will be seen in
Section 3.4, we expect InfluxDB’s query performance to improve even
more in multi-node environments.
35
3.4 Benchmarking
This section will compare performance of 3 commonly used technologies for
time-series data: Cassandra, OpenTSDB and Elasticsearch with InfluxDB.
See Figure 5 to see the comparison of some basic features of these technolo-
gies. Please note that all the content and statistics used in this
section are from the official benchmarking by influxDB [3]. The
following paragraphs introduce the metrics and data set against which this
benchmarking is done.
Metrics: The benchmarking is done against the most commonly evalu-
ated characteristics for working with time-series data which are:
1. Data ingest performance - measured in values per second.
2. On-disk storage requirements - measured in Gigabytes.
3. Mean query response time - measured in milliseconds.
The Data Set: The dataset used for these benchmarks, models a com-
mon DevOps monitoring and metrics use case, where a number of servers
are periodically reporting system and application metrics at a regular time
interval. Overview of the data can be seen in Table 4.
Parameters Cassandra OpenTSDB Elasticsearch
Number of Servers 1000 1000 100
Values measured 100 100 100
per Server
Measurement 10s 10s 10s
Interval
Dataset dura- 24h 4h 24h, 48h, 72h, 96h
tion(s)
Total values in 864,000,000 per day 144,000,000 864,000,000 per day
dataset
36
1. Write Performance: To test write performance, batch of 24-hour
dataset with 4 worker threads was concurrently loaded (to match the
number of cores on the server). The average throughput of Cassandra
was found to be 90,333 values per second .The same dataset loaded
into InuxDB at a rate of 476,460 values per second (See Figure 12).
37
Figure 13: On-disk Compression: InfluxDB Vs. Cassandra
asked to return results for a set of queries, which forced all of the query
processing to happen on the server side. Figure 14 shows that Cassan-
dra was only able to deliver performance comparable to InfluxDB in
the scenario where the application handled the query load. However,
InfluxDB outperformed Cassandra by being x20 times faster than it
when processing happened on the server side. And if we increase time
period to 12 hours or if we increase number of time series, that is query
over multiple servers, Cassandra is even slower and in such scenarios
even if all the query load is handled by client side, Cassandra is way
behind than InfluxDB.
Conclusion: InfluxDB is upto 168x faster than Cassandra in query
performance (server side aggregations) [5].
38
Figure 14: Query Performance: InfluxDB Vs. Cassandra
39
benchmarking is done both against recommended configuration for time
series data and default configuration of Elasticsearch. For 24 hour
data-set the amount of space utilized by InfluxDB was 127 MB, how-
ever Elastic search required 2.1 GB with default settings and 502 MB
with the recommended configurations. So approximately space needed
for InfluxDB was 1.54 bytes per value and 6.09 bytes per value for
Elasticsearch. Figure 16 shows this comparison in form of bar chart.
40
Figure 17: Query Performance: InfluxDB Vs. Elasticsearch
41
Figure 18: Write Throughput: InfluxDB Vs. OpenTSDB
42
uating on-disk compression.
43
References
[1] DB-Engines. Db-engines ranking, 2017.
[4] Joshi, P., Massaron, L., and Hearty, J. Python: Real World
Machine Learning. Packt Publishing, 2017.
[5] Persen, T., and Winslow, R. Influx db vs. cassandra for time-series
data, metrics & management. Tech. rep., September 2016.
[6] Persen, T., and Winslow, R. Influx db vs. elasticsearch for time-
series data, metrics & management. Tech. rep., September 2016.
[7] Persen, T., and Winslow, R. Influx db vs. opentsdb for time-series
data, metrics & management. Tech. rep., November 2016.
44