Using the ELK Stack for Data Analysis

ELK is a popular abbreviation of the Elasticsearch, Logstash, and Kibana stack. This is an end-to-end stack that handles everything from data aggregation to data visualization. On a recent project, I needed a database with a schema-less data model for aggregated queries and fast searching. I filtered my options to two choices – Elasticsearch and Solr (both based on Apache Lucene). I decided to go with Elasticsearch because of the full stack and AWS support.

In this post, we’ll understand how to run all three components of ELK to analyze data. The data was a set of orders with multiple attributes.

Elasticsearch

Elasticsearch provides a REST API over multiple indexes that can be searched and queried. Indexes are automatically created when you post a JSON document to an index scheme. The index scheme is composed of 3 parts:

  • Index name
  • Index type
  • Document ID

If your data has text fields, Elasticsearch will automatically build a search index. I used Docker to run a single Elasticsearch node:

docker run -i --rm \
-v "$PWD/esdata":/usr/share/elasticsearch/data \
-p 9200:9200 --name el-001 elasticsearch

This starts a Docker container, mounts a directory on the host for data persistence, and forwards a port so that a client can work with the API.

Mappings

Before you can run aggregate queries, you need to define the schema for the index. This means Elasticsearch needs to know the data type (string, integer, double) of the attributes in the schema. Elasticsearch does try to guess the attribute type, but you get predictable results with a schema. This is a snippet from a mapping configuration that defines a schema:

{
  "mappings": {
    "eliza": {
      "properties": {
        "recordType": {
          "type": "string",
        },
        "dateTime": {
          "type": "date",
          "format": "epoch_second"
        },
        "price": {
          "type": "double"
        },
        "qty": {
          "type": "integer"
        }
      }
    }
  }
}

Analyzed Fields

String attributes are analyzed for full-text search. This has unwanted side effects when you want to use these fields for aggregation. For example, if the “recordType” attribute has a value of ‘NOX-GM’, you may expect an aggregate query for the count of records on this attribute to have a single count for ‘NOX-GM’, but the actual result will have one count for ‘NOX’ and another for ‘GM’.

If an attribute is not needed for full text search, adding a “not_analyzed” property is sufficient:

"recordType": {
  "type": "string",
  "index": "not_analyzed"
}

If an attribute is needed for both text search and aggregation, a ‘raw’ sub-type can be added, which can be configured to be not analyzed:

"productCode": {
  "type": "string",
  "fields": {
    "raw": {
      "type": "string",
      "index": "not_analyzed"
    }
  }
}

The ‘productCode’ attribute is analyzed for full text search, while the ‘productCode.raw’ attribute is used for aggregate queries.

Add or Update Mapping

The mapping can be added to an Elasticsearch node with a PUT request to the selected index (example, ‘logstash-2016-04’):

curl -X PUT -T config/mapping.json \
http://localhost:9200/logstash-$(date "+%Y-%m")

A mapping can be updated to add new attributes to the index. This does not affect the previously indexed attributes. However, it is not possible to change the data-type of a previously indexed attribute. If you wish to change the type, the index must be dropped and recreated. You can refer to the mapping manual to write your own mapping.

Logstash

Logstash is an event collection and forwarding pipeline. A number of input, filter, and output plugins enable the easy transformation of events. Logstash needs a configuration file that, at minimum, specifies an input and output plugin. This is a configuration file with a CSV filter:

input {
  file {
    path => "/data/input.csv"
  }
}

filter {
  csv {
    columns => [
      "recordType",
      "dateTime", 
      "productName", 
      "productCode", 
      "price", 
      "qty"
    ]
  }
}

output {
  elasticsearch {
    hosts => ["${ESHOST}"]
    index => "logstash-%{+YYYY-MM}"
  }
}

The configuration is fairly self explanatory, but here are a few notes on it:

  • The configuration has its own syntax. It looks like a cross between half a dozen languages, which makes it quite confusing. It helps to have the reference manual handy when authoring a configuration file.
  • I used a CSV filter to name the columns.
  • Environment variables are enclosed in ${…} (like in a shell script), and they get substituted at run time.
  • The date format syntax and dynamic behavior is provided by the Elasticsearch output plugin.
  • The input file plugin tails the file and reads it in line by line.

Logstash is run as a Docker container as well:

docker run -i --rm \
-v "$PWD/config":/config -v "$PWD/data":/data \
--link el-001:esl -e ESHOST=el-001 logstash \
logstash -f /config/logstash.conf

This starts and connects Logstash with the Elasticsearch container started earlier. At this point, if you append a CSV line to the input file, Logstash will read it and send it to Elasticsearch.

Templates

Templates are applied when indexes are created. In order for the ‘raw’ attributes to be generated when data is added, a template mapped to the index is needed. Logstash defines an eponymous template, so if your index is named “logstash-*”, the raw attributes will be generated automatically. If your index has a different name, you will need to create a template for this index. The easiest way to do this is to get the logstash template, update it with your index name and add the new template.

Get the logstash template:

curl -XGET localhost:9200/_template/logstash?pretty

This is a snippet of the template; to use it, modify the marked ‘logstash’ with your index moniker:

{
 "logstash" : {
   "order" : 0,
   "template" : "logstash-*",
   "settings" : {
     "index" : {
       "refresh_interval" : "5s"
     }
    },
}

Update the new template:

curl -XPOST -d @custom_template.json localhost:9200/_template

Kibana

Kibana is a visualization framework ideal for exploratory data analysis. Kibana connects with an Elasticsearch node and has access to all indexes on the node. You can select one or more indexes and the attributes in the index are available for queries and graphs. It is really almost that easy – Kibana needs you to define your schema mappings to do anything meaningful.

Using Docker to run a Kibana instance:

docker run -i --rm \
--link el-001:esk -p 5601:5601 \
-e ELASTICSEARCH_URL=http://el-001:9200 kibana

Discovery

Open your browser to http://localhost:5601/ and the Kibana dashboard will open up. You should see the “Settings” page where you need to select at least one index. You need to change the index pattern if it does not match the default (logstash-*). Next, you need to choose a one-time field name from all date-time attributes in your schema and a special @timestamp attribute that is the time when the record was added to Elasticsearch.

Once you have chosen your indexes, you can go to the ‘Discover’ tab. If you are using an old data set (if all records are older than 15 minutes), you may either see a ‘No results found’ banner or less data than you expected. To fix this, adjust the timeline from the top right to something appropriate for your data set (for example, last 1 year).

The ‘Discover’ tab presents the count of records spread over the selected time range by day. You can adjust this to be more granular (hour, minute, second) or less granular (week, month, year).

discover

You can use the search bar to search for terms and the page will update with the results and the new graph.

Visualize

The ‘Visualize’ tab opens with the possible visualizations. Most visualizations start with the simplest metric – the total count of all documents.

bar-0


For a bar chart, you can add an x-axis to choose a term (the ‘raw’ term), which provides a breakup of the count versus each term.

bar-1


You can modify the y-axis to change the aggregation to average or sum and Kibana will show the available attributes to which the aggregation can apply, for example average price versus each term.

bar-2


You can also create a stacked bar chart by using common rank order metrics over a suitable attribute such as price or product quantity.

bar-3


The bottom flip switch shows the request and response for the visualization – this is very handy in developing the query and your own visualizations.

bar-4

Once you are happy with your visualization you can save it for loading later.

Dashboard

You can create a Dashboard from the saved visualizations and save the dashboard itself. You can set it to auto refresh, which will modify the visuals as new data is collected by Elasticsearch.

Conclusion

The ELK stack provides a simple way to load and analyze data sets. It is not meant to be a full-fledged statistical analysis tool, but more suited for business intelligence use cases.

I found that writes to Elasticsearch are quite slow, while reads are very fast. This is expected because a write creates the indexes on the attributes and the analysis on some attributes as well. Logstash may drop data if there is “back-pressure” – if the destination (like Elasticsearch) is not able to keep up with the data velocity. I tried to understand Elasticsearch properties with respect to the CAP theorem, but could not conclude if it was either AP or CP. In a production setting, I would recommend:

  • Elasticsearch is not used as the authoritative data source as it may drop data in case of network partitions.
  • Logstash is used as a transformation pipeline and not as a queue. It is better to read data from a message queue like RabbitMQ, Kafka, or Redis.
  • Elasticsearch is a great choice if your schema is dynamic. If it is static then you should use Elasticsearch only if you need full text search. If all you want to do is run aggregate queries on static schema, Cassandra is a better choice.

The company behind the ELK stack is Elastic. They have a wealth of documentation and videos that will help you get up and running quickly. The official Docker repositories are also a great help and everything ran very smoothly for me. Elastic provides a subscription model so you don’t have to set up anything on-premise. They also have products for access control and monitoring. Amazon’s support for Elasticsearch makes this even more of an attractive architectural option.

Overall, I had a great experience working with the ELK stack and will definitely consider it in my next projects.

Sayantam Dey

Sayantam Dey

Senior Director Engineering

Sayantam Dey is the Senior Director of Engineering at 3Pillar Global, working out of our office in Noida, India. He has been with 3Pillar for ten years, delivering enterprise products and building frameworks for accelerated software development and testing in various technologies. His current areas of interest are data analytics, messaging systems and cloud services. He has authored the ‘Spring Integration AWS’ open source project and contributes to other open source projects such as SocialAuth and SocialAuth Android.

Leave a Reply

Related Posts

How to Manage the “Need for Speed” Without Sacri... The pace of innovation today is faster than it has ever been. Customers are much more active and vocal thanks to social and mobile channels, and the c...
Determining the First Release The first thing you release needs to put the solution to your customer's most important problem in their hands. Deciding what the most important probl...
The Art of Building Rapid (and Valuable) Proofs of Concept Clients and stakeholders want results. They want assurances that their investment is well spent and they're building the right product. The software d...
Are You Doing Stuff or Creating Value? You can put a bunch of stickies on the wall, create tons of JIRA tickets, and commit lots of code, but are you creating value? Is the work your produc...
Costovation – Giving Your Customers Exactly What They ... On this episode of The Innovation Engine podcast, we delve into “cost-ovation,” or innovation that gives your customers exactly what they want – and n...