As the size of data in our world continues to grow, we will increasingly have to deal with data that that doesn’t fit into memory (a constraining factor for those using R/Python). While one solution is to buy more RAM (which is now easier than ever in cloud services like Amazon Web Services and Azure), this is many times not feasible. In these cases, we need a storage solution to throw our data into and from which we can easily query. This query may be for traditional analysis in R/Python or it may be a query that feeds a Shiny application.
Let’s dive in!
Why Elasticsearch, Logstash, and Kibana?
Below are some solutions that I’ve used/explored:
- SQLite (easy to set up and query with SQL commands, but some queries can take a while)
- Unix Commands (if your data is in numerous CSV files, sometimes using grep from a Linux/Mac terminal can effectively query your data)
- Apache Drill (this is a new tool that shows alot of promise for linux users. I haven’t used it yet, but Apache Drill seems like a great concept with a Schema-free SQL that can query Hadoop, NoSQL, and Cloud Storage. The use case that I’m exploring is using it to quickly query multiple JSON files sitting in AWS S3 Storage)
- Elasticsearch: Elasticsearch (ES) is the core of the ELK Stack (Elasticsearch, Logstash, and Kibana). While not as easy to set up as SQLite, Elasticsearch uses blazing fast search for query. You can set it up on your laptop or single AWS node to query GBs of data, or you can set it up on hundreds of servers to query Petabytes of data.
This blog is meant to walk an analyst through setting up an ELK stack in AWS. I have recently set up and extensively used an ELK stack in AWS in order to query 20M+ social media records and serve them up in a Kibana Dashboard. While AWS does offer Amazon Elastic Search Sevice, this service uses an older version of elasticsearch. I preferred to build it myself from the ground up so that I could customize the configuration to my needs. This tutorial is the result of hours of time on Google stitching together numerous blogs on the subject of ELK stacks. Note that this tutorial assumes that you have a basic knowledge of using AWS and the Linux terminal (for Linux basics, see Paul Gellerman’s DSCOE blog here)
Set up AWS Node
DO NOT setup an AWS AMI that is preconfigured (i.e. do not use popular the R/RStudio AMI that Louis Aslett maintains) because the RStudio webserver connection will impact the useability of the Kibana Dashboard. Instead, choose a basic Ubuntu Server AMI as seen below:
After choosing this, continue through the following steps:
- Choose desired instance size (from my experience ES requires a minimum of 2 GB of RAM)
- Leave instance detail default settings
- Increase EBS storage if desired (this is essentially the size of your hard drive…you can get up to 30GB on the free tier)
- Adjust Security Group by adding HTTP and HTTPS and restricting source IP as desired
- Make sure you create and select a key pair (see below) in order to later connect to your node through SSH protocol
As a side note, to save money ensure you power off you nodes when not in use. As you can see from this blog it isn’t necessarily easy to set up an ELK stack, and once you set it up you don’t necessarily want to terminate it, but if you stop it you will only pay for storage space and not for compute time.
Java
Now that you have an AWS instance up and running, let’s connect and start configuring your bare bones Ubuntu node. You can connect through SSH protocol using Putty (Windows 7), Bash shell (Windows 10) or standard terminal on Linux/Mac. For those that are new to SSH, use the command below to connect with a key pair:
ssh -i path-to-key/key [email protected]
Now that you have connected to your Ubuntu node, check for updates and install updates with the following lines of code (note: you will have to copy and paste each line separately):
sudo apt-get update
sudo apt-get upgrade
Now let’s install java with the following lines (once again, if you’re using copy/paste, remember that you will have to copy/paste each line separately)
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo apt-get -y install oracle-java8-installer
java -version
Your last line should produce a result that looks something like this:
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)
Elasticsearch
Now we’ll install elastic search with the following lines of code. This is downloading the package directly from elastic.co and installing it.
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a etc/apt/sources.list.d/elasticsearch-2.x.list
sudo apt-get update && sudo apt-get install elasticsearch
sudo update-rc.d elasticsearch defaults 95 10 ##This starts ES on boot up
sudo /etc/init.d/elasticsearch start
curl 'http://localhost:9200'
Now we need to configure ES to restrict access. To do this, we will open the configuration file in a command lind text editor:
sudo nano /etc/elasticsearch/elasticsearch.yml
and change the network host to localhost.:
network.host: localhost
If you’ve installed this correctly, the last line should generate the something that looks simlar to:
{
"name" : "Stonewall",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "Lz_jcK-IQXuVrbBrcHOA7w",
"version" : {
"number" : "2.4.3",
"build_hash" : "d38a34e7b75af4e17ead16f156feffa432b22be3",
"build_timestamp" : "2016-12-07T16:28:56Z",
"build_snapshot" : false,
"lucene_version" : "5.5.2"
},
"tagline" : "You Know, for Search"
}
Logstash
Logstash is “an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it” and sends it to your ES engine. While not necessary to use ES, it does provide a convenient way to load data if desired (especially if you’re unsure of how to create an elastic search mapping).
echo "deb https://packages.elastic.co/logstash/2.3/debian stable main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update && sudo apt-get install logstash
sudo update-rc.d logstash defaults 95 10 ##This starts logstash on startup
sudo /etc/init.d/logstash start
Kibana
Kibana is an open source dashboard that will easily visualize data held in your ES Stack. Install Kibana with the code below:
echo "deb http://packages.elastic.co/kibana/4.5/debian stable main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update && sudo apt-get install kibana
sudo update-rc.d kibana defaults 95 10
Open the Kibana configuration vile for editing:
sudo nano /opt/kibana/config/kibana.yml
In the configuration file, arrow down to the line that specifies server.host, and replace the IP address (default is “0.0.0.0”) with
server.host: "localhost"
NGINX
Now we’ll install Nginx in order to set up a reverse proxy.
sudo apt-get install nginx apache2-utils
We will use htpasswd to create an admin user (in the following code, change “kibanaadmin” to another name)
sudo htpasswd -c /etc/nginx/htpasswd.users kibanaadmin
Enter a password at the prompt.
Now we will set the Nginx default server block by copying creating the following file:
sudo nano /etc/nginx/sites-enabled/nsm
and then copying the code below, ensuring that you copy and paste your AWS URL in to replace the one I’m using here:
server {
listen 80;
server_name ec2-54-166-151-27.compute-1.amazonaws.com;
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/htpasswd.users;
location / {
proxy_pass http://localhost:5601;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}
Now run the code below to restart Nginx:
sudo /etc/init.d/nginx restart
At this point, reboot your AWS and node and then put your AWS Public DNS in a browser. You should be prompted for the username and password that you used above. After entering this, you should get a screen that looks like:
If you run into any problems, below are several commands that I found helpful:
##Restart Nginx
sudo service nginx restart
##Restart ES
sudo /etc/init.d/elasticsearch restart
##Restart Kibana
sudo service kibana start
Now, we need to load some data into ES. The easiest way is to load JSON data from Python. The elastic packages exists for R that can assist in loads some data. For my purposes, I chose to use Logstash to load Twitter Data from large CSV files. I created a new file in my home directory called myTweet.conf, and put the following mapping in this configuration file (note that I’m loading data that I scraped from the Twitter API and enriched with City, Country, and Language fields).
input {
file {
path => "/home/ubuntu/Dropbox/load/TwitterData2016-11-07.csv"
type => "twitter"
start_position => "beginning"
}
}
filter {
csv {
columns => [ "text" ,
"favorited" ,
"favoriteCount",
"replyToSN" ,
"created" ,
"truncated" ,
"replyToSID" ,
"id" ,
"replyToUID" ,
"statusSource" ,
"screenName" ,
"retweetCount",
"isRetweet" ,
"retweeted" ,
"longitude" ,
"latitude" ,
"language" ,
"city",
"country"]
separator => ","
remove_field => ["replyToSN","truncated","replyToSID","replyToUID"]
}
date {
match => ["created", "YYYY-MM-dd HH:mm:ss"]
timezone => "America/New_York"
target => "created"
}
mutate {
convert => {
"text" => "string"
"favoriteCount" => "integer"
"id" => "integer"
"statusSource" => "string"
"screenName" => "string"
"retweetCount" => "integer"
"isRetweet" => "boolean"
"retweeted" => "boolean"
"longitude" => "float"
"latitude" => "float"
"language" => "string"
"city" => "string"
"country" => "string"
}
}
}
output {
elasticsearch {
action => "index"
hosts => "127.0.0.1:9200"
index => "tweets_20161215"
}
}
Note that by using “tweets_date” for an index name, I can use a wildcard (*) in Kibana and then delete indices in order to allow the data to age off the stack.
To load a file using this mapping, run the following command in terminal (run this command for each CSV file you put into myTweet.conf):
/opt/logstash/bin/logstash -f ~/myTweet.conf
At this point, I’ve loaded over 20 million records (tweets) into my ES, and the below Kibana dashboard has been very responsive (I usually wait 5 seconds for queries, compared to 10 minutes with Sqlite).
Finally, here’s a couple of commands that are helpful:
## List ES indices (with number of records and size on disk)
curl 'localhost:9200/_cat/indices?v'
## Delete an Index
curl -XDELETE 'http://localhost:9200/tweets_20161226/'