Report Colloquium RTA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Real Time Analysis of Log Data Using Data Streaming

A project report submitted in partial fulfilment of the requirements


For Summer Internship Project
At

By:
Prakhar Dev Gupta (2014-IPG-062)

ABV INDIAN INSTITUTE OF INFORMATION


TECHNOLOGY AND MANAGEMENT
GWALIOR-474010
2018

1
DECLARATION

I, Prakhar Dev Gupta, hereby declare that the presented report of the Summer
Internship-2018 titled “Real Time Analysis of Log data Using Data Streaming” is
uniquely prepared by me after the completion of 60 days of internship study and work
at Toppr Education Private Limited during the period 14th May 2018 to 14th July 2018.
I also confirm that the report is prepared only for my academic requirements and
nothing else. It might not be used with the interest of the rival parties of the
corporation.

Prakhar Dev Gupta


Former Software Engineering Intern,
Toppr
Toppr Technologies Pvt. Ltd.
Hyderabad, Telangana

2
TABLE OF CONTENTS
ABSTRACT 4
ACKNOWLEDGEMENT 5
1. Company’s Profile 6
i. About Toppr 6
ii. Vision and Mission 6
iii. Awards 6
2. Problem Statement 7
i. Objective of RTA of log data 7
ii. Design constraints 7
3. Literature Review 8
i. What is streaming data? 8
ii. Comparison between batch processing and stream 9
processing
iii. Challenges in working with streaming data 9
4. Amazon Web Services used 10
i. Amazon EC2 10
ii. Kinesis Agent 10
iii. Kinesis Stream 11
iv. Kinesis Analytics Stream 12
v. AWS Lambda function 12
vi. CloudWatch and SES 13
5. Setup and Pricing details 14
i. Fake Apache Log 14
ii. In-depth working details 15
iii. Pricing and other detail 19
RESULTS 20
CONCLUSION 21
REFERENCES 22

3
ABSTRACT
This project is a product based work based on real time data collection and analysis
from online traffic through log data. The project has immense potential to capture
information in various fields, such as current server traffic, network busy time, or
malicious IP addresses trying to consume resources through DDOS attacks. In real-
time, a security operations center (SOC) could detect an attack in a matter of minutes.
Using RTA, the corporation can have a 360 degree view of their customer’s interests
and create policies to best suit their needs. A real time recommendation engine can
also be built using the RTA concepts. “Real-time data streaming is the process by which
big volumes of data are processed quickly such that a firm extracting the info from that
data can react to changing conditions in real time.” Additionally, large volumes of data
are processed via streaming, which helps organizations react to possible threats and
fraudulent activity quickly.
Several applications where this project will be used include
 E-Commerce
 Risk Management
 Pricing and Analytics
 Network Monitoring
 Fraud Detection

Here in this project, Amazon Web Services have been employed as the third party
vendor as a service, primarily due to ease of use and effectiveness and reliability of
working.

4
ACKNOWLEDGEMENT

I would like to take the opportunity to thank Mr. Hemanth Goteti and Zeeshan
Hayath, the co-founder and CEO of Toppr, Mr. Vivek Sharma, the Senior Product
Manager, Mr. Akhilesh Bussa, Backend Developer and also my Project Mentor. He
always motivated me to push my boundaries to extract the maximum potential out
of me. His guidance and encouragement helped me enjoy my work and enhance my
understanding.
Also I would like to thank all the fellow workers for always being willing to guide.
Without their supportive nature, this project would have been a distant reality.

Prakhar Dev Gupta


2014-IPG-062

5
COMPANY’S PROFILE

About Toppr
Toppr is a product of Toppr Technologies Private Limited. It was co-founded by
Zeeshan Hayath and Hemanth Goteti, alumni of IIT Bombay. It is a learning app for
students studying in classes 5th to 12th or students appearing for entrance exams
and scholarship exams. As of December 2017, Toppr has a user base of 2.5
million. The content on the app is available in English and Hindi.

Vision and Mission

Toppr holds the extraordinary vision to “Make Learning Personalised”. It aims at


revolutionizing education system. Over the past 5 years, Toppr has acquired
nationwide recognition. It has acquired Jodhpur based EasyPrep – an online learning
platform, and has also acquired ‘Manch’– a knowledge delivery platform. To make
quality education easily available to everyone across the nation is Toppr’s mission.

Awards:
 2017 - Awarded the Best Educational Website by India Digital Awards [IAMAI].

 2016 - 2017 - Awarded the Best Educational Website (emerging) by AWS


Mobility.

 2015 - Recognized as one of the Top 10 Hottest Start-ups by CB Insights.

6
PROBLEM STATEMENT

Objective
Given the enormous amount of data depicting different information, the task is to
capture them in real time and analyse it in order to detect various parameters,
particularly the malicious IP flooding the server, the busy servers, the loyal customers
and the error rates and types in every given time window frame.
The system should also be able to take some suitable automated actions such as
sending e-mail to tech-lead in case of high error rates, informing the network
security team in case of possible attack attempt, automatically redirect server traffic
for load balancing and triggering the recommendation abstraction from elastic
search to the users.

Design Constraints
 The system should be a near real time system.
 The data ingestion rate is enormous hence cannot be stored in databases.
 The solution should be cost effective and feasible.
 Since data produced continuously into the log files, therefore an agent is
needed which is triggered every time some entry is made into these files.
 The automated actions should also be quick enough.
 The streaming data must be full secure to prevent unethical stealing of data.

7
LITERATURE REVIEW

What is Streaming Data?


Streaming Data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously, and in small sizes (order of
Kilobytes). Streaming data includes a wide variety of data such as log files generated
by customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors, or
geospatial services, and telemetry from connected devices or instrumentation in
data centers.
This data needs to be processed sequentially and incrementally on a record-by-
record basis or over sliding time windows, and used for a wide variety of analytics
including correlations, aggregations, filtering, and sampling. Information derived
from such analysis gives companies visibility into many aspects of their business and
customer activity such as –service usage (for metering/billing), server activity,
website clicks, and geo-location of devices, people, and physical goods –and enables
them to respond promptly to emerging situations. For example, businesses can track
changes in public sentiment on their brands and products by continuously analysing
social media streams, and respond in a timely fashion as the necessity arises.
Streaming data processing is beneficial in most scenarios where new, dynamic data is
generated on a continual basis. It applies to most of the industry segments and big
data use cases. Companies generally begin with simple applications such as collecting
system logs and rudimentary processing like rolling min-max computations. Then,
these applications evolve to more sophisticated near-real-time processing.
Applications may process data streams to produce simple reports, and perform
simple actions in response, such as emitting alarms when key measures exceed
certain thresholds. These data streams can be used to extract deeper insight from
the data.

8
Comparison between Batch and Stream Processing

Batch Processing Stream Processing


Data Scope Queries or processing Queries or processing
over all or most of the over data within a rolling
data in the dataset. time window, or on just
the most recent data
record.
Data Size Large batch of data Individual records or
micro batches consisting
of a few records.
Performance Latencies from minutes to Only milliseconds of
hours. latencies
Analyses Simple analytics Complex response
functions, aggregates and
rolling metrics.

Table 1 A brief comparison of stream processing with normal batch processing

Challenges in Working with Streaming Data


Streaming data processing requires two layers: a storage layer and a processing
layer. The storage layer needs to support record ordering and strong consistency to
enable fast, inexpensive, and replayable reads and writes of large streams of data.
The processing layer is responsible for consuming data from the storage layer,
running computations on that data, and then notifying the storage layer to delete
data that is no longer needed. You also have to plan for scalability, data durability,
and fault tolerance in both the storage and processing layers. As a result, many
platforms have emerged that provide the infrastructure needed to build streaming
data applications including Amazon Kinesis Streams, Amazon Kinesis Firehose,
Apache Kafka, Apache Flume, Apache Spark Streaming, and Apache Storm.

9
AMAZON WEB SERVICES USED

Amazon Web Services (AWS) is a secure cloud services platform, offering compute
power, database storage, content delivery and other functionality to help businesses
scale and grow. Millions of corporations are currently leveraging AWS cloud products
and solutions to build sophisticated applications with increased flexibility, scalability
and reliability. Some of the services used in my project are described below:

Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure,
resizable compute capacity in the cloud. It is designed to make web-scale cloud
computing easier for developers. The EC2 instance used in this project is Amazon
Linux AMI 2018.03.0 (HVM), SSD Volume Type. This Amazon Linux AMI is an EBS-
backed, AWS-supported image. The default image includes AWS command line tools,
Python, Ruby, Perl, and Java. The repositories include Docker, PHP, MySQL,
PostgreSQL, and other packages.

Kinesis Agent
The Amazon Kinesis Agent is a stand-alone Java software application that offers an
easier way to collect and ingest data into Amazon Kinesis services, including Amazon
Kinesis Streams and Amazon Kinesis Firehose. The log files generated in the EC2
instance is read by this Kinesis Agent when service is turned on. The Kinesis Agent
has the following features:
 Monitors file patterns and sends new data records to delivery streams
 Handle file checkpointing, and retry from the point in case of failure.
 Delivers data in reliable and simplistic manner.
 Allows stream troubleshooting.

10
Kinesis Stream
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data
streaming service. KDS can continuously capture gigabytes of data per second from
hundreds of thousands of sources such as website clickstreams, database event
streams, financial transactions, social media feeds, IT logs, and location-tracking
events. The data collected is available in milliseconds to enable real-time analytics
use cases such as real-time dashboards, real-time anomaly detection, dynamic
pricing, and more.

Figure 1 Kinesis Data Stream high-level architecture diagram

Kinesis Analytics Stream


An application is the primary resource in Amazon Kinesis Data Analytics that you can
create in your account. You can create and manage applications using the AWS
Management Console or the Amazon Kinesis Data Analytics API. Kinesis Data
Analytics provides API operations to manage applications. Each application consists
of the following:
 Input- The streaming source for your application.
 Application Code- A series of SQL statements that process input and produce
output.
 Output- In application code, query results go to in-application streams. In our
application code, we can create one or more in-application streams to hold
intermediate results.

11
Figure 2 Kinesis Analytics Stream working diagram

AWS Lambda Function


AWS Lambda is a serverless compute service that runs our code in response to
events and automatically manages the underlying compute resources for us. We can
use AWS Lambda to extend other AWS services with custom logic. Some key features
of Lambda function are:
 Integrated Security Model
 Pay per Use
 Built in fault tolerance
 Automatic Scaling
 Can attach easily to custom logic.
 Automatic triggering.

Figure 3 Using Lambda function with Kinesis

12
CloudWatch and SES
Amazon CloudWatch is a monitoring and management service built for developers,
system operators, site reliability engineers (SRE), and IT managers. CloudWatch is
natively integrated with more than 70 AWS services such as Amazon EC2, Amazon
DynamoDB, Amazon S3, Amazon ECS, AWS Lambda, Amazon API Gateway, etc. It
therefore provides the richest and the deepest insights for AWS resources. Apart
from various custom metrics, CloudWatch allows the facility of custom definition of
metrics.
SES is the abbreviation of Simple Email Service. It is a cloud based email sending
service designed to help digital marketers and application developers send
marketing, notification, and transactional emails. In our case, it send the alert email
to the person in-charge in case the traffic metric crosses a particular threshold.

Figure 4 Basic Setup of Lambda with Kinesis and CloudWatch

13
SETUP AND PRICING DETAILS

The setup has to have some log data, which has to be lodged into a log file which is
later read by the Kinesis Agent for further analysis. The log data analysis is widely
done on Apache Log data, which has a predefined structure which is discussed as
under:

Fake Apache Log


The Apache HTTP Server provides very comprehensive and flexible logging
capabilities. In order to generate the log, we created a python script that generated
the exact copy of log data following the APACHE access log format:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326.

The details of each entry is as under:


 127.0.0.1 (%h): IP address of the client (remote host) which made the request
to the server.
 -- (%l): Hyphen tells that the requested piece of information is unavailable.
 Frank (%u): This is the username of the requester.
 [10/Oct/2000:13:55:36 -0700] (%t): Signifies date, time and zone.
 "GET /apache_pb.gif HTTP/1.0" (\"%r\"): Tells that it is a GET request, also
tells about the client request resources, and also tells the involved protocol.
 200(%>s): Signifies the HTTP status code.
 2326 (%b): Signifies the object returned to the client.
The Kinesis stream needs to read the schema of the data which is sent by the Kinesis
Agent, therefore we send the data by splitting the string using regex expressions and
then putting the subparts as key-value pairs in a dictionary.
The Kinesis Agent when turned on for services, will act as a producer script for the
Kinesis stream, which feeds data continuously into the stream for the analysis part.
For the same reason, the agent is triggered over a file which in turn is populated
using the python script written to generate fake log data.
The screenshot of how the output of python script looks is shown as below:

14
Figure 5 Fake Apache log generation using Python

This data is ready to be fed into our Kinesis Stream via Agent service. Since the string
is in a dictionary format, KDS can read the schema itself. The agent holds the Kinesis
endpoint value where the data is to be emitted.

In-depth Working Detail


 Create a Kinesis Stream: This can be done either by using AWS console, AWS
CLI or API call.
o Using Console: Create a new stream and set the number of shards as
per the requirement. The stream takes a minute to activate.
o Using AWS CLI: First we install the AWS CLI using pip command. Then
we use the access key and secret key to configure the CLI. Here is the
command we used to create a new stream:

aws kinesis create stream –stream-name <string> --shard-count <int>

15
o Using API: We need to have the boto3 installed into the computer. It is
a python package which allows accessing various AWS services through
script coding.

Figure 6 Creating a Kinesis Stream using API

 Install the Kinesis Agent on Linux EC2:


o First we need to update to the latest version of yum and then install
the AWS Kinesis Agent.
o Configure the agent.json file to set the end point as our target stream.
Then start the service of the agent.
o We allow some data to flow into Kinesis stream, this is essential for
schema discovery.

Figure 7 Agent file configuration

16
 Creating Kinesis Analytics Stream: Now our KDS has the raw data in its
shards. This data continuously flows in. Therefore in order to run our sliding
window queries on the data, we create an Analytic Stream, which outputs SQL
like tables in a stream format.
o In this project, it was required to find the number of requests sent by
each IP address in a window of 15 seconds.
o This is the scenario where we check the possibility of malicious botnet
sending fake requests, thus trying to slow down the service.
o Analytics stream has an in-built support of output and error stream. We
send the output over this output stream. We use the following query:

CREATE OR REPLACE STREAM "REQ_COUNT" (IP varchar(20), COUNT_VAL


integer);

CREATE OR REPLACE PUMP "VAL_PUMP" AS INSERT INTO "REQ_COUNT"

SELECT STREAM IP, COUNT (IP) AS COUNT_VAL

FROM "SOURCE_SQL_STREAM_001"

GROUP BY "IP",

STEP ("SOURCE_SQL_STREAM_001". ROWTIME BY INTERVAL '5' MINUTES);

Equation 1 Sliding window query to show request count for each IP in every 5 minutes window

o The output table is also a stream which has IP addresses and their
corresponding number of requests.
o Since this data has to be held into another KDS, therefore we create
another stream, and the output of Analytics stream is fed to our new
KDS.
o AWS allows us to easily set the source and destination of the data using
the endpoint ARN values.
o The role of pump created in this stream is to tie the in-application
streams together. Therefore it facilitates selecting and putting data
from one stream to another.

17
 Creating a new Lambda Function: A Lambda function is a function which can
be triggered to perform some specified function using existing service as a
triggering agent.
o The project utilised the Lambda function to trigger the SES services and
to perform CloudWatch metric services.
o The lambda function takes the batch of 15 input data from the Kinesis
stream shards.
o In these 15 sized batches, if any of the count value exceeds the
threshold value of 12 (i.e. more than 12 requests made by an IP in a
window frame of 5 minutes), then it will call the SES service.
o The SES service can send the CloudWatch captured details to the
Network Administrator along with the IP address which tried to cross
the threshold.
o The Network Administrator can take the necessary action regarding the
same.
o The steps are same for any other functioning, except the fact that it
would require a separate Kinesis streams to trigger a Lambda function
with different logic.

Figure 8 Lambda can be used with Kinesis for various further steps

18
Pricing Details
 Kinesis Pricing:
o No upfront cost
o Shard: 1MB/s input and 2MB/s output.
o Shard hour cost: $0.015
o PUT payload unit: in chunks of 25 KB (35 KB will count as 2 'PUT units',
5 KB as 'PUT unit')
o PUT Payload Unit is charged with a per million PUT Payload Units rate.
($0.014 per 1M records)
o Retention period: Default is 24 hours. Can extend up to 7 days.
Extended data retention up to 7 days has $0.02 per shard hour cost.

 AWS Kinesis Analytics Pricing:


o $0.11 per KPU hour. 1 KPU provides stream processing capacity
equivalent to 4 GB memory, 1 vCPU.

 AWS SES Limits:


o Requires verification of domain ownership which takes usually 24
hours.
o Max number of receivers per message : 50

 AWS Lambda Pricing Details:


o First 1M requests free per month, $0.02 per 1M requests after that.
o First 400K GB-seconds free per month, $0.00001667 FOR EVERY GB-
SECOND USED THEREAFTER.

19
RESULT
The project which I completed had undergone rigorous testing and quality analysis.
The QA team used different DevTools to generate various test cases. This project
passed the test on the staging server, passed the sanity tests and generated accurate
results as per the expectations.
To test the system working against DoS attacks using HTTP flooding, the tool named
OpenSTA is used by the testing team. The system could detect the attack and alerted
the Network Administrator about the same.
Secondly, the system was tested to find the number of client errors using the HTTP
status code as the criterion. The stream was reconfigured to count the number of
status code of type 4xx. If there were more than 20 such logs in the window of 5
minutes, the Tech-team is notified about the same, giving complete information of
where the fault has occurred. This brought down the system maintenance time as
well.
The knowledge has been transferred to the full time employees in the corporation.
The project has been highly appreciated and hence deployed. The corporation is
expanding its domain for analysis of other real time activities as well, with the same
knowledge of implementation.

20
CONCLUSION
The knowledge acquired through continuous learning at the corporation definitely
helped me in implementing the required tasks easily. The experience gained led the
mind to curiously investigate the know-hows of different software, aiming at finding
the most optimal solution. The skills demonstrated throughout were commended by
the team and the project head as the result produced were on time. As it was a
project that was independent of the other task in the company it helped me grow in
this field. The main objectives of the internship task helped me learn a lot of new
technologies and also gave me the necessary exposure to the real world problems. It
taught me to not just perform output oriented work, but to also perform efficient
work in terms of time, cost and accuracy.
As far as real time analytics is concerned, it is one of the most important
departments for any corporation as it allows them to react quickly without delay.
They can seize opportunities or prevent problems before they happen. Real time
analytics put the power directly into the hands of business corporations. This is also
where it should be, for the greatest business benefit. This department has therefore
huge potential of expansion, thus can provide a much needed edge to the firm over
its competitors.

21
REFERENCES

 AWS Documentation : https://docs.aws.amazon.com/index.html#lang/en_us


 Perform sliding window queries on streaming data:
https://sqlstream.com/platform/kinesis/
 Fake Apache log document: https://httpd.apache.org/docs/2.4/logs.html
 Code and method references: https://github.com/awslabs/
 HTTP status codes: https://developer.mozilla.org/en-
US/docs/Web/HTTP/Status
 About streaming data: https://aws.amazon.com/streaming-data/
 Producer and Consumer scripts: https://www.arundhaj.com/blog/getting-
started-kinesis-python.html
 Understanding DoS attacks:
https://www.digitalattackmap.com/understanding-ddos/

22

You might also like