Report Colloquium RTA
Report Colloquium RTA
Report Colloquium RTA
By:
Prakhar Dev Gupta (2014-IPG-062)
1
DECLARATION
I, Prakhar Dev Gupta, hereby declare that the presented report of the Summer
Internship-2018 titled “Real Time Analysis of Log data Using Data Streaming” is
uniquely prepared by me after the completion of 60 days of internship study and work
at Toppr Education Private Limited during the period 14th May 2018 to 14th July 2018.
I also confirm that the report is prepared only for my academic requirements and
nothing else. It might not be used with the interest of the rival parties of the
corporation.
2
TABLE OF CONTENTS
ABSTRACT 4
ACKNOWLEDGEMENT 5
1. Company’s Profile 6
i. About Toppr 6
ii. Vision and Mission 6
iii. Awards 6
2. Problem Statement 7
i. Objective of RTA of log data 7
ii. Design constraints 7
3. Literature Review 8
i. What is streaming data? 8
ii. Comparison between batch processing and stream 9
processing
iii. Challenges in working with streaming data 9
4. Amazon Web Services used 10
i. Amazon EC2 10
ii. Kinesis Agent 10
iii. Kinesis Stream 11
iv. Kinesis Analytics Stream 12
v. AWS Lambda function 12
vi. CloudWatch and SES 13
5. Setup and Pricing details 14
i. Fake Apache Log 14
ii. In-depth working details 15
iii. Pricing and other detail 19
RESULTS 20
CONCLUSION 21
REFERENCES 22
3
ABSTRACT
This project is a product based work based on real time data collection and analysis
from online traffic through log data. The project has immense potential to capture
information in various fields, such as current server traffic, network busy time, or
malicious IP addresses trying to consume resources through DDOS attacks. In real-
time, a security operations center (SOC) could detect an attack in a matter of minutes.
Using RTA, the corporation can have a 360 degree view of their customer’s interests
and create policies to best suit their needs. A real time recommendation engine can
also be built using the RTA concepts. “Real-time data streaming is the process by which
big volumes of data are processed quickly such that a firm extracting the info from that
data can react to changing conditions in real time.” Additionally, large volumes of data
are processed via streaming, which helps organizations react to possible threats and
fraudulent activity quickly.
Several applications where this project will be used include
E-Commerce
Risk Management
Pricing and Analytics
Network Monitoring
Fraud Detection
Here in this project, Amazon Web Services have been employed as the third party
vendor as a service, primarily due to ease of use and effectiveness and reliability of
working.
4
ACKNOWLEDGEMENT
I would like to take the opportunity to thank Mr. Hemanth Goteti and Zeeshan
Hayath, the co-founder and CEO of Toppr, Mr. Vivek Sharma, the Senior Product
Manager, Mr. Akhilesh Bussa, Backend Developer and also my Project Mentor. He
always motivated me to push my boundaries to extract the maximum potential out
of me. His guidance and encouragement helped me enjoy my work and enhance my
understanding.
Also I would like to thank all the fellow workers for always being willing to guide.
Without their supportive nature, this project would have been a distant reality.
5
COMPANY’S PROFILE
About Toppr
Toppr is a product of Toppr Technologies Private Limited. It was co-founded by
Zeeshan Hayath and Hemanth Goteti, alumni of IIT Bombay. It is a learning app for
students studying in classes 5th to 12th or students appearing for entrance exams
and scholarship exams. As of December 2017, Toppr has a user base of 2.5
million. The content on the app is available in English and Hindi.
Awards:
2017 - Awarded the Best Educational Website by India Digital Awards [IAMAI].
6
PROBLEM STATEMENT
Objective
Given the enormous amount of data depicting different information, the task is to
capture them in real time and analyse it in order to detect various parameters,
particularly the malicious IP flooding the server, the busy servers, the loyal customers
and the error rates and types in every given time window frame.
The system should also be able to take some suitable automated actions such as
sending e-mail to tech-lead in case of high error rates, informing the network
security team in case of possible attack attempt, automatically redirect server traffic
for load balancing and triggering the recommendation abstraction from elastic
search to the users.
Design Constraints
The system should be a near real time system.
The data ingestion rate is enormous hence cannot be stored in databases.
The solution should be cost effective and feasible.
Since data produced continuously into the log files, therefore an agent is
needed which is triggered every time some entry is made into these files.
The automated actions should also be quick enough.
The streaming data must be full secure to prevent unethical stealing of data.
7
LITERATURE REVIEW
8
Comparison between Batch and Stream Processing
9
AMAZON WEB SERVICES USED
Amazon Web Services (AWS) is a secure cloud services platform, offering compute
power, database storage, content delivery and other functionality to help businesses
scale and grow. Millions of corporations are currently leveraging AWS cloud products
and solutions to build sophisticated applications with increased flexibility, scalability
and reliability. Some of the services used in my project are described below:
Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure,
resizable compute capacity in the cloud. It is designed to make web-scale cloud
computing easier for developers. The EC2 instance used in this project is Amazon
Linux AMI 2018.03.0 (HVM), SSD Volume Type. This Amazon Linux AMI is an EBS-
backed, AWS-supported image. The default image includes AWS command line tools,
Python, Ruby, Perl, and Java. The repositories include Docker, PHP, MySQL,
PostgreSQL, and other packages.
Kinesis Agent
The Amazon Kinesis Agent is a stand-alone Java software application that offers an
easier way to collect and ingest data into Amazon Kinesis services, including Amazon
Kinesis Streams and Amazon Kinesis Firehose. The log files generated in the EC2
instance is read by this Kinesis Agent when service is turned on. The Kinesis Agent
has the following features:
Monitors file patterns and sends new data records to delivery streams
Handle file checkpointing, and retry from the point in case of failure.
Delivers data in reliable and simplistic manner.
Allows stream troubleshooting.
10
Kinesis Stream
Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data
streaming service. KDS can continuously capture gigabytes of data per second from
hundreds of thousands of sources such as website clickstreams, database event
streams, financial transactions, social media feeds, IT logs, and location-tracking
events. The data collected is available in milliseconds to enable real-time analytics
use cases such as real-time dashboards, real-time anomaly detection, dynamic
pricing, and more.
11
Figure 2 Kinesis Analytics Stream working diagram
12
CloudWatch and SES
Amazon CloudWatch is a monitoring and management service built for developers,
system operators, site reliability engineers (SRE), and IT managers. CloudWatch is
natively integrated with more than 70 AWS services such as Amazon EC2, Amazon
DynamoDB, Amazon S3, Amazon ECS, AWS Lambda, Amazon API Gateway, etc. It
therefore provides the richest and the deepest insights for AWS resources. Apart
from various custom metrics, CloudWatch allows the facility of custom definition of
metrics.
SES is the abbreviation of Simple Email Service. It is a cloud based email sending
service designed to help digital marketers and application developers send
marketing, notification, and transactional emails. In our case, it send the alert email
to the person in-charge in case the traffic metric crosses a particular threshold.
13
SETUP AND PRICING DETAILS
The setup has to have some log data, which has to be lodged into a log file which is
later read by the Kinesis Agent for further analysis. The log data analysis is widely
done on Apache Log data, which has a predefined structure which is discussed as
under:
14
Figure 5 Fake Apache log generation using Python
This data is ready to be fed into our Kinesis Stream via Agent service. Since the string
is in a dictionary format, KDS can read the schema itself. The agent holds the Kinesis
endpoint value where the data is to be emitted.
15
o Using API: We need to have the boto3 installed into the computer. It is
a python package which allows accessing various AWS services through
script coding.
16
Creating Kinesis Analytics Stream: Now our KDS has the raw data in its
shards. This data continuously flows in. Therefore in order to run our sliding
window queries on the data, we create an Analytic Stream, which outputs SQL
like tables in a stream format.
o In this project, it was required to find the number of requests sent by
each IP address in a window of 15 seconds.
o This is the scenario where we check the possibility of malicious botnet
sending fake requests, thus trying to slow down the service.
o Analytics stream has an in-built support of output and error stream. We
send the output over this output stream. We use the following query:
FROM "SOURCE_SQL_STREAM_001"
GROUP BY "IP",
Equation 1 Sliding window query to show request count for each IP in every 5 minutes window
o The output table is also a stream which has IP addresses and their
corresponding number of requests.
o Since this data has to be held into another KDS, therefore we create
another stream, and the output of Analytics stream is fed to our new
KDS.
o AWS allows us to easily set the source and destination of the data using
the endpoint ARN values.
o The role of pump created in this stream is to tie the in-application
streams together. Therefore it facilitates selecting and putting data
from one stream to another.
17
Creating a new Lambda Function: A Lambda function is a function which can
be triggered to perform some specified function using existing service as a
triggering agent.
o The project utilised the Lambda function to trigger the SES services and
to perform CloudWatch metric services.
o The lambda function takes the batch of 15 input data from the Kinesis
stream shards.
o In these 15 sized batches, if any of the count value exceeds the
threshold value of 12 (i.e. more than 12 requests made by an IP in a
window frame of 5 minutes), then it will call the SES service.
o The SES service can send the CloudWatch captured details to the
Network Administrator along with the IP address which tried to cross
the threshold.
o The Network Administrator can take the necessary action regarding the
same.
o The steps are same for any other functioning, except the fact that it
would require a separate Kinesis streams to trigger a Lambda function
with different logic.
Figure 8 Lambda can be used with Kinesis for various further steps
18
Pricing Details
Kinesis Pricing:
o No upfront cost
o Shard: 1MB/s input and 2MB/s output.
o Shard hour cost: $0.015
o PUT payload unit: in chunks of 25 KB (35 KB will count as 2 'PUT units',
5 KB as 'PUT unit')
o PUT Payload Unit is charged with a per million PUT Payload Units rate.
($0.014 per 1M records)
o Retention period: Default is 24 hours. Can extend up to 7 days.
Extended data retention up to 7 days has $0.02 per shard hour cost.
19
RESULT
The project which I completed had undergone rigorous testing and quality analysis.
The QA team used different DevTools to generate various test cases. This project
passed the test on the staging server, passed the sanity tests and generated accurate
results as per the expectations.
To test the system working against DoS attacks using HTTP flooding, the tool named
OpenSTA is used by the testing team. The system could detect the attack and alerted
the Network Administrator about the same.
Secondly, the system was tested to find the number of client errors using the HTTP
status code as the criterion. The stream was reconfigured to count the number of
status code of type 4xx. If there were more than 20 such logs in the window of 5
minutes, the Tech-team is notified about the same, giving complete information of
where the fault has occurred. This brought down the system maintenance time as
well.
The knowledge has been transferred to the full time employees in the corporation.
The project has been highly appreciated and hence deployed. The corporation is
expanding its domain for analysis of other real time activities as well, with the same
knowledge of implementation.
20
CONCLUSION
The knowledge acquired through continuous learning at the corporation definitely
helped me in implementing the required tasks easily. The experience gained led the
mind to curiously investigate the know-hows of different software, aiming at finding
the most optimal solution. The skills demonstrated throughout were commended by
the team and the project head as the result produced were on time. As it was a
project that was independent of the other task in the company it helped me grow in
this field. The main objectives of the internship task helped me learn a lot of new
technologies and also gave me the necessary exposure to the real world problems. It
taught me to not just perform output oriented work, but to also perform efficient
work in terms of time, cost and accuracy.
As far as real time analytics is concerned, it is one of the most important
departments for any corporation as it allows them to react quickly without delay.
They can seize opportunities or prevent problems before they happen. Real time
analytics put the power directly into the hands of business corporations. This is also
where it should be, for the greatest business benefit. This department has therefore
huge potential of expansion, thus can provide a much needed edge to the firm over
its competitors.
21
REFERENCES
22