BDA - Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

BIG DATA

CS-585
Unit-1: Lecture 3
• Structure of Big Data,
• Big Data Processes
• Big Data Framework
• Big Data Plateform and Applications Framework
Contents • Examples of Big Data Plateform in Practice
• Big Data Plateform Manifesto
• Big Data Technologies
• Big Data Tools
• Big Data Analytics and Techniques
• Big Data Use Case
Big Data Processes / Life Cycle

●Problem: The sale of Chewing gum is going down


●Acquisition

– Sales by customer, region and time


– Survey by users Acquistion
– Social Networks
Extraction

– Data Loading from receipts


– Automatics reading of questionnaires Extraction
– Data extraction from twitter Decision
Integration

– Based on user types


Analysis

– Chewing gum bought by people older than 25


– Chewing gum preferred by people younger than an
age
Interpretation

Interpretation
– Moms believes: Chewing gum = bad teeth Integration
– Boys and girl believe that chewing gum are for
babies
Decisions

– We make Chewing gums without sugar


– We ask dentists to advertise our chewing gum as Analysis
refresher
– We make commercials targeted to boys and girls.
Big Data Framework
Big Data Framework
Big Data Platform Manifesto
• Hadoop/HDFS (2007)
• A framework based on the principles given by Map Reduce and Big
Tables. Follows the principle of distributed computing, where the
data is distributed, managed and stored on different systems
known as nodes (HDFS: Hadoop Distributed File System). First used
by yahoo to support the storage of structured, unstructured and
semi structured data.
• Designed to parallelize data processing across computing nodes to
speed the computation and hide the latency.(Doug Cutting)
• Map Reduce(2007)
• Designed to process a large amount of data in batch mode. Follow
Big Data the distributed computing model, where each of the task is mapped
to many systems for processing in a way that manages the recovery
from failure and balance of load. The system was developed by
Technologies Google.
• Reduce operation aggregates the results. Mainly designed to work
with HDFS but now support other db formats also like Cassandra,
Hbase etc.
• Big Table
• Data storage was solved with the help of big tables. It is distributed
storage system to manage the vast quantity of highly scalable
structured data.
• It is like a multidimensional sorted map. Data captured is stored in
different nodes across the systems. It is unlike the traditional
databases where data is organized in rows and columns.
Big Data Technologies
•Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and asynchronous
masterless replication.
•HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with master-slave
replication. It uses HDFS as underlying storage.
•Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on Paxos algorithm
variant called Zab.
•Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native
Java MapReduce programming.
•Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL interface over
native Java MapReduce programming.
•Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for finding
meaningful patterns in HDFS datasets.
•Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back.
•YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the cluster resources
like memory and CPU.
•Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS.
•Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message semantics.
•Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It provides libraries for
Machine Learning, SQL interface and near real-time Stream Processing.
•Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
•SolrCloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It uses Lucene library
for data indexing.
• Databases
• MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
Dypertable, Voldemart, Riak, ZooKeeper.
• Map Reduce
• Hadoop, Hive, Pig, Cascading, Cascalog, Mrjob,
Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban,
Oozie, Greenplum
• Storage

Big Data • S3, HDFS

Tools • Servers
• EC2, Google App Engine, Elastic, Beanstalk, Heroku
• Processing
• R, Yahoo, Pipes, Mechanical, Turk, Solr/Lucence, Elastic
Search, Datameer, BigSheets, TinderPop
Big Data Tools and Technologies (Graphical View)
Apache Hadoop Eco System update
How Uber handled/solved their data problem
Big Data Success Stories
• Epidepmic Early warning
Healthcare • ICU and remote monitoring
• USD 150000 reduction in the cost of unnecessary neonatal surgery

• Fleet risk advisors helping truck operators by building stronger and


Transportation faster risk prediction models. 80% reduction in serious accidents, 20%
reduction in minor accidents, 30% reduction in driver retention rates.

• In japan claim processing has been made faster. 22% fewer mistakenly
Life Insurance unpaid claims, 90% accuracy in coding medical terms, 20% reducation
in assessment workforce.

IT Company • IBM’s Big data business grew over 150% in 2014. IBM joins apple &
twitter in strategic partnership.
References
• IBM ICE course Material
• http://blog.newtechways.com/2017/10/apache-hadoop-
ecosystem.html
• https://www.edureka.co/blog/what-is-big-data/
• https://bigdataanalyticsnews.com/
• https://data-flair.training/blogs/hadoop-tutorial/
• https://intellipaat.com/blog/tutorial/hadoop-tutorial/
• https://intellipaat.com/blog/tutorial/hadoop-tutorial/
• https://nptel.ac.in/courses/106/104/106104189/
• Thank You
Wish you a prosperous career
with Big Data

You might also like