Bigdata Lecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Big Data ecosystem

M. Fanilo Andrianasolo
Data Analytics Tech Lead & Product Manager

2013 - 2016 2016 - 2017 2017-2019 2019-current


Big Data Data Analytics Data Science Data Analytics
Engineer Evangelist Tech Lead Product Manager
Data is at the center of all IT activities

Data Hardware Innovations


Explosion of data
3Vs of Big Data

Volume

Big Data Variability


Value
Velocity
Vulnerability
Vwhatever..

Variety
Problem

Daily rate in 2014

21,000 $ 600 TB 75 days


Cost of 1 TB Time to read 1 TB
~35$ ~3 hours

How should we store and query such data ?


Scaling ?

Vertical scaling Horizontal scaling

Less power PRICE Much cheaper Bigger energy footprint


consumption, cooling
costs Hardware failure causes Easier fault-tolerance Higher utility cost
bigger outages (electricity, cooling)
Less challenging to “Easier” upgrade by
implement Vendor lock-in adding new machines More networking
equipment
Less licencing costs Limited upgradeability

(Sometimes) less
network hardware
Scaling is hard
Big Data ecosystem
Apache Hadoop

Open-source software for reliable, scalable, distributed computing


Apache Hadoop ecosystem

More than 30 open source projects for managing and analyzing Big Data


Hadoop distributions
Hadoop distributions vs Cloud providers
Hadoop ecosystem use cases

Web indexing from web crawlers

Playlist generation from every listens

Log analysis

Product recommendation from purchases


A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Acquisition
Acquisition

Import

Hadoop
RDBMS
FS
Export
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Transport
Transport

{
"type" : "record",
"namespace" : "test",
"name" : "Employee", emp e1=new emp( );
"fields" : [ e1.setName("omar");
{ "name" : "Name" , "type" : "string" }, e1.setAge(21);
{ "name" : "Age" , "type" : "int" }
]
}

.ascv .java
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Hadoop Distributed File System

NameNode

DataNode1 DataNode2 DataNode3

block 1 block 2 block 1

block 2 block 1 block 1

block 1 block 1 block 2


HBase

Key U:cookie U:is_auth U:has_t P:Product1 P:Product2 P:Product3

1960:Fanilo c13e 1 3

2001:Fanilo c13e 1

1990:Omar d45 1
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
YARN – Yet Another Resource Negotiator

YARN hides the resource management details from the user to


facilitate the management of parallel applications.
Batch processing - Map Reduce

Data locality : Moving Computation is Cheaper than Moving Data


Batch processing

music_sales.csv

1, « Let it go », 4.99€, 5
2, « Snow », 7.99€, 1
HiveQL MapReduce
3, « Lion King », 0.99€, 1
4, « SISE », 1.99€, 2
5, « Lyon is great », 2.99€, 3
Metastore
Batch processing

music_sales.csv

recordings = LOAD '$file' USING PigStorage(',') AS 1, « Let it go », 4.99€, 5


(id, price, artist, title,
duration, year);
limit = LIMIT recordings $size;
2, « Snow », 7.99€, 1
DUMP limit;
3, « Lion King », 0.99€, 1
4, « SISE », 1.99€, 2
MapReduce 5, « Lyon is great », 2.99€, 3
Batch processing
Realtime processing
Realtime processing
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Visualizing
Visualizing
A data platform canvas

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Security

Kerberos
Orchestration
Orchestration
Overview

Acquisition Transport Storage Processing Servicing

Security
Orchestration
Architecture design
Multiple architectures
Lambda architecture
Kappa architecture
CONCLUSION
THANKS

@andfanilo

@andfanilo

[email protected]

You might also like