Hadoop V.01
Hadoop V.01
Hadoop V.01
When dealing with larger datasets, organizations face difficulties in being able to create, manipulate, and manage Big Data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and analyze massive datasets
A primary goal for looking at big data is to discover repeatable business patterns. Unstructured data , most of it located in text files, accounts for at least 80% of an organizations data. If left unmanaged, the sheer volume of unstructured data thats generated each year within an enterprise can be costly in terms of storage . Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit.
Volume Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
Variety Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more
Velocity Often timesensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Main considerations: Synchronizing retention and disposition policies across jurisdictions, moving data across countries. Customers need help navigating frameworks and changes
Compression
Challenge: Compression normally happens instead of deduplication, yet, will compress duplicated data regardless Opportunity: There is a need for an automated manner in doing both de-duplicating, and then compressing
About Hadoop
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. Solution for Big Data: Deals with complexities of high volume, Velocity & variety of data. It enables applications to work with thousands of nodes and petabytes of data. It is:-
Reliable : The software is fault tolerant, it expects and handles hardware and software failures
Scalable Designed for massive scale of processors, memory, and local attached storage
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availaible service on top of a cluster of computers, each of which may be prone to failures
Financial Drivers Growing cost of data systems as percentage of IT spendCost advantage of commodity hardware + opensource Enables departmental-level big data strategies
Trend
The OLD WAY
Operational systems keep only current records, short history
Analytics systems keep only conformed/cleaned/digested data Unstructured data locked away in operational silos
HBase: Column oriented, non-relational, schema-less, distributed database modeled after Googles BigTable. Promises Random, real-time read/write access to Big Data Hive: Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data
Pig: A platform for manipulating and analyzing large data sets. High level language for analysts
ZooKeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Hadoop Developer Core contributor since Hadoops infancy Project Lead for Hadoop Distributed File System Facebook (Hadoop, Hive, Scribe) Yahoo! (Hadoop in Yahoo Search) Veritas (San Point Direct, Veritas File System) IBM Transarc (Andrew File System) UW Computer Science Alumni (Condor Project)
Staging layer: The most common use of Hadoop in enterprise environments is as Hadoop ETL preprocessing, filtering, and transforming vast quantities of semistructured and unstructured data for loading into a data warehouse.
Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
Karmasphere released the results of a survey of 102 Hadoop developers regarding adoption, use and future plans
Hadoop At Linkedin:-
LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIns 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a social media feeds, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.
Hadoop At Forsquare
Forsquare were finding problems in handling huge amount of data which they are handling. Their Business development managers, venue specialists, and upper management eggheads needed access to the data in order to inform some important decisions.
To enable easy access to data foursquare engineering decided to use Apache Hadoop and Apache Hive in combination with a custom data server (built in Ruby), all running in Amazon EC2. The data server is built using Rails, MongoDB, Redis, and Resque and communicates with Hive using the ruby Thrift client.
Hadoop @ Orbitz
Orbitz needed an infrastructure that provides:Long term storage of large data sets; Open access for developers and business analysts; Ad-hoc quering of data
HDFS Architecture
Metadata ops Namenode Client Read Block ops Datanodes
replication
Datanodes B Blocks
Rack1
Write Client
Rack2
7/30/2012
24