48 Infochimps - How To Do A Big Data Project
48 Infochimps - How To Do A Big Data Project
48 Infochimps - How To Do A Big Data Project
RQWHFKQRORJ\UVWLQVWHDGRIEXVLQHVVREMHFWLYHV
All of these factors put every Big Data project at risk. In a recent survey of IT professionals it was found that
QHDUO\RI%LJ'DWDSURMHFWVGRQWJHWFRPSOHWHG7KHVDPHPHWULFIRU7SURMHFWVLQJHQHUDOLVRQO\
:KLOHKRZ\RXPDQDJH\RXU%LJ'DWDSURMHFWZLOOYDU\GHSHQGLQJRQ\RXUVSHFLFXVHFDVHDQGFRPSDQ\
SUROHWKHUHDUHNH\VWHSVWRVXFFHVVIXOO\LPSOHPHQWD%LJ'DWDSURMHFW
1. 'HQLQJ\RXUEXVLQHVVXVHFDVHZLWKFOHDUO\GHQHGREMHFWLYHVGULYLQJ
business value.
At Infochimps, we have seen Big Data business objectives ranging from creating and
deploying a SaaS application product to increasing revenue by providing the sales team
prioritized leads leveraging customer service data.
MOD. IMPACT
G A I N
V A L U E
T I M E
T O
Product/Service Quality
Revenue Growth
I M P L E M E N T
Market Differentiation
LOW IMPACT
MOD. IMPACT
B U S I N E S S
{
{ Executive Financial Dashboard
In order to explore your organizations expectations of Big Data, we recommend answering these
TXHVWLRQVUVW
7KHVHTXHVWLRQVEXLOGDJRRGIRXQGDWLRQIRU\RXUSURMHFW7KHPRUHVSHFLFDQGFRQQHFWHGWR\RXUEXVLQHVVJRDOV
\RXUDQVZHUVDUHWKHPRUHOLNHO\\RXUSURMHFWZLOOVXFFHHG7KHVHDUHWKHW\SHVRIGLVFXVVLRQSRLQWVZHDVNRXU
FOLHQWVZKHQUVWOHDUQLQJDERXWWKHLU%LJ'DWDSURMHFWV+HUHDUHPRUHVSHFLFTXHVWLRQVWKDWHODERUDWHRQHDFK
foundational point.
Determine the companys high level objectives and how Big Data can support these objectives.
Understand the companys perception of Big Data and any relative historical context.
GHQWLI\WKHSUREOHPDUHDVXFKDVPDUNHWLQJFXVWRPHUFDUHRUEXVLQHVVGHYHORSPHQWDQGWKHPRWLYDWLRQV
behind the project.
Describe the problem and obstacles in non-technical terms.
Inventory any solutions and tools currently used to address the business problem.
Weigh the advantages and disadvantages of the current solutions.
Navigate the process for initiating new projects and implementing solutions.
GHQWLI\WKHVWDNHKROGHUVWKDWZLOOEHQHWIURPWKH%LJ'DWDSURMHFW
Interview individual stakeholders to determine their project goals and concerns.
'RFXPHQWVSHFLFEXVLQHVVREMHFWLYHVGHFLGHGXSRQE\NH\GHFLVLRQPDNHUV
Assign priorities to the business objectives
Architect a business use case.
Support
Clickstreams &
Downloads
TM
C L O U D
User
Generated
Content
F O R
B I G
Customer
Behavior
Data Science
D A T A
GHQWLI\WKHSURMHFWVSRQVRUWRUHPRYHREVWDFOHVQGWKHEXGJHWSURYLGHRUJDQL]DWLRQDOVXSSRUWDQG
champion the cause.
(VWDEOLVKWKHSURMHFWPDQDJHUDQGWKHWHDP'HQHWKHUROHVDQGUHVSRQVLELOLWLHVRIHDFKWHDPPHPEHU
Understand the teams availability and resource constraints for the project.
At Infochimps, we typically see project teams that include these roles: Executive Sponsor,
Technical Sponsor, Project Manager, Architect, Data Engineer, Data Scientist/Analyst, Lead
Test, Lead QA, Lead Developer, IT Lead.
([DPSOH%XVLQHVV8VH&DVH&KHFNOLVW
Item
XYZ Bank
Executive Sponsor:
SVP IT
Company Objective:
Yes (3 months)
Budget (Pilot/Production):
Project Lead:
Senior Architect
Use Case:
Success Criteria:
At Infochimps, we like to see an objective like Leveraging data from our CRM, Customer
Support and Finance applications and using cloud architecture, create an application to score
customers on a ranking scale, based on the likelihood well lose their business. The app will
be used by the Account Services team, who will employ a special customer service strategy to
increase retention, thereby reducing churn.
6HW6SHFLF2EMHFWLYH6XFFHVV&ULWHULD
:KHQGHWHUPLQLQJVXFFHVVFULWHULDLWVLPSRUWDQWWRSLFNFULWHULDWKDWDUHPHDVXUDEOHVXFKDVDVSHFLFNH\
performance metric.
The following are tasks and considerations you can use to ensure you have properly captured the success criteria:
$VSUHFLVHO\DVSRVVLEOHGRFXPHQWWKHVXFFHVVFULWHULDIRUWKLVSURMHFW
0DNHVXUHHDFKLGHQWLHGEXVLQHVVREMHFWLYHKDVDPHDVXUDEOHFULWHULDWKDWZLOOGHWHUPLQHLIWKDWREMHFWLYHKDV
been met successfully.
Share and gain approval of your business success criteria among key stakeholders.
'HWHUPLQHSURSHUVFRSHVSHFLFDOO\ZKDWLVLQFOXGHGDQGZKDWLVQRWLQFOXGHG
Develop a rough budget.
6HWDWLPHOLQHDQGVXFFHVVIXOPLOHVWRQHVDWPRQWKVPRQWKVDQGD\HDU
At Infochimps, we use a Gantt chart template as a starting point for Big Data
implementations:
WLVHVVHQWLDOWRDYRLGFRPPRQSLWIDOOVOLNHVFRSHFUHHSXQFOHDUJRDOVSDVVLYHRUQRQH[LVWHQWSURMHFW
PDQDJHPHQWSRRUFRPPXQLFDWLRQVWDUWLQJWRRVPDOOHWF
Why do Big Data projects fail 30% more often than other IT projects?
%\IDUWKHELJJHVWUHDVRQZK\WKHVHSURMHFWVIDLOHGZDVLQDFFXUDWHVFRSH5HTXLUHPHQWVH[SDQGHGRXWRI
SURSRUWLRQRUWKHUHZDVQWDUPVHWRISURMHFWREMHFWLYHV:LWKRXWKDYLQJUPVXFFHVVFULWHULRQSURMHFWV
FRQWLQXHWRH[SDQGZLWKRXWUHDOLJQHGWLPHOLQHVIDLOHGWRGHPRQVWUDWHSRVLWLYHUHWXUQRQLQYHVWPHQWRULQWKH
worst case -- failed to meet objectives and provide business value.
$QRWKHUNH\SRLQWWKDWFDPHXSZDVODFNRIFRRSHUDWLRQEHWZHHQGHSDUWPHQWV%\QDWXUH%LJ'DWDLVQRWMXVW
JRLQJWRKHOSRQHVWDNHKROGHUWVRIWHQVRPHWKLQJWKDWLPSURYHVSHUIRUPDQFHIRUDORWRIGLHUHQWSDUWLHV
VXFKDV\RXU7WHDP\RXUDSSOLFDWLRQGHYHORSHUV\RXUGDWDVFLHQWLVWVDQGDQDO\VWVDQGSHRSOHDFURVVWKH
organization from line of business managers to executives.
)RUPRUHRQWKLVUHSRUWVHHhttp://www.infochimps.com/resources/white-papers/cios-big-data
&XUUHQW7HFKQLFDO(QYLURQPHQW
WVLPSRUWDQWWRXQGHUVWDQGZKDWWRROVDUHXVHGDQGWKHDUFKLWHFWXUHWKH\DUHXVHGLQDVLWVLWVWRGD\
:KHQH[DPLQLQJ\RXUGDWDVRXUFHVDVN
7KHUHDUHPDQ\ZD\VWRGHVFULEHGDWDEXWPRVWGHVFULSWLRQVIRFXVRQWKHTXDQWLW\DQGTXDOLW\RIWKHGDWD/LVWHG
below are some key characteristics to address when describing data:
Volume of data.)RUPRVWDQDO\WLFDOWHFKQLTXHVWKHUHDUHWUDGHRVDVVRFLDWHGZLWKGDWDVL]H/DUJHGDWDVHWV
FDQSURGXFHPRUHDFFXUDWHPRGHOVEXWWKH\FDQDOVRLQFUHDVHSURFHVVLQJWLPH
At Infochimps, we utilize the following template to gather and compile data sources,
documenting each of the dimensions of a data source.
Name
Example
Data Source
Description of the data source
POS System
Listener Type
HTTP Req./Stream, Syslog, Batch Upload, etc.
HTTP Request
Data Type
JSON, XML, CSV/TSV/Delimited, Fixed Width, SQL, Binary
JSON
# of Channels
Either 1 or many; Gnip is 1 source with possibly 100s of diff.
URLs to listen to
Data Velocity
Avg. Events / Sec*
125
500
5x
0.4KB
Time to Action
Real-time actions (<150ms), near real-time actions (seconds), or non-time
critical actions (minutes to hours)
At Infochimps, we recommend tracking how you plan to use and consume the data
generated by your data sources. Your data sources will eventually feed other processes in
your Big Data environment (e.g. directly to a customer application, to a RDBMS powering a BI
tool, a NoSQL data store, Hadoop, data archive, etc.)
Name
Example
End User
Data Scientist, Data Engineer Data Analyst, Business
Analyst, Statistician, BI Specialist, LOB User,
Field Manager, Executive, etc
Data Analyst
Tableau
Analysis Activities
What is the end user doing as part of their analysis?
Assessment
(YDOXDWH\RXURSWLRQVZLWKD7RWDO%XVLQHVV9DOXH
$VVHVVPHQW7KLVPHDQVWKDW\RXSHUIRUPDWOHDVWD
\HDUWRWDOFRVWRIRZQHUVKLSDQDO\VLVEXW\RXDOVR
LQFOXGHWKLQJVOLNHWLPHWREXVLQHVVYDOXHHDVHRIXVH
VFDODELOLW\VWDQGDUGVEDVHGDQGHQWHUSULVHUHDGLQHVV
+RZHYHUEHIRUH\RXJHWVWDUWHGRQHYDOXDWLQJ\RXU
VROXWLRQRSWLRQVLWLVLPSRUWDQWWRNQRZ\RXUEX\LQJ
WHDP%X\LQJWHDPVJHQHUDOO\FRQVLVWRIVWDNHKROGHUV
from multiple organizational levels and sometimes
PXOWLSOHGLYLVLRQVRXWVLGHRI7$WDPLQLPXPWKHUH
VKRXOGEHDQH[HFXWLYHVSRQVRUSURMHFWFKDPSLRQRU
SURMHFWWHDPOHDGWHFKQLFDOGHFLVLRQPDNHUDQGDQ
economic decision maker.
Your entire buying team needs to be involved in
HYDOXDWLQJWKHRSWLRQV2SWLRQVVWDUWZLWKZKR\RXUH
relying on to implement your project such as: doing
LW\RXUVHOIZLWKLQWHUQDOUHVRXUFHVDQGRUOHYHUDJLQJ
VRIWZDUHYHQGRUVZRUNLQJZLWKV\VWHPLQWHJUDWRUV
GHSOR\LQJZLWKFORXGVHUYLFHVSURYLGHUVRUXVLQJ
HPHUJLQJERXWLTXH%LJ'DWDFRQVXOWLQJUPV:H
recommend looking at each implementation option and
ZHLJKLQJWKHPDJDLQVW\RXUVSHFLFEXVLQHVVSULRULWLHV
%XWGRQWIRUJHWWRLQFOXGH7LPHWR%XVLQHVV9DOXH(DVH
RI8VH6FDODELOLW\6WDQGDUGVEDVHGDQG(QWHUSULVH
5HDGLQHVV$V\RXHYDOXDWHVROXWLRQVGRFXPHQWKRZ
each solution performs on these and other important
dimensions.
Project Overview
Accountable Executive
Objective
Expected Outcome
Approach
Phase 1 : Pilot
Phase 2 : Production
Department
Impact
Success Measures
Pilot Success
Production Success
Estimated Cost
Category
Description
Initial Cost
3yr TCO
Personnel
Software
Hardware
Training
Consulting
Time to Value
Execution
Closure
Major Activities
Timing
Deliverables
Name
Description
Project Plan
Infrastructure
Functional Spec
Implementation
Spec
Timing
Technical Requirements
Solution Description
Data Inputs
Name
Listener Type
HTTP Req./Stream,
Syslog, Batch upload
Example
HTTP Request
Data Type
JSON, XML, CSV/TSV/
Delimited, Fixed Width,
SQL, Binary
JSON
# of Channels
1 or many; Gnip is 1
source with possibly 100s
RIGL85/VWROLVWHQWR
125
Max Sustained
Events / Sec
(10 min)
500
12-Month Growth
Factor of Data
How much volume is
expected to grow in next
year?
5x
0.4KB
Streaming
Aggregation?
By Minute
Non-Trivial
Decorators
Calls to external APIs/
data stores, complex
algorithms/logic, &c.
Sentiment Analyzer
Feed 1
Feed 2
Feed 3
Feed 4
Batch Jobs
Name
Description
What does this job do?
Frequency
How often does it run?
Input(s)
Where is data from?
Output(s)
Where does data write to?
# of Records
Avg. # of records in
each run
Example
Job 1
...
Job N
Nightly
S3
MySQL
1B
Requires Persistent
Data?
Do we have to store any
data on the Hadoop
cluster itself to perform
the job?
Type
Hadoop
Development
Method
What method should be
used to perform Hadoop
jobs?
No
Wukong
Hive
Java M/R
Pig
ElasticSearch
HBase
MySQL
HDFS
S3/Glacier
Purpose
Support a
customer-facing
web app
Support an
internal BI tool
Support a Tableau
installation
Allow ad hoc
queries in Hive
Archival/Disaster
Recovery
50
200
10
200
200
Max. Sustained
Events / Sec
(10 min)
200
500
50
500
500
Retention Policy
12-month
Growth Rate
4XHU\3UROH
What kinds of queries/
usage/volume should be
supported?
3x
3x
3x
3x
3x
~20 key/value
lookups / sec.
Many short table
scans to create time
series
Ad hoc queries
running over
large amounts of
historical data
Occasional batch
jobs from Hadoop
Private Cloud
Hybrid Cloud
Additional Tools
Any Additional Tools?
Public Cloud
Enviroment
Databases
Hadoop
Production Nodes
4 (ElasticSearch)
1 Elastic Cluster
Staging Nodes
2 (ElasticSearch)
n/a
SLA Requirements
Example
System Uptime
99.99%
<200ms
Data Retention
99.99%
Solution Diagram
Production
Staging
5HTXHVWD)UHH'HPR
See Infochimps Cloud for Big Data.
Infochimps Cloud is a suite of
enterprise-ready cloud services that
PDNHLWVLPSOHUIDVWHUDQGIDUOHVV
complicated to develop and deploy
%LJ'DWDDSSOLFDWLRQVLQSXEOLFYLUWXDO
private and private clouds.
2XUFORXGVHUYLFHVSURYLGHD
FRPSUHKHQVLYHDQDO\WLFVSODWIRUP
LQFOXGLQJGDWDVWUHDPLQJVWRUDJH
queries and administration. With
QIRFKLPSV\RXIRFXVRQWKH
analytics that drive your business
LQVLJKWVQRWEXLOGLQJDQGPDQDJLQJD
complex infrastructure.
Contact Us
QIRFKLPSVQF
:WK6W6XLWH
$XVWLQ7;
www.infochimps.com
[email protected]
Twitter: @infochimps
kQIRFKLPSVTMQF