48 Infochimps - How To Do A Big Data Project

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

How to Do a Big Data Project

Big Data is sweeping the business world and while it


FDQPHDQGLHUHQWWKLQJVWRGLHUHQWSHRSOHRQHWKLQJ
always rings true: Data-driven decisions and applications
create immense value by utilizing data sources to
GLVFRYHUSUHVHQWDQGRSHUDWLRQDOL]HLPSRUWDQWEXVLQHVV
insights.
While there is broad industry consensus on the value of
%LJ'DWDWKHUHLVQRVWDQGDUGL]HGDSSURDFKIRUKRZWR
EHJLQDQGFRPSOHWHDSURMHFW7KHPDQ\WRROVYHQGRUV
DQGWUHQGVLQWKHPDUNHWSODFHPXOWLSOLHGE\GLHUHQW
XVHFDVHVDQGSRWHQWLDOSURMHFWVFDQOHDGWRGHFLVLRQ
SDUDO\VLVQDGGLWLRQVRPHFRPSDQLHVPLVWDNHQO\IRFXV

At Infochimps, we want to share


our experiences after guiding
many enterprises through
successful Big Data projects. This
project guideline should empower
you to tackle the discussion
and decide on build versus buy
when it comes to achieving your
defined business objectives across
various technical environments.

RQWHFKQRORJ\UVWLQVWHDGRIEXVLQHVVREMHFWLYHV
All of these factors put every Big Data project at risk. In a recent survey of IT professionals it was found that
QHDUO\RI%LJ'DWDSURMHFWVGRQWJHWFRPSOHWHG7KHVDPHPHWULFIRU7SURMHFWVLQJHQHUDOLVRQO\
:KLOHKRZ\RXPDQDJH\RXU%LJ'DWDSURMHFWZLOOYDU\GHSHQGLQJRQ\RXUVSHFLFXVHFDVHDQGFRPSDQ\
SUROHWKHUHDUHNH\VWHSVWRVXFFHVVIXOO\LPSOHPHQWD%LJ'DWDSURMHFW

1. 'HQLQJ\RXUEXVLQHVVXVHFDVHZLWKFOHDUO\GHQHGREMHFWLYHVGULYLQJ
business value.

2. 3ODQQLQJ\RXUSURMHFW - a well managed plan and scope will lead to success.


3. 'HQLQJ\RXUWHFKQLFDOUHTXLUHPHQWV - detailed requirements will ensure you
build what you need to reach your objectives.

4. &UHDWLQJD7RWDO%XVLQHVV9DOXH$VVHVVPHQW - a holistic solution


comparison will take the politics (and emotion) out of the choices.

1. Define Your Business Use Case


$VHQWHUSULVHVH[SORUH%LJ'DWDWKHEXVLQHVVGULYHUVYDU\ZLGHO\IURPUHYHQXHJURZWKWRPDUNHWGLHUHQWLDWLRQ
:HYHVHHQFRPSDQLHVUHDOL]HWKHPRVWVLJQLFDQWEHQHWVIURP%LJ'DWDSURMHFWVZKHQWKH\VWDUWZLWKDQ
inventory of business challenges and goals and quickly narrow them down to those expected to provide the
highest return.

At Infochimps, we have seen Big Data business objectives ranging from creating and
deploying a SaaS application product to increasing revenue by providing the sales team
prioritized leads leveraging customer service data.

Typical Use Cases


HIGH IMPACT

MOD. IMPACT

G A I N
V A L U E

T I M E

T O

Product/Service Quality

{ Enhanced Analytics for Supply Chain

Revenue Growth

I M P L E M E N T

Cost & Risk Mitigation

Market Differentiation

{ Customer Risk Algorithm

{ Customer Insights Engine

{ New Customer Facing Software Product

{ Executive Financial Dashboard


{ New Product Delivery Channel

LOW IMPACT

MOD. IMPACT

B U S I N E S S

{
{ Executive Financial Dashboard

{ Customer Risk Algorithm { Enhanced Analytics for Supply Chain

{ New Product Delivery Channel

{ Customer Insights Engine

{ New Customer Facing Software Product

In order to explore your organizations expectations of Big Data, we recommend answering these
TXHVWLRQVUVW

What is the project goal?


What direction is the business headed?
What are the obstacles to getting there?
Who is are the key stakeholders and what are their roles?
:KDWLVWKHUVW%LJ'DWDXVHFDVHGHWHUPLQHGE\NH\VWDNHKROGHUV"

7KHVHTXHVWLRQVEXLOGDJRRGIRXQGDWLRQIRU\RXUSURMHFW7KHPRUHVSHFLFDQGFRQQHFWHGWR\RXUEXVLQHVVJRDOV
\RXUDQVZHUVDUHWKHPRUHOLNHO\\RXUSURMHFWZLOOVXFFHHG7KHVHDUHWKHW\SHVRIGLVFXVVLRQSRLQWVZHDVNRXU
FOLHQWVZKHQUVWOHDUQLQJDERXWWKHLU%LJ'DWDSURMHFWV+HUHDUHPRUHVSHFLFTXHVWLRQVWKDWHODERUDWHRQHDFK
foundational point.

What Direction is the Business Headed?

Determine the companys high level objectives and how Big Data can support these objectives.
Understand the companys perception of Big Data and any relative historical context.
GHQWLI\WKHSUREOHPDUHDVXFKDVPDUNHWLQJFXVWRPHUFDUHRUEXVLQHVVGHYHORSPHQWDQGWKHPRWLYDWLRQV
behind the project.
Describe the problem and obstacles in non-technical terms.
Inventory any solutions and tools currently used to address the business problem.
Weigh the advantages and disadvantages of the current solutions.
Navigate the process for initiating new projects and implementing solutions.

Identify Stakeholders and Business Use Case

GHQWLI\WKHVWDNHKROGHUVWKDWZLOOEHQHWIURPWKH%LJ'DWDSURMHFW
Interview individual stakeholders to determine their project goals and concerns.
'RFXPHQWVSHFLFEXVLQHVVREMHFWLYHVGHFLGHGXSRQE\NH\GHFLVLRQPDNHUV
Assign priorities to the business objectives
Architect a business use case.

At Infochimps, we create these types of business use cases:

Support
Clickstreams &
Downloads

TM

C L O U D

User
Generated
Content

F O R

B I G

Customer
Behavior
Data Science

Our company intends to generate insights on how


customers interact with their customer service portal
and improve their services for happier customers and
more streamlined business.

D A T A

Faster path to ROI with both tech and services


BI Queries &
Dashboards

Ability to prove the value of Big Data internally


Scalability to more data sources and use cases

Determine the Project Team

GHQWLI\WKHSURMHFWVSRQVRUWRUHPRYHREVWDFOHVQGWKHEXGJHWSURYLGHRUJDQL]DWLRQDOVXSSRUWDQG
champion the cause.
(VWDEOLVKWKHSURMHFWPDQDJHUDQGWKHWHDP'HQHWKHUROHVDQGUHVSRQVLELOLWLHVRIHDFKWHDPPHPEHU
Understand the teams availability and resource constraints for the project.

At Infochimps, we typically see project teams that include these roles: Executive Sponsor,
Technical Sponsor, Project Manager, Architect, Data Engineer, Data Scientist/Analyst, Lead
Test, Lead QA, Lead Developer, IT Lead.
([DPSOH%XVLQHVV8VH&DVH&KHFNOLVW

Item

XYZ Bank

Executive Sponsor:

SVP IT

Company Objective:

Enhance internal audit & reporting capabilities

Pilot (Yes/No - how long):

Yes (3 months)

Budget (Pilot/Production):

$100,000 (Pilot) - $500,000 (Production)

Project Lead:

Senior Architect

Use Case:

Information security, account security, ID


theft monitoring and prevention

Success Criteria:

faster reporting, more robust reporting,


improved operation efficiency

2. Plan Your Project


7KLVLVZKHUHWKLQJVJHWVSHFLF$VDUHVXOWRI\RXUUHVHDUFKDQGPHHWLQJV\RXPRVWOLNHO\KDYHDQHEXORXV
REMHFWLYHOLNHUHGXFLQJFXVWRPHUFKXUQ7KLVVHFWLRQLQWHQGVWRFRQVWUXFWDFRQFUHWHDQGVSHFLFREMHFWLYHDJUHHG
upon by the project sponsors and stakeholders.
Specify expected goals in measurable business terms.
Identify all business questions as precisely as possible.
'HWHUPLQHDQ\RWKHUTXDQWLDEOHEXVLQHVVUHTXLUHPHQWV
'HQHZKDWDVXFFHVVIXO%LJ'DWDLPSOHPHQWDWLRQZRXOGORRNOLNH
7KHJRDOPD\QRZEHFOHDUEXWKRZZLOO\RXNQRZRQFH\RXYHDFKLHYHGLW"WVLPSRUWDQWWRGHQHZKDWEXVLQHVV
success for your Big Data project looks like before proceeding further.

At Infochimps, we like to see an objective like Leveraging data from our CRM, Customer
Support and Finance applications and using cloud architecture, create an application to score
customers on a ranking scale, based on the likelihood well lose their business. The app will
be used by the Account Services team, who will employ a special customer service strategy to
increase retention, thereby reducing churn.

6HW6SHFLF2EMHFWLYH6XFFHVV&ULWHULD
:KHQGHWHUPLQLQJVXFFHVVFULWHULDLWVLPSRUWDQWWRSLFNFULWHULDWKDWDUHPHDVXUDEOHVXFKDVDVSHFLFNH\
performance metric.
The following are tasks and considerations you can use to ensure you have properly captured the success criteria:

$VSUHFLVHO\DVSRVVLEOHGRFXPHQWWKHVXFFHVVFULWHULDIRUWKLVSURMHFW

0DNHVXUHHDFKLGHQWLHGEXVLQHVVREMHFWLYHKDVDPHDVXUDEOHFULWHULDWKDWZLOOGHWHUPLQHLIWKDWREMHFWLYHKDV
been met successfully.

Share and gain approval of your business success criteria among key stakeholders.

Tie the Project Plan to Success

'HWHUPLQHSURSHUVFRSHVSHFLFDOO\ZKDWLVLQFOXGHGDQGZKDWLVQRWLQFOXGHG
Develop a rough budget.
6HWDWLPHOLQHDQGVXFFHVVIXOPLOHVWRQHVDWPRQWKVPRQWKVDQGD\HDU

At Infochimps, we use a Gantt chart template as a starting point for Big Data
implementations:

WLVHVVHQWLDOWRDYRLGFRPPRQSLWIDOOVOLNHVFRSHFUHHSXQFOHDUJRDOVSDVVLYHRUQRQH[LVWHQWSURMHFW
PDQDJHPHQWSRRUFRPPXQLFDWLRQVWDUWLQJWRRVPDOOHWF

Why do Big Data projects fail 30% more often than other IT projects?
%\IDUWKHELJJHVWUHDVRQZK\WKHVHSURMHFWVIDLOHGZDVLQDFFXUDWHVFRSH5HTXLUHPHQWVH[SDQGHGRXWRI
SURSRUWLRQRUWKHUHZDVQWDUPVHWRISURMHFWREMHFWLYHV:LWKRXWKDYLQJUPVXFFHVVFULWHULRQSURMHFWV
FRQWLQXHWRH[SDQGZLWKRXWUHDOLJQHGWLPHOLQHVIDLOHGWRGHPRQVWUDWHSRVLWLYHUHWXUQRQLQYHVWPHQWRULQWKH
worst case -- failed to meet objectives and provide business value.
$QRWKHUNH\SRLQWWKDWFDPHXSZDVODFNRIFRRSHUDWLRQEHWZHHQGHSDUWPHQWV%\QDWXUH%LJ'DWDLVQRWMXVW
JRLQJWRKHOSRQHVWDNHKROGHUWVRIWHQVRPHWKLQJWKDWLPSURYHVSHUIRUPDQFHIRUDORWRIGLHUHQWSDUWLHV
VXFKDV\RXU7WHDP\RXUDSSOLFDWLRQGHYHORSHUV\RXUGDWDVFLHQWLVWVDQGDQDO\VWVDQGSHRSOHDFURVVWKH
organization from line of business managers to executives.
)RUPRUHRQWKLVUHSRUWVHHhttp://www.infochimps.com/resources/white-papers/cios-big-data

3. Define Your Technical Requirements


The technical requirements phase involves taking a closer look at the data available for your Big Data project. This
step will enable you to determine the quality of your data and describe the results of these steps in the project
documentation.

&XUUHQW7HFKQLFDO(QYLURQPHQW
WVLPSRUWDQWWRXQGHUVWDQGZKDWWRROVDUHXVHGDQGWKHDUFKLWHFWXUHWKH\DUHXVHGLQDVLWVLWVWRGD\

Inventory all tools used today.

Sketch the current architecture.

Identify Data Sources


Consider what data sources youll need to take advantage of. Most data thats relevant to a production applications
XVHLVJRLQJWRFRPHIURPDOLYHVWUHDPRUIHHG

Existing data sources. 7KLVLQFOXGHVDZLGHYDULHW\RIGDWDVXFKDVWUDQVDFWLRQDOGDWDVXUYH\GDWD:HEORJV


etc. Consider whether your existing data sources are enough to meet your needs.

Purchased data sources.'RHV\RXURUJDQL]DWLRQXVHVXSSOHPHQWDOGDWDVXFKDVGHPRJUDSKLFV"IQRW


consider whether something like a Gnip or Datasift social media and news stream would complement your
current data to create additional project value.

Additional data sources.IWKHDERYHVRXUFHVGRQWPHHW\RXUQHHGV\RXPD\QHHGWRFRQGXFWVXUYH\VRU


begin additional tracking to supplement your existing data stores.

:KHQH[DPLQLQJ\RXUGDWDVRXUFHVDVN

Which attributes from the database(s) seem most promising?


Which attributes seem irrelevant and can be excluded?
Is there enough data to draw generalizable conclusions or make accurate predictions?
Are there too many attributes for your analytics method of choice?
$UH\RXPHUJLQJYDULRXVGDWDVRXUFHV"IVRDUHWKHUHDUHDVWKDWPLJKWSRVHDSUREOHPZKHQPHUJLQJ"
+DYH\RXFRQVLGHUHGKRZPLVVLQJYDOXHVDUHKDQGOHGLQHDFKRI\RXUGDWDVRXUFHV"

7KHUHDUHPDQ\ZD\VWRGHVFULEHGDWDEXWPRVWGHVFULSWLRQVIRFXVRQWKHTXDQWLW\DQGTXDOLW\RIWKHGDWD/LVWHG
below are some key characteristics to address when describing data:

Volume of data.)RUPRVWDQDO\WLFDOWHFKQLTXHVWKHUHDUHWUDGHRVDVVRFLDWHGZLWKGDWDVL]H/DUJHGDWDVHWV
FDQSURGXFHPRUHDFFXUDWHPRGHOVEXWWKH\FDQDOVRLQFUHDVHSURFHVVLQJWLPH

Velocity of data.7KHUHDUHDOVRWUDGHRVDVVRFLDWHGZLWKZKHWKHUWKHGDWDLVDWUHVWRULQPRWLRQ VWDWLFRU


real-time). Velocity translates into how fast the data is created within any given period of time.
Variety of data.'DWDFDQWDNHDYDULHW\RIIRUPDWVVXFKDVQXPHULFFDWHJRULFDO VWULQJ RU%RROHDQ WUXHIDOVH 
3D\LQJDWWHQWLRQWRYDOXHW\SHFDQSUHYHQWSUREOHPVGXULQJODWHUDQDO\WLFV)UHTXHQWO\YDOXHVLQWKHGDWDEDVHDUH
UHSUHVHQWDWLRQVRIFKDUDFWHULVWLFVVXFKDVJHQGHURUSURGXFWW\SH)RUH[DPSOHRQHGDWDVHWPD\XVH0DQG)WR
UHSUHVHQWPDOHDQGIHPDOHZKLOHDQRWKHUPD\XVHWKHQXPHULFYDOXHVDQG1RWHDQ\FRQLFWLQJVFKHPHVLQ
the data.
Time to action.'DWDFDQEHXVHGWRWDNHLPPHGLDWHDFWLRQDVZHOODVEHVWRUHGIRUIXWXUHQRQWLPHFULWLFDO
DQDO\VLVWVLPSRUWDQWWRLGHQWLI\ZKLFKGDWDZLOOPRVWOLNHO\EHXVHGIRUUHDOWLPHDFWLRQV PV QHDUUHDO
WLPHDFWLRQV VHFRQGV RUQRQWLPHFULWLFDODFWLRQV PLQXWHVWRKRXUV 

At Infochimps, we utilize the following template to gather and compile data sources,
documenting each of the dimensions of a data source.

Name

Example

Data Source
Description of the data source

POS System

Listener Type
HTTP Req./Stream, Syslog, Batch Upload, etc.

HTTP Request

Data Fields (Attributes)

timestamp, item, purchase price, item ID,


inventory ID, purchase ID, customer ID, cashier ID

Data Type
JSON, XML, CSV/TSV/Delimited, Fixed Width, SQL, Binary

JSON

# of Channels
Either 1 or many; Gnip is 1 source with possibly 100s of diff.
URLs to listen to

Data Velocity
Avg. Events / Sec*

125

Max Sustained Events / Sec (10 min)

500

12-Month Growth Factor*


How much is volume expected to grow in next 12-mos.

5x

Avg Event Size

0.4KB

Time to Action
Real-time actions (<150ms), near real-time actions (seconds), or non-time
critical actions (minutes to hours)

Non Time Critical, by Minute

Identify How Youll Work with the Data


Consider what interfaces and tools are necessary for your company to work with your data sources. Infochimps
SURYLGHVWKHDELOLW\WRFUHDWHFXVWRPDSSOLFDWLRQVDQGDQDO\WLFVXVLQJQDWLYH$3VDQGWRROVDVSDUWRI+DGRRS
GDWDEDVHVDQGVWUHDPSURFHVVLQJDVZHOODVDEVWUDFWHGDQGXQLHGLQWHUIDFHVIRULPSURYHGXVHUH[SHULHQFH
QIRFKLPSVDOVRSURYLGHVXVHUVZLWKWKHDELOLW\WRSURGXFHWDEOHVFKDUWVDQGRWKHUYLVXDOL]DWLRQHOHPHQWVXVLQJ
%WRROVVXFKDV%XVLQHVV2EMHFWV0LFURVWUDWHJ\&RJQRV7DEOHDX'DWDPHHURURWKHUVLPLODUWRROV6XFKYLVXDO
DQDO\VHVFDQKHOSWRDGGUHVVWKH%LJ'DWDSURMHFWJRDOVGHQHGGXULQJWKHEXVLQHVVXQGHUVWDQGLQJSKDVH2WKHU
WLPHVLWLVPRUHDSSURSULDWHWRXWLOL]HVWDWLVWLFDOWRROV 56$663660DWODEHWF DQGSDFNDJHGDSSOLFDWLRQV &50
326(53HWF 

Who needs to work with the data?


What are their skills and techniques?
Will training be required?
What tools do you currently have in your enterprise that youd like to take advantage of?
Do those tools have Big Data connectors or proven interface methods?
:KDWQHZWRROVFRXOGKHOSZLWK\RXUGDWDPLQLQJDQDO\VLVYLVXDOL]DWLRQUHSRUWLQJHWF"
+RZDQGZKHUHZLOOWKHGDWDEHVWRUHG"
What are the reporting and visualization tools necessary to achieve success in your end users eyes?

At Infochimps, we recommend tracking how you plan to use and consume the data
generated by your data sources. Your data sources will eventually feed other processes in
your Big Data environment (e.g. directly to a customer application, to a RDBMS powering a BI
tool, a NoSQL data store, Hadoop, data archive, etc.)

Name

Example

End User
Data Scientist, Data Engineer Data Analyst, Business
Analyst, Statistician, BI Specialist, LOB User,
Field Manager, Executive, etc

Data Analyst

End User Tool


Hive/Hue, SQL Server / MySQL / Oracle, Tableau,
Cognos, Microstrategy, SAP Business Objects, SAS, SPSS,
R, Excel, custom application, custom dashboard, etc.

Tableau

Analysis Activities
What is the end user doing as part of their analysis?

Exploratory queries, simple data mashups


and calculations, creating visual reports
with Tableaus report builder

4. Create a Total Business Value

Assessment
(YDOXDWH\RXURSWLRQVZLWKD7RWDO%XVLQHVV9DOXH
$VVHVVPHQW7KLVPHDQVWKDW\RXSHUIRUPDWOHDVWD
\HDUWRWDOFRVWRIRZQHUVKLSDQDO\VLVEXW\RXDOVR
LQFOXGHWKLQJVOLNHWLPHWREXVLQHVVYDOXHHDVHRIXVH
VFDODELOLW\VWDQGDUGVEDVHGDQGHQWHUSULVHUHDGLQHVV
+RZHYHUEHIRUH\RXJHWVWDUWHGRQHYDOXDWLQJ\RXU
VROXWLRQRSWLRQVLWLVLPSRUWDQWWRNQRZ\RXUEX\LQJ
WHDP%X\LQJWHDPVJHQHUDOO\FRQVLVWRIVWDNHKROGHUV
from multiple organizational levels and sometimes
PXOWLSOHGLYLVLRQVRXWVLGHRI7$WDPLQLPXPWKHUH
VKRXOGEHDQH[HFXWLYHVSRQVRUSURMHFWFKDPSLRQRU
SURMHFWWHDPOHDGWHFKQLFDOGHFLVLRQPDNHUDQGDQ
economic decision maker.
Your entire buying team needs to be involved in
HYDOXDWLQJWKHRSWLRQV2SWLRQVVWDUWZLWKZKR\RXUH
relying on to implement your project such as: doing
LW\RXUVHOIZLWKLQWHUQDOUHVRXUFHVDQGRUOHYHUDJLQJ
VRIWZDUHYHQGRUVZRUNLQJZLWKV\VWHPLQWHJUDWRUV
GHSOR\LQJZLWKFORXGVHUYLFHVSURYLGHUVRUXVLQJ
HPHUJLQJERXWLTXH%LJ'DWDFRQVXOWLQJUPV:H
recommend looking at each implementation option and
ZHLJKLQJWKHPDJDLQVW\RXUVSHFLFEXVLQHVVSULRULWLHV
%XWGRQWIRUJHWWRLQFOXGH7LPHWR%XVLQHVV9DOXH(DVH
RI8VH6FDODELOLW\6WDQGDUGVEDVHGDQG(QWHUSULVH
5HDGLQHVV$V\RXHYDOXDWHVROXWLRQVGRFXPHQWKRZ
each solution performs on these and other important
dimensions.

Time to Business Value. As projects require


VLJQLFDQWXSIURQWLQYHVWPHQWLWLVLPSRUWDQWWR
understand how long before the solution will start
generating value. As many of these projects can
RYHUH[WHQGYHQGRUDJUHHGWLPHOLQHVLWLVUHFRPPHQGHG
to request information on similar scope projects that
have already been completed for other clients to
better determine whether the vendor delivers on set
milestones. If your new Big Data application generates
0SHUPRQWKDQG\RXODXQFKPRQWKVDKHDGRI
VFKHGXOHWKDWVZRUWK0:KDWV\RXUWLPHWR
business value?
Ease of Use.:KLOHFRQVLGHULQJQHZ%LJ'DWDVROXWLRQV
it is imperative to consider how the solution will
DHFW\RXUQHHGWRDXJPHQWLQWHUQDOUHVRXUFHVIRU
LPSOHPHQWDWLRQDQGRQJRLQJPDLQWHQDQFH6SHFLFDOO\
QHZUHVRXUFHVVXFKDVGDWDVFLHQWLVWVPD\EH
QHFHVVDU\WRKDQGOHDSURMHFWLQWHUQDOO\QDGGLWLRQ
corporate IT resources will be needed to maintain the
project once implemented. Ease of use also translates
into ease of integration within your internal technical
infrastructure. This can have a big impact on time-tobusiness value.

6FDODELOLW\ Ideal solutions will expand with evolving


business needs. While one use case may be the initial
GULYHUWKHLGHDOVROXWLRQZLOOVXSSRUWIXWXUHXVHVFDVHV
$OVRFRQVLGHUDQ\FRVWVDVVRFLDWHGZLWKVFDOLQJ 6RPH
VROXWLRQVRHULQLWLDOFDSDFLW\DWDORZFRVWEXWWKHFRVW
quickly increases as expansion occurs.)

At Infochimps, weve see customers enduring


time to business value of 18-24 months with
legacy solutions, and where time-to-value
can be as short as 30 days with cloud-based
deployments.
Standards-Based.QWRGD\VZRUOGSRSXODU
technologies come and go. Consider whether your
SURSRVHGVROXWLRQVXVHRSHQVWDQGDUGVEDVHGEHVW
in-class technologies to drive innovation and revenue
IRU\RXUEXVLQHVVQHHGV$GGLWLRQDOO\H[SORUHDQ\ULVNV
associated with being able to tailor your solution and
your ability to act quickly when you need to do so.

Enterprise Readiness. An important aspect of


selection is whether the solution can operate within an
enterprise setting. Ideal solutions have high availability
DQGGLVDVWHUUHFRYHU\LQSODFH$GGLWLRQDOO\WKH\ZLOO
VXSSRUW\RXUEXVLQHVVFRPSOLDQFHDQGVHFXULW\QHHGV
DQGPHHW\RXUFXUUHQWFRPSDQ\GHQHG6/$V7REHWWHU
understand how solutions operate in an enterprise
HQYLURQPHQWFRQWDFWWKHVROXWLRQSURYLGHUVHQWHUSULVH
customer references.
\HDU7RWDO&RVWRI2ZQHUVKLS You need to add up
your costs over at least three years to appreciate some
of the recurring costs which may not be completely
REYLRXVWKHUVW\HDU&RVWVLQFOXGHDOORFDWHGGDWD
FHQWHULQIUDVWUXFWXUH RRUUDFNVSRZHUFRQQHFWLYLW\ 
KDUGZDUH FRPSXWHVWRUDJHDQGQHWZRUNLQJ VRIWZDUH
%LJ'DWDVWDFN26$GPLQ6:VHFXULW\6:DQDO\WLFV
6: SHUVRQQHOFRVWV V\VWHPVPDQDJHPHQW12&
GHYHORSPHQWFRQVXOWLQJ 
5HVHDUFKVKRZVWKDWQHDUO\RI%LJ'DWDSURMHFWV
IDLO$WQIRFKLPSVZHDUHH[SHUWVLQ%LJ'DWD
SURMHFWVLPSOHPHQWLQJQXPHURXVHQWHUSULVHSURMHFWV
WKDWGHOLYHULQVLJKWVZLWKLQGD\V7KLV%LJ'DWD
guideline is based on a culmination of our successes; we
KRSH\RXQGYDOXHLQLWI\RXZRXOGOLNHWRKHDUPRUH
DERXWWKHQIRFKLPSV&ORXGSOHDVHrequest a demo. If
\RXZRXOGOLNHWRGLVFXVV\RXUSURMHFWplease request a
complimentary consultation.

Project Overview
Accountable Executive

Objective

Expected Outcome

Approach
Phase 1 : Pilot

Phase 2 : Production

Department

Impact

Success Measures
Pilot Success

Production Success

Estimated Cost
Category

Description

Initial Cost

3yr TCO

Personnel

Software

Hardware

Training

Consulting

Time to Value

Activity & Timing


Phase
Initiation & Planning

Execution

Closure

Major Activities

Timing

Deliverables
Name

Description

Project Plan

Infrastructure
Functional Spec
Implementation
Spec

Dependencies, Assumptions & Constraints

Timing

Technical Requirements
Solution Description

Data Inputs
Name
Listener Type
HTTP Req./Stream,
Syslog, Batch upload

Example
HTTP Request

Data Type
JSON, XML, CSV/TSV/
Delimited, Fixed Width,
SQL, Binary

JSON

# of Channels
1 or many; Gnip is 1
source with possibly 100s
RIGL85/VWROLVWHQWR

Avg. Events / Sec

125

Max Sustained
Events / Sec
(10 min)

500

12-Month Growth
Factor of Data
How much volume is
expected to grow in next
year?

5x

Avg. Event Size

0.4KB

Streaming
Aggregation?

By Minute

Non-Trivial
Decorators
Calls to external APIs/
data stores, complex
algorithms/logic, &c.

Sentiment Analyzer

Feed 1

Feed 2

Feed 3

Feed 4

Batch Jobs
Name
Description
What does this job do?

Frequency
How often does it run?

Input(s)
Where is data from?

Output(s)
Where does data write to?

# of Records
Avg. # of records in
each run

Example

Job 1

...

Job N

Take last weeks


raw data and bin
by hour

Nightly

S3

MySQL

1B

Requires Persistent
Data?
Do we have to store any
data on the Hadoop
cluster itself to perform
the job?

Type
Hadoop
Development
Method
What method should be
used to perform Hadoop
jobs?

No

Wukong

Hive

Java M/R

Pig

Data Stores Examples


Type

ElasticSearch

HBase

MySQL

HDFS

S3/Glacier

Purpose

Support a
customer-facing
web app

Support an
internal BI tool

Support a Tableau
installation

Allow ad hoc
queries in Hive

Archival/Disaster
Recovery

Avg. Events / Sec

50

200

10

200

200

Max. Sustained
Events / Sec
(10 min)

200

500

50

500

500

Retention Policy

12-month
Growth Rate
4XHU\3UROH
What kinds of queries/
usage/volume should be
supported?

Keep all events from


last 30 days

Keep all events from


last 90 days

Keep all events from


last 7 days

Keep all events from


last 6 months

Keep all events


from last 12
months.
Last 3 months in
Glacier

3x

3x

3x

3x

3x

~10 simple searches


/ sec.
Fewer complex/
faceted queries

~20 key/value
lookups / sec.
Many short table
scans to create time
series

~5 joins and time


series queries / sec
to support Tableau

Ad hoc queries
running over
large amounts of
historical data

Occasional batch
jobs from Hadoop

Private Cloud

Hybrid Cloud

Additional Tools
Any Additional Tools?

Big Data Systems & Deployment Enviroment


Type

Public Cloud

Virtual Private Cloud

Enviroment

ETL + Stream Processing

Databases

Hadoop

Production Nodes

4 (ElasticSearch)

1 Elastic Cluster

Staging Nodes

2 (ElasticSearch)

n/a

SLA Requirements
Example
System Uptime

99.99%

Query Latency Response Time

<200ms

Data Retention

99.99%

Solution Diagram

Production

Staging

5HTXHVWD)UHH'HPR
See Infochimps Cloud for Big Data.
Infochimps Cloud is a suite of
enterprise-ready cloud services that
PDNHLWVLPSOHUIDVWHUDQGIDUOHVV
complicated to develop and deploy
%LJ'DWDDSSOLFDWLRQVLQSXEOLFYLUWXDO
private and private clouds.
2XUFORXGVHUYLFHVSURYLGHD
FRPSUHKHQVLYHDQDO\WLFVSODWIRUP
LQFOXGLQJGDWDVWUHDPLQJVWRUDJH
queries and administration. With
QIRFKLPSV\RXIRFXVRQWKH
analytics that drive your business
LQVLJKWVQRWEXLOGLQJDQGPDQDJLQJD
complex infrastructure.

Contact Us
QIRFKLPSVQF
:WK6W6XLWH
$XVWLQ7;

www.infochimps.com
[email protected]
Twitter: @infochimps
kQIRFKLPSVTMQF

You might also like