CH 10
CH 10
CH 10
By Frank Ohlhor st
Copyright 2013 by John Wiley & Sons, Inc.
CHAPTER
Bringing It
All Together
111
THE PA TH TO BIG DA TA
More exam ples are readily available to prove th at data can deliver
valu e well beyon d on e s expectation s. Th e key issu es are th e an alysis
perform ed an d th e goal sou gh t. Th e previou s exam ples on ly scratch
th e su rface of wh at Big Data m ean s to th e m asses. Th e essen tial poin t
h ere is to u n derstan d th e in trin sic valu e of Big Data an alytics an d
extrapolate th e valu e as it can be applied to oth er circu m stan ces.
HA N DS-O N BIG DA TA
The analysis of Big Data involves m ultiple distin ct phases, each of wh ich
introdu ces challen ges. These phases inclu de acquisition , extraction,
aggregation, m odeling, and interpretation . However, m ost people focu s
just on the m odeling (an alysis) phase.
Alth ou gh th at ph ase is cru cial, it is of little u se w ith ou t th e oth er
ph ases of th e data an alysis process, wh ich can create problem s like
false ou tcom es an d u n in terru ptable resu lts. Th e an alysis is on ly as
good as th e data provided. Th e problem stem s from th e fact th at
th ere are poorly u n derstood com plexities in th e con text of m u lti-
ten an ted data clu sters, especially wh en several an alyses are bein g
ru n con cu rren tly.
Man y sign i can t ch allen ges exten d beyon d an d u n dern eath th e
m odelin g ph ase. For exam ple, Big Data h as to be m an aged for con text,
wh ich m ay in clu de spu riou s in form ation an d can be h eterogen eou s in
n atu re; th is is fu rth er com plicated by th e lack of an u pfron t m odel.
It m ean s th at data proven an ce m u st be accou n ted for, as well as m eth ods
created to h an dle u n certain ty an d error.
Perh aps th e problem s can be attribu ted to ign oran ce or, at th e very
least, a lack of con sideration for prim ary topics th at de n e th e Big Data
process yet are often afterth ou gh ts. Th is m ean s th at qu estion s an d
an alytical processes m u st be plan n ed an d th ou gh t ou t in th e con text of
th e data provided. On e h as to determ in e wh at is wan ted from th e data
an d th en ask th e appropriate qu estion s to get th at in form ation .
Accom plish in g th at will requ ire sm arter system s as well as better
su pport for th ose m akin g th e qu eries, perh aps by em powerin g th ose
u sers with n atu ral lan gu age tools (rath er th an com plex m ath em atical
algorith m s) to qu ery th e data. Th e key issu e is th e level of ach ievable
arti cial in telligen ce an d h ow m u ch th at can be relied on . Cu rren tly,
Big Data does n ot arise from a vacuum (except, of cou rse, wh en studying
deep space). Basically, data are recorded from a data-generating sou rce.
Gathering data is akin to sen sing and observing the world aroun d u s, from
the h eart rate of a h ospital patient to the con tents of an air sam ple to the
n um ber of Web page queries to scien ti c experim ents that can easily
produ ce petabytes of data.
However, m u ch of th e data collected is of little in terest an d can be
ltered an d com pressed by m an y orders of m agn itu de, wh ich creates a
bigger ch allen ge: th e de n ition of lters th at do n ot discard u sefu l
in form ation . For exam ple, su ppose on e data sen sor readin g differs
su bstan tially from th e rest. Can th at be attribu ted to a fau lty sen sor, or
are th e data real an d worth in clu sion ?
Fu rth er com plicatin g th e lterin g process is h ow th e sen sors gath er
data. Are th ey based on tim e, tran saction s, or oth er variables? Are th e
sen sors affected by en viron m en t or oth er activities? Are th e sen sors tied
to spatial an d tem poral even ts su ch as traf c m ovem en t or rain fall?
Before th e data are ltered, th ese con sideration s an d oth ers m u st
be addressed. Th at m ay requ ire n ew tech n iqu es an d m eth odologies to
process th e raw data in telligen tly an d deliver a data set in m an ageable
ch u n ks with ou t th rowin g away th e n eedle in th e h aystack. Fu rth er
lterin g com plication s com e with real-tim e processin g, in wh ich th e
data are in m otion an d stream in g on th e y, an d on e does n ot h ave
th e lu xu ry of bein g able to store th e data rst an d process th em later
for redu ction .
An oth er ch allen ge com es in th e form of au tom atically gen eratin g
th e righ t m etadata to describe wh at data are recorded an d h ow th ey
are recorded an d m easu red. For exam ple, in scien ti c experim en ts,
con siderable detail on speci c experim en tal con dition s an d procedu res
m ay be requ ired to be able to in terpret th e resu lts correctly, an d it is
im portan t th at su ch m etadata be recorded with observation al data.
Wh en im plem en ted properly, au tom ated m etadata acqu isi-
tion system s can m in im ize th e n eed for m an u al processin g, greatly
redu cin g th e h u m an bu rden of recordin g m etadata. Th ose wh o
are gath erin g data also h ave to be con cern ed w ith th e data prove-
n an ce. Recordin g in form ation abou t th e data at th eir tim e of creation
becom es im portan t as th e data m ove th rou gh th e data an alysis
process. Accu rate proven an ce can preven t processin g errors from
ren derin g th e su bsequ en t an alysis u seless. With su itable proven an ce,
th e su bsequ en t processin g steps can be qu ickly iden ti ed. Provin g th e
accu racy of th e data is accom plish ed by gen eratin g su itable m etadata
th at also carry th e proven an ce of th e data th rou gh th e data an alysis
process.
An oth er step in th e process con sists of extractin g an d clean in g th e
data. Th e in form ation collected will frequ en tly n ot be in a form at
ready for an alysis. For exam ple, con sider electron ic h ealth records in a
m edical facility th at con sist of tran scribed dictation s from several
ph ysician s, stru ctu red data from sen sors an d m easu rem en ts (possibly
with som e associated an om alou s data), an d im age data su ch as scan s.
Data in th is form can n ot be effectively an alyzed. Wh at is n eeded is an
in form ation extraction process th at draws ou t th e requ ired in form a-
tion from th e u n derlyin g sou rces an d expresses it in a stru ctu red form
su itable for an alysis.
Accom plish in g th at correctly is an on goin g tech n ical ch allen ge,
especially wh en th e data in clu de im ages (an d, in th e fu tu re, video).
Su ch extraction is h igh ly application depen den t; th e in form ation in an
MRI, for in stan ce, is very differen t from wh at you wou ld draw ou t of a
su rveillan ce ph oto. Th e u biqu ity of su rveillan ce cam eras an d th e
popu larity of GPS-en abled m obile ph on es, cam eras, an d oth er portable
devices m ean s th at rich an d h igh - delity location an d trajectory (i.e.,
m ovem en t in space) data can also be extracted.
An oth er issu e is th e h on esty of th e data. For th e m ost part, data are
expected to be accu rate, if n ot tru th fu l. However, in som e cases, th ose
wh o are reportin g th e data m ay ch oose to h ide or falsify in form ation .
For exam ple, patien ts m ay ch oose to h ide risky beh avior, or poten tial
borrowers llin g ou t loan application s m ay in ate in com e or h ide
expen ses. Th e list is en dless of ways in wh ich data cou ld be m is-
in terpreted or m isreported. Th e act of clean in g data before an alysis
sh ou ld in clu de well-recogn ized con strain ts on valid data or well-
u n derstood error m odels, wh ich m ay be lackin g in Big Data platform s.
Movin g data th rou gh th e process requ ires con cen tration on in te-
gration , aggregation , an d represen tation of th e data all of wh ich are
process-orien ted steps th at address th e h eterogen eity of th e ood of
data. Here th e ch allen ge is to record th e data an d th en place th em in to
som e type of repository.
Data analysis is con siderably m ore challen ging than sim ply locating,
identifyin g, u nderstan ding, and citin g data. For effective large-scale
analysis, all of this h as to h appen in a com pletely automated m an ner.
This requires differences in data stru ctu re and sem an tics to be expressed
in forms th at are m achine readable and then com puter resolvable.
It m ay take a signi cant amoun t of work to ach ieve autom ated error-
free difference resolution.
Th e data preparation ch allen ge even exten ds to an alysis th at u ses
on ly a sin gle data set. Here th ere is still th e issu e of su itable database
design , fu rth er com plicated by th e m an y altern ative ways in wh ich to
store th e in form ation . Particu lar database design s m ay h ave certain
advan tages over oth ers for an alytical pu rposes. A case in poin t is th e
variety in th e stru ctu re of bioin form atics databases, in wh ich in for-
m ation on su bstan tially sim ilar en tities, su ch as gen es, is in h eren tly
differen t bu t is represen ted with th e sam e data elem en ts.
Exam ples like th ese clearly in dicate th at database design is an
artistic en deavor th at h as to be carefu lly execu ted in th e en terprise
con text by profession als. Wh en creatin g effective database design s,
profession als su ch as data scien tists m u st h ave th e tools to assist th em
in th e design process, an d m ore im portan t, th ey m u st develop tech -
n iqu es so th at databases can be u sed effectively in th e absen ce of
in telligen t database design .
As th e data m ove th rou gh th e process, th e n ext step is qu eryin g
th e data an d th en m odelin g it for an alysis. Meth ods for qu eryin g an d
m in in g Big Data are fu n dam en tally differen t from tradition al statistical
an alysis. Big Data is often n oisy, dyn am ic, h eterogen eou s, in terrelated,
an d u n tru stworth y a very differen t in form ation al sou rce from sm all
data sets u sed for tradition al statistical an alysis.
Even so, n oisy Big Data can be m ore valu able th an tin y sam ples
becau se gen eral statistics obtain ed from frequ en t pattern s an d cor-
relation an alysis u su ally overpower in dividu al u ctu ation s an d often
disclose m ore reliable h idden pattern s an d kn ow ledge. In addition ,
in tercon n ected Big Data creates large h eterogen eou s in form a-
tion n etworks w ith w h ich in form ation redu n dan cy can be explored
to com pen sate for m issin g data, cross-ch eck con ictin g cases, an d
validate tru stw orth y relation sh ips. In tercon n ected Big Data resou rces
can disclose in h eren t clu sters an d u n cover h idden relation sh ips
an d m odels.
Min in g th e data th erefore requ ires in tegrated, clean ed, tru stwor-
th y, an d ef cien tly accessible data, backed by declarative qu ery an d
m in in g in terfaces th at featu re scalable m in in g algorith m s. All of th is
relies on Big Data com pu tin g en viron m en ts th at are able to h an dle th e
load. Fu rth erm ore, data m in in g can be u sed con cu rren tly to im prove
th e qu ality an d tru stworth in ess of th e data, expose th e sem an tics
beh in d th e data, an d provide in telligen t qu eryin g fu n ction s.
Viru len t exam ples of in trodu ced data errors can be readily fou n d
in th e h ealth care in du stry. As n oted previou sly, it is n ot u n com m on
for real-w orld m edical records to h ave errors. Fu rth er com plicatin g th e
situ ation is th e fact th at m edical records are h eterogen eou s an d are
u su ally distribu ted in m u ltiple system s. Th e resu lt is a com plex an a-
lytics en viron m en t th at lacks an y type of stan dard n om en clatu re to
de n e its respective elem en ts.
Th e valu e of Big Data an alysis can be realized on ly if it can be
applied robu stly u n der th ose ch allen gin g con dition s. However, th e
kn owledge developed from th at data can be u sed to correct errors an d
rem ove am bigu ity. An exam ple of th e u se of th at corrective an alysis is
wh en a ph ysician writes DVT as th e diagn osis for a patien t. Th is
abbreviation is com m on ly u sed for both deep vein th rom bosis an d
diverticu litis, two very differen t m edical con dition s. A kn owledge base
con stru cted from related data can u se associated sym ptom s or m edi-
cation s to determ in e wh ich of th e two th e ph ysician m ean t.
It is easy to see h ow Big Data can en able th e n ext gen eration
of in teractive data an alysis, wh ich by u sin g au tom ation can deliver
sh own th e resu lts. Today s an alysts n eed to presen t resu lts in powerfu l
visu alization s th at assist in terpretation an d su pport u ser collaboration .
Th ese visu alization s sh ou ld be based on in teractive sou rces th at
allow th e u sers to click an d rede n e th e presen ted elem en ts, creatin g a
con stru ctive en viron m en t wh ere th eories can be played ou t an d oth er
h idden elem en ts can be brou gh t forward. Ideally, th e in terface will
allow visu alization s to be affected by wh at-if scen arios or ltered by
oth er related in form ation , su ch as date ran ges, geograph ical location s,
or statistical qu eries.
Furth ermore, with a few clicks the u ser should be able to go deeper
into each piece of data and u nderstand its provenance, wh ich is a key
featu re to u nderstanding the data. Users n eed to be able to n ot only see
the results but also un derstand why they are seein g those results.
Raw proven an ce, particu larly regardin g th e ph ases in th e an alytics
process, is likely to be too tech n ical for m an y u sers to grasp com pletely.
On e altern ative is to en able th e u sers to play with th e steps in th e
an alysis m ake sm all ch an ges to th e process, for exam ple, or m odify
valu es for som e param eters. Th e u sers can th en view th e resu lts of
th ese in crem en tal ch an ges. By th ese m ean s, th e u sers can develop an
in tu itive feelin g for th e an alysis an d also verify th at it perform s as
expected in corn er cases, th ose th at occu r ou tside n orm al circu m -
stan ces. Accom plish in g th is requ ires th e system to provide con ven ien t
facilities for th e u ser to specify an alyses.
BIG DA TA PRIVA CY