The Data Analytics Handbook V.4
The Data Analytics Handbook V.4
The Data Analytics Handbook V.4
HANDBOOK
B I G D ATA E D I T I O N
ABOUT
THE AUTHORS
B R I A N L I O U Content
Brian graduated from Cal with simultaneous degrees
in Business Administration at the Haas School of
Business and Statistics with an emphasis in Computer
Science. He
previously
worked
in
investment
T R I S TA N TA O Content
Tristan holds dual degrees in Computer Science and
Statistics from UC Berkeley (14). He first began
working as a quantitative technical data analyst at
Starmine (Thomson Reuters). From there he worked
as a software engineer at Splunk. He has experience
working with various Machine Learning models, NLP,
Hadoop/Hive, Storm, R, Python and Java.
D E C L A N S H E N E R Content
Declan is in his third year at UC Berkeley where he
studies Computer Science and Economics. He currently
interns with San Francisco 49ers as a Data Analyst on
their Business Operations team.
HANDBOOK
DESIGN
E L I Z A B E T H L I N Design
ave you ever wondered what the deal was behind all the hype of big
data? Well, so did we. In 2014, data science hit peak popularity, and as
T O P 5 TA K E A W AY S
T H E B I G D ATA E D I T I O N
1. The terms Data Scientist and Data Engineer are not synonymous,
but they are not mutually exclusive either.
There exists a hybrid position that requires both a software engineering
background as well as mastery of statistical analysis, but that is not the norm.
Instead, expertise in one of the two with a good understanding of the other
is important. As a data engineer, you will be delivering the data that a data
scientist uses, and vice versa. Knowing how your data is being used and
understanding your data are crucial to success in either of these positions.
4. Big Data can be used for more than solving business problems.
Dont think that Big Data can only be used to help a business make financial
decisions or target consumers better. Big Data will also be used to solve
industrial issues such as energy and food shortages. The reality is that there is
no industry that Big Data will not touch eventually.
TA B L E O F C O N T E N T S
05 M I C H A E L J O R D A N
DISTINGUISHED EECS PROFESSOR
09 C H U L L E E
M Y F I T N E S S PA L
13 J O H N A K R E D
S I L I C O N VA L L E Y D ATA S C I E N C E
18 M AT T M C M A N U S
D ATA M E E R
21 J O H N S C H U S T E R
P L AT F O R A
24 T O M D AV E N P O R T
BABSON COLLEGE
MICHAEL JORDAN
D I S T I N G U I S H E D E E C S P R O F E S S O R AT U C B E R K E L E Y
M I C H A E L is the Pehong Chen Distinguished Professor in the Department of
Electrical Engineering and Computer Science and the Department of Statistics
at the University of California, Berkeley. His research in recent years has
focused on Bayesian nonparametric analysis, probabilistic graphical models,
spectral methods, kernel machines and applications to problems in signal
processing, statistical genetics, computational biology, information retrieval
and natural language processing.
05
Where do you think the biggest talent fall is within the Big
Data Industry?
You have to know something about statistics and you have to know something
about computer science. Out in industry, they find it hard to find such people.
been designed to include both computation and statistics. Its not the ideal
situation, universities need to change too to blend the education of both.
Right now you have to take about half a dozen statistics classes and about half
a dozen computation courses of some kind on top of prerequisite courses for
those majors. That would be enough to get you a good job, where you could do
a good job and have a meaningful career. One way to do that might be to get a
masters; either get a bachelors degree in either computer science or statistics
and get masters in the other.
07
Theyre
teaching you the toolbox without the understanding of the toolbox. I think
supplementing coursework by working with real world datasets is productive.
If youve got a solid set of fundamentals, working on projects is a great way to
improve your skills.
08
CHUL LEE
H E A D O F D ATA E N G I N E E R I N G AT M Y F I T N E S S P A L
C H U L is the currently the Head of Data Engineering and Science at
MyFitnessPal. He brings over twelve years of experience in data scientist
specifically experience designing, architecting, implementing, and measuring
large-scale data processing and business-intelligence applications.
09
10
At MyFitnessPal, Data Scientist roles and Data Engineering roles are well
defined since the emphasis for each role is somewhat different. However,
the overall trend in Silicon Valley is that everyone is part of the engineering
organization. Thus, as a data scientist, you need to have a basic understanding of
computer science, specifically data structures and algorithms. This is especially
true when you have to deal with large scale data since the computation aspect
of it becomes very critical. You can always come up with a fancy algorithm,
but when you actually try to apply that in practice, making sure that your
algorithm scales is very important. You need to have a basic understanding
of the system or algorithm aspect of the problem. Even though you may not
end up implementing that by yourself, you still need to be able communicate
with other engineers. Similarly, when youre a data engineer, you need to have
a basic understanding of statistics, so when youre talking to a data scientist
you can understand what they are getting across to you. This is for the sake of
communication, and working with a team that consists of people with different
background as opposed to a skillset that is absolutely necessary to get a data
science or data engineering job at MFP.
11
over time.
12
JOHN AKRED
F O U N D E R & C T O S I L I C O N VA L L E Y D ATA S C I E N C E
With over 15 years in advanced analytical applications and architecture, John
is dedicated to helping organizations become more data-driven. He combines
deep expertise in analytics and data science with business acumen and dynamic
engineering leadership.
13
14
15
16
across warehouses and retail stores. This involves heavy engineering, but also
data science since people put stuff in their shopping carts and inventory is
never where you think it is. As a result you need to understand the probabilistic
aspect of the problem and things such as time to fulfill order based on the
location of its destination etc.
So with us you might work on that for 3 months, but the next project might
be optimizing interventions for a healthcare company [what are the best
interventions to get a diabetic to adhere to the treatment regimen and regularly
test the blood sugar level etc.] These are very different problem spaces. Of course
they might leverage similar technology, but if youre curious about exploring
different problems (and want to be at the forefront of that exploration), you
get to see that variety with us. On the other hand, other people might instead
want to get on the LinkedIns famous People You May Know team. Certain
aspects of that problem were solved a long time ago, and I imagine that the
approach has remained relatively consistent. So of course youll be working
on the long tail of small incremental improvements. Some people love getting
very, very deep into a problem like that.
We provide premium services to companies, and that means we have to lead
the market in terms of the technologies we work with and how we implement
them. For example, we currently work a lot with Spark. Weve got Spark in
production for a major US retailer over the holidays [Fall/Winter of 2014]
and thats a pretty novel thing. Since we demand a premium in the market
place due to our team and capabilities, learning these new technologies and
capabilities is existential for us. We dedicate a lot of time to training our team
and exploring that space.
In a large or more mature product organization, you might be 1 of the 20
people working on the Hadoop cluster (which youll learn a lot), versus with
us youll be 1 of the 4 people who bring Hadoop to a new use case that has
never done before. Of course, people learn differently and some people might
do better in the first scenario. The fun thing is that we get to interact with the
people doing similar things on those teams at product companies a lot, and we
have a lot mutual respect for each other. Its just different strokes for different
folks, as they say.
17
M AT T M C M A N U S
V P O F E N G I N E E R I N G AT D ATA M E E R
M AT T has been building enterprise software products for over 10 years with
deep experience in architecture, software engineering and team management
roles. He is currently the VP of Engineering at Datameer, an end-to-end big
data analytics application for Hadoop.
18
19
20
JOHN SCHUSTER
V P O F E N G I N E E R I N G AT P L AT F O R A
J O H N is currently the Vice President of Engineering at Platfora. He leads a
team of engineers in building a Hadoop native Big Data Analytics platform for
organizations to use to analyze their data seamlessly.
21
22
as their development environment. We use stash and git for source control.
We have a whole bunch of internally written scripts and different pieces of
automation that build and test the product in a continuous integration sense.
23
T O M D AV E N P O R T
P R O F E S S O R AT B A B S O N C O L L E G E A N D A U T H O R
O F B I G D ATA @ W O R K
T O M is a world renowned thought leader on business analytics and big data.
He is the Presidents Distinguished Professor of Information Technology and
Management at Babson College, and the co-founder and Director of Research
at the International Institute for Analytics.
24
It seems like in your book, you are very optimistic about Big
Data and the advancements it will have in the world. There
is an interesting point that Michael Jordan, a professor at UC
Berkeley, brought up. He thinks that Big Data will return a lot
false positives in the future. He parallels it to having a billion
Monkeys typing at once, one of them will write Shakespeare.
How serious is this concern for the future?
Its certainly true that if you are only looking at measures of statistical
significance, working with Big Data is going to generate a lot of false positives
because by definition a certain percentage are going to be significant. Some
of those are going to make sense and some of them arent. If all youre using
is Machine Learning to generate statistically significant relationships among
variables, that statement is true. I think its one of the reasons why you need
to use a little judgement about whether or not that finding makes sense and is
there anything we can do with it. I think given the vast amount of data, we have
to use Machine Learning, but I generally advocate that its still accompanied by
some human analyst who can help make sense of the outcomes.
26
technical story, but chances are good its going to be a non-technical audience.
If this is the case, you have to use language that resonates with that specific
audience. If youre in business, then its mostly going to be a language of ROI,
conversion, lift, and things that people are familiar with in that context. You
have to spend a lot time thinking of a clever way of communicating your idea.
Use all the tools of storytelling, like metaphors and analogies, and provide as
many examples of possible. Unfortunately, I dont think there is a lot content
out there about how to communicate effectively about analytics. I heard this
week about an organization, a large academic medical center, whose Chief
Analytics Officer hired former journalists to do the communication. So you
could imagine we have some division of labor so the people who are really good
at communicating, like a journalist, could take some of the load off the analysts
themselves. Lets face it, its hard to find to people who are both a talented
analyst and a talented communicator.
27
L E A R N I N S I G H T I N D ATA
http://www.teamleada.com/