Data Science Interview Question
Data Science Interview Question
Data Science Interview Question
As I mentioned in my first post, I have just finished an extensive tech job search, which
featured eight on-sites, along with countless phone screens and informal chats. I was
interviewing for a combination of data science and software engineering (machine learning)
positions, and I got a pretty good sense of what those interviews are like. In this post, I give an
overview of what you should expect in a data science interview, and some suggestions for how to
prepare.
An interview is not a pop quiz. You should know what to expect going in, and you can take the
time to prepare for it. During the interview phase of the process, your recruiter is on your side
and can usually tell you what types of interviews youll have. Even if the recruiter is reluctant to
share that, common practices in the industry are a good guide to what youre likely to see.
In this post, Ill go over the types of data science interviews Ive encountered, and offer
my advice on how to prepare for them. Data science roles generally fall into two broad ares of
focus: statistics and machine learning. I only applied to the latter category, so thats the type of
position discussed in this post. My experience is also limited to tech companies, so I cant offer
guidance for data science in finance, biotech, etc..
Here are the types of interviews (or parts of interviews) Ive come across.
Always:
Your background
Often:
Culture fit
Dataset analysis
Stats
You will encounter a similar set of interviews for a machine learning software engineering
position, though more of the questions will fall in the coding category.
This is the same type of interview youd have for any software engineering position, though the
expectations may be less stringent. There are lots of websites and books that will tell you how to
prepare. Practice your coding skills if theyre rusty. Dont forget to practice coding away from
the computer (e.g. on paper), which is surely a skill thats rusty. Review the data structures you
may never have used outside of school binary search trees, linked lists, heaps. Be comfortable
with recursion. Know how to reason about algorithm running times. You can generally use any
real language you want in an interview (Matlab doesnt count, unfortunately); Pythons
succinct syntax makes it a great language for coding interviews.
Prep tips:
If you get nervous in interviews, try doing some practice problems under time
pressure.
If you dont have much software engineering experience, see if you can get a
friend to look over your practice code and provide feedback.
Make sure you understand exactly what problem youre trying to solve. Ask
the interviewer questions if anything is unclear or underspecified.
Make sure you explain your plan to the interviewer before you start writing
any code, so that they can help you avoid spending time going down lessthan-ideal paths.
Mention what invalid inputs youd want to check for (e.g. input variable type
check). Dont bother writing the code to do so unless the interviewer asks. In
all my interviews, nobody has ever asked.
Before declaring that your code is finished, think about variable initialization,
end conditions, and boundary cases (e.g. empty inputs). If it seems helpful,
run through an example. Youll score points by catching your bugs yourself,
rather than having the interviewer point them out.
All the applied machine learning interviews Ive had focused on supervised learning. The
interviewer will present you with a prediction problem, and ask you to explain how you would
set up an algorithm to make that prediction. The problem selected is often relevant to the
company youre interviewing at (e.g. figuring out which product to recommend to a user, which
users are going to stop using the site, which ad to display, etc.), but can also be a toy example
(e.g. recommending board games to a friend). This type of interview doesnt depend on much
background knowledge, other than having a general understanding of machine learning
concepts (see below). However, it definitely helps to prepare by brainstorming the types of
problems a particular company might ask you to solve. Even if you miss the mark, the
brainstorming session will help with the culture fit interview (also see below).
When answering this type of question, Ive found it helpful to start by laying out the setup of the
problem. What are the inputs? What are the labels youre trying to predict? What machine
learning algorithms could you run on the data? Sometimes the setup will be obvious from the
question, but sometimes youll need to figure out how to define the problem. In the latter case,
youll generally have a discussion with the interviewer about some plausible definitions (e.g.,
what does it mean for a user to stop using the site?).
The main component of your answer will be feature engineering. There is nothing magical about
brainstorming features. Think about what might be predictive of the variable you are trying to
predict, and what information you would actually have available. Ive found it helpful to give
context around what Im trying to capture, and to what extent the features Im proposing reflect
that information.
For the sake of concreteness, heres an example. Suppose Amazon is trying to figure out what
books to recommend to you. (Note: I did not interview at Amazon, and have no idea what they
actually ask in their interviews.) To predict what books youre likely to buy, Amazon can look for
books that are similar to your past Amazon purchases. But maybe some purchases were mistakes,
and you vowed to never buy a book like that again. Well, Amazon knows how youve interacted
with your Kindle books. If theres a book you started but never finished, it might be a positive
signal for general areas youre interested in, but a negative signal for the particular author. Or
maybe some categories of books deserve different treatment. For example, if a year ago you were
buying books targeted at one-year-olds, Amazon could deduce that nowadays youre looking for
books for two-year-olds. Its easy to see how you can spend a while exploring the space between
what youd like to know and what you can actually find out.
Your background
You should be prepared to give a high-level summary of your career, as well as to do a deep-dive
into a project youve worked on. The project doesnt have to be directly related to the position
youre interviewing for (though it cant hurt), but it needs to be the kind of work you can have an
in-depth technical discussion about.
To prepare:
Practice explaining your project to a friend in order to make sure you are
telling a coherent story. Keep in mind that youll probably be talking to
someone whos smart but doesnt have expertise in your particular field.
Be prepared to answer questions as to why you chose the approach that you
did, and about your individual contribution to the project.
Culture fit
Here are some culture fit questions your interviewers are likely to be interested in. These
questions might come up as part of other interviews, and will likely be asked indirectly. It helps
to keep what the interviewer is looking for in the back of your mind.
Will you work well with other people? I know its a clich, but most work
is collaborative, and companies are trying to assess this as best they can.
Avoid bad-mouthing former colleagues, and show appreciation for their
contributions to your projects.
Are you willing to get your hands dirty? If theres annoying work that
needs to be done (e.g. cleaning up messy data), will you take care of it?
You may also get broad questions about what kinds of work you enjoy and what motivates you.
Its useful to have an answer ready, but there may not be a right answer the interviewer is
looking for.
Machine learning theory
This type of interview will test your understanding of basic machine learning concepts, generally
with a focus on supervised learning. You should understand:
Why you want to split data into training and test sets
The idea that models that arent powerful enough cant capture the right
generalizations about the data, and ways to address this (e.g. different model
or projection into a higher-dimensional space)
The idea that models that are too powerful suffer from overfitting, and ways
to address this (e.g. regularization)
You dont need to know a lot of machine learning algorithms, but you definitely need to
understand logistic regression, which seems to be what most companies are using. I also had
some in-depth discussions of SVMs, but that may just be because I brought them up.
Dataset analysis
In this type of interview, you will be given a data set, and asked to write a script to pull out
features for some prediction task. You may be asked to then plug the features into a machine
learning algorithm. This interview essentially adds an implementation component to the applied
machine learning interview (see above). Of course, your features may now be inspired by what
you see in the data. Do the distributions for each feature youre considering differ between the
labels youre trying to predict?
I found these interviews hardest to prepare for, because the recruiter often wouldnt tell me what
format the data would be in, and what exactly Id need to do with it. (For example, do I need to
review Pythons csv import module? Should I look over the syntax for training a model in scikitlearn?) I also had one recruiter tell me Id be analyzing big data, which was a bit intimidating
(am I going to be working with distributed databases or something?) until I discovered at the
interview that the big data set had all of 11,000 examples. I encourage you to push for as much
info as possible about what youll actually be doing.
If you plan to use Python, working through the scikit-learn tutorial is a good way to prepare.
Stats
I have a decent intuitive understanding of statistics, but very little formal knowledge. Most of the
time, this sufficed, though Im sure knowing more wouldnt have hurt. You should understand
how to set up an A/B test, including random sampling, confounding variables, summary statistics
(e.g. mean), and measuring statistical significance.
Preparation Checklist & Resources
Here is a summary list of tips for preparing for data science interviews, along with a few helpful
resources.
1. Coding (usually whiteboard)
o
Get comfortable with basic algorithms, data structures and figuring out
algorithm complexity.
Resources:
Think about the machine learning problems that are relevant for each
company youre interviewing at. Use these problems as practice
questions.
3. Your background
o
4. Culture fit
o
Think about the problems each company is trying to solve, and how
you and the team youd be part of could make a difference.
Resources:
6. Dataset analysis
Get comfortable with a set of technical tools for working with data.
Resources:
7. Stats
o
Resources:
Sample size calculator, which you can use to get some intuition
about sample sizes required based on the sensitivity (i.e.
minimal detectable effect) and statistical significance youre
looking for
I have just finished a more extensive tech job search than anyone should really do. It
featured eight on-sites, along with countless phone screens and informal chats. There were a few
reasons why I ended up doing things this way: (a) I quit my job when my husband and I moved
from Boston to San Francisco a few months ago, so I had the time; (b) I wasnt sure what I was
looking for big company vs. small, data scientist vs. software engineer on a machine learning
system, etc.; (c) I wasnt sure how well it would all go.
This way of doing a job search turned out to be an awesome learning experience. In this series
of posts, Ive tried to jot down some thoughts on what makes for a good interview process, both
for the company and for the candidate. I was interviewing for a combination of data science and
software engineering positions, but many observations should be more broadly applicable.
Doing this well has an obvious benefit when the candidate is qualified: theyll be more likely to
take the offer. But it also has some less obvious benefits that apply to all candidates:
The candidate will be more likely to refer friends to your company. I heard
about a candidate who was rejected but went on to recommend two friends
who ended up joining the company.
The candidate will be more positive when discussing your company with their
friends. Its a small world.
Even if you dont want to hire the candidate right now, you might want to hire
them in a year.
There is intrinsic merit in being nice to people as theyre going through what
is often a stressful experience.
Feel good doing it: Make sure the interviewers have a positive interview
experience.
As someone on the other side of the fence, this one is harder for me to reason about. But here are
some thoughts on why this is important:
If the interviewer is grumpy, the candidate will be less likely to think well of
the company (see above). One of the companies I interviewed at requires
interviewers to submit detailed written feedback, which resulted in them
dedicating much of their attention to typing up my whiteboard code during
the interview. More than one interviewer expressed their frustration with the
process. Even if they were pretty happy with their job most of the time, it
certainly didnt come across that way.
In the next post, Ill take a look at some job postings. Do you have thoughts on
other goals companies should strive for? Please comment!k
Get that job at Google
I've been meaning to write up some tips on interviewing at Google for a good long
time now. I keep putting it off, though, because it's going to make you mad.
Probably. For some statistical definition of "you", it's very likely to upset you.
Why? Because... well, here, I wrote a little ditty about it:
Hey man, I don't know that stuff
Stevey's talking aboooooout
If my boss thinks it's important
I'm gonna get fiiiiiiiiiired
Oooh yeah baaaby baaaay-beeeeee....
I didn't realize this was such a typical reaction back when I first started writing
about interviewing, way back at other companies. Boy-o-howdy did I find out in a
hurry.
See, it goes like this:
Me: blah blah blah, I like asking question X in interviews, blah blah blah...
You: Question X? Oh man, I haven't heard about X since college! I've never needed
it for my job! He asks that in interviews? But that means someone out there thinks
it's important to know, and, and... I don't know it! If they detect my ignorance, not
those positions.
These tips are actually generic; there's nothing specific to Google vs. any other
software company. I could have been writing these tips about my first software job
20 years ago. That implies that these tips are also timeless, at least for the span of
our careers.
These tips obviously won't get you a job on their own. My hope is that by following
them you will perform your very best during the interviews.
Oh, and um, why Google?
Oho! Why Google, you ask? Well let's just have that dialog right up front, shall we?
You: Should I work at Google? Is it all they say it is, and more? Will I be serenely
happy there? Should I apply immediately?
Me: Yes.
You: To which ques... wait, what do you mean by "Yes?" I didn't even say who I am!
Me: Dude, the answer is Yes. (You may be a woman, but I'm still calling you Dude.)
You: But... but... I am paralyzed by inertia! And I feel a certain comfort level at my
current company, or at least I have become relatively inured to the discomfort. I
know people here and nobody at Google! I would have to learn Google's build
system and technology and stuff! I have no credibility, no reputation there I would
have to start over virtually from scratch! I waited too long, there's no upside! I'm
afraaaaaaid!
Me: DUDE. The answer is Yes already, OK? It's an invariant. Everyone else who
came to Google was in the exact same position as you are, modulo a handful of
famous people with beards that put Gandalf's to shame, but they're a very tiny
minority. Everyone who applied had the same reasons for not applying as you do.
And everyone here says: "GOSH, I SURE AM HAPPY I CAME HERE!" So just apply
already. But prep first.
You: But what if I get a mistrial? I might be smart and qualified, but for some
random reason I may do poorly in the interviews and not get an offer! That would be
a huge blow to my ego! I would rather pass up the opportunity altogether than have
a chance of failure!
Me: Yeah, that's at least partly true. Heck, I kinda didn't make it in on my first
attempt, but I begged like a street dog until they gave me a second round of
interviews. I caught them in a weak moment. And the second time around, I
prepared, and did much better.
The thing is, Google has a well-known false negative rate, which means we
sometimes turn away qualified people, because that's considered better than
sometimes hiring unqualified people. This is actually an industry-wide thing, but the
dial gets turned differently at different companies. At Google the false-negative rate
is pretty high. I don't know what it is, but I do know a lot of smart, qualified people
who've not made it through our interviews. It's a bummer.
But the really important takeaway is this: if you don't get an offer, you may still be
qualified to work here. So it needn't be a blow to your ego at all!
As far as anyone I know can tell, false negatives are completely random, and are
unrelated to your skills or qualifications. They can happen from a variety of factors,
including but not limited to:
1. you're having an off day
2. one or more of your interviewers is having an off day
3. there were communication issues invisible to you and/or one or more of the
interviewers
4. you got unlucky and got an Interview Anti-Loop
Oh no, not the Interview Anti-Loop!
Yes, I'm afraid you have to worry about this.
What is it, you ask? Well, back when I was at Amazon, we did (and they undoubtedly
still do) a LOT of soul-searching about this exact problem. We eventually concluded
that every single employee E at Amazon has at least one "Interview Anti-Loop": a
set of other employees S who would not hire E. The root cause is important for you
to understand when you're going into interviews, so I'll tell you a little about what
I've found over the years.
First, you can't tell interviewers what's important. Not at any company. Not unless
they're specifically asking you for advice. You have a very narrow window of perhaps
one year after an engineer graduates from college to inculcate them in the art of
interviewing, after which the window closes and they believe they are a "good
interviewer" and they don't need to change their questions, their question styles,
their interviewing style, or their feedback style, ever again.
It's a problem. But I've had my hand bitten enough times that I just don't try
anymore.
Second problem: every "experienced" interviewer has a set of pet subjects and
possibly specific questions that he or she feels is an accurate gauge of a candidate's
abilities. The question sets for any two interviewers can be widely different and
even entirely non-overlapping.
A classic example found everywhere is: Interviewer A always asks about C++ trivia,
filesystems, network protocols and discrete math. Interviewer B always asks about
Java trivia, design patterns, unit testing, web frameworks, and software project
management. For any given candidate with both A and B on the interview loop, A
and B are likely to give very different votes. A and B would probably not even hire
each other, given a chance, but they both happened to go through interviewer C,
who asked them both about data structures, unix utilities, and processes versus
threads, and A and B both happened to squeak by.
That's almost always what happens when you get an offer from a tech company. You
just happened to squeak by. Because of the inherently flawed nature of the
interviewing process, it's highly likely that someone on the loop will be unimpressed
with you, even if you are Alan Turing. Especially if you're Alan Turing, in fact, since it
means you obviously don't know C++.
The bottom line is, if you go to an interview at any software company, you should
plan for the contingency that you might get genuinely unlucky, and wind up with
one or more people from your Interview Anti-Loop on your interview loop. If this
happens, you will struggle, then be told that you were not a fit at this time, and then
you will feel bad. Just as long as you don't feel meta-bad, everything is OK. You
should feel good that you feel bad after this happens, because hey, it means you're
human.
And then you should wait 6-12 months and re-apply. That's pretty much the best
solution we (or anyone else I know of) could come up with for the false-negative
problem. We wipe the slate clean and start over again. There are lots of people here
who got in on their second or third attempt, and they're kicking butt.
You can too.
OK, I feel better about potentially not getting hired
Good! So let's get on to those tips, then.
If you've been following along very closely, you'll have realized that I'm interviewer
D. Meaning that my personal set of pet questions and topics is just my own, and it's
no better or worse than anyone else's. So I can't tell you what it is, no matter how
much I'd like to, because I'll offend interviewers A through X who have slightly
different working sets.
Instead, I want to prep you for some general topics that I believe are shared by the
majority of tech interviewers at Google-like companies. Roughly speaking, this
means the company builds a lot of their own software and does a lot of distributed
computing. There are other tech-company footprints, the opposite end of the
spectrum being companies that outsource everything to consultants and try to use
as much third-party software as possible. My tips will be useful only to the extent
that the company resembles Google.
So you might as well make it Google, eh?
First, let's talk about non-technical prep.
The Warm-Up
Nobody goes into a boxing match cold. Lesson: you should bring your boxing gloves
to the interview. No, wait, sorry, I mean: warm up beforehand!
How do you warm up? Basically there is short-term and long-term warming up, and
you should do both.
Long-term warming up means: study and practice for a week or two before the
interview. You want your mind to be in the general "mode" of problem solving on
whiteboards. If you can do it on a whiteboard, every other medium (laptop, shared
network document, whatever) is a cakewalk. So plan for the whiteboard.
Short-term warming up means: get lots of rest the night before, and then do
intense, fast-paced warm-ups the morning of the interview.
The two best long-term warm-ups I know of are:
1) Study a data-structures and algorithms book. Why? Because it is the most
likely to help you beef up on problem identification. Many interviewers are happy
when you understand the broad class of question they're asking without
explanation. For instance, if they ask you about coloring U.S. states in different
colors, you get major bonus points if you recognize it as a graph-coloring problem,
even if you don't actually remember exactly how graph-coloring works.
And if you do remember how it works, then you can probably whip through the
answer pretty quickly. So your best bet, interview-prep wise, is to practice the art of
recognizing that certain problem classes are best solved with certain algorithms and
data structures.
My absolute favorite for this kind of interview preparation is Steven Skiena's The
Algorithm Design Manual. More than any other book it helped me understand just
how astonishingly commonplace (and important) graph problems are they should
be part of every working programmer's toolkit. The book also covers basic data
structures and sorting algorithms, which is a nice bonus. But the gold mine is the
second half of the book, which is a sort of encyclopedia of 1-pagers on zillions of
useful problems and various ways to solve them, without too much detail. Almost
every 1-pager has a simple picture, making it easy to remember. This is a great way
to learn how to identify hundreds of problem types.
Other interviewers I know recommend Introduction to Algorithms. It's a true classic
and an invaluable resource, but it will probably take you more than 2 weeks to get
through it. But if you want to come into your interviews prepped, then consider
deferring your application until you've made your way through that book.
2) Have a friend interview you. The friend should ask you a random interview
question, and you should go write it on the board. You should keep going until it is
complete, no matter how tired or lazy you feel. Do this as much as you can possibly
tolerate.
I didn't do these two types of preparation before my first Google interview, and I
was absolutely shocked at how bad at whiteboard coding I had become since I had
last interviewed seven years prior. It's hard! And I also had forgotten a bunch of
algorithms and data structures that I used to know, or at least had heard of.
Going through these exercises for a week prepped me mightily for my second round
of Google interviews, and I did way, way better. It made all the difference.
As for short-term preparation, all you can really do is make sure you are as alert and
warmed up as possible. Don't go in cold. Solve a few problems and read through
your study books. Drink some coffee: it actually helps you think faster, believe it or
not. Make sure you spend at least an hour practicing immediately before you walk
into the interview. Treat it like a sports game or a music recital, or heck, an exam: if
you go in warmed up you'll give your best performance.
Mental Prep
So! You're a hotshot programmer with a long list of accomplishments. Time to forget
about all that and focus on interview survival.
You should go in humble, open-minded, and focused.
If you come across as arrogant, then people will question whether they want to work
with you. The best way to appear arrogant is to question the validity of the
interviewer's question it really ticks them off, as I pointed out earlier on.
Remember how I said you can't tell an interviewer how to interview? Well, that's
especially true if you're a candidate.
So don't ask: "gosh, are algorithms really all that important? do you ever need to do
that kind of thing in real life? I've never had to do that kind of stuff." You'll just get
rejected, so don't say that kind of thing. Treat every question as legitimate, even if
you are frustrated that you don't know the answer.
Feel free to ask for help or hints if you're stuck. Some interviewers take points off for
that, but occasionally it will get you past some hurdle and give you a good
performance on what would have otherwise been a horrible stony half-hour silence.
Don't say "choo choo choo" when you're "thinking".
Don't try to change the subject and answer a different question. Don't try to divert
the interviewer from asking you a question by telling war stories. Don't try to bluff
your interviewer. You should focus on each problem they're giving you and make
your best effort to answer it fully.
Some interviewers will not ask you to write code, but they will expect you to start
writing code on the whiteboard at some point during your answer. They will give you
hints but won't necessarily come right out and say: "I want you to write some code
on the board now." If in doubt, you should ask them if they would like to see code.
Interviewers have vastly different expectations about code. I personally don't care
about syntax (unless you write something that could obviously never work in any
programming language, at which point I will dive in and verify that you are not, in
fact, a circus clown and that it was an honest mistake). But some interviewers are
really picky about syntax, and some will even silently mark you down for missing a
semicolon or a curly brace, without telling you. I think of these interviewers as
well, it's a technical term that rhymes with "bass soles", but they think of
themselves as brilliant technical evaluators, and there's no way to tell them
otherwise.
So ask. Ask if they care about syntax, and if they do, try to get it right. Look over
your code carefully from different angles and distances. Pretend it's someone else's
code and you're tasked with finding bugs in it. You'd be amazed at what you can
miss when you're standing 2 feet from a whiteboard with an interviewer staring at
your shoulder blades.
It's OK (and highly encouraged) to ask a few clarifying questions, and occasionally
verify with the interviewer that you're on the track they want you to be on. Some
interviewers will mark you down if you just jump up and start coding, even if you
get the code right. They'll say you didn't think carefully first, and you're one of those
"let's not do any design" type cowboys. So even if you think you know the answer to
the problem, ask some questions and talk about the approach you'll take a little
before diving in.
On the flip side, don't take too long before actually solving the problem, or some
interviewers will give you a delay-of-game penalty. Try to move (and write) quickly,
since often interviewers want to get through more than one question during the
interview, and if you solve the first one too slowly then they'll be out of time. They'll
mark you down because they couldn't get a full picture of your skills. The benefit of
the doubt is rarely given in interviewing.
One last non-technical tip: bring your own whiteboard dry-erase markers. They sell
pencil-thin ones at office supply stores, whereas most companies (including Google)
tend to stock the fat kind. The thin ones turn your whiteboard from a 480i standarddefinition tube into a 58-inch 1080p HD plasma screen. You need all the help you
can get, and free whiteboard space is a real blessing.
You should also practice whiteboard space-management skills, such as not starting
on the right and coding down into the lower-right corner in Teeny Unreadable Font.
Your interviewer will not be impressed. Amusingly, although it always irks me when
people do this, I did it during my interviews, too. Just be aware of it!
Oh, and don't let the marker dry out while you're standing there waving it. I'm tellin'
ya: you want minimal distractions during the interview, and that one is surprisingly
common.
OK, that should be good for non-tech tips. On to X, for some value of X! Don't stab
me!
Tech Prep Tips
The best tip is: go get a computer science degree. The more computer science you
have, the better. You don't have to have a CS degree, but it helps. It doesn't have to
be an advanced degree, but that helps too.
However, you're probably thinking of applying to Google a little sooner than 2 to 8
years from now, so here are some shorter-term tips for you.
Algorithm Complexity: you need to know Big-O. It's a must. If you struggle with
basic big-O complexity analysis, then you are almost guaranteed not to get hired.
It's, like, one chapter in the beginning of one theory of computation book, so just go
read it. You can do it.
Sorting: know how to sort. Don't do bubble-sort. You should know the details of at
least one n*log(n) sorting algorithm, preferably two (say, quicksort and merge sort).
Merge sort can be highly useful in situations where quicksort is impractical, so take
a look at it.
For God's sake, don't try sorting a linked list during the interview.
Hashtables: hashtables are arguably the single most important data structure
known to mankind. You absolutely have to know how they work. Again, it's like one
chapter in one data structures book, so just go read about them. You should be able
to implement one using only arrays in your favorite language, in about the space of
one interview.
Trees: you should know about trees. I'm tellin' ya: this is basic stuff, and it's
embarrassing to bring it up, but some of you out there don't know basic tree
construction, traversal and manipulation algorithms. You should be familiar with
binary trees, n-ary trees, and trie-trees at the very very least. Trees are probably the
best source of practice problems for your long-term warmup exercises.
You should be familiar with at least one flavor of balanced binary tree, whether it's a
red/black tree, a splay tree or an AVL tree. You should actually know how it's
implemented.
You should know about tree traversal algorithms: BFS and DFS, and know the
difference between inorder, postorder and preorder.
You might not use trees much day-to-day, but if so, it's because you're avoiding tree
problems. You won't need to do that anymore once you know how they work. Study
up!
Graphs
Graphs are, like, really really important. More than you think. Even if you already
think they're important, it's probably more than you think.
There are three basic ways to represent a graph in memory (objects and pointers,
matrix, and adjacency list), and you should familiarize yourself with each
representation and its pros and cons.
You should know the basic graph traversal algorithms: breadth-first search and
depth-first search. You should know their computational complexity, their tradeoffs,
and how to implement them in real code.
You should try to study up on fancier algorithms, such as Dijkstra and A*, if you get
a chance. They're really great for just about anything, from game programming to
distributed computing to you name it. You should know them.
Whenever someone gives you a problem, think graphs. They are the most
fundamental and flexible way of representing any kind of a relationship, so it's
about a 50-50 shot that any interesting design problem has a graph involved in it.
Make absolutely sure you can't think of a way to solve it using graphs before moving
on to other solution types. This tip is important!
Other data structures
You should study up on as many other data structures and algorithms as you can fit
in that big noggin of yours. You should especially know about the most famous
classes of NP-complete problems, such as traveling salesman and the knapsack
problem, and be able to recognize them when an interviewer asks you them in
disguise.
You should find out what NP-complete means.
Basically, hit that data structures book hard, and try to retain as much of it as you
can, and you can't go wrong.
Math
Some interviewers ask basic discrete math questions. This is more prevalent at
Google than at other places I've been, and I consider it a Good Thing, even though
I'm not particularly good at discrete math. We're surrounded by counting problems,
probability problems, and other Discrete Math 101 situations, and those innumerate
among us blithely hack around them without knowing what we're doing.
Don't get mad if the interviewer asks math questions. Do your best. Your best will
be a heck of a lot better if you spend some time before the interview refreshing
your memory on (or teaching yourself) the essentials of combinatorics and
probability. You should be familiar with n-choose-k problems and their ilk the more
the better.
I know, I know, you're short on time. But this tip can really help make the difference
between a "we're not sure" and a "let's hire her". And it's actually not all that bad
discrete math doesn't use much of the high-school math you studied and forgot. It
starts back with elementary-school math and builds up from there, so you can
probably pick up what you need for interviews in a couple of days of intense study.
Sadly, I don't have a good recommendation for a Discrete Math book, so if you do,
Here are top 50 objective type sample Data Science Interview questions and their answers are
given just below to them. These sample questions are framed by experts from Intellipaat who
trains for Data Science training to give you an idea of type of questions which may be asked in
interview. We have taken full care to give correct answers for all the questions. Do comment
your thoughts Happy Job Hunting!
Scalability
Data sparsity
Synonyms
Grey sheep Data sparsity
Shilling attacks
Diversity and the Long Tail
Diversity
Recommender Persistence
Privacy
User Demographics
Robustness
Serendipity
Trust
Labeling
to get the 2016 data scientist salary report delivered to your inbox!
Python is the friendly programming language that plays well with everyone and runs on
everything. So it is hardly surprising that Python offers quite a few libraries that deal with data
efficiently and is therefore used in data science. Python was used for data science only in the
recent years. But now that it has firmly established itself as an important language for Data
Science, Python programming is not going anywhere. Mostly Python is used for data analysis
when you need to integrate the results of data analysis into web apps or if you need to add
mathematical/statistical codes for production.
In our previous posts 100 Data Science Interview Questions and Answers (General) and 100
Data Science in R Interview Questions and Answers, we listed all the questions that can be asked
in data science job interviews. This article in the series, lists questions which are related to
Python programming and will probably be asked in data science interviews.
The questions below are based on the course that is taught at DeZyre Data Science in Python.
This is not a guarantee that these questions will be asked in Data Science Interviews. The
purpose of these questions is to make the reader aware of the kind of knowledge that an applicant
for a Data Scientist position needs to possess.
Data Science Interview Questions in Python are generally scenario based or problem based
questions where candidates are provided with a data set and asked to do data munging, data
exploration, data visualization, modelling, machine learning, etc. Most of the data science
interview questions are subjective and the answers to these questions vary, based on the given
data problem. The main aim of the interviewer is to see how you code, what are the
visualizations you can draw from the data, the conclusions you can make from the data set, etc.
1) How can you build a simple logistic regression model in Python?
2) How can you train and interpret a linear regression model in SciKit learn?
3) Name a few libraries in Python used for Data Analysis and Scientific computations.
NumPy, SciPy, Pandas, SciKit, Matplotlib, Seaborn
4) Which library would you prefer for plotting in Python language: Seaborn or
Matplotlib?
Matplotlib is the python library used for plotting but it needs lot of fine-tuning to ensure that
the plots look shiny. Seaborn helps data scientists create statistically and aesthetically
appealing meaningful plots. The answer to this question varies based on the requirements for
plotting data.
5) What is the main difference between a Pandas series and a single-column
DataFrame in Python?
6) Write code to sort a DataFrame in Python in descending order.
7) How can you handle duplicate values in a dataset for a variable in Python?
8) Which Random Forest parameters can be tuned to enhance the predictive power of
the model?
9) Which method in pandas.tools.plotting is used to create scatter plot matrix?
Scatter_matrix
10) How can you check if a data set or time series is Random?
To check whether a dataset is random or not use the lag plot. If the lag plot for the given
dataset does not show any structure then it is random.
11) Can we create a DataFrame with multiple data types in Python? If yes, how can
you do it?
12) Is it possible to plot histogram in Pandas without calling Matplotlib? If yes, then
write the code to plot the histogram?
13) What are the possible ways to load an array from a text data file in Python? How
can the efficiency of the code to load data file be improved?
numpy.loadtxt ()
14) Which is the standard data missing marker used in Pandas?
NaN
15) Why you should use NumPy arrays instead of nested Python lists?
16) What is the preferred method to check for an empty array in NumPy?
17) List down some evaluation metrics for regression problems.
18) Which Python library would you prefer to use for Data Munging?
Pandas
19) Write the code to sort an array in NumPy by the nth column?
Using argsort () function this can be achieved. If there is an array X and you would like to
sort the nth column then code for this will be x[x [: n-1].argsort ()]
20) How are NumPy and SciPy related?
21) Which python library is built on top of matplotlib and Pandas to ease data plotting?
Seaborn
22) Which plot will you use to access the uncertainty of a statistic?
Bootstrap
23) What are some features of Pandas that you like or dislike?
24) Which scientific libraries in SciPy have you worked with in your project?
25) What is pylab?
A package that combines NumPy, SciPy and Matplotlib into a single namespace.
26) Which python library is used for Machine Learning?
SciKit-Learn
Learn Data Science in Python to become an Enterprise Data Scientist
2)
However, it is not possible to copy all objects in Python using these functions. For instance,
dictionaries have a separate copy method whereas sequences in Python have to be copied by
Slicing.
28) What is the difference between tuples and lists in Python?
Tuples can be used as keys for dictionaries i.e. they can be hashed. Lists are mutable whereas
tuples are immutable - they cannot be changed. Tuples should be used when the order of
elements in a sequence matters. For example, set of actions that need to be executed in sequence,
geographic locations or list of points on a specific route.
29) What is PEP8?
PEP8 consists of coding guidelines for Python language so that programmers can write readable
code making it easy to use for any other person, later on.
30) Is all the memory freed when Python exits?
No it is not, because the objects that are referenced from global namespaces of Python modules
are not always de-allocated when Python exits.
31) What does _init_.py do?
_init_.py is an empty py file used for importing a module in a directory. _init_.py provides an
easy way to organize the files. If there is a module maindir/subdir/module.py,_init_.py is placed
in all the directories so that the module can be imported using the following commandimport maindir.subdir.module
32) What is the different between range () and xrange () functions in Python?
range () returns a list whereas xrange () returns an object that acts like an iterator for generating
numbers on demand.
33) How can you randomize the items of a list in place in Python?
Shuffle (lst) can be used for randomizing the items of a list in Python
34) What is a pass in Python?
Pass in Python signifies a no operation statement indicating that nothing is to be done.
35) If you are gives the first and last names of employees, which data type in Python will
you use to store them?
You can use a list that has first name and last name included in an element or use Dictionary.
36) What happens when you execute the statement mango=banana in Python?
A name error will occur when this statement is executed in Python.
37) Write a sorting algorithm for a numerical dataset in Python.
38) Optimize the below python codeword = 'word'
print word.__len__ ()
Answer: print word._len_ ()
39) What is monkey patching in Python?
Monkey patching is a technique that helps the programmer to modify or extend other code at
runtime. Monkey patching comes handy in testing but it is not a good practice to use it in
production environment as debugging the code could become difficult.
40) Which tool in Python will you use to find bugs if any?
Pylint and Pychecker. Pylint verifies that a module satisfies all the coding standards or not.
Pychecker is a static analysis tool that helps find out bugs in the course code.
41) How are arguments passed in Python- by reference or by value?
The answer to this question is neither of these because passing semantics in Python are
completely different. In all cases, Python passes arguments by value where all values are
references to objects.
42) You are given a list of N numbers. Create a single list comprehension in Python to
create a new list that contains only those values which have even numbers from elements of
the list at even indices. For instance if list[4] has an even value the it has be included in the
new output list because it has an even index but if list[5] has an even value it should not be
included in the list because it is not at an even index.
word = aeioubcdfg'
print word [:3] + word [3:]
The output for the above code will be: aeioubcdfg'.
In string slicing when the indices of both the slices collide and a + operator is applied on the
string it concatenates them.
48)
list= [a,e,i,o,u]
No, as their syntax is restrcited to single expressions and they are used for creating function
objects which are returned at runtime.
This list of questions for Python interview questions and answers is not an exhaustive one and
will continue to be a work in progress. Let us know in comments below if we missed out on any
important question that needs to be up here.
Python Developer interview questions
This Python Developer interview profile brings together a snapshot of what to look for in
candidates with a balanced sample of suitable interview questions.
Introduction
In some respects even the most technical role demands qualities common to strong candidates for
all positions: the willingness to learn; qualified skills; passion for the job.
Even college performance, while it helps you to assess formal education, doesnt give a complete
picture. This is not to underplay the importance of a solid background in computer science. Some
things to look for:
Understanding of basic algorithmic concepts
Discuss basic algorithms, how would they find/think/sort
Can they show a wider understanding of databases
Do they have an approach to modelling?
Do they stay up to date with the latest developments? If so, how? Probe for their favourite
technical books. Who are they following on Twitter, which blogs do they turn to?
Are they active on Github? Do they contribute to any open source software projects? Or take part
in Hackathons. In short, how strong is their intellectual interest in their chosen field? How is this
demonstrated? Ask for side projects (like game development). Committed, inquisitive candidates
will stand out.
Implement the linux whereis command that locates the binary, source, and
manual page files for a command.
print list[10:]
class C:
dangerous = 2
o
o
c1 = C()
c2 = C()
print c1.dangerous
c1.dangerous = 3
print c1.dangerous
print c2.dangerous
o
o
del c1.dangerous
print c1.dangerous
o
o
C.dangerous = 3
print c2.dangerous
Here are top 30 objective type sample Python Interview questions and their answers are given
just below to them. These sample questions are framed by experts from Intellipaat who trains for
Python training to give you an idea of type of questions which may be asked in interview. We
have taken full care to give correct answers for all the questions. Do comment your thoughts
Happy Job Hunting!
>>> y.split(,)
Result: (true, false, none)
What is the use of generators in Python?
Generators are primarily used to return multiple items but one after the other.
They are used for iteration in Python and for calculating large result sets. The
generator function halts until the next time request is placed.
One of the best uses of generators in Python coding is implementing callback
operation with reduced effort and time. They replace callback with iteration.
Through the generator approach, programmers are saved from writing a
separate callback function and pass it to work-function as it can applying for
loop around the generator.
13. How to create a multidimensional list in Python?
As the name suggests, a multidimensional list is the concept of a list holding
another list, applying to many such lists. It can be one easily done by creating
single dimensional list and filling each element with a newly created list.
14. What is lambda?
lambda is a powerful concept used in conjunction with other functions like
filter(), map(), reduce(). The major use of lambda construct is
to create anonymous functions during runtime, which can be used where they
are created. Such functions are actually known as throw-away functions in
Python. The general syntax is lambda argument_list:expression.
For instance:
>>> def intellipaat1 = lambda i, n : i+n
>>> intellipaat(2,2)
4
Using filter()
>> intellipaat = [1, 6, 11, 21, 29, 18, 24]
>> print filter (lambda x: x%3 = = 0, intellipaat)
[6, 21, 18, 24]
15. Define Pass in Python?
The pass statement in Python is equivalent to a null operation and a
placeholder, wherein nothing takes place after its execution. It is mostly used
at places where you can let your code go even if it isnt written yet.
If you would set out a pass after the code, it wont run. The syntax is pass
16. How to perform Unit Testing in Python?
Referred to as PyUnit, the python Unit testing framework-unittest supports
automated testing, seggregating test into collections, shutdown testing code
and testing independence from reporting framework. The unittest module
makes use of TestCase class for holding and preparing test routines and
clearing them after the successful execution.
17. Define Python tools for finding bugs and performing static analysis?
. PyChecker is an excellent bug finder tool in Python, which performs static
analysis unlike C/C++ and Java. It also notifies the programmers about the
complexity and style of the code. In addition, there is another tool, PyLint for
checking the coding standards including the code line length, variable names
and whether the interfaces declared are fully executed or not.
18. How to convert a string into list?
Using the function list(string). For instance:
>>> list(intellipaat) in your lines of code will return
[i, n, t, e, l, l, i, p, a, a, t]
In Python, strings behave like list in various ways. Like, you can access
individual characters of a string
>> > y = intellipaat
>>> s[2]
t
19. What OS do Python support?
Linux, Windows, Mac OS X, IRIX, Compaq, Solaris
20. Name the Java implementation of Python?
Jython
21. Define docstring in Python.
A string literal occurring as the first statement (like a comment) in any
module, class, function or method is referred as docstring in Python. This kind
of string becomes the _doc_ special attribute of the object and provides an
easy way to document a particular code segment. Most modules do contain
docstrings and thus, the functions and classes extracted from the module
also consist of docstrings.
22. Name the optional clauses used in a try-except statement in Python?
While Python exception handling is a bit different from Java, the former
provides an option of using a try-except clause where the programmer
receives a detailed error message without termination the program.
Sometimes, along with the problem, this try-except statement offers a
solution to deal with the error.
The language also provides try-except-finally and try-except-else blocks.
Python language is an interpreted language. Python program runs directly from the source code.
It converts the source code that is written by the programmer into an intermediate language,
which is again translated into machine language that has to be executed.
5) How memory is managed in Python?
Python memory is managed by Python private heap space. All Python objects and data
structures are located in a private heap. The programmer does not have an access to this
private heap and interpreter takes care of this Python private heap.
The allocation of Python heap space for Python objects is done by Python memory
manager. The core API gives access to some tools for the programmer to code.
Python also have an inbuilt garbage collector, which recycle all the unused memory and
frees the memory and makes it available to the heap space.
6) What are the tools that help to find bugs or perform static analysis?
PyChecker is a static analysis tool that detects the bugs in Python source code and warns about
the style and complexity of the bug. Pylint is another tool that verifies whether the module meets
the coding standard.
7) What are Python decorators?
A Python decorator is a specific change that we make in Python syntax to alter functions easily.
8) What is the difference between list and tuple?
The difference between list and tuple is that list is mutable while tuple is not. Tuple can be
hashed for e.g as a key for dictionaries.
9) How are arguments passed by value or by reference?
Everything in Python is an object and all variables hold references to the objects. The references
values are according to the functions; as a result you cannot change the value of the references.
However, you can change the objects if it is mutable.
List
Sets
Dictionaries
Strings
Tuples
Numbers
Python sequences can be index in positive and negative numbers. For positive index, 0 is the
first index, 1 is the second index and so forth. For negative index, (-1) is the last index and (-2)
is the second last index and so forth.
23) How you can convert a number to a string?
In order to convert a number into a string, use the inbuilt function str(). If you want a octal or
hexadecimal representation, use the inbuilt function oct() or hex().
24) What is the difference between Xrange and range?
Xrange returns the xrange object while range returns the list, and uses the same memory and no
matter what the range size is.
25) What is module and package in Python?
In Python, module is the way to structure program. Each Python program file is a module, which
imports other modules like objects and attributes.
The folder of Python program is a package of modules. A package can have modules or
subfolders.
Here are the answers. Because of the length, here are the answers to the first 11
questions, and here is part 2.
Q1. Explain what regularization is and why it
is useful.
Answer by Matthew Mayo.
Regularization is the process of adding a tuning
parameter to a model to induce smoothness in
order to prevent overfitting. (see also KDnuggets
posts on Overfitting)
This is most often done by adding a constant multiple to an existing weight vector.
This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be
any norm. The model predictions should then minimize the mean of the loss
function calculated on the regularized training set.
Xavier Amatriain presents a good comparison of L1 and L2 regularization here, for
those interested.
Fig 1: Lp ball: As the value of p decreases, the size of the corresponding Lp space also decreases.
Geoff Hinton, Yann LeCun, and Yoshua Bengio - for persevering with Neural Nets
when and starting the current Deep Learning revolution.
Demis Hassabis, for his amazing work on DeepMind, which achieved human or
superhuman performance on Atari games and recently Go.
Jake Porway from DataKind and Rayid Ghani from U. Chicago/DSSG, for enabling
data science contributions to social good.
DJ Patil, First US Chief Data Scientist, for using Data Science to make US
government work better.
Kirk D. Borne for his influence and leadership on social media.
Claudia Perlich for brilliant work on ad ecosystem and serving as a great KDD-2014
chair.
Hilary Mason for great work at Bitly and inspiring others as a Big Data Rock Star.
Usama Fayyad, for showing leadership and setting high goals for KDD and Data
Science, which helped inspire me and many thousands of others to do their best.
Hadley Wickham, for his fantastic work on Data Science and Data Visualization in R,
including dplyr, ggplot2, and Rstudio.
There are too many excellent startups in Data Science area, but I will not list them
here to avoid a conflict of interest.
Here is some of our previous coverage of startups.
Q3. How would you validate a model you created to generate a predictive
model of a quantitative outcome variable using multiple regression.
If the values predicted by the model are far outside of the response variable
range, this would immediately indicate poor estimation or model inaccuracy.
Use the model for prediction by feeding it new data, and use the coefficient of
determination (R squared) as a model validity measure.
Q4. Explain what precision and recall are. How do they relate to the ROC
curve?
Answer by Gregory Piatetsky:
Here is the answer from KDnuggets FAQ: Precision and Recall:
Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases
among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a
better chance of catching many of the 100 positive cases. You record the IDs of your predictions,
and when you get the actual results you sum up how many times you were right or wrong. There
are four ways of being right or wrong:
1. TN / True Negative: case was negative and predicted negative
2. TP / True Positive: case was positive and predicted positive
3. FN / False Negative: case was positive but predicted negative
4. FP / False Positive: case was negative but predicted positive
Makes sense so far? Now you count how many of the 10,000 cases fall in each bucket, say:
Predicted Negative
Predicted Positive
Negative Cases
TN: 9,760
FP: 140
Positive Cases
FN: 40
TP: 60
Ensure that there is no selection bias in test data used for performance
comparison
Ensure that the test data has sufficient variety in order to be symbolic of reallife data (helps avoid overfitting)
Ensure that the results are repeatable with near similar results
One common way to achieve the above guidelines is through A/B testing, where
both the versions of algorithm are kept running on similar environment for a
considerably long time and real-life input data is randomly split between the two.
This approach is particularly common in Web Analytics.
Q6. What is root cause analysis?
Answer by Gregory Piatetsky:
According to Wikipedia,
Root cause analysis (RCA) is a method of problem solving used for identifying the
root causes of faults or problems. A factor is considered a root cause if removal
thereof from the problem-fault-sequence prevents the final undesirable event from
recurring; whereas a causal factor is one that affects an event's outcome, but is not
a root cause.
Root cause analysis was initially developed to analyze industrial accidents, but is
now widely used in other areas, such as healthcare, project management, or
software testing.
Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.
Essentially, you can find the root cause of a problem and show the relationship of
causes by repeatedly asking the question, "Why?", until you find the root of the
problem. This technique is commonly called "5 Whys", although is can be involve
more or less than 5 questions.
Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .
Q7. Are you familiar with price optimization, price elasticity, inventory
management, competitive intelligence? Give examples.
Answer by Gregory Piatetsky:
Those are economics terms that are not frequently asked of Data Scientists but they
are useful to know.
Price optimization is the use of mathematical tools to determine how customers will
respond to different prices for its products and services through different channels.
Big Data and data mining enables use of personalization for price optimization. Now
companies like Amazon can even take optimization further and show different prices
to different visitors, based on their history, although there is a strong debate about
whether this is fair.
Price elasticity in common usage typically refers to
Similarly, Price elasticity of supply is an economics measure that shows how the
quantity supplied of a good or service responds to a change in its price.
Tools like Google Trends, Alexa, Compete, can be used to determine general trends
and analyze your competitors on the web.
8. What is statistical power?
Answer by Gregory Piatetsky:
Wikipedia defines Statistical power or sensitivity of a binary hypothesis test is the
probability that the test correctly rejects the null hypothesis (H0) when the
alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an
effect when the effect is present. The higher the statistical power, the less likely you
are to make a Type II error (concluding there is no effect when, in fact, there is).
Here are some tools to calculate statistical power.
9. Explain what resampling methods are and why they are useful. Also
explain their limitations.
Answer by Gregory Piatetsky:
Classical statistical parametric tests compare observed statistics to theoretical
sampling distributions. Resampling a data-driven, not theory-driven methodology
which is based upon repeated sampling within the same sample.
Resampling refers to methods for doing one of these
Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling
overfitting, experimental design, tall and wide data, understanding the validity of statistics in the
media, and more.
By Gregory Piatetsky, KDnuggets.
comments
The post on KDnuggets 20 Questions to Detect Fake Data Scientists has been very
popular - most viewed post of the month.
However these questions were lacking answers, so KDnuggets Editors got together
and wrote the answers. Here is part 2 of the answers, starting with a "bonus"
question.
Bonus Question: Explain what is overfitting and how would you control for
it
This question was not part of the original 20, but probably is the most important one
in distinguishing real data scientists from fake ones.
Answer by Gregory Piatetsky.
Overfitting is finding spurious results that are due to chance and cannot be
reproduced by subsequent studies.
We frequently see newspaper reports about studies that overturn the previous
findings, like eggs are no longer bad for your health, or saturated fat is not linked to
heart disease. The problem, in our opinion is that many researchers, especially in
social sciences or medicine, too frequently commit the cardinal sin of Data Mining Overfitting the data.
The researchers test too many hypotheses without proper statistical control, until
they happen to find something interesting and report it. Not surprisingly, next time
the effect, which was (at least partly) due to chance, will be much smaller or absent.
Minimal bias due to financial and other factors (including popularity of that
scientific field)
Unfortunately, too often these rules were violated, producing irreproducible results.
For example, S&P 500 index was found to be strongly related to Production of butter
in Bangladesh (from 19891 to 1993) (here is PDF)
See more interesting (and totally spurious) findings which you can discover yourself
using tools such as Google correlate or Spurious correlations by Tyler Vigen.
Several methods can be used to avoid "overfitting" the data
Randomization Testing (randomize the class variable, try your method on this
data - if it find the same strong results, something is wrong)
Nested cross-validation (do feature selection on one level, then run entire
method in cross-validation on outer level)
Good data science is on the leading edge of scientific understanding of the world,
and it is data scientists responsibility to avoid overfitting data and educate the
public and the media on the dangers of bad data analysis.
See also
Tag: Overfitting
Fig 12: There is a flaw in your experimental design (cartoon from here)
Step 4: Determine Experimental Design.
We consider experimental complexity i.e vary one factor at a time or multiple
factors at one time in which case we use factorial design (2^k design). A design is
also selected based on the type of objective (Comparative, Screening, Response
surface) & number of factors.
Q13. What is the diference between "long" ("tall") and "wide" format
data?
Answer by Gregory Piatetsky.
In most data mining / data science applications there are many more records (rows)
than features (columns) - such data is sometimes called "tall" (or "long") data.
In some applications like genomics or bioinformatics you may have only a small
number of records (patients), eg 100, but perhaps 20,000 observations for each
patient. The standard methods that work for "tall" data will lead to overfitting the
data, so special approaches are needed.
Fig 13. Diferent approaches for tall data and wide data, from presentation
Sparse Screening for Exact Data Reduction, by Jieping Ye.
The problem is not just reshaping the data (here there are useful R packages), but
avoiding false positives by reducing the number of features to find most relevant
ones.
Approaches for feature reduction like Lasso are well covered in Statistical Learning
with Sparsity: The Lasso and Generalizations, by Hastie, Tibshirani, and Wainwright.
(you can download free PDF of the book)
Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling
overfitting, experimental design, tall and wide data, understanding the validity of statistics in the
media, and more.
Pages: 1 2 3
By Gregory Piatetsky, KDnuggets.
Fig 14a: Example of a very misleading bar chart that appeared on Fox
News
Fig 14b: how the same data should be presented objectively, from 5 Ways to
Avoid Being Fooled By Statistics
Often the authors try to hide the inadequacy of their research through canny
storytelling and omitting important details to jump on to enticingly presented false
insights. Thus, a thumb's rule to identify articles with misleading statistical
inferences is to examine whether the article includes details on the research
methodology followed and any perceived limitations of the choices made related to
research methodology. Look for words such as "sample size", "margin of error", etc.
While there are no perfect answers as to what sample size or margin of error is
appropriate, these attributes must certainly be kept in mind while reading the end
results.
Another common case of erratic reporting are the situations when journalists with
poor data-education pick up an insight from one or two paragraphs of a published
research paper, while ignoring the rest of research paper, just in order to make their
point. So, here is how you can be smart to avoid being fooled by such articles:
Firstly, a reliable article must not have any unsubstantiated claims. All the
assertions must be backed with reference to past research. Or otherwise, is must be
clearly differentiated as an "opinion" and not an assertion. Secondly, just because
an article is referring to renowned research papers, does not mean that it is using
the insight from those research papers appropriately. This can be validated by
reading those referred research papers "in entirety", and independently judging
their relevance to the article at hand. Lastly, though the end-results might naturally
seem like the most interesting part, it is often fatal to skip the details about
research methodology (and spot errors, bias, etc.).
Ideally, I wish that all such articles publish their underlying research data as well as
the approach. That way, the articles can achieve genuine trust as everyone is free
to analyze the data and apply the research approach to see the results for
themselves.
Fig 15. Tufte writes: "an unintentional Necker Illusion, as two back planes optically
flip to the front. Some pyramids conceal others; and one variable (stacked depth of
the stupid pyramids) has no label or scale."
Here is a more
modern example from exceluser where it is very hard to understand the column plot
because of workers and cranes that obscure them.
The problem with such decorations is that they forces readers to work much harder
than necessary to discover the meaning of data.
16. How would you screen for outliers and what should you do if you find
one?
Answer by Bhavya Geethika.
Some methods to screen outliers are z-scores, modified z-score, box plots, Grubb's
test, Tietjen-Moore test exponential smoothing, Kimber test for exponential
distribution and moving window filter algorithm. However two of the robust methods
in detail are:
Inter Quartile Range
An outlier is a point of data that lies over 1.5 IQRs below the first quartile (Q1) or
above third quartile (Q3) in a given data set.
Tukey Method
It uses interquartile range to filter very large or very small numbers. It is practically
the same method as above except that it uses the concept of "fences". The two
values of fences are:
Pandora uses the properties of a song or artist (a subset of the 400 attributes
provided by the Music Genome Project) in order to seed a "station" that plays
music with similar properties. User feedback is used to refine the station's
results, deemphasizing certain attributes when a user "dislikes" a particular
song and emphasizing other attributes when a user "likes" a song. This is an
example of a content-based approach.
19. Explain what a false positive and a false negative are. Why is it
important to diferentiate these from each other?
Answer by Gregory Piatetsky:
In binary classification (or medical testing), False positive is when an algorithm (or
test) indicates presence of a condition, when in reality it is absent. A false negative
is when an algorithm (or test) indicates absence of a condition, when in reality it is
present.
In statistical hypothesis testing false positive is also called type I error and false
negative - type II error.
It is obviously very important to distinguish and treat false positives and false
negatives differently because the costs of such errors can be hugely different.
For example, if a test for serious disease is false positive (test says disease, but
person is healthy), then an extra test will be made that will determine the correct
diagnosis. However, if a test is false negative (test says healthy, but person has
disease), then treatment will be done and person may die as a result.
20. Which tools do you use for visualization? What do you think of
Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in
a chart (or in a video)?
There are many ways to representing more than 2 dimensions in a chart. 3rd
dimension can be shown with a 3D scatter plot which can be rotate. You can use
color, shading, shape, size. Animation can be used effectively to show time
dimension (change over time).
Here is a good example.
Fig 20a: 5-dimensional scatter plot of Iris data, with size: sepal length; color:
sepal width; shape: class; x-column: petal length; y-column: petal width, from here.
For more than 5 dimensions, one approach is Parallel Coordinates, pioneered by
Alfred Inselberg.
See also