4227 GUI Ebook Data Science Interview Guide
4227 GUI Ebook Data Science Interview Guide
4227 GUI Ebook Data Science Interview Guide
Interview Guide
Lead the Data
Science Revolution
Harvard Business Review referred to of qualified candidates worldwide.
data scientist as the “Sexiest Job of If you’re moving down the path to
the 21st Century.” Glassdoor placed becoming a data scientist, you must
it #1 on the 25 Best Jobs in America be prepared to impress prospective
list. According to IBM, demand for employers with your knowledge.
this role will soar 28 percent by 2020. In addition to explaining why data
It’s unwise to ignore the importance science is so important, you’ll need to
of data and our capacity to analyze, show that you're technically proficient
consolidate, and contextualize it. And with Big Data concepts, frameworks,
it should come as no surprise that and applications.
companies that are able to leverage
But nothing to worry about. We have
massive amounts of data to improve
clubbed a list of the most popular
the way they serve customers, build
questions you can expect in an
products, and run their operations
interview. So prepare ahead of time,
will be positioned to thrive in this
and crack your Data Science interview
economy.
in the first go.
Data scientists are relied upon to fill
this need, but there is a serious lack
2 | www.simplilearn.com
Interview Guide
Topics
Covered
♦ Statistics
♦ SQL
♦ Model building
3 | www.simplilearn.com
Interview Guide
4 | www.simplilearn.com
Interview Guide
For example, let’s say you want to build a decision tree to decide
whether you should accept or decline a job offer. The decision tree for
this case is as shown:
5 | www.simplilearn.com
Interview Guide
3.) Split the node into daughter nodes using the best split
4.) Repeat steps two and three until leaf nodes are finalized
5.) Build forest by repeating steps one to four for ‘n’ times to create ‘n’
number of trees
Overfitting refers to a model that is only set for a very small amount of
data and ignores the bigger picture. There are three main methods to
avoid overfitting:
1.) Keep the model simple—take fewer variables into account, thereby
removing some of the noise in the training data
6 | www.simplilearn.com
Interview Guide
164
167.3
170
174.2
178
180
Bivariate
Bivariate data involves two different variables. The analysis of this type
of data deals with causes and relationships and the analysis is done to
determine the relationship between the two variables.
Multivariate
7 | www.simplilearn.com
Interview Guide
2 0 900 $4000,00
3 2 1,100 $600,000
4 3 2,100 $1,200,000
Wrapper Methods
This involves:
Forward Selection: We test one feature at a time and keep adding
them until we get a good fit
Backward Selection: We test all the features and start removing them
to see what works better
Recursive Feature Elimination: Recursively looks through all the
different features and how they pair together
Wrapper methods are very labor-intensive, and high-end computers
8 | www.simplilearn.com
Interview Guide
But for multiples of three, print “Fizz” instead of the number and for
the multiples of five, print “Buzz.” For numbers which are multiples of
both three and five, print “FizzBuzz”
Note that the range mentioned is 51, which means zero to 50.
However, the range asked in the question is one to 50. Therefore, in
the above code, you can include the range as (1,51).
9 | www.simplilearn.com
Interview Guide
If the data set is large, we can just simply remove the rows with
missing data values. It is the quickest way; we use the rest of the data
to predict the values.
For smaller data sets, we can substitute missing values with the mean
or average of the rest of the data using pandas dataframe in python.
There are different ways to do so, such as df.mean(), df.fillna(mean).
10) For the given points, how will you calculate the Euclidean
distance in Python?
plot1 = [1,3]
plot2 = [2,5]
10 | www.simplilearn.com
Interview Guide
-2 -4 2
-2 1 2
4 2 5
Expanding determinant:
11 | www.simplilearn.com
Interview Guide
Monitor
Constant monitoring of all models is needed to determine their
performance accuracy. When you change something, you want to
figure out how your changes are going to affect things. This needs to
be monitored to ensure it's doing what it's supposed to do.
Evaluate
Evaluation metrics of the current model is calculated to determine if a
new algorithm is needed.
Compare
The new models are compared to each other to determine which
model performs the best.
Rebuild
The best performing model is re-built on the current state of data.
Collaborative filtering
As an example, Last.fm recommends tracks that other users with
similar interests play often. This is also commonly seen on Amazon
after making a purchase; customers may notice the following message
accompanied by product recommendations: “Users who bought this
also bought…”
Content-based filtering
As an example: Pandora uses the properties of a song to recommend
music with similar properties. Here, we look at content, instead of
looking at who else is listening to music.
12 | www.simplilearn.com
Interview Guide
RMSE and MSE are two of the most common measures of accuracy
for a linear regression model.
13 | www.simplilearn.com
Interview Guide
Example: height of an adult = abc ft. This cannot be true, as the height
cannot be a string value. In this case, outliers can be removed.
If the outliers have extreme values, they can be removed. For example,
if all the data points are clustered between zero to 10, but one point
lies at 100, then we can remove this point.
It is stationary when the variance and mean of the series are constant
with time.
14 | www.simplilearn.com
Interview Guide
In the first graph, the variance is constant with time. Here, X is the time
factor and Y is the variable. The value of Y goes through the same
points all the time; in other words, it is stationary.
In the second graph, the waves get bigger, which means it is non-
stationary and the variance is changing with time.
Total=650 actual
P n
P 262 15 False Negative
Predicted
N 26 347 True Positive
True Positive
False Negative
21) Write the equation and calculate the precision and recall
rate.
Total=650 actual
P n
P 262 15 False Negative
Predicted
N 26 347 True Positive
True Positive
False Negative
15 | www.simplilearn.com
Interview Guide
23) Write a basic SQL query that lists all orders with
customer information.
Usually, we have order tables and customer tables that contain the
following columns:
Order Table
16 | www.simplilearn.com
Interview Guide
Orderid
customerId
OrderNumber
TotalAmount
Customer Table
Id
FirstName
LastName
City
Country
The SQL query is:
SELECT OrderNumber, TotalAmount, FirstName,
LastName, City, Country
FROM Order
JOIN Customer
ON Order.CustomerId = Customer.Id
K-means clustering
17 | www.simplilearn.com
Interview Guide
Linear regression
Decision trees
26) Below are the eight actual values of target variable in the
train file. What is the entropy of the target variable?
[0, 0, 0, 1, 1, 1, 1, 1]
Choose the correct answer.
1. -(5/8 log(5/8) + 3/8 log(3/8))
2. 5/8 log(5/8) + 3/8 log(3/8)
3. 3/8 log(5/8) + 5/8 log(3/8)
4. 5/8 log(3/8) – 3/8 log(5/8)
The target variable, in this case, is 1.
The formula for calculating the entropy is:
Putting p=5 and n=8, we get
Entropy = A = -(5/8 log(5/8) + 3/8 log(3/8))
1. Logistic Regression
2. Linear Regression
18 | www.simplilearn.com
Interview Guide
3. K-means clustering
4. Apriori algorithm
The most appropriate algorithm for this case is A, logistic regression.
1. K-means clustering
2. Linear regression
3. Association rules
4. Decision trees
As we are looking for grouping people together specifically by four
different similarities, it indicates the value of k. Therefore, K-means
clustering (answer A) is the most appropriate algorithm for this study.
19 | www.simplilearn.com
Interview Guide
1. One-way ANOVA
2. K-means clustering
3. Association rules
4. Student’s t-test
The answer is A: One-way ANOVA
20 | www.simplilearn.com
Interview Guide
21 | www.simplilearn.com
Interview Guide
22 | www.simplilearn.com
Interview Guide
48. What are the types of biases that can occur during sampling?
Selection bias
Undercoverage bias
Survivorship bias
23 | www.simplilearn.com
Interview Guide
Are you prepared enough for your next career in data science?
Try answering this Data Science with R Practice Test and find out.
24 | www.simplilearn.com
INDIA USA
Simplilearn Solutions Pvt Ltd. Simplilearn Americas, Inc.
# 53/1 C, Manoj Arcade, 24th Main, 201 Spear Street, Suite 1100,
Harlkunte San Francisco, CA 94105
2nd Sector, HSR Layout United States
Bangalore: 560102 Phone No: +1-844-532-7688
Call us at: 1800-212-7688
www.simplilearn.com
25 | www.simplilearn.com