Data Analyst - CX Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

GOJEK

Analytics Take-home Test


Introduction
In this test, only SQL, R and/or Python are allowed to be used to process the data. This means
even if you managed to arrive at the right answer using other tools (such as Excel), you will be
awarded 0 points for that question. Please submit the answer in HTML/PDF format, showing the
full code, explanations and other outputs (graphs, tables, numbers). You have 2 days to finish
the assignment (starting from the day you were given the test).

Section 1: SQL (11 points)

Instruction
You will use Google BigQuery for this task. Please follow these steps to access the dataset:
1. Join our Google Group by following this link and click Ask to Join Group
2. After your join request is approved, go to this link to start using BigQuery

The dataset that you will use is bigquery-public-data.new_york_citibike which contains the
following two tables:
1. citibike_stations
2. citibike_trips

Problem 1 (4 points)
Exclude trips with missing start_station_id from the trip table. From the remaining trips, keep
those with start_station_ids that were not present at the station table. What percentage of these
trips end up in end_station_ids which are also not present in the station table?

Problem 2 (7 points)
Filter the trip table to include only trips with starttime from 2018-01-01 onwards. Combine
usertype, birth_year, and gender into 1. Assume every unique combination represents 1 user.
Include users with missing usertype/birth_year/gender.

For every month, classify users into segments based on their trips data that month:
● 0 distinct start_station_name = "inactive"
● 1-10 distinct start_station_name = "casual”
● > 10 distinct start_station_name = "power"
GOJEK

Note that missing month data must be imputed with 0. For example, if user A has info on
months 1 and 3 but not 2, then you need to impute month 2 with 0 and therefore classify user A
on month 2 as “inactive”.

Questions:
a. For each month in 2018, how many users belong to each segment?
b. For each month in 2018, compute the movements of users between segments for the
next month. For example: from January 2018 to February 2018, how many casual users
stayed as casual, became power, or became inactive? Do the same for the other groups
and the other months in 2018

Section 2: Modelling & R/Python (20 points)

Instruction
Use this dataset to solve the problems below.

Problem
Using multiple linear regression, predict the total_cbv. Create 1 model for each service.

Forecast period = 2016-03-30, 2016-03-31 and 2016-04-01


Train period = the rest

List of predictors to use:


1. Day of month
2. Month
3. Day of week
4. Weekend/weekday flag (weekend = Saturday & Sunday)

Pre-processing (do it in this order):


1. Remove GO-TIX
2. Keep only `Cancelled` order_status
3. Ensure the complete combinations (cartesian product) of date and service are present
4. Impute missing values with 0
5. Create is_weekend flag predictor (1 if Saturday/Sunday, 0 if other days)
6. One-hot encode month and day of week predictors
7. Standardize all predictors into z-scores using the mean and standard deviation from
train-period data only
GOJEK
Evaluation metric: MAPE

Validation: 3-fold scheme. Each validation fold has the same length as the forecast period.

Question 1 (4 points)
After all the pre-processing steps, what is the value of all the predictors for service = GO-FOOD,
date = 2016-02-28?

Question 2 (1 point)
Show the first 6 rows of one-hot encoded variables (month and day of the week)

Question 3 (1 points)
Print the first 6 rows of the data after pre-processing for service = GO-KILAT. Sort ascendingly
by date

Question 4 (7 points)
Compute the forecast-period MAPE for each service. Display in ascending order based on the
MAPE

Question 5 (7 points)
Create graphs to show the performance of each validation fold. One graph one service. x =
date, y = total_cbv. Color: black = actual total_cbv, other colors = the fold predictions (there
should be 3 other colors). Only show the validation period. For example, if rows 11, 12 and 13
were used for validations, then do not show the other rows in the graphs. Clearly show the
month and date on the x-axis

You might also like