Data Analyst - CX Assignment

GOJEK
Analytics Take-home Test

Introduction
In this test, only SQL, R and/or Python are allowed to be used to process the data. This means
even if you managed to arrive at the right answer using other tools (such as Excel), you will be
awarded 0 points for that question. Please submit the answer in HTML/PDF format, showing the
full code, explanations and other outputs (graphs, tables, numbers). You have 2 days to finish
the assignment (starting from the day you were given the test).
Section 1: SQL (11 points)
Instruction
You will use Google BigQuery for this task. Please follow these steps to access the dataset:
1. Join our Google Group by following this link and click Ask to Join Group
2. After your join request is approved, go to this link to start using BigQuery
The dataset that you will use is bigquery-public-data.new_york_citibike which contains the
following two tables:
1. citibike_stations
2. citibike_trips
Problem 1 (4 points)
Exclude trips with missing start_station_id from the trip table. From the remaining trips, keep
those with start_station_ids that were not present at the station table. What percentage of these
trips end up in end_station_ids which are also not present in the station table?
Problem 2 (7 points)
Filter the trip table to include only trips with starttime from 2018-01-01 onwards. Combine
usertype, birth_year, and gender into 1. Assume every unique combination represents 1 user.
Include users with missing usertype/birth_year/gender.
For every month, classify users into segments based on their trips data that month:
● 0 distinct start_station_name = "inactive"
● 1-10 distinct start_station_name = "casual”
● > 10 distinct start_station_name = "power"
GOJEK
Note that missing month data must be imputed with 0. For example, if user A has info on
months 1 and 3 but not 2, then you need to impute month 2 with 0 and therefore classify user A
on month 2 as “inactive”.
Questions:
a. For each month in 2018, how many users belong to each segment?
b. For each month in 2018, compute the movements of users between segments for the
next month. For example: from January 2018 to February 2018, how many casual users
stayed as casual, became power, or became inactive? Do the same for the other groups
and the other months in 2018
Section 2: Modelling & R/Python (20 points)
Instruction
Use this dataset to solve the problems below.
Problem
Using multiple linear regression, predict the total_cbv. Create 1 model for each service.
Forecast period = 2016-03-30, 2016-03-31 and 2016-04-01

Train period = the rest
List of predictors to use:

1. Day of month
2. Month
3. Day of week
4. Weekend/weekday flag (weekend = Saturday & Sunday)
Pre-processing (do it in this order):

1. Remove GO-TIX
2. Keep only `Cancelled` order_status
3. Ensure the complete combinations (cartesian product) of date and service are present
4. Impute missing values with 0
5. Create is_weekend flag predictor (1 if Saturday/Sunday, 0 if other days)
6. One-hot encode month and day of week predictors
7. Standardize all predictors into z-scores using the mean and standard deviation from
train-period data only
GOJEK
Evaluation metric: MAPE
Validation: 3-fold scheme. Each validation fold has the same length as the forecast period.
Question 1 (4 points)
After all the pre-processing steps, what is the value of all the predictors for service = GO-FOOD,
date = 2016-02-28?
Question 2 (1 point)
Show the first 6 rows of one-hot encoded variables (month and day of the week)
Print the first 6 rows of the data after pre-processing for service = GO-KILAT. Sort ascendingly
by date
Compute the forecast-period MAPE for each service. Display in ascending order based on the
MAPE
Create graphs to show the performance of each validation fold. One graph one service. x =
date, y = total_cbv. Color: black = actual total_cbv, other colors = the fold predictions (there
should be 3 other colors). Only show the validation period. For example, if rows 11, 12 and 13
were used for validations, then do not show the other rows in the graphs. Clearly show the
month and date on the x-axis

Data Analyst - CX Assignment

Uploaded by

Copyright:

Available Formats

Data Analyst - CX Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analyst - CX Assignment

Uploaded by

Copyright:

Available Formats

GOJEK

Analytics Take-home Test

Section 1: SQL (11 points)

Section 2: Modelling & R/Python (20 points)

Forecast period = 2016-03-30, 2016-03-31 and 2016-04-01

List of predictors to use:

Pre-processing (do it in this order):

You might also like