Data Analyst - CX Assignment
Data Analyst - CX Assignment
Data Analyst - CX Assignment
Instruction
You will use Google BigQuery for this task. Please follow these steps to access the dataset:
1. Join our Google Group by following this link and click Ask to Join Group
2. After your join request is approved, go to this link to start using BigQuery
The dataset that you will use is bigquery-public-data.new_york_citibike which contains the
following two tables:
1. citibike_stations
2. citibike_trips
Problem 1 (4 points)
Exclude trips with missing start_station_id from the trip table. From the remaining trips, keep
those with start_station_ids that were not present at the station table. What percentage of these
trips end up in end_station_ids which are also not present in the station table?
Problem 2 (7 points)
Filter the trip table to include only trips with starttime from 2018-01-01 onwards. Combine
usertype, birth_year, and gender into 1. Assume every unique combination represents 1 user.
Include users with missing usertype/birth_year/gender.
For every month, classify users into segments based on their trips data that month:
● 0 distinct start_station_name = "inactive"
● 1-10 distinct start_station_name = "casual”
● > 10 distinct start_station_name = "power"
GOJEK
Note that missing month data must be imputed with 0. For example, if user A has info on
months 1 and 3 but not 2, then you need to impute month 2 with 0 and therefore classify user A
on month 2 as “inactive”.
Questions:
a. For each month in 2018, how many users belong to each segment?
b. For each month in 2018, compute the movements of users between segments for the
next month. For example: from January 2018 to February 2018, how many casual users
stayed as casual, became power, or became inactive? Do the same for the other groups
and the other months in 2018
Instruction
Use this dataset to solve the problems below.
Problem
Using multiple linear regression, predict the total_cbv. Create 1 model for each service.
Validation: 3-fold scheme. Each validation fold has the same length as the forecast period.
Question 1 (4 points)
After all the pre-processing steps, what is the value of all the predictors for service = GO-FOOD,
date = 2016-02-28?
Question 2 (1 point)
Show the first 6 rows of one-hot encoded variables (month and day of the week)
Question 3 (1 points)
Print the first 6 rows of the data after pre-processing for service = GO-KILAT. Sort ascendingly
by date
Question 4 (7 points)
Compute the forecast-period MAPE for each service. Display in ascending order based on the
MAPE
Question 5 (7 points)
Create graphs to show the performance of each validation fold. One graph one service. x =
date, y = total_cbv. Color: black = actual total_cbv, other colors = the fold predictions (there
should be 3 other colors). Only show the validation period. For example, if rows 11, 12 and 13
were used for validations, then do not show the other rows in the graphs. Clearly show the
month and date on the x-axis