Team Alacrity - Amazon ML Challenge 2023 - Text File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

AMAZON ML HACKATHON

TEAM: ALACRITY

TEAM MEMBERS:
1. AVANEESH SHETYE
2. OMKAR CHAUBAL
3. DEEPTI AGRAWAL

PROBLEM STATEMENT:

In this hackathon, the goal is to develop a machine learning model that can predict
the length dimension of a product. Product length is crucial for packaging and storing
products efficiently in the warehouse. Moreover, in many cases, it is an important
attribute that customers use to assess the product size before purchasing. However,
measuring the length of a product manually can be time-consuming and error-prone,
especially for large catalogs with millions of products.

You will have access to the product title, description, bullet points, product type ID,
and product length for 2.2 million products to train and test your submissions. Note
that there is some noise in the data.

Task: You are required to build a machine learning model that can predict product
length from catalog metadata.

MAIN OBJECTIVE: To predict the length of new Amazon Products based on the vast
training dataset provided.

Major Steps Followed:

1. Pre-Processing

a. Cleaning of Noise from Data:

- Removing missing values

- Removing Null Values

- Converting categorical variables to numerical variables

b. Using text feature extraction and stemming to convert long text in 'TITLE'
column to labels.
c. Using logistic regression to classify and categorize labels

D. Text feature extraction

2. Exploratory Data Analysis:

- Mean

- Standard Deviation

- Median

- Normal Distribution

- Variance of data and display into bar charts

3. Applying Model to perform prediction of length of products according to their


product category:

- Use of Linear Regression

- Use of Random Forest

- Use of Ridge and Lasso Techniques

- Use of Support Vector Machine

- Use of Neural Networks

4. Application of above methods on testing set with unseen data of new products

5. Performance Measures:

- Accuracy

- Precision

- Recall
- F Score

- Specificity

6. Use of Hyperparameter Tuning to improve performance measures of the above


models based on the given feedback.

7. Final creation of .csv file with predicted lengths for new products (displayed
with product IDs (for the model that gives us the highest possible accuracy)

EXPLANATION OF CODE:

Inbuilt Libraries Used:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, model_selection, metrics, linear_model,
ensemble, tree, svm
import xgboost as xgb
import tensorflow as tf
import torch
import statsmodels.api as sm
import scipy
import joblib
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizeri

1. Pre-Processing of data:

a. Cleaning of Noise from Data:

The code loads a dataset from a CSV file, removes duplicate rows from it, and
stores the processed dataset in a variable called "processed_1".

b. Replacing Missing Values with Mean Value


The code is a basic data preprocessing pipeline for a machine learning model.
The missing values in both the training and test datasets are being replaced
with the mean values of each column in the training dataset.

However, it is important to ensure that the mean values are calculated only on
the training dataset and not the test dataset. This is because using the mean
values of the test dataset can lead to data leakage, where information from
the test dataset is used to train the model, which can lead to overfitting and
poor generalization performance.

c. Using logistic regression to classify and categorize labels

The code is a performing label encoding on the categorical variables in the train
dataframe. Label encoding is a technique used to convert categorical variables to
numerical variables so that they can be used in machine learning models.

However, it is important to note that label encoding can introduce an arbitrary


ordering of the categories, which may not be appropriate for certain types of
categorical variables. In such cases, one-hot encoding or other techniques may be
more appropriate.

Also, it is recommended to fit the LabelEncoder on the training dataset and apply the
same transformation on the test dataset. This ensures that the encoding is consistent
between the training and test datasets.

d. Text Preprocessing

The code is a performing text classification using the Multinomial Naive Bayes
algorithm. In this code, you are first loading the training and testing data into Pandas
dataframes. Then, you are converting the text data in the "TITLE" column of the
training data to a matrix of token counts using the CountVectorizer class from
scikit-learn. Next, you are training a Naive Bayes model on the transformed training
data using the MultinomialNB class from scikit-learn. Finally, you are using the trained
model to predict the categories of the text data in the "TITLE" column of the testing
data, and saving the output to new variables.

2. Exploratory Data Analysis:

A. Mean:

The code loads the processed training and testing data using Pandas read_csv()
function, and selects the first 20 rows of the training data. It then creates a bar plot
of the mean product length for each product category using the plot() function from
Matplotlib.
B. Median:

This code loads the preprocessed training and testing data into Pandas dataframes.
Then, it groups the data in the training set by the numerical categories in the "TITLE"
column using the groupby() method and divides the range of title values into 10 bins
using the pd.cut() function. For each bin, it calculates the median value of the
"PRODUCT_LENGTH" column using the median() method. The resulting data is plotted
as a bar chart using Matplotlib's plt.bar() function, with the x-axis labels rotated
vertically for readability. The resulting plot shows the median product length for each
group of titles.
C. Confidence Intervals:
This code is an example of a statistical analysis on a subset of a training dataset. It performs the
following steps:
1. Load the processed train data from a CSV file and selects the first 10 rows of the data.
2. Categorizes the numerical values in the TITLE column using the pd.cut function and
labels them as category 1 through 5.
3. Calculates the mean and standard deviation of PRODUCT_LENGTH for each TITLE
category using the groupby method.
4. Calculates the sample size and standard error of the mean for each TITLE category.
5. Calculates the confidence interval for each TITLE category using the stats.t.ppf function,
which returns the value of the t-distribution for a given probability and degrees of
freedom.
6. Plots the results as a bar chart using the matplotlib.pyplot module. The mean
PRODUCT_LENGTH is shown on the y-axis and TITLE category on the x-axis. The 95%
confidence intervals are shown as error bars with a capsize of 5.
Overall, this code performs a basic statistical analysis of the relationship between TITLE and
PRODUCT_LENGTH, by grouping and summarizing the data by TITLE category and calculating
confidence intervals for the mean PRODUCT_LENGTH in each category. The resulting plot
allows us to visualize any differences in mean PRODUCT_LENGTH across TITLE categories
and assess whether those differences are statistically significant.
3. Models Used:

a. Linear Regression:
This code is implementing a simple linear regression model on a dataset containing processed
training and testing data for predicting the product length based on the product title. Here are
the main steps:
1. Load the processed training data from a CSV file using pandas.
2. Split the data into input features (X_train) and target variable (y_train).
3. Train the linear regression model on the training data using the input features and target
variable.
4. Load the processed testing data from a CSV file using pandas.
5. Add a new column to the testing set with the predicted product lengths, calculated using
the trained model on the product titles in the testing set.
6. Print the first 20 rows of the processed test dataset.
7. Calculate the mean squared error and R-squared score of the predictions made by the
model on the training set.
The mean squared error and R-squared score are two commonly used metrics to evaluate the
performance of a regression model. The mean squared error measures the average squared
difference between the actual and predicted values of the target variable, while the R-squared
score measures the proportion of variance in the target variable that is explained by the model.

b. Random Forest:
This code trains a random forest regression model to predict the length of a product based on
its title, and then uses the trained model to make predictions on new data.
The first step is to load and preprocess the training data. The train DataFrame is loaded from a
CSV file, and the input features (X_train) and target variable (y_train) are extracted from it.
Next, a RandomForestRegressor object is created with hyperparameters n_estimators=100 and
max_depth=10, and the model is trained using the training data.
The model is then used to predict the lengths of the products in the training set, and the
predicted lengths are added to the train DataFrame as a new column.
The train DataFrame is saved to a new CSV file with the predicted lengths included.
The next step is to load and preprocess the test data. The test DataFrame is loaded from a CSV
file, and the input features (X_test) are extracted from it.
The trained model is then used to predict the lengths of the products in the test set, and the
predicted lengths are added to the test DataFrame as a new column.
The test DataFrame is saved to a new CSV file with the predicted lengths included.
Finally, the code prints the first 20 rows of the test DataFrame, and calculates and prints the
mean squared error and R-squared score as performance metrics for the model on the test set.

You might also like