Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
152 16,727
3 authors:
Anand Nayyar
Duy Tan University
Abstract—In the field of Artificial Intelligence Machine learn- technique that are used in the development of recommendation
ing provides the automatic systems which learn and improve system is clustering. Clustering[20], [21] is a process to group
itself from experience without being explicitly programmed. In a set of objects in such a way that objects in the same clusters
this research work a movie recommender system is built using are more similar to each other than to those in other clusters
the K-Means Clustering and K-Nearest Neighbor algorithms. The [11], [12], [13], [14].K-Means [13], [23], [33] Clustering along
movielens dataset is taken from kaggle. The system is imple-
with K-Nearest Neighbor [18], [24] is implemented on the
mented in python programming language. The proposed work
deals with the introduction of various concepts related to machine movielens dataset in order to obtain the best-optimized result.
learning and recommendation system. In this work, various tools In existing technique the data is scattered which results in a
and techniques have been used to build recommender systems. high number of clusters while in the proposed technique data
Various algorithms such as K-Means Clustering, KNN, Collabo- is gathered and results in a low number of clusters.The process
rative Filtering, Content-Based Filtering have been described in of recommendation of a movie is optimized in the proposed
detail. Further, after studying different types of machine learning scheme. The proposed recommender system predicts the user’s
algorithms, there is a clear picture of where to apply which preference of a movie on the basis of different parameters.
algorithm in different areas of industries such as recommender The recommender system works on the concept that people are
systems, e-commerce, etc. Then there is an illustration of how having common preference or choice. These user will influence
implementations and working of the proposed system are used for
on each other’s opinions. This process optimize the process and
the implementation of the movie recommender system. Various
building blocks of the proposed system such as Architecture, having lower RMSE.
Process Flow, Pseudo Code, Implementation and Working of the The work starts with the section I as Introduction section with
System is described in detail. Finally, in this work for different the basics of recommendation system. Section II discusses the
cluster values, different values of Root Mean Squared Error are latest work done by recent authors with the details of tech-
obtained. In this proposed work as the no of clusters decreases, niques and tools used by different authors. Section III describe
the value of RMSE also decreases. The best value of RMSE the evolution of the proposed recommendation system. Section
obtained is 1.081648. The results given by the proposed system IV shows the algorithm of the proposed system. Section V
are better than the existing technique on the basis of RMSE shows the implementation of the proposed system. The section
value. VI discusses the working and results of the system with the
Keywords—Recommender System, k-Means, KNN, Collabora- help of the snapshot of the system. Section VII is having the
tive Filtering,Content-Based Filtering conclusion and future work of the proposed system.
978-1-5386-5933-5/19/$31.00 2019
c IEEE 263
similar interests. This system (K-mean Cuckoo) has 0.68 MAE
[15], [16].In 2017 authors used a new approach that can solve Processing Module
sparsity problem to a great extent[38].In 2018, authors built a In this, the panda’s module first separates the data from the
recommendation engine by analyzing rating data sets collected raw files. It separates the information about the user and
from Twitter to recommend movies to specific user using movie items into a separate data frame using the panda’s
R[39]. library. After separating the data from the raw form, in a
utility matrix module a utility matrix is built which defines
III. E VOLUTION OF P ROPOSED M OVIE which user rated which movie. This helps in figuring out how
R ECOMMENDATION SYSTEM many times each movie is rated by the users. Then based
on previous preprocessing of data, separate data frames for
This section consist of the architecture and process flow of the training set and testing set is created. This is done to
proposed system. further evaluate the performance of the system. After getting
the utility matrix, K-means clustering is used to build a
A. Architecture separate data frame which shows which movie belongs to
which genre. The Within-Cluster Sum of Squares (WCSS) is
Figure 1 shows the architecture of the proposed system. a measure of the variability of the observations within each
It consists of three modules, namely the input module, a cluster. In general, a cluster that has a small sum of squares is
processing module, an output module. Figure 1 gives the more compact than a cluster that has a large sum of squares.
clear conceptual idea about the working of the proposed In WCSS module the right no of clusters is chosen using
recommendation system. the technique Within Clustered Sum of Square. Now, for
Next, an illustration of each module is done in detail and calculating the average rating given by each user given to each
explains the architecture of the system. This helps in under- cluster, a utility clustered matrix is created. In utility clustered
standing the architecture of the system in a crisp and clear module the utility clustered matrix is used to calculate the
manner. similarity between the users. The PCS and normalization
module calculates the correlation using the utility clustered
matrix. Finally, in the KNN module and similarity module
using the K-Nearest Neighbor predictions for movie rating is
calculated with the help of the similarity matrix and utility
clustered matrix.
Output Module
The output module describes the predicted movies that the
input user might like. Further, in the output along with the
movies, their predicted ratings are also defined which input
user might give to the movies.
B. Process Flow
Figure 2 shows the process of the flow diagram of the
Movie Recommendation System. This diagram shows the
process flow of the proposed system. Process flow depicts
how the system is working, how the system is dealing with
the raw data, and how the system predicts the rating for the
input userId.
Input Module
In this module, the user is asked to give the details as input. Fig. 2: Process Flow Diagram
In input, the user gives the detail about himself by providing
details such as userId, age, gender, pin code. This information
is further passed to the next module i.e. the processing module. Step 1:The user gives the userId and information such as
gender, age, pin code. 5) Model Testing
Step 2:Using the numpy and pandas library the raw A. Data Collection
data is preprocessed into separate data frames.
The first step in the process of implementation is the data
Step 3:Within Clustered Sum of the Squared method is collection step. In this step, the right dataset is chosen so
used to find the right no clusters so that K-means clustering as to perform further computations. In the case of movie
can be applied to the movie. recommendation system movielens, the dataset is taken from
the kaggle website. The dataset consists of 100,000 movie
Step 4:After applying K-means clustering a utility clustered rating from (1-5). Further, there are 943 users and 1682 no
matrix is build which defines average rating the user gives to movies. With this information, further computations are done
each cluster using the Python programming language.
Step 5:Using the utility clustered matrix and Pearson B. Data Preparation
correlation similarity between the users are calculated. The second step in the process of implementation is the
data preparation step. In this step data preprocessing is done.
Step 6:Finally KNN uses the utility clustered matrix It represents the utility matrix which tells which user rated
and similarity to predict the movies for input user. which movie. This is done by first separating the user data and
movie data into the separate data frames. Then, using both the
data frames, utility matrix is created.
Algorithm for the proposed algorithm is as follows: C. Data Creation
Step1:Import the python libraries: Numpy, Pandas, The third step in the process of implementation is the
MAtplotlib, sklearn data creation step. In this step, the K-Means clustering model
Step2:Read the csv information as data frames in user and is applied. The right number of clusters is chosen using the
item variable. WCSS method. After choosing the right no of cluster movies
Step3:Split the data into the training set and test set as data are divided into clusters by applying the K-Means Clustering
frame into the variables rating and rating test. model. This leads to the creation of utility clustered matrix.
Step4:Create a utility matrix name utility which tells which
user rated which movie.
D. Data Training
Step5:Using the WCSS method choose the right number of
clusters so that the K-means Clustering technique can be The fourth step in the process of implementation is the data
applied to classify the movies according to the number of training. In this step normalization of utility clustered matrix
clusters. is done. Then the similarity between the users is calculated
Step6:Define the utility clustered matrix after applying the using the Pearson Correlation Matrix. Then, using the KNN
K-means clustering algorithm. [18]prediction for the movie ratings for top N users is done.
Step7:Apply Pearson Correlation metric on utility clustered
matrix to calculate the similarity matrix between the users. E. Data Testing
Step8:Normalize the values stored in utility matrix.
Step9: Guess() function takes two parameters as input userID The fifth step in the process of implementation is the data
and topN users which is used by KNN to predict the movie training. In his step prediction for the movie, the rating is done
ratings for topN similar users. for the test users. This is done for the evaluation of our model,
Step10:ratingTest data frame ratings are used for comparison by using some evaluation metric.
while using the guess function for predicting the ratings of
Step11:RMSE is calculated to evaluate the accuracy of the The proposed system working is discussed using the
model. following steps:
Step 1:In this step the user information is taken as input
the userId and information such as gender, age, pin code as
V. I MPLEMENTATION shown in Figure 3.
The system has been implemented in python programming
language using K-Means clustering library and K-Nearest
Neighbor. The implementation of the system consists of many
sub-sections which are standard processes to be followed while
solving any machine learning [17], [19], [22], [27], [28], [29],
[30], [31], [32], [34] problem. These are as follows: Fig. 3: Utility Matrix
1) Data Collection
2) Data Preparation
3) Model Creation Step 2:Then using the numpy and pandas library the raw
4) Model Training data is preprocessed into separate data frames as shown in
figure 4. shown in figure 7.
Fig. 8: Output
TABLE I: Results
K-means + KNN
Number of clusters Root Mean Squared Error
19 2.504990
18 2.375555
17 2.337194
16 2.416212
15 2.256299
14 2.080751
13 1.994332
12 1.928682
11 1.861167
10 1.820095
9 1.625027
8 1.493939
7 1.441855
Fig. 6: Utility Clustered Matrix 6 1.439451
5 1.269583
4 1.166091
3 1.141065
2 1.081648
Step 5:Using the utility clustered matrix and Pearson
correlation similarity between the users are calculated as
A. Comparison with Existing Technology data sets. Sentimental Analysis concept can be used in future
to enhance the efficiency of movie recommendation system,
The table 2 and 3 compares the result of the proposed so the model can be tuned to accommodate more situations.
system with the existing technique. These tables shows a In future, individual characterstic may be removed which is
comparison of RMSE with the existing technique i.e. cuckoo hidden in the recommendation of the users.
search. It is seen from the tables that for the existing technique
the RMSE value is 1.23154 for cluster equal to 68, RMSE
