Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
net/publication/334763301
CITATIONS READS
152 16,727
3 authors:
Anand Nayyar
Duy Tan University
608 PUBLICATIONS 12,986 CITATIONS
SEE PROFILE
All content following this page was uploaded by Arun Solanki on 03 June 2020.
Abstract—In the field of Artificial Intelligence Machine learn- technique that are used in the development of recommendation
ing provides the automatic systems which learn and improve system is clustering. Clustering[20], [21] is a process to group
itself from experience without being explicitly programmed. In a set of objects in such a way that objects in the same clusters
this research work a movie recommender system is built using are more similar to each other than to those in other clusters
the K-Means Clustering and K-Nearest Neighbor algorithms. The [11], [12], [13], [14].K-Means [13], [23], [33] Clustering along
movielens dataset is taken from kaggle. The system is imple-
with K-Nearest Neighbor [18], [24] is implemented on the
mented in python programming language. The proposed work
deals with the introduction of various concepts related to machine movielens dataset in order to obtain the best-optimized result.
learning and recommendation system. In this work, various tools In existing technique the data is scattered which results in a
and techniques have been used to build recommender systems. high number of clusters while in the proposed technique data
Various algorithms such as K-Means Clustering, KNN, Collabo- is gathered and results in a low number of clusters.The process
rative Filtering, Content-Based Filtering have been described in of recommendation of a movie is optimized in the proposed
detail. Further, after studying different types of machine learning scheme. The proposed recommender system predicts the user’s
algorithms, there is a clear picture of where to apply which preference of a movie on the basis of different parameters.
algorithm in different areas of industries such as recommender The recommender system works on the concept that people are
systems, e-commerce, etc. Then there is an illustration of how having common preference or choice. These user will influence
implementations and working of the proposed system are used for
on each other’s opinions. This process optimize the process and
the implementation of the movie recommender system. Various
building blocks of the proposed system such as Architecture, having lower RMSE.
Process Flow, Pseudo Code, Implementation and Working of the The work starts with the section I as Introduction section with
System is described in detail. Finally, in this work for different the basics of recommendation system. Section II discusses the
cluster values, different values of Root Mean Squared Error are latest work done by recent authors with the details of tech-
obtained. In this proposed work as the no of clusters decreases, niques and tools used by different authors. Section III describe
the value of RMSE also decreases. The best value of RMSE the evolution of the proposed recommendation system. Section
obtained is 1.081648. The results given by the proposed system IV shows the algorithm of the proposed system. Section V
are better than the existing technique on the basis of RMSE shows the implementation of the proposed system. The section
value. VI discusses the working and results of the system with the
Keywords—Recommender System, k-Means, KNN, Collabora- help of the snapshot of the system. Section VII is having the
tive Filtering,Content-Based Filtering conclusion and future work of the proposed system.
978-1-5386-5933-5/19/$31.00 2019
c IEEE 263
similar interests. This system (K-mean Cuckoo) has 0.68 MAE
[15], [16].In 2017 authors used a new approach that can solve Processing Module
sparsity problem to a great extent[38].In 2018, authors built a In this, the panda’s module first separates the data from the
recommendation engine by analyzing rating data sets collected raw files. It separates the information about the user and
from Twitter to recommend movies to specific user using movie items into a separate data frame using the panda’s
R[39]. library. After separating the data from the raw form, in a
utility matrix module a utility matrix is built which defines
III. E VOLUTION OF P ROPOSED M OVIE which user rated which movie. This helps in figuring out how
R ECOMMENDATION SYSTEM many times each movie is rated by the users. Then based
on previous preprocessing of data, separate data frames for
This section consist of the architecture and process flow of the training set and testing set is created. This is done to
proposed system. further evaluate the performance of the system. After getting
the utility matrix, K-means clustering is used to build a
A. Architecture separate data frame which shows which movie belongs to
which genre. The Within-Cluster Sum of Squares (WCSS) is
Figure 1 shows the architecture of the proposed system. a measure of the variability of the observations within each
It consists of three modules, namely the input module, a cluster. In general, a cluster that has a small sum of squares is
processing module, an output module. Figure 1 gives the more compact than a cluster that has a large sum of squares.
clear conceptual idea about the working of the proposed In WCSS module the right no of clusters is chosen using
recommendation system. the technique Within Clustered Sum of Square. Now, for
Next, an illustration of each module is done in detail and calculating the average rating given by each user given to each
explains the architecture of the system. This helps in under- cluster, a utility clustered matrix is created. In utility clustered
standing the architecture of the system in a crisp and clear module the utility clustered matrix is used to calculate the
manner. similarity between the users. The PCS and normalization
module calculates the correlation using the utility clustered
matrix. Finally, in the KNN module and similarity module
using the K-Nearest Neighbor predictions for movie rating is
calculated with the help of the similarity matrix and utility
clustered matrix.
Output Module
The output module describes the predicted movies that the
input user might like. Further, in the output along with the
movies, their predicted ratings are also defined which input
user might give to the movies.
B. Process Flow
Figure 2 shows the process of the flow diagram of the
Movie Recommendation System. This diagram shows the
process flow of the proposed system. Process flow depicts
how the system is working, how the system is dealing with
the raw data, and how the system predicts the rating for the
input userId.
Input Module
In this module, the user is asked to give the details as input. Fig. 2: Process Flow Diagram
In input, the user gives the detail about himself by providing
details such as userId, age, gender, pin code. This information
is further passed to the next module i.e. the processing module. Step 1:The user gives the userId and information such as
264 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
gender, age, pin code. 5) Model Testing
Step 2:Using the numpy and pandas library the raw A. Data Collection
data is preprocessed into separate data frames.
The first step in the process of implementation is the data
Step 3:Within Clustered Sum of the Squared method is collection step. In this step, the right dataset is chosen so
used to find the right no clusters so that K-means clustering as to perform further computations. In the case of movie
can be applied to the movie. recommendation system movielens, the dataset is taken from
the kaggle website. The dataset consists of 100,000 movie
Step 4:After applying K-means clustering a utility clustered rating from (1-5). Further, there are 943 users and 1682 no
matrix is build which defines average rating the user gives to movies. With this information, further computations are done
each cluster using the Python programming language.
Step 5:Using the utility clustered matrix and Pearson B. Data Preparation
correlation similarity between the users are calculated. The second step in the process of implementation is the
data preparation step. In this step data preprocessing is done.
Step 6:Finally KNN uses the utility clustered matrix It represents the utility matrix which tells which user rated
and similarity to predict the movies for input user. which movie. This is done by first separating the user data and
movie data into the separate data frames. Then, using both the
data frames, utility matrix is created.
IV. A LGORITHM OF P ROPOSED S YSTEM
Algorithm for the proposed algorithm is as follows: C. Data Creation
Step1:Import the python libraries: Numpy, Pandas, The third step in the process of implementation is the
MAtplotlib, sklearn data creation step. In this step, the K-Means clustering model
Step2:Read the csv information as data frames in user and is applied. The right number of clusters is chosen using the
item variable. WCSS method. After choosing the right no of cluster movies
Step3:Split the data into the training set and test set as data are divided into clusters by applying the K-Means Clustering
frame into the variables rating and rating test. model. This leads to the creation of utility clustered matrix.
Step4:Create a utility matrix name utility which tells which
user rated which movie.
D. Data Training
Step5:Using the WCSS method choose the right number of
clusters so that the K-means Clustering technique can be The fourth step in the process of implementation is the data
applied to classify the movies according to the number of training. In this step normalization of utility clustered matrix
clusters. is done. Then the similarity between the users is calculated
Step6:Define the utility clustered matrix after applying the using the Pearson Correlation Matrix. Then, using the KNN
K-means clustering algorithm. [18]prediction for the movie ratings for top N users is done.
Step7:Apply Pearson Correlation metric on utility clustered
matrix to calculate the similarity matrix between the users. E. Data Testing
Step8:Normalize the values stored in utility matrix.
Step9: Guess() function takes two parameters as input userID The fifth step in the process of implementation is the data
and topN users which is used by KNN to predict the movie training. In his step prediction for the movie, the rating is done
ratings for topN similar users. for the test users. This is done for the evaluation of our model,
Step10:ratingTest data frame ratings are used for comparison by using some evaluation metric.
while using the guess function for predicting the ratings of
test users. VI. W ORKING AND R ESULTS OF P ROPOSED S YSTEM
Step11:RMSE is calculated to evaluate the accuracy of the The proposed system working is discussed using the
model. following steps:
Step 1:In this step the user information is taken as input
the userId and information such as gender, age, pin code as
V. I MPLEMENTATION shown in Figure 3.
The system has been implemented in python programming
language using K-Means clustering library and K-Nearest
Neighbor. The implementation of the system consists of many
sub-sections which are standard processes to be followed while
solving any machine learning [17], [19], [22], [27], [28], [29],
[30], [31], [32], [34] problem. These are as follows: Fig. 3: Utility Matrix
1) Data Collection
2) Data Preparation
3) Model Creation Step 2:Then using the numpy and pandas library the raw
4) Model Training data is preprocessed into separate data frames as shown in
9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 265
figure 4. shown in figure 7.
Fig. 8: Output
TABLE I: Results
K-means + KNN
Number of clusters Root Mean Squared Error
19 2.504990
18 2.375555
17 2.337194
16 2.416212
15 2.256299
14 2.080751
13 1.994332
12 1.928682
11 1.861167
10 1.820095
9 1.625027
8 1.493939
7 1.441855
Fig. 6: Utility Clustered Matrix 6 1.439451
5 1.269583
4 1.166091
3 1.141065
2 1.081648
Step 5:Using the utility clustered matrix and Pearson
correlation similarity between the users are calculated as
266 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
A. Comparison with Existing Technology data sets. Sentimental Analysis concept can be used in future
to enhance the efficiency of movie recommendation system,
The table 2 and 3 compares the result of the proposed so the model can be tuned to accommodate more situations.
system with the existing technique. These tables shows a In future, individual characterstic may be removed which is
comparison of RMSE with the existing technique i.e. cuckoo hidden in the recommendation of the users.
search. It is seen from the tables that for the existing technique
the RMSE value is 1.23154 for cluster equal to 68, RMSE
value using proposed technique is 1.233 to 19 clusters and R EFERENCES
RMSE value using proposed technique is 1.081648 to 2
clusters. [1] Goel A., Khandelwal D., Mundhra J., Tiwari R. (2018) Intelligent
and Integrated Book Recommendation and Best Price Identifier System
Using Machine Learning. In: Bhateja V., Coello Coello C., Satapathy
TABLE II: RMSE in Proposed Technique S., Pattnaik P. (eds) Intelligent Engineering Informatics. Advances in
Intelligent Systems and Computing, vol 695. Springer, Singapore
Root Mean Squared Error No. of Cluster
[2] Bao J., Zheng Y. (2017) Location-Based Recommendation Systems. In:
1.23154 68 Shekhar S., Xiong H., Zhou X. (eds) Encyclopedia of GIS. Springer,
Cham
[3] Chavarriaga O., Florian-Gaviria B., Solarte O. (2014) A Recommender
System for Students Based on Social Knowledge and Assessment Data
of Competences. In: Rensing C., de Freitas S., Ley T., Muñoz-Merino
TABLE III: RMSE in Existing Technique P.J. (eds) Open Learning and Teaching in Educational Communities.
Root Mean Squared Error No. of Cluster EC-TEL 2014. Lecture Notes in Computer Science, vol 8719. Springer,
Cham
1.2333 19
1.081648 2 [4] F.O.Isinkaye et. al, Recommendation systems: Principles, methods and
evaluation, Egyptian Informatics Journal Volume 16, Issue 3, November
2015, Pages 261-273
[5] H. Drachsler, T. Bogers, R. Vuorikari, K. Verbert, E. Duval, N.
Manouselis, G. Beham, S. Lindstaedt, H. Stern, M. Friedrich, et al.
Issues and considerations regarding sharable data sets for recommender
systems in technology enhanced learning. Procedia Computer Science,
1(2):2849–2858, 2010.
[6] ÁlvaroTejeda-Lorente, A quality based recommender system to dissem-
inate information in a university digital library,Information Sciences
Volume 261, 10 March 2014, Pages 52-69
[7] Trang Tran, T.N., Atas, M., Felfernig, A. et al. J Intell Inf Syst (2018)
50: 501. https://doi.org/10.1007/s10844-017-0469-0
[8] Xin Luo, Mengchu Zhou, Yunni Xia, and Qingsheng Zhu,An Efficient
Non-Negative Matrix-Factorization-Based Approach to Collaborative
Fig. 9: Comparison Graph with the Existing Technique Filtering for Recommender Systems,IEEE Transactions on Industrial
Informatics ( Volume: 10 , Issue: 2 , May 2014 )
[9] Badsha, S., Yi, X. Khalil, I. Data Sci. Eng. (2016) 1: 161.
Figure 9 compares the RMSE value for existing technique https://doi.org/10.1007/s41019-016-0020-2
with the RMSE value of the proposed technique. The X- [10] Farman Ullah, Ghulam Sarwar, Sung Chang Lee, Yun Kyung Park,
axis represents the No of Clusters and Y-axis represents the Kyeong Deok Moon, Jin Tae Kim, Hybrid Recommender System with
Temporal Information,The International Conference on Information
RMSE values. It is seen from the graph that for the existing Network, 2012,DOI: 10.1109/ICOIN.2012.6164413
technique the RMSE value is 1.23154 for cluster equal to 68,
[11] Jing Jiang, Jie Lu, Guangquan Zhang, Guodong Long,Scaling-up
RMSE value using proposed technique is 1.233 to 19 clusters Item-based Collaborative Filtering Recommendation Algorithm based
and RMSE value using proposed technique is 1.081648 to 2 on hadoop,2011 IEEE World Congress on Services, 4-9 July 2011,
clusters. 10.1109/SERVICES.2011.66
[12] Vibhor Kanta , Kamal K. Bharadwaj,Enhancing Recommendation Qual-
ity of Content-based,Filtering through Collaborative Predictions and
VII. C ONCLUSION Fuzzy Similarity Measures,Procedia Engineering Volume 38, 2012,
Pages 939-942.
Machine learning is a method of data analysis that au-
tomates analytical model building. It is a branch of artifi- [13] Jiang Z., Zang W., Liu X. (2016) Research of K-means Clustering
Method Based on DNA Genetic Algorithm and P System. In: Zu Q.,
cial intelligence based on the idea that systems can learn Hu B. (eds) Human Centered Computing. HCC 2016. Lecture Notes in
from data, identify patterns and make decisions with minimal Computer Science, vol 9567. Springer, Cham
human intervention[25]. In this proposed system a movie [14] Sanjoy K. Sinha a,Nan M. Lairdb, Garrett M. Fitzmaurice,Multivariate
recommender system is built using the K-Means Clustering logistic regression with incomplete covariate and auxiliary informa-
and K-Nearest Neighbor algorithms. The data are taken from tion,Elsevier,2010
movielens data set. The system is implemented in python [15] D.A. Adeniyi, Z. Wei, Y. Yongquan,Automated web usage data mining
programming language. It is seen that after implementing the and recommendation system using K-Nearest Neighbor (KNN) classifi-
cation method,Saudi Computer Society, King Saud University,October
system in the python programming language the RMSE value 2014
of the proposed technique is better than the existing technique.
[16] Rahul Kataria , Om Prakash Verma,An effective collaborative
It is also seen that the RMSE value of the proposed system is movie recommender system with cuckoo search,Egyptian Informat-
achieving the same value as the existing technique but with less ics Journal,2016,Volume 18, Issue 2, July 2017, Pages 105-112
no of clusters. The proposed work can be improved using more https://doi.org/10.1016/j.eij.2016.10.002
9th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 267
[17] Czarnowski I., Jdrzejowicz P. (2008) Data Reduction Algorithm for [34] Anna L. Buczak,A Survey of Data Mining and Machine Learning
Machine Learning and Data Mining. In: Nguyen N.T., Borzemski L., Methods for Cyber Security Intrusion Detection, IEEE COMMUNI-
Grzech A., Ali M. (eds) New Frontiers in Applied Artificial Intelligence. CATIONS SURVEYS TUTORIALS, VOL. 18, NO. 2, SECOND
IEA/AIE 2008. Lecture Notes in Computer Science, vol 5027. Springer, QUARTER 2016.
Berlin, Heidelberg [35] Nguyen N.T., Rakowski M., Rusin M., Sobecki J., Jain L.C. (2007) Hy-
[18] Vejmelka M., Hlaváčková-Schindler K. (2007) Mutual Information brid Filtering Methods Applied in Web-Based Movie Recommendation
Estimation in Higher Dimensions: A Speed-Up of a k-Nearest Neighbor System. In: Apolloni B., Howlett R.J., Jain L. (eds) Knowledge-Based
Based Estimator. In: Beliczynski B., Dzielinski A., Iwanowski M., Intelligent Information and Engineering Systems. KES 2007. Lecture
Ribeiro B. (eds) Adaptive and Natural Computing Algorithms. ICAN- Notes in Computer Science, vol 4692. Springer, Berlin, Heidelberg
NGA 2007. Lecture Notes in Computer Science, vol 4431. Springer, [36] Ko SK. et al. (2011) A Smart Movie Recommendation System. In:
Berlin, Heidelberg Smith M.J., Salvendy G. (eds) Human Interface and the Management
[19] Duarte D., Ståhl N. (2019) Machine Learning: A Concise Overview. In: of Information. Interacting with Information. Human Interface 2011.
Said A., Torra V. (eds) Data Science in Practice. Studies in Big Data, Lecture Notes in Computer Science, vol 6771. Springer, Berlin, Hei-
vol 46. Springer, Cham delberg
[20] Fan Y., Dong L., Sun X., Wang D., Qin W., Aizeng C. (2018) [37] Wei D., Junliang C. (2013) The Bayesian Network and Trust Model
Research on Auto-Generating Test-Paper Model Based on Spatial- Based Movie Recommendation System. In: Du Z. (eds) Intelligence
Temporal Clustering Analysis. In: Huang DS., Jo KH., Zhang XL. (eds) Computation and Evolutionary Computation. Advances in Intelligent
Intelligent Computing Theories and Application. ICIC 2018. Lecture Systems and Computing, vol 180. Springer, Berlin, Heidelberg
Notes in Computer Science, vol 10955. Springer, Cham [38] Mishra N., Chaturvedi S., Mishra V., Srivastava R., Bargah P. (2017)
[21] Kushwaha N., Pant M. (2019) A Teaching–Learning-Based Particle Solving Sparsity Problem in Rating-Based Movie Recommendation
Swarm Optimization for Data Clustering. In: Tanveer M., Pachori R. System. In: Behera H., Mohapatra D. (eds) Computational Intelligence
(eds) Machine Intelligence and Signal Analysis. Advances in Intelligent in Data Mining. Advances in Intelligent Systems and Computing, vol
Systems and Computing, vol 748. Springer, Singapore 556. Springer, Singapore
[22] Howley T., Madden M.G., O’Connell ML., Ryder A.G. (2006) The [39] Das D., Chidananda H.T., Sahoo L. (2018) Personalized Movie Recom-
Effect of Principal Component Analysis on Machine Learning Accuracy mendation System Using Twitter Data. In: Pattnaik P., Rautaray S., Das
with High Dimensional Spectral Data. In: Macintosh A., Ellis R., Allen H., Nayak J. (eds) Progress in Computing, Analytics and Networking.
T. (eds) Applications and Innovations in Intelligent Systems XIII. SGAI Advances in Intelligent Systems and Computing, vol 710. Springer,
2005. Springer, London Singapore
[23] Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering [40] Lops P., de Gemmis M., Semeraro G. (2011) Content-based Recom-
algorithm. Journal of the Royal Statistical Society. Series C 28(1), mender Systems: State of the Art and Trends. In: Ricci F., Rokach L.,
100–108 (1979) Shapira B., Kantor P. (eds) Recommender Systems Handbook. Springer,
[24] J. Laaksonen and E. Oja, Classification with learning k-nearest neigh- Boston, MA
bors, Proceedings of International Conference on Neural Networks [41] Hatami, M., Pashazadeh, S.(2014)Improving results and performance
(ICNN’96),3-6 June 1996, 10.1109/ICNN.1996.549118 of collaborative filtering-based recommender systems using cuckoo
[25] Anna L. Buczak et. al, A Survey of Data Mining and Machine Learning optimization algorithm,Int J Comput Appl Volume 88, Pages 46-51
Methods for Cyber Security Intrusion Detection, IEEE Communications [42] Z. Huang, D. Zeng, H. Chen (2007) A comparison of collaborative-
Surveys Tutorials ( Volume: 18 , Issue: 2 , Secondquarter 2016 )DOI: filtering algorithms for e-commerce IEEE Intell Syst, 22, pp. 68-78
10.1109/COMST.2015.2494502 [43] R. Burke (2007) Hybrid web recommender systems, Adapt Web, pp.
[26] Author(s) C. Kilgus et.al, Root-Mean-Square Error in Encoded Digital 377-408, 10.1007/978-3-540-72079-91 2
Telemetry, IEEE Transactions on Communications,( Volume: 20 , Issue:
3 , Jun 1972 ),DOI: 10.1109/TCOM.1972.1091174
[27] Alexandra L’Heureux et.al, Machine Learning With Big Data: Chal-
lenges and Approaches, IEEE Access ( Volume: 5 ) Page(s): 7776 -
7797,DOI: 10.1109/ACCESS.2017.2696365
[28] Bernard Marr,What Is The Difference Between Artificial Intelligence
And Machine Learning,6 December 2016,8 Feb 2018 accessed
from:https://www.forbes.com/sites/bernardmarr/2016/12/06/what-
is-the-difference-between-artificial-intelligence-and-machine-
learning/7e94a51a2742
[29] Nick Mccrea,An Introduction to Machine Learning Theory and Its
Applications: A Visual Tutorial with Examples,Aug 2016,Feb 2018
accessed from: https://www.toptal.com/machine-learning/machine-
learning-theory-an-introductory-primer
[30] Miroslav Kubat, An Introduction to Machine
Learning,https://doi.org/10.1007/978-3-319-63913-0, Springer
International Publishing AG 2017,Print ISBN 978-3-319-63912-
3, Online ISBN 978-3-319-63913-0
[31] Priyadharshini,Machine Learning: What it is and Why it Matters,March
2018,April 2018 accessed from: https://www.decypher.com/machine-
learning-matters/
[32] Fabien Dubosson et.al,A Python Framework for Exhaustive Ma-
chine Learning Algorithms and Features Evaluations,2016 IEEE
30th International Conference on Advanced Information Networking
and Applications (AINA),23-25 March 2016,ISSN: 1550-445X,DOI:
10.1109/AINA.2016.160
[33] Jianpeng Qi et. al, K*-Means: An Effective and Efficient K-Means
Clustering Algorithm, 2016 IEEE International Conferences on Big
Data and Cloud Computing (BDCloud), Social Computing and Net-
working (SocialCom), Sustainable Computing and Communications
(SustainCom) (BDCloud-SocialCom-SustainCom),8-10 Oct. 2016,DOI:
10.1109/BDCloud-SocialCom-SustainCom.2016.46
268 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence)