Movie Recommendation - 01
Movie Recommendation - 01
Movie Recommendation - 01
Stemming and Lemmatization in Python NLTK area unit text normalisation techniques for language process.
These techniques area unit wide used for text preprocessing. The distinction between stemming and
lemmatization is that stemming is quicker because it cuts words while not knowing the context, whereas
lemmatization is slower because it is aware of the context of words before process. Stemming could be a
methodology of normalisation of words in language process. it's a method within which a collection of words in
an exceedingly sentence area unit reborn into a sequence to shorten its operation. during this methodology, the
words having an equivalent which means however have some variations in step with the context or sentence
area unit normalized.
C. Training and Validation
Now, the processed data are stored in “.csv ” file for further use. The processed data-set is divided into two
parts :
• Training.(70 % of the data-set is used)
• Testing.(30 % of the data-set is used)
Now, comes the training part of the models. So, classification models are trained and tested to get the accuracy
of the models. Once done with the accuracy part, we need to perform validation for further efficiency of the
project.
D. Prediction
The presentation of algorithm associated based on accuracy and performance analysis and will provide a
suggestion for the movies to the user whether movies are suggested or not upon user’s interest.
E. Result
The final result gives the recommendation of the movies.
IV. ALGORITHMS
Some of the algorithms used in movie recommendation are COUNT VECTORIZER AND COSINE SIMILARITY.
A. COUNT VECTORIZER
In order to use matter information for prophetic modelling, the text should be parsed to get rid of sure words –
this method is termed tokenization. These words got to then be encoded as integers, or floating-point values, to
be used as inputs in machine learning algorithms. This method is termed feature extraction (or vectorization).
Scikit-learn’s CountVectorizer is employed to convert a set of text documents to a vector of term/token counts.
It conjointly permits the pre-processing of text information before generating the vector illustration. This
practicality makes it a extremely versatile feature illustration module for text.
e.g-
text = [‘Hello my name is james, this is my jupyter notebook’]
The text is transformed to a sparse matrix as shown below.
Count vectorizer makes it easy for text data to be used directly in machine learning and deep learning models
such as text classification.
Text Vectorization is that the method of changing text into numerical illustration. Vectorization is jargon for a
classic approach of changing computer file from its raw format (i.e. text) into vectors of real numbers that is
that the format that millilitre models support. Here we have a tendency to area unit mistreatment Bag of Words
technique to convert text to vectors.
B. COSINE SIMILARITY
The circular function Similarity mensuration begins by finding the circular function of the 2 non-zero vectors.
The output can manufacture a worth starting from -1 to one, indicating similarity wherever -1 is non-similar,
zero is orthogonal (perpendicular), and one represents total similarity If 2 vectors area unit diametrically
opposed, that means they're familiarised in mere opposite directions, then the similarity mensuration is -1.
circular function Similarity is employed in positive area, between the bounds zero and one. circular function
Similarity isn't involved, and doesn't live, variations is magnitude (length), and is simply a illustration of
similarities in orientation. The library contains each procedures and functions to calculate similarity between
sets of knowledge. The operate is best used once calculative the similarity between little numbers of sets. The
procedures lay the computation and area unit thus additional acceptable for computing similarities
onlargerdatasets.
Theoretically, the perform circular function similarity is any range between -1 and +1 as a result of the image of
the cos function, however during this case, there'll not be any negative picture show rating therefore the
therefore the are going to be between zeroº and 90º bounding the cos similarity between 0 and one. If the angle
θ = 0º =>cosine similarity = one, if θ = 90º => cos similarity =0.
C. Cross Validation
In machine learning, we tend to couldn’t work the model on the coaching information and can’t say that the
model can work accurately for the important information. For this, we tend to should assure that our model got
the right patterns from the info, and it's not obtaining up an excessive amount of noise. For this purpose, we
tend to use the cross-validation technique.
Cross-validation may be a technique within which we tend to train our model exploitation the set of the data-
set and so value exploitation the complementary set of the data-set.
The 3 steps concerned in cross-validation square measure as follows :
Reserve some portion of sample data-set.
• Using the rest data-set train the model.
• Test the model using the reserve portion of the data-set.
V. PROPOSED METHODOLOGY
The methodology of the project is meant in six steps:
• Installing the Python and SciPy platform. we want to mount our “.ipynb” file on our google drive for more
access.
• Loading the dataset. The dataset of picture show recommendation is required to be foreign in “.csv” format.
• Summarizing the dataset. Sorting and improvement of knowledge is that the necessary method to extend
the potency of the
• project. we are able to fill the missing information victimisation “imputer” perform.
• Visualizing the dataset. we are able to visualize our ”tmdb_5000_movies.csv” and “tmdb_5000_credits
dataset through the Kaggle.com and so pre process method thereon.
• Evaluating some algorithms. when visualising the dataset, currently comes coaching and testing part!!! Let’s
divide {the information|the info|the information} into 7:3 magnitude relation wherever seventieth data are
trained and half-hour are tested. Now, let’s choose the suitable models and so train them to urge the
accuracy of the prediction. we've got used two models: COUNT VECTORIZER AND cos SIMILARITY. when
obtaining the accuracy of every model and scrutiny them, lets cross Making some predictions. Now , comes
the last stage of the project, i.e., to form predictions. Here, user will manually provide the input and acquire
the advice of flick as per his/her interest.
• For content-based recommender system specifically, we have a tendency to conceive to notice a brand new
thanks to improve the accuracy of the representative of the flick and suggest high 5 similar flicks to the user
as per the interest of movie. Now, to form the project additional easy, we've got designed a frontend as well!!
VII. CONCLUSION
The main motivation of creating this project is to spice up every ., in order that we are able to perform our day-
after-day of the movie, which are diversity and unique. We have successfully got the output of top high five
recommended movies as the user in selected by it’s choice. We develop the movie recommendation model
using the machine learning and algorithms.
Hence, our project “Movie recommendation system” is justified.
VIII. REFERENCES
[1] D.K.Yadav. A movie recommender system. 2000(1):012101, 2017.
[2] Hongli Lin, Xuedong Yang, and Weisheng Wang. A content-boosted collaborative filtering algorithm for
personalized training in interpretation of radiological imaging. Journal of digital imaging, 27(4):449–
456, 2014.
[3] Harpreet Kaur Virk, Er Maninder Singh, and A Singh. Analysis and design of hybrid online movie
recommender system. International Journal of Innovations in Engineering and Technology (IJIET)
Volume, 5, 2015.
[4] Urszula Ku zelewska. Recommendation system engines. Iranian Journal of Energy and Environment,
2019.
[5] Hongli Lin, Xuedong Yang, and Weisheng Wang. A content-boosted collaborative filtering algorithm for
personalized training in interpretation of radiological imaging. Journal of digital imaging, 27(4):449–
456, 2014.
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1916]