Language Detector: Bachelor of Engineering (Sem-VIII)
Language Detector: Bachelor of Engineering (Sem-VIII)
Language Detector: Bachelor of Engineering (Sem-VIII)
on
“Language Detector”
Submitted in partial fulfillment of the requirements
of the degree of
Dr. D. R. Ingle
CERTIFICATE
This is to certify that
“Language Detector”
as prescribed by the University of Mumbai, for the award of the degree of Bachelor
of Engineering in Computer Engineering
Sr No Title Page No
1 Introduction 1-2
6 References 21
ACKNOWLEDGEMENT
I take this opportunity to express my deepest gratitude and appreciation to all those who have helped me
directly or indirectly towards the successful completion of this dissertation report.
It is a great pleasure and moment of immense satisfaction for me to express my profound gratitude to my
dissertation Project Guide, Prof. D. R. Ingle whose constant encouragement enabled me to work enthusiastically.
His perpetual motivation, patience and excellent expertise in discussion during progress of the dissertation work
have benefited me to an extent, which is beyond expression. I am highly indebted to him for his invaluable
guidance and ever-ready support in the successful completion of this dissertation in time. Working under his
guidance has been a fruitful and unforgettable experience. Despite of his busy schedule, he was always available to
give me advice, support and guidance during the entire period of my project. The completion of this project would
not have been possible without his encouragement, patient guidance and constant support. I express my deepest
sense of gratitude & thanks to Prof. D. R. Ingle for her continuous support, and guidance throughout this work.
I am thankful to Prof. D. R. Ingle, Head of Computer Engineering Department, for their guidance,
encouragement and support during my project. I would like to mention here that he was instrumental in making
available all the needed resources throughout my project. I am highly indebted to him for his kind support.
I am also thankful to Dr. Sandhya Jadhav, Principal, for his encouragement and for providing an outstanding
academic environment, also for providing the adequate facilities.I acknowledge all the staff members of the
department of Computer Engineering for their valuable guidance with their valuable guidance with their interest
and valuable suggestions brightened me.
No words are sufficient to express my gratitude to my beloved Parents for their unwavering
encouragement in every work. I also thank all friends for being a constant source of my support.
Name : Tushar Wankhede (Roll.No. 75)
Devesh Upadhayay (Roll.No. 70)
Abhishek Singh (Roll.No. 64)
Prasad shinde (Roll.No. 62)
Introduction:
Natural Language Processing (or NLP) is the science of dealing with human language or
text data. One of the NLP applications is Language Identification, which is a technique used
to discover language across text documents. Many real world applications such as chat bots,
comments and feedback forums have lot of data present in unstructured format and in
different languages all together. Now it is important for one to analyze and extract essential
information from this data in order to boost revenues, get insights or increase in customer
support etc. But in order for a person to analyze this data, it is equally important for one to
recognize the language it is represented in. Also other areas of application would be online
video conferencing where in speech in one language must be identified so that it can be
translated into another. So for all these applications the development of a language identfier
application is extremely important.
About the dataset:
In this project we are using Language Detection dataset present in Kaggle site. It's a small
language detection dataset. This dataset consists of text details for 17 different languages, in
order for us to create an NLP model for predicting 17 different language.
Languages
1) English
2) Malayalam
3) Hindi
4) Tamil
5) Kannada
6) French
7) Spanish
8) Portuguese
9) Italian
10) Russian
11) Sweedish 12) Dutch
13) Arabic
14) Turkish
15) German
16) Danish 17) Greek
Using the text we have to create a model which will be able to predict the given language.
This is a solution for many artificial intelligence applications and computational linguists.
These kinds of prediction systems are widely used in electronic devices such as mobiles,
laptops, etc for machine translation, and also on robots. It helps in tracking and identifying
multilingual documents too.
ALGORITHM USED FOR MODEL CREATION :
We are using the naive_bayes algorithm for our model creation. Multinomial Naive Bayes
algorithm is a probabilistic learning method that is mostly used in Natural Language
Processing (NLP). The algorithm is based on the Bayes theorem and predicts the tag of a text
such as a piece of email or newspaper article. It calculates the probability of each tag for a
given sample and then gives the tag with the highest probability as output.
Naive Bayes classifier is a collection of many algorithms where all the algorithms share one
common principle, and that is each feature being classified is not related to any other feature.
The presence or absence of a feature does not affect the presence or absence of the other
feature.
Implementation
So let’s get started. First of all, we will import all the required libraries.
data["Language"].value_counts()
Output :
English1385
French1014
Spanish819
Portugeese739
Italian698
Russian692
Sweedish676
Malayalam594
Dutch546
Arabic536
Turkish474
German470
Tamil469
Danish428
Kannada369
Greek365
Hindi63
Name: Language, dtype: int64
Separating Independent and Dependent features
Now we can separate the dependent and independent variables, here text data is the
X = data["Text"] y =
data["Language"]
Label Encoding
Our output variable, the name of languages is a categorical variable. For training the model
we should have to convert it into a numerical form, so we are performing label encoding
on that output variable. For this process, we are importing LabelEncoder from sklearn.
This is a dataset created using scraping the Wikipedia, so it contains many unwanted
symbols, numbers which will affect the quality of our model. So we should perform text
preprocessing techniques.
# creating a list for appending the preprocessed text data_list
= []
# iterating through all the text for text in X:
# removing the symbols and numberstext
Bag of Words
As we all know that, not only the output feature but also the input feature should be of the
numerical form. So we are converting text into numerical form by creating a Bag of Words
We preprocessed our input and output variable. The next step is to create the training set, for
training the model and test set, for evaluating the test set. For this process, we are using a
And we almost there, the model creation part. We are using the naive_bayes algorithm for
our model creation. Later we are training the model using the training set.
Model Evaluation
print("Accuracy is :",ac)
# Accuracy is : 0.9772727272727273 O/P
IS BELOW: