Termpaper
Termpaper
Termpaper
KISEJJERE RASHID
line 2: dept. name
MAKERERE UNIVERSITY
2100711543
21/U/11543/EVE
[email protected]
A. Research Gaps The AI evaluation framework used in this project are the
accuracy metrics mainly. This is a major of how the model
Below are some of the major research Gaps in the field will be able to translate a given text correctly.
of machine translation.
Limited Training Data: The quality of AI-powered In conclusion, the proposed AI approach for this project
translations is heavily dependent on the amount and is to develop a neural machine translation model that can
quality of training data used to train the model. accurately translate English text into Luganda text while
Further research is needed to explore methods for preserving the meaning and cultural context of the original
obtaining high-quality training data. text.
Lack of Cultural Sensitivity: AI-powered translation A. Dataset Description
systems can produce translations that are
grammatically correct but lack the cultural sensitivity
of human translations. This can result in translations
that are culturally inappropriate or that do not
accurately convey the original message.
Vulnerability to Errors of the machine learning
system. AI can only understand what it has been
trained on. So in cases where the input is not similar
to the data which it was trained on, AI then can easily
create undesired results.
B. Contributions of this paper.
a) One of the major aim of this paper is lay a foundation for
The dataset I used was created by Makerere University
further and much more detailed research in the and it contains approximately 15k English sentences with
translation of large vocabulary languages like Luganda. there respective Luganda translation. Below are the factors
Through showing the different machine learning for considering this dataset.
techniques that can be used ti achieve this.
1) Scarcity of Luganda datasets. Luganda isn’t a
IV. METHODOLOGY famous language world wide and it is mainly used in the
The problem being investigated in this project is to Country Uganda only so the only major dataset I could find
develop an AI-powered English to Luganda translation was this one.
system. The significance of this problem lies in the growing 2) Cost. The dataset is available for free for anyone to
demand for high-quality and culturally sensitive translations, use and edit.
particularly in the field of commerce and communication 3) The accuracy of the dataset isn’t bad at all so it is the
between English and Luganda-speaking communities.
best option to use.
The scope of the project is to develop an AI system that is 4) The dataset is relatively large and diverse enough to
capable of accurately translating English text into Luganda be able to create a very good model out of.
text, while also preserving the meaning and cultural context
of the original text.
To address this problem, the proposed AI approach is to B. Data Preparation and Exploratory Data Analysis.
develop a neural machine translation (NMT) model. The
NMT model will be trained on the English and Luganda Headings, or heads, are organizational devices that guide
parallel corpus dataset, and will use this data to learn the the reader through your paper. There are two types: compone
relationship between the two languages.The AI process can nt heads and text heads.
be summarized as follows: Data preparation refers to the steps taken to prepare raw data
Data Collection: Collect a large corpus of parallel text into improved data which can be used to train a machine
data in English and Luganda. learning model. The data preparation process for my model
was as follows;
a. Removal of any punctuation plus any unnecessary spaces to each other.
this is necessary to prevent the model from training on a
large amount of unnecessary data.
b. Converting the case of words in the dataset to lowercase.
Since python is case-sensitive a word like “Hello” is
different from “hello”. to avoid this dilemma I had to
change the case.
c. Vectorization of the dataset. Vectorization is referred to
as the process of converting a given text into numerical
indices. This is necessary because the machine learning
pipeline can only be trained on numerical data.
d. Removal of null values. Here all the rows that had null
data had to be dropped because for textual data it is
very difficult to estimate the value in the null spot.
e. Those were the data preparation processes I used in the
model creation process.
Exploratory data analysis is referred to as the process of A correlation matrix for one of the sentences
performing initial investigations on data to discover
anomalies and patterns. Exploratory data analysis is mainly
carried out through the visualization of the data. Below are
the visualizations and their meanings;
A. Word Cloud
A word cloud is graphical representation of the words
that are used frequently in the dataset. This is important as it
shows that the model will highlt depend on those particular
words .
A. Correlation matrix.
This is a matrix showing the correlation of the different
values to each other. Plotting a 2d correlation matrix for the
entire dataset is almost impossible but what is possible is the
plot of a particular sentence. The matrix below shows the
correlation for a given sentence. Here the model will have to
pay a lot of attention to the words that are highly correlated
The recurrent neural network model was a simple model
that uses RNNs to translate the model. Its accuracy was very
bad because the vocabulary for the two languages was very
big. These types of RNNs are best for simple vocabularies.
The attention mechanism model. This happened to
be much much better compared to the RNN model. Attention
is a mechanism used in deep neural networks where the
model can focus on only the important parts of a given text
by assigning them more weights.
The other model I created used transformers.
Transformers are also deep learning models that are built on
top of attention layers. This makes them much more efficient
when it comes to NLP tasks.
These figures show the maximum sentence lengths for the
English and the Luganda sentences receptively. D. AI model selection Accountability.
In this context of AI, “accountability” refers to the
g. Box Plot expectation that organizations or individuals will use
A box plot is visual representation that can be used to toensure the proper functioning, throughout the AI systems
show the major outliers in the dataset. Plotting a box plot for that they design, develop, operate or deploy, following their
the entire spot is also almost impossible but what is possible rolesand applicable regulatory frameworks, and for
is the plotting of the box plot for a particular sentence, this as demonstrating this through their actions and decision-making
a result shows on the possible outliers in the sentence thus process (for example, by providing documentation on key
the model during the training process ends up not paying a decisions throughout the AI system lifecycle or conducting
lot of attention to those particular words. or allowing auditing were justified).
Box plot for one of the sentences in the dataset AI accountability is very important because it’s a means
of safeguarding against unintended uses. Most AI systems
are designed for a specific use case; using them for a
different use case would produce incorrect results. Through
this am also to apply accountability to my model by making
sure that Since my AI model mainly depends on the dataset.
Hence, it’s best to make sure that the quality of the dataset is
constantly improved and filtered. Because of any slight
modifications in the spelling of the words then the model’s
accuracy will decrease.
ACKNOWLEDGMENT
Special Thanks to Mr.Ggaliwango Marvin for his never
ending support towards my research on this project. I also
want to appreciate Mrs. ------ for the provision of the
foundation knowledge needed for this project.
REFERENCES