Research Methods in Machine Learning: A Content Analysis: Jackson Kamiri Geoffrey Mariga
Abstract—Research methods in machine learning play a pivotal as scientific research, therefore, it’s research methodology
role since the accuracy and reliability of the results are according to many scholars incorporates several factors
influenced by the research methods used. The main aims of this which include; choice of training data to use, choice of data
paper were to explore current research methods in machine attributes to use, choice of algorithm(s) to use in the model,
learning, emerging themes, and the implications of those themes the ratio between training data and testing data, the
in machine learning research. To achieve this the researchers performance measures of the algorithms, among other
analyzed a total of 100 articles published since 2019 in IEEE factors. Thus, scholars have demonstrated that the choice of
journals. This study revealed that Machine learning uses
quantitative research methods with experimental research
methodology to use should span across the above factors. The
design being the de facto research approach. The study also objectives of the research also influence the methodology to
revealed that researchers nowadays use more than one be used [3]
algorithm to address a problem. Optimal feature selection has
also emerged to be a key thing that researchers are using to Scholars in this field of machine learning have demonstrated
optimize the performance of Machine learning algorithms. that a good research methodology should be capable of
Confusion matrix and its derivatives are still the main ways used clearly explaining how the researcher ended-up with the
to evaluate the performance of algorithms, although researchers results [4]. Therefore, the methodology should be put in such
are now also considering the processing time taken by an a way that another researcher can easily follow and end up
algorithm to execute. Python programming languages together with the same results. Some researchers have opted to use a
with its libraries are the most used tools in creating, training, single approach in terms of choice of the algorithm while
and testing models. The most used algorithms in addressing others use more than one algorithm to model and then they
both classification and prediction problems are; Naïve Bayes,
Support Vector Machine, Random Forest, Artificial Neural
compare the results of the various algorithms [4].
Networks, and Decision Tree. The recurring themes identified Research in the field of machine learning is mainly
in this study are likely to open new frontiers in Machine learning
quantitative since it involves the modeling of data and also
using statistical approaches and formulations to make sense
Keywords: Research methods in machine learning, machine of the data [1], [2], [3], [4]. Therefore, the methods used in
learning algorithms, machine learning techniques. machine learning research in most cases must be consistent
with quantitative research. However, this is not to mean that
I. INTRODUCTION all fields of artificial intelligence or domains affiliated to
Machine learning can be defined as a field of artificial machine learning are confiscated to quantitative research.
intelligence that is concerned with the development of Some areas such as Natural Language Processing may
algorithms and techniques that allow a computer to learn and require a qualitative research approach [5].
gain intelligence from experience [1]. Research methods in
machine learning research play a pivotal role since the When scholars are researching on machine learning they
accuracy and reliability of the results are influenced by the must first scrutinize the problem that they intend to solve and
research approach/ method used. In machine learning, then decide if it requires a quantitative approach, qualitative
models learn from historical data which can either be primary approach, or a hybrid of the two approaches. This is so
data or secondary data. This creates a wide pool of knowledge because machine learning does not exist in a vacuum, it is
from which machines can learn and make decisions based on used to solve problems in society. Thus, the nature of the
what they learn [2]. problem influences the methodology to be used in research
[6]. Also, machine learning is a very active area of research,
The research articles analyzed emphasized the need to choose therefore, scholars are continuously trying to solve a majority
an appropriate methodology. Machine learning is classified 78
of the problems in the various domains using machine The apps were assessed against five functions that are
learning techniques. performed by successful caregivers to determine the level to
which the apps relieve the burden of caregivers. The five key
The articles assessed in this paper have shown that research functions were; (i) Information and Resources, (ii) practical
methodology in machine learning can be the key difference problem solving involving behavioral solutions, medication
between various study which are basically solving a similar management, safety, and personal health records tracking. (ii)
problem in the same domain [7,8]. Therefore, the Memory aids, (iv) Family communication which includes
methodology can form the research gap that a researcher notification for checkups, emergency contact list, sharing of
wants to address. The researcher will then use a different critical information. (v) caregiver support that is caring for
approach which s/he thinks is better than the approach(s) used the caregiver. The general assessment of the apps revealed
by other researchers previously to solve the same problem. that 50% (22 apps) were designed for caregivers taking care
Some researchers have also combined methodologies used by of people with memory loss [9].
other researchers with the sole aim of increasing the accuracy
of the results. The results of the analysis against the five functions were as
follows; first, only 34% (15 apps) satisfied the criteria of
providing caregivers with relevant information and resources
Problem statement such as searchable databases [9]. Second, 21 applications met
the criteria for practical problem-solving. Practical problem-
solving in this case was defined as addressing medication
This study aims at exploring research methods in machine management, safety health record tracking, and behavioral
learning and determining emerging themes in the field of solutions [9]. Third, 15 apps passed the test of facilitating
machine learning research. Specifically, the following three better communication and enhancing coordination among
research questions will be addressed. family members [9]. Fourth, 12 applications met the criteria
1) What research methods are commonly used in the of actin as a memory aid. The apps have tools that enable
field of Machine Learning? people with dementia to enhance their cognitive abilities [9].
2) What other recurring themes are there in Machine Five, a total of 10 apps of the 44 apps contained tools that
Learning field with regard to research methods? care for the caregiver by providing emotional support stress
3) What implications do these recurring themes in management, and social support, among others to the
research methods have for the Machine Learning caregivers [9].
discipline? [10] conducted a content analysis of research methods in
Scholars in the field of Machine learning have contributed Library Information Science (LIS). The researcher analyzed
widely to the evolution of research methods. Therefore, a total of 1162 articles published between 2001 and 2010 in
cross-examination of the status quo in the field of machine three major journals of LIS namely; Journal of
learning will inform scholars about research methods and Documentation (JDoc), Journal of the American Society for
help them make more informed decisions while selecting the Information Science, and Technology (JASIS&T), and
methodology to use in their research. Library Information Science research (LIS). However, the
researcher deferred the analysis of articles in the JASIS&T
journal between 2003-2008 and only analyzed articles
between 2001-2002 and 2009-2010 [10]. The articles were
A survey of Content Analysis analyzed quantitively and qualitatively to address some
recurring themes about research method selection and
Grossman, Zak, and Zelinsk conducted a quantitative content application in the scholarly domain. Non-research articles in
analysis on mobile applications used by caregivers. In the this field were excluded from the analysis.
analysis, 44 applications that self-identified themselves or
were advertised as caregivers aiding tool were assessed. The The findings of the research were as follows: the three
apps were identified in two major application stores namely journals shared 4 of the top five research methods identified
iTunes App Store and Google play store. To get the apps, the in this study. The theoretical approach leads with a
researchers searched the keywords “caregivers”, cumulative percentage of 65%, the content analysis took the
“caregiving” and “elderly care” [9]. The researchers only second position with a percentage of 57%, Questionnaire
include applications that exclusively dealt with informal or becomes position three with a cumulative percentage of
family caregivers for elderly people. Therefore, apps 55.8%, and experiment occupied position four with a 53.4%
developed for professional caregivers, and those developed cumulative percentage score [10]. The researcher discovered
for people with special chronic conditions such as diabetes that unlike before, questionnaires and surveys no longer
were excluded [9]. dominate as the leading research methods in Library
Information Science [10]. Despite Bibliometrics leading in
JASIS&T, it did not feature well in the other journals, 79
therefore, it did not qualify to be among the top 4 techniques. III. METHOD
The researcher further reports that the use of multiple One hundred research articles published since 2019 were
methods has gained a lot of prevalence. obtained from IEEE journals. All the research papers used
were in the field of machine learning. Journal papers that
Azeez and Van der Vyver conducted a comprehensive were not written in English or at least translated to English
content analysis on privacy and security issues in e-health were excluded. Access to these journals was provided by
cloud-based model. The scholar reviewed 110 articles in this Murang’a University of Technology.
area and discovered several models discovered in their The data collection yielded 100 research articles in the field
solutions. The articles were sourced from the following of machine learning as summarized on table1. These articles
journals; ACM digital library, IEEE digital library, IEE were considered to be very relevant since they were not more
explore digital library, Springer, Elsevier, and Science direct than 2 years old, thus they can inform current research in the
[11]. The 110 reviewed articles were arrived at after field of machine learning which is a very active field. The
comparing the models and approaches used by many research articles chosen were solving different problems
researchers. The researchers examined the strengths and using machine learning approaches. All the 100 articles were
weaknesses of each method adopted in addressing E-health analyzed by the researchers to determine the research
security challenges [11]. methods used in machine learning.
The scholars discovered that for any E-health system to be In this study, the researchers analyzed each of the sampled
reliable it must have strong mechanisms to counter various papers along several dimensions to describe machine
security threats such as; tampering, masquerading, denial of learning research methods. These dimensions are: General
service attack, and privacy [11]. Public Key Encryption was research design, sources of data, algorithms used, and
identified as the most widely used security mechanism, approaches applied in data pre-processing, model training,
however, the researcher record that it needs to be enhanced. testing, and evaluation. The above issues were used to inform
ABE is another security scheme that is being used. ABE is the findings of this paper. All the 100 research papers used in
efficient in enhancing security and privacy although it is this study had the
expensive considering computational complexity. The high
cost of ABE is mainly because the ciphertext is expensive above key parameters in their research methodology. These
thus decryption is expensive [11]. parameters are just general parameters, therefore, under each
one of them, several sub-parameters were identified. For
Selective encryption was also widely used in the articles example, under the performance metrics, the majority of the
assed. This involves encrypting the parts of the data that are articles recorded the performance of the algorithms used
considered to be more sensitive [11]. This method is under the various performance metrics.
considered to be less costly, therefore, researchers are
considering it over ABE. The researchers concluded that
research in security and privacy issues in E-health is still
wanting since threats are increasing with time.
Table 1 shows a summary of research articles analyzed in this repository was the UCI ML repository. This repository has
research. The table portrays that machine learning techniques been used by research works in machine learning that are
are being are applied to address problems in diverse fields. addressing diverse problems. This shows that the repository
has a variety of datasets in different areas. Apart from
repositories, some research articles used data obtained from
IV. RESULTS government entities and Non-governmental organizations.
Data Collection
Figure 3: Secondary data usage analysis
In this stage, two main sources of data were used in the
articles analyzed. The two sources are secondary data and Primary data was mainly sourced from recording data over a
primary data. 80% of the articles used purely secondary data long period. Research that involved primary data took a
sets in the research as shown in Figure 2. The secondary data longer time than those that used secondary data. A
set was obtained from data repositories. The main used data combination of both primary and secondary data was used 81
where there is a need to guarantee the validity of the model Data normalization Involved tasks such as formatting the data
or to enhance the model to accommodate real-time data [12]. in a form that is easier for the algorithms to understand during
Also, where secondary data is not available or the available training. It also involved data scaling and ensuring that the
secondary data is not sufficient then the researchers resulted training data is in the same format as the testing data. It was
in using either purely primary data or a combination of both noted that normalized data promotes the efficiency of
primary and secondary data. machine learning models [17].
None of the articles analyzed in this study used more than one Noise reduction involved the elimination of outliers or any
set of primary data. Also, none of the research works used data item that is inconsistent with the other parts of the data.
qualitative data collection methods to collect any form of Optimal feature selection involved the selection of critical
primary data. This, therefore, proves that research in machine features that influence the results also known as the
learning is purely quantitative. dependent variable in a great way. Articles that applied this
technique first modeled the data without extracting optimal
Data Pre-processing features and then in the second phase the optimal features
were selected and the data modeled again.
All the articles analyzed had this stage as part of their
methodology. The majority of the articles noted that this is a In all the articles that used optimal feature extraction, it was
very critical stage in machine learning since this stage can noted that the performance of the models in terms of accuracy
influence the performance of models and consequently and also in the confusion matrix improved after optimal
influence the results of the model. This is so because machine features were selected [13], [14], [15], [16]. This is a new
learning algorithms learn from the data provided to them and development in the field of machine learning that is attracting
use the acquired knowledge to make decisions. a lot of interest among researchers since it improves
The analyzed articles revealed that despite the algorithms to
be used in the research (supervised, unsupervised, or Principle component analysis PCA has been used as one of
reinforcement learning) or the nature of the research the major techniques of selecting optimal features [13], [14].
(classification or regression) data pre-processing is a It identifies correlation among various attributes of a dataset,
paramount stage that cannot be omitted. Some articles therefore, enabling researchers to determine how each
recorded that much focus in their research work was on data attribute influences the results and the relevance of such
pre-processing. attribute in that particular data set [14]. 82
