Sahono 2020
Sahono 2020
Sahono 2020
Abstract— Personality is a characteristic of each individual designed for regular populations, to distinguish between
who describes their behavior and influences their interactions with individuals that occur naturally.
other individuals. Every individual has various way to express
their feelings, one of them through social media. On social media, Social media accounts can be called someone's self-reflection.
humans can create and share a variety of content about various On social media, humans can easily spread personal
objects, describe activities, to express their thoughts, opinions, and information, such as posting daily activities and commenting on
feelings. This study aims to classify human personalities based on posts. Today, this is done easily and has become a habit. The
the MBTI method that focused on Extrovert and Introvert class, text left by the user can be analyzed to obtain information, in
seen from their tweets. Humans able to better understand and this case the user's personality[5].
improve themselves by recognizing their weaknesses and
strengths. The dataset used in this study is a public dataset from This study aims to classify human personalities based on the
Kaggle, consists of 8676 data that posted on Twitter. Various MBTI method which focuses on detecting extroverts and
feature combinations have been compared using Support Vector introverts in a person, seen from their posts on social media.
Machine (SVM) classifier. The deployment of solution have been Extroverts and introverts were chosen because these traits are
described. Accuracies up to 84.07% were achieved using the most easily seen from the outside, relate to the dataset used in
methods detailed in this work. this study. Humans able to better understand and improve
themselves by recognizing their weaknesses and strengths.
Keywords—personality classification; Extrovert; Introvert;
Myers-Briggs Type Indicator (MBTI); TFIDF; Support Vector The discussion in this study is elaborated as follows: Chapter II
Machine. carried out a discussion of related research. Chapter III discusses
the methods used in the research and evaluation process. Chapter
I. INTRODUCTION IV discusses the results of research and analysis of the proposed
method. The last is the conclusion and future work.
. Personality is a combination of behavior and character of
each individual when faced with various situations. In many II. RELATED WORK
ways, personality has a role in one's choice in various things
The system of determining a person's personality is usually
such as music tastes, books, films, and others. In addition, the
interaction of individuals with others and the environment is done using a questionnaire. There are various psychological
also strongly influenced by personality [1]. Personality is often systems for categorizing personalities, such as Big Five and
used in assessments in employee recruitment, health MBTI. The MBTI was developed from Carl Jung's personality
counseling, relationship counseling, and career counseling. theory, and sort an individual into sixteen categories of
personality, depending on their psychological state.
One of the most serious global problems is the problem of
mental illness. Mental illnesses that suffer the most are There are several studies that have been done before regarding
depression, anxiety, and several other disorders. In the world,
the classification of personality. The study with title
more than 350 million people suffer from mental disorders, of
Personality Classification Based on Twitter Text Using Naive
which 16.7% of British citizenship[2]. Mental illness is a
Bayes, KNN and SVM was conducted by Bayu Yudha Pratama
dangerous condition. Even more worrying, mental disorders are
difficult to detect[3]. This is because patients have a tendency and Riyanarto Sarno. The study showed that the personality of
to want to be alone. the users can and successfully predicted from the text written
on Twitter. The study uses 3 methods, the Naive Bayes method
In assessing a person's personality, analysis can be done based has better results by 60% compared to other methods. But still
on a combination of traits shown by each individual. The Myers- failed to improve the accuracy of previous studies which have
Briggs Type Indicator Theory (MBTI) is a method of personality an accuracy of 65%.
profiling using a questionnaire, which aims to highlight The second study, Persona Traits Identification based on the
psychological differences between individual perceptions about Myers-Briggs Type Indicator (MBTI) - A Text Classification
the world and decision-making skills [4]. This method is
Approach. In this study, MBTI has carried out further
exploration, resulting in an accuracy of up to 73.9% has been The split class process is used in this study as a
observed. feature extraction method to optimize the
The third study, Detection of Myers-Briggs Type Indicator via separation between class ‘E ’ and ‘I’ so that the
Text-Based Computer-Mediated Communication, shows the recognition results are more accurate. Thus, the
results that classification using Naive Bayes has not been able split class process will be tested based on the level
to produce maximum accuracy, requires advanced natural of recognition accuracy
language processing techniques to improve the result. 2. Counting Mention
Mention feature is used to point or invite friends to
III. METHODOLOGY communicate directly by adding the symbol "@"
A. Text Mining before the intended user name. In this study, the
mention of a name is calculated so that an
Text mining can be broadly defined as a process of extracting
assumption is obtained that ‘E ’ and ‘I’ are related
information where a user interacts with a group of documents
to other people. The researcher only utilizes the
using analysis tools which are components in data mining, one
data linkages or user comments[9].
of which is a categorization.
c. Filtering or Stopword Removal
B. Text Preprocessing Stopwords removal is a process of eliminating words
T ext Preprocessing is an initial step to prepare text into data that that are not important in the description by checking
will be further processed. A text cannot be processed directly the words parsing the description. The checking
by a search algorithm, therefore preprocessing text is needed to process aims to find out whether the words included in
convert text into numeric data[6]. This process consists of the list of words is not important (stoplist) or not.
several stages of document cleaning as follows[7]: d. Stemming
a. Data Cleaning Stemming is an integral part of Information Retrieval
This research will discuss some data extraction processes (IR). There are many algorithms that are specific to
such as : stemming the English language with various
1. Case Folding limitations in it. The porter algorithm for example, this
Case Folding aims to make all text lowercase. algorithm requires relatively shorter time compared to
2. URL Removal stemming using another algorithm
A URL usually contains video links, images and other e. Tokenization
websites that must be extracted first. URL often Tokenizing is the breaking process of a document into
appears to make data ineffective and meaningless, so a collection of words. Tokenization can be done by
it is necessary to delete the URL. The appearance of removing punctuation and separating perspectives.
the web address or URL is due to the number of users This stage also removes certain characters such as
promoting a product on their site so that other users punctuation and changes all tokens to lowercase
can directly enter the web page
3. Punctuation Removal C. Term Frequency Inverse Document Frequency (TfIdf)
Punctuation removal. Aims to delete all non-alphabet
The TFIDF method is a method for calculating the weight of
characters such as symbols, spaces, and others.
each word that is most commonly used in information retrieval.
4. White Space Removal
This method is also known to be efficient, easy and has accurate
This process refers to correcting punctuation. An
results[10].
example of an error at this stage is not to use the dot
Term Frequency Inverse Document Frequency (TFIDF)
‘.’ At the end of the sentence. However, this model is
method is one of the way to assign the weight of a word's terms
not very significant, especially when dealing with
to documents. TFIDF is a measure to determine the extent to
social media which rarely heeds punctuation
which the word is important or not in a document or in a group
5. Emoticon and Unicode Removal
of words. For every single document, every sentence is
Deletion of emoticons serves to facilitate data reading
considered as a document. The frequency with which words
and preprocessing stages. Deleting these elements is
appear in a given document indicates how important that word
the same as deleting usernames, tags, and Unicode
is in the document. The frequency of the document that shows
characters, namely by detecting using regular
how often the word is used or can be interpreted how common
expressions. Words detected by regular expressions
the word is. Word weight will be greater if it appears frequently
will be the same, by turning them into blank text
in documents and smaller value if it appears in many
b. Feature Extraction
documents[11].
Feature extraction (feature extraction) is a process to
Term frequency is a measure of how often a term appears in a
extract important characteristics or information from
document. This term frequency is calculated using equation
data so that it can be used in the classification process.
t f (i) =freqi (dj)i=1 kfreqi (dj) (1)
the process used in feature extraction is as follows[8]1
where A is the i-th term and B is the frequency of the i-th term
appearing in the j-th document.
1. Split class
573
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)
Where,
WDT = weight of the d-document to the t-word
tfdt = number of words searched for in a document
Idft = Inversed Document Frequency (log (N / df))
N = total document
df = many documents that contain the search term
D. Classification
Classification is the process of finding a model or function that
can explain or differentiate concepts or classes of data. The Fig 1. Experimental steps
purpose of classification is to estimate the class of an object
whose label is unknown. Tthe classification process is divided
into two parts, the learning phase then the testing phase. At the The second column contains posts from social media, each
learning stage, part of the data known as the data class is used containing 50 posts with a separator mark ‘|||’. From the post
to form an approximate model. Then in the testing phase, the dataset, there are several components such as links (images,
model set is tested with some other data to determine the videos, and websites), hashtags, emoticons, mentions, and
accuracy of the model. If accuracy is adequate, this model can sentences.
be used for unknown data class predictions. Support Vector
Machine (SVM) is one of the best methods that can be used in For this research, we will focus on extracting the Extrovert or
pattern classification problems. The use of SVM algorithm Intovert label features. From the dataset, there are 1999 datasets
which aims to classify text using term index weights as a labeled Extrovert and 6677 datasets labeled Introvert.
feature[13].
574
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)
TABLE I. SAMPLE DATASET b. Count the number of mentions with the format @ [name] to
count the number of individuals associated with the user
c. Counting the number of emoticons used, this is only to
compile the composition of posts.
No TYPE POSTS
Examples of emoticons: [:),: P, :(, ...]
d. If there is a link, it is ignored or deleted, because of the
1 INFJ difficulty in extracting the meaning from the link. Example:
'littp ://w w w .y o u t u b e .c o m / w a t c h ? v = q s X
H c w e 3 k r w | |h t tp ://4 1 . m e d i a .tu m b le . c o m
http ://www .url_name, http ://www .urlname/imagename.jpg
/tr n n b l r l f o u y 0 3 P M A l q a l r o o o l 3 0 0 .jp
g |||e n f p and in tj m o m e n ts TABLE II. SAMPLE NON-WORD FEATURE EXTRACTION RESULTS
h t t p s :/ / w w w .y o u t u b e . c o m / w a tc l i7 v H z 7 1
E 1 g 4 X M 4 s p o i t s c e n t e r n o t to p t e n p l a y s
D ocum ent Type c o u n t_ c o u n t_
h t t p s : / / w w w .y o u t u b e . c o m / w a t c h ? v = u C
e m o tic o n s m e n tio n
d f e e l e t e c p r a n k s ||| . . . ’
1 I 0 0
575
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)
1 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0
2 E 0 .0 0 .0 3 7 0 5 3 5 2 9 5 0 0 .0 0 .0 0 .0 0 .0 0 .0
9286975
3 I 0 .1 3 2 8 9 4 7 7 2 7 7 0 .0 4 8 1 8 3 5 4 7 5 2 0 .1 1 6 1 0 7 2 4 7 1 0 .0 0 .0 6 9 5 3 7 4 1 0 .0 0 .0
047257 171105 8000343 02627311
4 I 0 .0 0 .0 7 4 5 4 3 4 6 1 2 0 0 .0 0 .0 4 7 3 8 9 0 .0 0 .0 0 .0
65534 2383343
2656
5 E 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0
6 I 0 .0 0 .0 2 9 7 9 5 7 6 4 4 4 0 .0 0 .0 0 .0 0 .0 0 .0
8288273
7 I 0 .0 0 .0 3 2 1 5 7 4 6 8 0 5 0 .0 0 .0 0 .0 0 .0 0 .0
276569
8 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 3 4 8 5 4 1 4 5
887464334
9 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 5 5 3 2 7 8 6 2
051235965
10 I 0 .0 0 .0 0 .0 8 0 9 3 2 6 3 1 0 0 .0 0 .0 0 .0 0 .0
4160807
This process produces high dimensional vectors. There are improvement especially the recall evaluation did not show a
more than 8000 terms / features produced from this tf idf, where significant difference. This value is obtained by testing cross
not all of them have a usable tfidf weight. For this reason the validation 10 folds.
selection feature is performed to reduce the processing TABLE V. RESULTS
performed and this selection is based on the accumulation of the
weight value of each term with 800 terms that have the largest Feature accuracy recall I recall E avg recall
accumulation of TFIDF taken.
t f id f ( 8 0 0 te r m 8 4 ,0 7 % 9 5 ,4 5 % 4 6 ,0 7 % 7 0 ,7 6 %
Evaluation
+ em ot count + + /- 0 .6 1 %
Based on the previous research, one of the good methods for m e n t i o n c o u n t)
this case with different numbers of unequal datasets is the use
of the SVM method. For this reason, the classification of posts t f id f ( 8 0 0 t e r m ) 8 3 ,6 9 % 9 4 ,7 1 % 4 6 ,8 7 % 7 0 ,7 9 %
based on Intovert and extrovert types is done using the SVM + /- 0 ,6 8 %
logistic regression method. Two SVM experiments were
conducted with a maximum of 1000 iterations. These 2
experiments were the first experiment with the tf-idf selection For comparison with previous studies, especially in the case of
feature only and the second experiment the tf-idf selection the use of SVM with tf-idf when compared with the results of
feature + feature non-word count. The results show that the this experiment, it was able to provide a better performance of
second experiment has higher performance values both 84.07%. This is based on the selection of features and
accuracy and precision. This is based on the attachment of preprocessing conducted in this study
extrovert and introvert to external environmental factors.
However, the results obtained did not show a significant
576
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)
577
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.