Sahono 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

Extrovert and Introvert Classification based on


Myers-Briggs Type Indicator(MBTI) using Support
Vector Machine (SVM)
Muhammad Nurfauzi Sahono1, Fiqie Ulya Sidiastahta2, Guruh Fajar Shidik3, Ahmad Zainul Fanani4,
Muljono5, Safira Nuraisha6, Erba Lutfina7
1234567 Faculty of Computer Science, Universitas Dian Nuswantoro, Semarang, Indonesia
[email protected], [email protected], [email protected], [email protected],
[email protected], [email protected], [email protected]

Abstract— Personality is a characteristic of each individual designed for regular populations, to distinguish between
who describes their behavior and influences their interactions with individuals that occur naturally.
other individuals. Every individual has various way to express
their feelings, one of them through social media. On social media, Social media accounts can be called someone's self-reflection.
humans can create and share a variety of content about various On social media, humans can easily spread personal
objects, describe activities, to express their thoughts, opinions, and information, such as posting daily activities and commenting on
feelings. This study aims to classify human personalities based on posts. Today, this is done easily and has become a habit. The
the MBTI method that focused on Extrovert and Introvert class, text left by the user can be analyzed to obtain information, in
seen from their tweets. Humans able to better understand and this case the user's personality[5].
improve themselves by recognizing their weaknesses and
strengths. The dataset used in this study is a public dataset from This study aims to classify human personalities based on the
Kaggle, consists of 8676 data that posted on Twitter. Various MBTI method which focuses on detecting extroverts and
feature combinations have been compared using Support Vector introverts in a person, seen from their posts on social media.
Machine (SVM) classifier. The deployment of solution have been Extroverts and introverts were chosen because these traits are
described. Accuracies up to 84.07% were achieved using the most easily seen from the outside, relate to the dataset used in
methods detailed in this work. this study. Humans able to better understand and improve
themselves by recognizing their weaknesses and strengths.
Keywords—personality classification; Extrovert; Introvert;
Myers-Briggs Type Indicator (MBTI); TFIDF; Support Vector The discussion in this study is elaborated as follows: Chapter II
Machine. carried out a discussion of related research. Chapter III discusses
the methods used in the research and evaluation process. Chapter
I. INTRODUCTION IV discusses the results of research and analysis of the proposed
method. The last is the conclusion and future work.
. Personality is a combination of behavior and character of
each individual when faced with various situations. In many II. RELATED WORK
ways, personality has a role in one's choice in various things
The system of determining a person's personality is usually
such as music tastes, books, films, and others. In addition, the
interaction of individuals with others and the environment is done using a questionnaire. There are various psychological
also strongly influenced by personality [1]. Personality is often systems for categorizing personalities, such as Big Five and
used in assessments in employee recruitment, health MBTI. The MBTI was developed from Carl Jung's personality
counseling, relationship counseling, and career counseling. theory, and sort an individual into sixteen categories of
personality, depending on their psychological state.
One of the most serious global problems is the problem of
mental illness. Mental illnesses that suffer the most are There are several studies that have been done before regarding
depression, anxiety, and several other disorders. In the world,
the classification of personality. The study with title
more than 350 million people suffer from mental disorders, of
Personality Classification Based on Twitter Text Using Naive
which 16.7% of British citizenship[2]. Mental illness is a
Bayes, KNN and SVM was conducted by Bayu Yudha Pratama
dangerous condition. Even more worrying, mental disorders are
difficult to detect[3]. This is because patients have a tendency and Riyanarto Sarno. The study showed that the personality of
to want to be alone. the users can and successfully predicted from the text written
on Twitter. The study uses 3 methods, the Naive Bayes method
In assessing a person's personality, analysis can be done based has better results by 60% compared to other methods. But still
on a combination of traits shown by each individual. The Myers- failed to improve the accuracy of previous studies which have
Briggs Type Indicator Theory (MBTI) is a method of personality an accuracy of 65%.
profiling using a questionnaire, which aims to highlight The second study, Persona Traits Identification based on the
psychological differences between individual perceptions about Myers-Briggs Type Indicator (MBTI) - A Text Classification
the world and decision-making skills [4]. This method is
Approach. In this study, MBTI has carried out further

978-1-7281-9068-6/20/$31.00 ©2020 IEEE


572
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

exploration, resulting in an accuracy of up to 73.9% has been The split class process is used in this study as a
observed. feature extraction method to optimize the
The third study, Detection of Myers-Briggs Type Indicator via separation between class ‘E ’ and ‘I’ so that the
Text-Based Computer-Mediated Communication, shows the recognition results are more accurate. Thus, the
results that classification using Naive Bayes has not been able split class process will be tested based on the level
to produce maximum accuracy, requires advanced natural of recognition accuracy
language processing techniques to improve the result. 2. Counting Mention
Mention feature is used to point or invite friends to
III. METHODOLOGY communicate directly by adding the symbol "@"
A. Text Mining before the intended user name. In this study, the
mention of a name is calculated so that an
Text mining can be broadly defined as a process of extracting
assumption is obtained that ‘E ’ and ‘I’ are related
information where a user interacts with a group of documents
to other people. The researcher only utilizes the
using analysis tools which are components in data mining, one
data linkages or user comments[9].
of which is a categorization.
c. Filtering or Stopword Removal
B. Text Preprocessing Stopwords removal is a process of eliminating words
T ext Preprocessing is an initial step to prepare text into data that that are not important in the description by checking
will be further processed. A text cannot be processed directly the words parsing the description. The checking
by a search algorithm, therefore preprocessing text is needed to process aims to find out whether the words included in
convert text into numeric data[6]. This process consists of the list of words is not important (stoplist) or not.
several stages of document cleaning as follows[7]: d. Stemming
a. Data Cleaning Stemming is an integral part of Information Retrieval
This research will discuss some data extraction processes (IR). There are many algorithms that are specific to
such as : stemming the English language with various
1. Case Folding limitations in it. The porter algorithm for example, this
Case Folding aims to make all text lowercase. algorithm requires relatively shorter time compared to
2. URL Removal stemming using another algorithm
A URL usually contains video links, images and other e. Tokenization
websites that must be extracted first. URL often Tokenizing is the breaking process of a document into
appears to make data ineffective and meaningless, so a collection of words. Tokenization can be done by
it is necessary to delete the URL. The appearance of removing punctuation and separating perspectives.
the web address or URL is due to the number of users This stage also removes certain characters such as
promoting a product on their site so that other users punctuation and changes all tokens to lowercase
can directly enter the web page
3. Punctuation Removal C. Term Frequency Inverse Document Frequency (TfIdf)
Punctuation removal. Aims to delete all non-alphabet
The TFIDF method is a method for calculating the weight of
characters such as symbols, spaces, and others.
each word that is most commonly used in information retrieval.
4. White Space Removal
This method is also known to be efficient, easy and has accurate
This process refers to correcting punctuation. An
results[10].
example of an error at this stage is not to use the dot
Term Frequency Inverse Document Frequency (TFIDF)
‘.’ At the end of the sentence. However, this model is
method is one of the way to assign the weight of a word's terms
not very significant, especially when dealing with
to documents. TFIDF is a measure to determine the extent to
social media which rarely heeds punctuation
which the word is important or not in a document or in a group
5. Emoticon and Unicode Removal
of words. For every single document, every sentence is
Deletion of emoticons serves to facilitate data reading
considered as a document. The frequency with which words
and preprocessing stages. Deleting these elements is
appear in a given document indicates how important that word
the same as deleting usernames, tags, and Unicode
is in the document. The frequency of the document that shows
characters, namely by detecting using regular
how often the word is used or can be interpreted how common
expressions. Words detected by regular expressions
the word is. Word weight will be greater if it appears frequently
will be the same, by turning them into blank text
in documents and smaller value if it appears in many
b. Feature Extraction
documents[11].
Feature extraction (feature extraction) is a process to
Term frequency is a measure of how often a term appears in a
extract important characteristics or information from
document. This term frequency is calculated using equation
data so that it can be used in the classification process.
t f (i) =freqi (dj)i=1 kfreqi (dj) (1)
the process used in feature extraction is as follows[8]1
where A is the i-th term and B is the frequency of the i-th term
appearing in the j-th document.
1. Split class

573
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

While inverse document frequency is the logarithm of the ratio


number of all documents to the number of documents that have
the term referred to as written mathematically in equation:
id f i = log |D | |{d : ti d}| (2)
Values are obtained by multiplying both of which are
formulated in equation:
(tf - idf)i j = t f i(dj) . idfi (3)
In the TFIDF algorithm the formula is used to calculate the
weight (W) of each document against keywords with the
formula[12]:
Wdt = tfdt * Idft (4)

Where,
WDT = weight of the d-document to the t-word
tfdt = number of words searched for in a document
Idft = Inversed Document Frequency (log (N / df))
N = total document
df = many documents that contain the search term

D. Classification
Classification is the process of finding a model or function that
can explain or differentiate concepts or classes of data. The Fig 1. Experimental steps
purpose of classification is to estimate the class of an object
whose label is unknown. Tthe classification process is divided
into two parts, the learning phase then the testing phase. At the The second column contains posts from social media, each
learning stage, part of the data known as the data class is used containing 50 posts with a separator mark ‘|||’. From the post
to form an approximate model. Then in the testing phase, the dataset, there are several components such as links (images,
model set is tested with some other data to determine the videos, and websites), hashtags, emoticons, mentions, and
accuracy of the model. If accuracy is adequate, this model can sentences.
be used for unknown data class predictions. Support Vector
Machine (SVM) is one of the best methods that can be used in For this research, we will focus on extracting the Extrovert or
pattern classification problems. The use of SVM algorithm Intovert label features. From the dataset, there are 1999 datasets
which aims to classify text using term index weights as a labeled Extrovert and 6677 datasets labeled Introvert.
feature[13].

IV. RESULT AND ANALYSIS

The experimental steps are described in Fig 1. Stages of the


experiment begins with the preprocessing to process 8675 data
text. Then, perform the feature extraction process with the tfidf
method, followed by classification with SVM as the classifier.
The results of this study will produce a model for classifying
human’s personality, they are extroverts or introverts.
Preprocessing Result
This study uses a public dataset in Kaggle with the title "Myers ■ Extrovert ■ Introvert
Briggs Type Indicator Dataset", the number of data is 8676
rows. Fig 2. The distribution of data
This dataset contains 2 columns, the first of which is 16
indicators or MBTI labels, which are composed of a Where,
combination of[14]: I= 76% E=26%
Sample extaction data of E/I from mbti:
■ Introvert(I) or Extrovert(E) ENFP classified in class E
■ Sensing(S) or Intuition(N) INFJ classified in class I
■ Feeling(F) or Thinking(T)
■ Perceiving(P) or Judging(J)

574
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

TABLE I. SAMPLE DATASET b. Count the number of mentions with the format @ [name] to
count the number of individuals associated with the user
c. Counting the number of emoticons used, this is only to
compile the composition of posts.
No TYPE POSTS
Examples of emoticons: [:),: P, :(, ...]
d. If there is a link, it is ignored or deleted, because of the
1 INFJ difficulty in extracting the meaning from the link. Example:
'littp ://w w w .y o u t u b e .c o m / w a t c h ? v = q s X
H c w e 3 k r w | |h t tp ://4 1 . m e d i a .tu m b le . c o m
http ://www .url_name, http ://www .urlname/imagename.jpg
/tr n n b l r l f o u y 0 3 P M A l q a l r o o o l 3 0 0 .jp
g |||e n f p and in tj m o m e n ts TABLE II. SAMPLE NON-WORD FEATURE EXTRACTION RESULTS
h t t p s :/ / w w w .y o u t u b e . c o m / w a tc l i7 v H z 7 1
E 1 g 4 X M 4 s p o i t s c e n t e r n o t to p t e n p l a y s
D ocum ent Type c o u n t_ c o u n t_
h t t p s : / / w w w .y o u t u b e . c o m / w a t c h ? v = u C
e m o tic o n s m e n tio n
d f e e l e t e c p r a n k s ||| . . . ’

1 I 0 0

2 ENTP 'I'm finding the lack o f me in these posts 2 E 3 0


very alarming. |Sex can be boring if it’s
in the same position often. For example 3 I 3 0

me and my girlfriend are currently in an


4 I 0 0
environment where we have to
creatively use cowgirl and missionary. 5 E 0 2
There isn't enough...] Giving new
meaning to 'Game' theory.] |Hello 6 I 0 0
*ENTP Grin* That's all it takes. Than
we converse and they do most of the 7 I 0 0
flirting while I acknowledge their
presence and return their words with
smooth wordplay and more cheeky
grins. HI...*
Cleansing and Preprocessing
To facilitate the TFIDF feature extraction processing, data
cleaning is performed:
a. Delete punctuation or punctuation like? /: @ # $, ''% A & *()
3 INTP 'Good one b. Removal of excess white space, such as "I have"
https://mvw.younibe.com/wateh?v=fH c. Lowering sentences, like 'Good one' to 'good one'
GbolFFGw O f course, to which I say d. Stopwords and steaming use an English corpus to reduce the
know; that's m y bless mg and m number of terms to be counted
curse. Does being absolutely positiv e. Tokenisasi to take words from a collection of sentences such
as ‘good one ...’ to [‘good’, ‘one’, ‘...’]
that you and your best friend could be a
f. Tfidf
amazing couple count? I f so, than ye:
Or it's more I could be madly hi love i After all processes are performed, based on the results of term
case I reconciled my feelings (wine tokenisation, the tf-idf value is calculated for feature extraction.
at...||[No, I didn't: thank you for Sample tf-idf calculation for term ‘able’ in document 2.
link! ||.. . ’ The term able in the second document has 1 occurrence (tf),
where there are 7966 (df) documents which have the term 'able'
of a total of 8676 documents (D). So the tf-idf value is obtained
for 0.0370535295092869. This calculation is then performed on
Non Word Feature Extraction each term.
Feature extraction is done to take some additional information TABLE III SAMPLE TF-IDF CALCULATION FOR TERM ‘ABLE’ IN
and compose sentences without interpreting the meaning in DOCUMENT 2
detail from the component. First extraction:
tf df D (D/df) idf = W=
a. Extract the label E or I from the Type column, E for the log(D/df) tf.idf
Extrovert label, and I for the Introvert label
1 7966 8676 1,0890643 0,037053 0,037053
19 5295092 5295092
869 869

575
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

TABLE IV. SAMPLE TF IDF RESULTS

Document Type ability able absolutely accept accurate across act

1 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0

2 E 0 .0 0 .0 3 7 0 5 3 5 2 9 5 0 0 .0 0 .0 0 .0 0 .0 0 .0
9286975

3 I 0 .1 3 2 8 9 4 7 7 2 7 7 0 .0 4 8 1 8 3 5 4 7 5 2 0 .1 1 6 1 0 7 2 4 7 1 0 .0 0 .0 6 9 5 3 7 4 1 0 .0 0 .0
047257 171105 8000343 02627311

4 I 0 .0 0 .0 7 4 5 4 3 4 6 1 2 0 0 .0 0 .0 4 7 3 8 9 0 .0 0 .0 0 .0
65534 2383343
2656

5 E 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0

6 I 0 .0 0 .0 2 9 7 9 5 7 6 4 4 4 0 .0 0 .0 0 .0 0 .0 0 .0
8288273

7 I 0 .0 0 .0 3 2 1 5 7 4 6 8 0 5 0 .0 0 .0 0 .0 0 .0 0 .0
276569

8 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 3 4 8 5 4 1 4 5
887464334

9 I 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 0 .0 5 5 3 2 7 8 6 2
051235965

10 I 0 .0 0 .0 0 .0 8 0 9 3 2 6 3 1 0 0 .0 0 .0 0 .0 0 .0
4160807

This process produces high dimensional vectors. There are improvement especially the recall evaluation did not show a
more than 8000 terms / features produced from this tf idf, where significant difference. This value is obtained by testing cross
not all of them have a usable tfidf weight. For this reason the validation 10 folds.
selection feature is performed to reduce the processing TABLE V. RESULTS
performed and this selection is based on the accumulation of the
weight value of each term with 800 terms that have the largest Feature accuracy recall I recall E avg recall
accumulation of TFIDF taken.
t f id f ( 8 0 0 te r m 8 4 ,0 7 % 9 5 ,4 5 % 4 6 ,0 7 % 7 0 ,7 6 %
Evaluation
+ em ot count + + /- 0 .6 1 %
Based on the previous research, one of the good methods for m e n t i o n c o u n t)
this case with different numbers of unequal datasets is the use
of the SVM method. For this reason, the classification of posts t f id f ( 8 0 0 t e r m ) 8 3 ,6 9 % 9 4 ,7 1 % 4 6 ,8 7 % 7 0 ,7 9 %
based on Intovert and extrovert types is done using the SVM + /- 0 ,6 8 %
logistic regression method. Two SVM experiments were
conducted with a maximum of 1000 iterations. These 2
experiments were the first experiment with the tf-idf selection For comparison with previous studies, especially in the case of
feature only and the second experiment the tf-idf selection the use of SVM with tf-idf when compared with the results of
feature + feature non-word count. The results show that the this experiment, it was able to provide a better performance of
second experiment has higher performance values both 84.07%. This is based on the selection of features and
accuracy and precision. This is based on the attachment of preprocessing conducted in this study
extrovert and introvert to external environmental factors.
However, the results obtained did not show a significant

576
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.
2 0 2 0 In tern atio n a l S e m in a r on A pp licatio n for T ech n o lo g y of In fo rm atio n and C o m m u n icatio n (iS e m an tic)

TABLE VII. RESULTS REFERENCE


[1] D. Brinks and H. White, “Detection of Myers- Briggs
Feature precision I precision E avg precision Type Indicator via Text Based Computer-Mediated
Communication,” pp. 53-56.
[2] V. C. Akkapon Wongkoblap, Miguel A. Vadillo,
tf id f ( 8 0 0 t e r m 8 5 ,5 3 % 7 5 ,1 8 % 8 0 ,3 6 %
“Detecting and Treating Mental Illness on Social
+ em ot count +
m e n t i o n c o u n t)
Networks,” p. 5090, 2017.
[3] E. Saravia, C. Chang, R. J. De Lorenzo, and Y. Chen,
t f id f ( 8 0 0 t e r m ) 8 5 ,6 2 % 7 2 ,6 4 % 7 9 ,1 3 % “MIDAS: Mental Illness Detection and Analysis via
Social Media,” pp. 1418-1421, 2016.
[4] S. Bharadwaj, S. Sridhar, R. Choudhary, and R. Srinath,
For comparison with previous studies, especially in the case of “Persona Traits Identification based on Myers-Briggs
the use of SVM with tf-idf when compared with the results of Type Indicator ( MBTI ) - A Text Classification
this experiment, it was able to provide a better performance of Approach,” 2018 Int. Conf. Adv. Comput. Commun.
84.07%. This is based on the selection of features and Informatics, pp. 1076-1082, 2018.
preprocessing conducted in this study. [5] B. Y. Pratama, “Personality Classification Based on
Twitter Text Using Naive Bayes , KNN and SVM,” pp.
TABLE VIIII. COMPARISON TO PREVIOUS RESEARCH
170-174, 2015.
[6] Z. Erenel and H. Altinc, “Nonlinear transformation of
F e a tu r e A c cu ra c y term frequencies for term weighting in text
categorization,” vol. 25, pp. 1505-1514, 2012.
P r e v i o u s s tu d y 7 3 ,9 % [7] Q. Luo, E. Chen, and H. Xiong, “A semantic term
(T F -ID F ) weighting scheme for text categorization,” Expert Syst.
Appl., vol. 38, no. 10, pp. 12708-12716, 2011.
( T F - I D F s e l e c ti o n f e a tu r e + e m o t i c o n 8 4 ,0 7 %
[8] S. Song and S. Hyon, “A novel term weighting scheme
c o u n t + m e n t i o n c o u n t)
based on discrimination power obtained from past
retrieval results,” Inf. Process. Manag., vol. 48, no. 5,
CONCLUSION pp. 919-930, 2012.
In this study, it can be shown that the selection of the right [9] V. N. Patodkar and I. R. Sheikh, “Twitter as a Corpus
features can improve the accuracy, precision, and recall of for Sentiment Analysis and Opinion Mining,” vol. 5,
previous MBTI classification studies, specifically to determine no. 12, pp. 320-322, 2016.
the Introvert and Extrovert classes. The feature used is the tf-idf [10] S. Mohammad and A. Intelligence, “A Novel Text
feature that has been selected based on its highest value in this Mining Approach Based on TF-IDF and Support
case by 800 features and additional information on the number Vector Machine for News,” no. March, pp. 16-20,
of mention of others and the number of emoticons used. In this 2016.
study the SVM method can be used well because this method is [11] T. H. C. L. K. Yang and S. Wang, “Using TF-IDF to
suitable for high dimensional. hide sensitive itemsets,” pp. 502-510, 2013.
[12] K. Chen, Z. Zhang, J. Long, and H. Zhang, “Turning
FUTURE WORK from TF-IDF to TF-IGM for term weighting in text
In this study, it can be shown that the selection of the right classification,” Expert Syst. Appl., vol. 66, pp. 245-260,
features can improve the accuracy, precision, and recall of 2016.
previous MBTI classification studies, specifically to determine [13] J. Lilleberg, “Support Vector Machines and Word2vec
the Introvert and Extrovert classes. The feature used is the tf-idf for Text Classification with Semantic Features,” pp.
feature that has been selected based on its highest value in this 136-140, 2015.
case by 800 features and additional information on the number [14] M. Komisin and C. Guinn, “Identifying Personality
of mention of others and the number of emoticons used. In this Types Using Document Classification Methods,” pp.
study the SVM method can be used well because this method is 232-237, 1999.
suitable for high dimensional.

577
Authorized licensed use limited to: San Francisco State Univ. Downloaded on June 24,2021 at 21:39:02 UTC from IEEE Xplore. Restrictions apply.

You might also like