Special Issue Reprint
Computational Linguistics
and Natural Language
Processing
Edited by
Peter Z. Revesz
mdpi.com/journal/information
Computational Linguistics and
Natural Language Processing
Computational Linguistics and
Natural Language Processing
Editor
Peter Z. Revesz
Basel • Beijing • Wuhan • Barcelona • Belgrade • Novi Sad • Cluj • Manchester
Editor
Peter Z. Revesz
Department of Classics and
Religious Studies, College of
Arts and Sciences, and School
of Computing, College of
Engineering, University of
Nebraska-Lincoln
Lincoln, NE
USA
Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland
This is a reprint of articles from the Special Issue published online in the open access journal
Information (ISSN 2078-2489) (available at: https://www.mdpi.com/journal/information/special_
issues/Computational_CLNLP2021).
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
Lastname, A.A.; Lastname, B.B. Article Title. Journal Name Year, Volume Number, Page Range.
ISBN 978-3-7258-1369-8 (Hbk)
ISBN 978-3-7258-1370-4 (PDF)
doi.org/10.3390/books978-3-7258-1370-4
© 2024 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license. The book as a whole is distributed by MDPI under the terms
and conditions of the Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)
license.
Contents
About the Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Peter Z. Revesz
Preface to the Special Issue on Computational Linguistics and Natural Language Processing
Reprinted from: Information 2024, 15, 281, doi:10.3390/info15050281 . . . . . . . . . . . . . . . . .
1
Sudheesh R, Muhammad Mujahid, Furqan Rustam, Rahman Shafique, Venkata Chunduri,
Mónica Gracia Villar, et al.
Analyzing Sentiments Regarding ChatGPT Using Novel BERT: A Machine Learning Approach
Reprinted from: Information 2023, 14, 474, doi:10.3390/info14090474 . . . . . . . . . . . . . . . . .
5
Rodolfo Delmonte
Computing the Sound–Sense Harmony: A Case Study of William Shakespeare’s Sonnets and
Francis Webb’s Most Popular Poems
Reprinted from: Information 2023, 14, 576, doi:10.3390/info14100576 . . . . . . . . . . . . . . . . . 34
Robert Gorman
Morphosyntactic Annotation in Literary Stylometry
Reprinted from: Information 2024, 15, 211, doi:10.3390/info15040211 . . . . . . . . . . . . . . . . . 75
Mohamed Hesham Ibrahim Abdalla, Simon Malberg, Daryna Dementieva, Edoardo Mosca
and Georg Groh
A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers
Reprinted from: Information 2023, 14, 522, doi:10.3390/info14100522 . . . . . . . . . . . . . . . . . 93
Ioana-Raluca Zaman and Stefan Trausan-Matu
A Survey on Using Linguistic Markers for Diagnosing Neuropsychiatric Disorders with
Artificial Intelligence
Reprinted from: Information 2024, 15, 123, doi:10.3390/info15030123 . . . . . . . . . . . . . . . . . 126
Akshay Mendhakar
Linguistic Profiling of Text Genres: An Exploration of Fictional vs. Non-Fictional Texts
Reprinted from: Information 2022, 13, 357, doi:10.3390/info13080357 . . . . . . . . . . . . . . . . . 141
Vinto Gujjar, Neeru Mago, Raj Kumari, Shrikant Patel, Nalini Chintalapudi and
Gopi Battineni
A Literature Survey on Word Sense Disambiguation for the Hindi Language
Reprinted from: Information 2023, 14, 495, doi:10.3390/info14090495 . . . . . . . . . . . . . . . . . 158
Vincenzo Manca
Agile Logical Semantics for Natural Languages
Reprinted from: Information 2024, 15, 64, doi:10.3390/info15010064 . . . . . . . . . . . . . . . . . 183
Mateusz Łab˛edzki and Olgierd Unold
D0L-System Inference from a Single Sequence with a Genetic Algorithm
Reprinted from: Information 2023, 14, 343, doi:10.3390/info14060343 . . . . . . . . . . . . . . . . . 199
Aaradh Nepal and Francesco Perono Cacciafoco
Minoan Cryptanalysis: Computational Approaches to Deciphering Linear A and Assessing Its
Connections with Language Families from the Mediterranean and the Black Sea Areas
Reprinted from: Information 2024, 15, 73, doi:10.3390/info15020073 . . . . . . . . . . . . . . . . . 215
v
Peter Z. Revesz and Géza Varga
A Proposed Translation of an Altai Mountain Inscription Presumed to Be from the 7th
Century BC
Reprinted from: Information 2022, 13, 243, doi:10.3390/info13050243 . . . . . . . . . . . . . . . . . 228
Peter Z. Revesz
Decipherment Challenges Due to Tamga and Letter Mix-Ups in an Old Hungarian Runic
Inscription from the Altai Mountains
Reprinted from: Information 2022, 13, 422, doi:10.3390/info13090422 . . . . . . . . . . . . . . . . . 237
vi
About the Editor
Peter Z. Revesz
Peter Z. Revesz earned a B.S. summa cum laude with a double major in Computer Science
and Mathematics from Tulane University and a Ph.D. in Computer Science from Brown University
(dissertation title: Constraint Query Languages; advisor: Paris C. Kanellakis). He was a postdoctoral
fellow at the University of Toronto before joining the University of Nebraska-Lincoln, where he
is a professor in the School of Computing and the Department of Classics and Religious Studies.
He is an expert in computational linguistics, databases, bioinformatics, and geoinformatics. He
is the author of Introduction to Databases: From Biological to Spatio-Temporal (Springer, 2010) and
Introduction to Constraint Databases (Springer, 2002). Dr. Revesz has held visiting appointments at
the IBM T.J. Watson Research Center, INRIA, the Max Planck Institute for Computer Science, the
University of Athens, the University of Hasselt, the University of Helsinki, the U.S. Air Force Office
of Scientific Research, and the U.S. Department of State. He is a recipient of an AAAS Science
& Technology Policy Fellowship, a J. William Fulbright Scholarship, an Alexander von Humboldt
Research Fellowship, a Jefferson Science Fellowship, a National Science Foundation CAREER award,
and a Faculty International Scholar of the Year award by Phi Beta Delta, the Honor Society for
International Scholars.
vii
Preface
The aim of this Special Issue is to present the latest research on computational linguistics
and natural language processing, especially in the areas of sentiment analysis, linguistic profiling,
higher-order logical representation, and computational methods for the decipherment of various
scripts. The authors are leading experts in these areas from various countries and continents. The
readers can apply these ideas to various applications, such as measuring the sentiments of customers,
characterizing, or sometimes even identifying, the likely authors of anonymous texts, aiding the
diagnoses of neuropsychiatric diseases, improving communication with chatbots, and deciphering
encrypted texts. The cutting-edge ideas and open problems presented in this Special Issue will spark
additional ideas on the part of researchers who would like to explore novel topics in computational
linguistics and natural language processing.
Peter Z. Revesz
Editor
ix
information
Editorial
Preface to the Special Issue on Computational Linguistics and
Natural Language Processing
Peter Z. Revesz
Department of Classics and Religious Studies, College of Arts and Sciences, and School of Computing, College of
Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
[email protected]
Citation: Revesz, P.Z. Preface to the
Special Issue on Computational
Linguistics and Natural Language
Processing. Information 2024, 15, 281.
https://doi.org/10.3390/info15050281
Received: 28 April 2024
Accepted: 7 May 2024
Published: 15 May 2024
Copyright:
© 2024 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Computational linguistics and natural language processing are at the heart of the AI
revolution that is currently transforming our lives. We are witnessing the growth of these
areas to the point that intelligent, talking robots can now perform many jobs that humans
used to do. We are encountering robots in an increasing number of situations. For example,
it is becoming common to see robots answer customer inquiries at call centers, replace
cashiers with automated talking checkout machine at stores, look up a given address on an
online map to plan a path and then autonomously navigate a car to its intended destination,
assemble complex products while making decisions according to the particularities of
supply and workflow demands in factories, and monitor access to buildings and sound an
alarm if a dangerous situation develops. Robots have become such an integral part of our
daily lives that while using the internet, people are required to take the “I am not a robot”
test on a regular basis. This ambiguity resulting from the difficulty to distinguish humans
and robots means that robots have acquired the capacity to replace humans.
In addition, intelligent and talking robots are increasingly used, even in situations and
in places that are not as visible for most citizens. This includes various security systems,
where robots are analyzing online conversations and chats. Some security robots also
make decisions about potential dangers, such as possible illegal drug smuggling or acts of
violence. Coupled with sophisticated drone systems, intelligent robots can assist humans
by carrying out missions that require flying or being in outer space.
All of these machines need some way to communicate their contributions back to
humans for the advancement of the human civilization. This is often best done by using
human language, either written or spoken. Thus, the building of useful robots comes with
the need to make the robot capable of using and analyzing human language. Therefore,
the study of computational linguistics and natural language processing is a foundational
part of the AI revolution that is presently resounding in our midst. It is in the light of
these sentiments that this Special Issue on computational linguistics and natural language
processing was called for, written, and assembled.
There is no doubt that computational linguistics and natural language processing
facilitate not only major technological transformations but also influence social transformations. We increasingly live in what a few decades ago would have been termed a sci-fi
world. These transformations come with certain challenges, but those challenges need not
be feared because the opportunities outweigh the downside for humanity. That is also the
overall message of the paper “Analyzing Sentiments Regarding ChatGPT Using Novel
BERT: A Machine Learning Approach” by Sudheesh R. et al. [1] in this Special Issue, which
is based on selected papers from the International Conference on Computational Linguistics and Natural Language Processing held in Beijing in December 2021. We also invited
additional papers on the same topics, and they also underwent a rigorous review process.
Linguistic Profiling
Many of the papers proposed various novel methods of linguistic profiling and categorizing texts. The paper “Computing the Sound–Sense Harmony: A Case Study of
William Shakespeare’s Sonnets and Francis Webb’s Most Popular Poems” by Delmonte [2]
Information 2024, 15, 281. https://doi.org/10.3390/info15050281
1
https://www.mdpi.com/journal/information
Information 2024, 15, 281
proposes a novel sentiment analysis metric. The main idea is that sounds have presumed
meanings, for example, they can be happy and sad. The actual text of a poem also expresses
a meaning with a happy or sad connotation. Delmonte’s sentiment analysis shows that
both Shakespeare and Webb carefully chose their words to make the presumed meaning
of the words’ sounds and the sentences’ explicit meaning be in harmony, or sometimes in
disharmony when irony was intended.
The paper “Morphosyntactic Annotation in Literary Stylometry” by Gorman [3]
provides a sophisticated stylometric analysis of texts. This computational analysis uses
morphosyntactic annotations, such as the frequency of various pronouns, verbal cases,
and the ordering of various elements of the sentence. Interestingly, famous authors are
demonstrated to have a unique style because they can be identified by their stylometric
profile with a very high probability. So, if we take a short quotation from one of their books,
then they can be almost 100 percent correctly identified as the authors of their other books.
For example, if we take a quotation from the book Oliver Twist by Charles Dickens, then
we can identify that he also wrote A Christmas Carol because of the stylometric similarities
between the two novels.
The above papers pose the prospect of intelligent robots imitating famous writers by
simply adhering to some sentiment analysis measures and stylometric profiles as described
in the two previous papers. The good news is that the paper “A Benchmark Dataset
to Distinguish Human-Written and Machine-Generated Scientific Papers” by Abdalla
et al. [4] proposes a practical approach to distinguish between human-written and machinegenerated texts. While the paper focuses on identifying fake scientific papers, many of the
proposed techniques can be applied to other types of texts too.
The paper “A Survey on Using Linguistic Markers for Diagnosing Neuropsychiatric
Disorders with Artificial Intelligence” by Zaman and Trausan-Matu [5] is also concerned
with the categorization of spoken language and written text. The goal of their paper is to aid
medical diagnoses by identifying these linguistics markers, including sentiment analysis
and stylometric measures, that characterize various mental illnesses. The paper also
provides a comprehensive review of this growing subject with an extensive bibliography.
The paper “Linguistic Profiling of Text Genres: An Exploration of Fictional vs. NonFictional Texts” by Mendhakar [6] applies linguistic profiling to distinguish between texts
that describe fiction versus those that describe real events. Various types of fiction are also
categorized, such as fables, myths, mystery, romance, thriller, legends, and science fiction.
Non-fiction works are also more finely divided into discussions, explanations, instructions,
and persuasions.
The paper “A Literature Survey on Word Sense Disambiguation for the Hindi Language” by Gujjar et al. [7] focuses on the process to determine the exact context-specific
meanings of ambiguous words, for example, the English word bark, which can mean the
sound emitted by dogs, the outer sheath of a tree trunk, or a kind of ship. While many
natural language processing techniques were developed to deal with the disambiguation of
English words, the disambiguation of Hindi words sometimes requires language-specific
algorithms that are reviewed in this paper.
Higher-Order Logical Representations and Methods
There were some papers that were concerned with the internal computer representation of texts and images. The paper “Agile Logical Semantics for Natural Languages”
by Manca [8] introduces predicate abstraction as a new operator, which is argued to be
a natural operator when some form of monadic high-order logic is used to express the
semantics of linguistic structures. A possible application of predicate abstraction could
be to teach more abstract logical thinking to chatbots, such as ChatGPT. For example,
the author details a conversation with ChatGPT where ChatGPT was able to learn the
predicate abstraction that “Go” and “Goes” represent the same predicate but in different
grammatical forms.
2
Information 2024, 15, 281
Context-free Lindenmayer systems (D0L systems) have been used to describe the
generation of growing plants, cities, and fractals, among other applications. Context-free
Lindenmayer systems can be viewed as an extension of context-free grammars where the
rewriting rules include commands such as ‘draw a line’, ‘turn a specific degree’, and ‘go
to a specific position’. They form a special type of language that is studied in the paper
“D0L-System Inference from a Single Sequence with a Genetic Algorithm” by Łab˛edzki
and Unold [9]. The aim of this paper is to essentially reverse engineer a context-free
Lindenmayer system, that is, when given an image that was generated by a context-free
Lindenmayer system, to then find its grammar. The authors demonstrate that their genetic
algorithmic method finds satisfying solutions on various types of images, such as binary
trees, Barnsley ferns, Koch snowflakes, and Sierpiński triangles.
Deciphering Scripts
A group of papers were concerned with the problem of deciphering inscriptions. The
paper “Minoan Cryptanalysis: Computational Approaches to Deciphering Linear A and
Assessing its Connections with Language Families from the Mediterranean and the Black
Sea Areas” by Nepal and Perono Cacciafoco [10] provides a linguistic analysis of Linear A
inscriptions, which were written by Minoan scribes during the Bronze Age, mainly on the
island of Crete. The linguistic analysis uses the feature-based comparison of signs method
introduced in Revesz [11]. However, in their study, Nepal and Perono Cacciafoco [9] obtain
slightly different sign matches between Linear A signs, Carian alphabet letters, and Cypriot
syllabary signs than Revesz [11] obtained, because they use different weights for the various
features. The matches provide candidate phonetic values for Linear A signs. This allows
for a phonetic transcription of Linear A inscriptions to be carried out.
Next, they applied a linguistic analysis of the Linear A inscriptions by finding possible
words from the following languages: Ancient Egyptian, Hittite, Luwian, Proto-Celtic, and
Proto-Uralic. The latter two languages were chosen because they may have been spoken
on the coastal areas of the Black Sea, which has been shown to be the likely source of some
Minoans [12,13]. The analysis yielded eight Ancient Egyptian, nine Hittite, seven Luwian,
eleven Proto-Celtic, and twelve Proto-Uralic words as good matches with the Linear A
inscriptions. While the analysis of Nepal and Perono Cacciafoco [10] is inconclusive in
deciding the underlying language of the Linear A inscriptions, it nicely demonstrates that
it was premature of many earlier authors to focus their attention only on Mediterranean
languages, ignoring the fact that the Bosporus and Dardanelles Straits enable easy sailing
between the Aegean Sea and the Black Sea areas. The analysis of Nepal and Perono
Cacciafoco [10] is compatible with Revesz [11], which provided a translation of twentyeight Linear A inscriptions into a Uralic language.
The paper “A Proposed Translation of an Altai Mountain Inscription Presumed to Be
from the 7th Century BC” by Revesz and Varga [14] originated when someone brought to
our attention an inscription from a book by Karžaubaj Sartkožauly, who is a member of the
Kazakhstan Academy of Sciences. Sartkožauly presumed the inscription to be from the 7th
century BC, and this was also our initial assumption. However, we became increasingly
suspicious about the dating of the inscription during our decipherment work. For example,
the inscription used a personal woman’s name, Enikő, that was only created in the 19th
century, although it became popular afterwards. This paper also proposed two different
solutions because there was an ambiguous part in the inscription.
After the study by [14] was published, it received great publicity and became the
subject of a popular YouTube video, which happened to be watched by the scribe, Peter
Kun, who admitted in a comment below the video that he wrote the inscription while he
was visiting the Altai Mountains as a young man. That was a fascinating turn of events
because there is no other known case when a scribe practically “came alive” to be able to
judge the correctness of the decipherment of a presumably ancient inscription.
The paper “Decipherment Challenges due to Tamga and Letter Mix-Ups in an Old
Hungarian Runic Inscription from the Altai Mountains” [15] came as a natural follow up
3
Information 2024, 15, 281
of [14] after contacting Peter Kun. He provided a fascinating explanation of the cryptic
section of his inscription, while verifying the correctness of the rest of our decipherment
in [14]. I thought that the readers deserve to learn about the entire correct decipherment. In
addition, ref. [15] also provides a mathematical analysis of the process of incorrectly mixing
up some visually similar signs by Peter Kun. Since such mix-ups are frequent in other
inscriptions too, this mathematical analysis may benefit other scholars who are working on
the decipherment of ancient scripts.
Finally, I would like to thank the many reviewers who have reviewed the papers
submitted to this Special Issue. I also would like to thank Janessy Zhan, Section Managing
Editor at MDPI of this Special Issue, for her outstanding help in every aspect of organization,
including arranging independent reviewers of my contributions to this Special Issue. I
am also grateful to all parties that made contributions to this Special Issue. It was great
to work with such a talented group of authors, and I wish them much success in their
future research.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Sudheesh, R.; Mujahid, M.; Rustam, F.; Shafique, R.; Chunduri, V.; Villar, M.G.; Ballester, J.B.; de la Torre Diez, I.; Ashraf, I.
Analyzing Sentiments Regarding ChatGPT Using Novel BERT: A Machine Learning Approach. Information 2023, 14, 474.
Delmonte, R. Computing the Sound–Sense Harmony: A Case Study of William Shakespeare’s Sonnets and Francis Webb’s Most
Popular Poems. Information 2023, 14, 576. [CrossRef]
Gorman, R. Morphosyntactic Annotation in Literary Stylometry. Information 2024, 15, 211. [CrossRef]
Abdalla, M.H.I.; Malberg, S.; Dementieva, D.; Mosca, E.; Groh, G. A Benchmark Dataset to Distinguish Human-Written and
Machine-Generated Scientific Papers. Information 2023, 14, 522. [CrossRef]
Zaman, I.-R.; Trausan-Matu, S. A Survey on Using Linguistic Markers for Diagnosing Neuropsychiatric Disorders with Artificial
Intelligence. Information 2024, 15, 123. [CrossRef]
Mendhakar, A. Linguistic Profiling of Text Genres: An Exploration of Fictional vs. Non-Fictional Texts. Information 2023, 13, 357.
[CrossRef]
Gujjar, V.; Mago, N.; Kumari, R.; Patel, S.; Chintalapudi, N.; Battineni, G. A Literature Survey on Word Sense Disambiguation for
the Hindi Language. Information 2022, 14, 495. [CrossRef]
Manca, V. Agile Logical Semantics for Natural Languages. Information 2024, 15, 64. [CrossRef]
Łab˛edzki, M.; Unold, O. D0L-System Inference from a Single Sequence with a Genetic Algorithm. Information 2023, 14, 343.
[CrossRef]
Nepal, A.; Perono Cacciafoco, F. Minoan Cryptanalysis: Computational Approaches to Deciphering Linear A and Assessing its
Connections with Language Families from the Mediterranean and the Black Sea Areas. Information 2024, 15, 73. [CrossRef]
Revesz, P.Z. Establishing the West-Ugric Language Family with Minoan, Hattic and Hungarian by a Decipherment of Linear A.
WSEAS Trans. Inf. Sci. Appl. 2017, 14, 306–335.
Revesz, P.Z. Data Mining Autosomal Archaeogenetic Data to Determine Minoan Origins. In Proceedings of the 25th International
Database Engineering & Applications Symposium, New York, NY, USA, 7 September 2021; Association for Computing Machinery:
New York, NY, USA, 2021; pp. 46–55.
Revesz, P.Z. Minoan archaeogenetic data mining reveals Danube Basin and western Black Sea littoral origin. Int. J. Biol. Biomed.
Eng. 2019, 13, 108–120.
Revesz, P.Z.; Varga, G. A Proposed Translation of an Altai Mountain Inscription Presumed to be from the 7th Century BC.
Information 2022, 13, 243. [CrossRef]
Revesz, P.Z. Decipherment Challenges due to Tamga and Letter Mix-Ups in an Old Hungarian Runic Inscription from the Altai
Mountains. Information 2023, 13, 422. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
4
information
Article
Analyzing Sentiments Regarding ChatGPT Using Novel BERT:
A Machine Learning Approach
Sudheesh R 1,† , Muhammad Mujahid 2,† , Furqan Rustam 3,† , Rahman Shafique 4 , Venkata Chunduri 5 ,
Mónica Gracia Villar 6,7,8 , Julién Brito Ballester 6,9,10 , Isabel de la Torre Diez 11, * and Imran Ashraf 4, *
1
2
3
4
5
6
7
8
9
10
11
*
†
Citation: R, S.; Mujahid, M.; Rustam,
F.; Shafique, R.; Chunduri, V.; Villar,
M.G.; Ballester, J.B.; Diez, I.d.l.T.;
Ashraf, I. Analyzing Sentiments
Regarding ChatGPT Using Novel
BERT: A Machine Learning
Approach. Information 2023, 14, 474.
https://doi.org/10.3390/
info14090474
Academic Editor: Peter Revesz
Received: 11 July 2023
Revised: 15 August 2023
Accepted: 19 August 2023
Published: 25 August 2023
Kodiyattu Veedu, Kollam, Valakom 691532, India;
[email protected]
Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology,
Rahim Yar Khan 64200, Pakistan;
[email protected]
School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland;
[email protected]
Department of Information and Communication Engineering, Yeungnam University,
Gyeongsan 38541, Republic of Korea;
[email protected]
Indiana State University, Terre Haute, IN 47809, USA;
[email protected]
Faculty of Social Science and Humanities, Universidad Europea del Atlántico, Isabel Torres 21,
39011 Santander, Spain;
[email protected] (M.G.V.);
[email protected] (J.B.B.)
Department of Project Management, Universidad Internacional Iberoamericana Arecibo,
Puerto Rico, PR 00613, USA
Department of Extension, Universidade Internacional do Cuanza, Cuito EN250, Bié, Angola
Universidad Internacional Iberoamericana, Campeche 24560, Mexico
Universitaria Internacional de Colombia, Bogotá 11001, Colombia
Department of Signal Theory, Communications and Telematics Engineering, Unviersity of Valladolid,
Paseo de Belén, 15, 47011 Valladolid, Spain
Correspondence:
[email protected] (I.d.l.T.D.);
[email protected] (I.A.)
These authors contributed equally to this work.
Abstract: Chatbots are AI-powered programs designed to replicate human conversation. They are
capable of performing a wide range of tasks, including answering questions, offering directions,
controlling smart home thermostats, and playing music, among other functions. ChatGPT is a
popular AI-based chatbot that generates meaningful responses to queries, aiding people in learning.
While some individuals support ChatGPT, others view it as a disruptive tool in the field of education.
Discussions about this tool can be found across different social media platforms. Analyzing the
sentiment of such social media data, which comprises people’s opinions, is crucial for assessing
public sentiment regarding the success and shortcomings of such tools. This study performs a
sentiment analysis and topic modeling on ChatGPT-based tweets. ChatGPT-based tweets are the
author’s extracted tweets from Twitter using ChatGPT hashtags, where users share their reviews
and opinions about ChatGPT, providing a reference to the thoughts expressed by users in their
tweets. The Latent Dirichlet Allocation (LDA) approach is employed to identify the most frequently
discussed topics in relation to ChatGPT tweets. For the sentiment analysis, a deep transformer-based
Bidirectional Encoder Representations from Transformers (BERT) model with three dense layers
of neural networks is proposed. Additionally, machine and deep learning models with fine-tuned
parameters are utilized for a comparative analysis. Experimental results demonstrate the superior
performance of the proposed BERT model, achieving an accuracy of 96.49%.
Keywords: ChatGPT; sentimental analysis; BERT; machine learning; LDA; app reviewers; deep learning
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
1. Introduction
conditions of the Creative Commons
AI-based chatbots, powered by natural language processing (NLP), are computer programs designed to simulate human interactions by understanding speech and generating
human-like responses [1]. They have gained popularity across various industries as a
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Information 2023, 14, 474. https://doi.org/10.3390/info14090474
5
https://www.mdpi.com/journal/information
Information 2023, 14, 474
tool to enhance digital experiences. The utilization of chatbots is experiencing continuous
growth, with predictions indicating that the chatbot industry is expected to reach a market
size of $3.62 billion by 2030, accompanied by an annual growth rate of 23.9% [2]. Additionally, the chatbot market is expected to reach approximately 1.25 billion U.S. dollars
by 2025 [3]. The adoption of chatbots in sectors such as education, healthcare, banking,
and retail is estimated to save around $11 billion annually by 2023 [4]. Especially in recent
developments in the field of education, chatbots have the potential to significantly enhance
the learning experience for students.
ChatGPT, an AI-based chatbot that is currently gaining attention, is being discussed
widely across various platforms [5–7]. It has become a prominent topic of conversation
due to its ability to provide personalized support and guidance to students, contributing
to an improved academic performance. Developed by OpenAI, ChatGPT utilizes advanced language generation techniques based on the GPT language model technology [8].
Its impressive capabilities in generating coherent and contextually relevant responses
have captivated individuals, communities, and social media platforms. The widespread
discussions surrounding ChatGPT highlight its significant impact on natural language
processing and artificial intelligence, and its potential to revolutionize our interactions
with AI systems. People are fascinated by its usefulness in various domains including
learning, entertainment, and problem-solving, which further contributes to its popularity
and widespread adoption.
While there are many advantages to using ChatGPT, there are also some notable
disadvantages and criticisms of the AI chatbot. Some raised concerns include the potential
for academic dishonesty, as ChatGPT could be used as a tool for cheating in educational
settings, similar to using search engines like Google [9]. There is also a concern that
ChatGPT may perpetuate biases when used in research, as the language model is trained
on large amounts of data that may contain biased information [9]. Another topic of
discussion revolves around the potential impact of ChatGPT on students’ critical thinking
and creativity. Some argue that an over-reliance on ChatGPT may lead to a decline in
these important skills among students [10]. Additionally, the impact of ChatGPT on the
online education business has been evident, as seen in the case of Chegg Inc., where the
rise of ChatGPT contributed to a significant decline of 47% in the company’s shares during
early trading [11]. To gain insights into people’s perceptions of ChatGPT, opinion mining
was conducted using social media data. This analysis aimed to understand the general
sentiment and opinions surrounding the use of ChatGPT in various contexts: people, in
this sense, tweet on Twitter concerning their thoughts about ChatGPT, which could provide
valuable information.
Opinion mining involves evaluating individuals’ perspectives, attitudes, evaluations,
and emotions towards various objects including products, services, companies, individuals,
events, topics, occurrences, and applications, along with their attributes. When making
decisions, we often seek the opinions of others, whether as individuals or organizations.
Sentiment analysis tools have found application in diverse social and corporate contexts [12].
Social media platforms, microblogging sites, and app stores serve as rich sources of openly
expressed opinions and discussions, making them valuable for a sentiment analysis [13].
The sentiment analysis employs NLP, a text analysis, and computational methods such
as machine learning and data mining to automate the categorization of sentiments based
on feedback and reviews [14]. The sentiment analysis process involves identifying sentiment from reviews, selecting relevant features, and performing sentiment classification to
determine polarity.
1.1. Research Questions
To meet the objective of this study by analyzing people’s attitudes toward ChatGPT,
this study formulates the following questions (RQs):
i.
RQ1: What are people’s sentiments about ChatGPT technology?
6
Information 2023, 14, 474
ii.
iii.
iv.
RQ2: Which classification model is most effective, such as the proposed transformerbased models, machine learning-based models, and deep learning-based models,
for analyzing sentiments about ChatGPT tweets?
RQ3: What are the impacts of ChatGPT on student learning?
RQ4: What role does topic modeling play in the sentiment analysis of social media tweets?
1.2. Contributions
The sentiment analysis of tweets regarding ChatGPT aims at providing users’ perceptions of ChatGPT and analyzing the ratio of positive and negative comments from
users. In addition, a topic analysis can provide insights on frequently discussed topics concerning ChatGPT and provide feedback to further improve its functionality. In particular,
the following contributions are made:
•
•
•
•
This study aims to analyze people’s perceptions of the trending topic of ChatGPT
worldwide. The research contributes by collecting relevant data and examining the
sentiments expressed by individuals toward this significant development.
Tweets related to ChatGPT are collected by utilizing the Tweepy application programming interface (API) and employing various keywords. The collected tweets undergo preprocessing and annotation using Textblob and the valence aware-dictionary
(VADER). The bag of words (BoW) feature engineering technique is employed to
extract essential features.
A deep transformer-based BERT model is proposed for the sentiment analysis. It consists of three dense layers of neural networks for enhanced performance. Additionally,
machine learning and deep learning models with fine-tuned parameters are utilized
for comparison purposes. Notably, this study is the first to investigate ChatGPT raw
tweets using Transformers.
The study utilizes the latent Dirichlet allocation (LDA) approach to extract highly
discussed topics from the dataset of ChatGPT tweets. This analysis provides valuable
insights into the frequently discussed themes and subjects.
The remaining sections of the paper are structured as follows: Section 2 provides a
comprehensive review of relevant research works on sentiment analyses, offering a valuable
background for the proposed approach. Section 3 presents a detailed description of the
proposed approach. Section 4 presents and discusses the experimental results obtained
from the analysis. Finally, Section 5 concludes the study, summarizing the key findings and
suggesting potential directions for future research.
2. Related Work
The analysis of reviews has gained significant attention in recent years, mainly due
to the widespread use of social media platforms. These platforms serve as a hub for
discussions on various topics, providing researchers with valuable insights and information.
For instance, in a study conducted by Lee et al. [15], social media data were utilized to
investigate the Taliban’s control over Afghanistan. By analyzing the discussions and
conversations on social media, the study aimed to gain a deeper understanding of the
situation. Similarly, the study by Lee et al. [16] focused on extracting tweets related to
racism to shed light on the issue of racism in the workplace. By analyzing these tweets,
the researchers aimed to uncover patterns and gain insights into the prevalence and nature
of racism in professional environments. They utilized Twitter data and annotated it with
the TextBlob approach. The authors attained 72% accuracy for the racism classification.
In a different context, Mujahid et al. [17] conducted a study on public opinion about online
education during the COVID-19 pandemic. By analyzing social media data, the researchers
aimed to understand the sentiment and perceptions surrounding online education during
this challenging time. These studies highlight the significance of a social media data
analysis in extracting meaningful information and gaining insights into various subjects.
By harnessing the vast amount of discussions and conversations on social media platforms,
7
Information 2023, 14, 474
researchers can delve into important topics and uncover valuable findings. The researchers
employed 17,155 tweets for the analysis and attained 95% accuracy using the SMOTE
technique with bag of word features by the SVM model.
ChatGPT is a hot topic nowadays and exploring people’s perceptions about it using
Twitter data can provide valuable insights. Many studies have previously done such kinds
of analyses on different topics. In the study conducted by Tran et al. [18], the focus was on
examining consumer sentiments towards chatbots in various retail sectors and investigating
the impact of chatbots on their sentiments and expectations regarding interactions with
human agents. Through the application of the automated sentiment analysis, it was
observed that the general sentiment towards chatbots is more positive compared to that
towards human agents in online settings. They collected a limited dataset of 8190 tweets
and used ANCOVA for the test. They only classify the tweets into their exact sentiments
and do not properly use performance metrics like accuracy. Additionally, sentiments varied
across different sectors, such as fashion and telecommunications, with the implementation
of chatbots resulting in more negative sentiments towards human agents in both sectors.
The study [19] aimed to develop an effective system for analyzing and extracting sentiments
and mental health during the COVID-19 pandemic. By utilizing a vast amount of data
and leveraging hashtags, we employed the BERT machine learning algorithm to classify
customer perspectives into positive and negative sentiments with high accuracy. Ensuring
user privacy, our main objective was to facilitate self-understanding and the regulation of
mental states through end-to-end encrypted user-bot interactions. The researchers were
able to achieve 95.6% accuracy and 95% recall for automated sentiment classification related
to chatbots.
Some studies, such as [20], focus on a sentiment analysis of disaster-related tweets
at different time intervals for specific locations. By using the LSTM network with word
embedding, keywords are derived from the tweet history and context. The proposed
algorithm, RASA, classifies tweets and identifies sentiment scores for each location. RASA
outperforms other algorithms, aiding the government in post-disaster management by
providing valuable insights and preventive measures. Another study [21] tries to predict
cryptocurrency prices using Twitter data. They focus on a sentiment analysis and emotion
detection using tweets related to cryptocurrency. An ensemble model, LSTM-GRU, combines LSTM and GRU to enhance the analysis’ accuracy. Multiple features and models,
including machine learning and deep learning, are examined. Results reveal a predominance of positive sentiment, with fear and surprise also as prominent emotions. The dataset
consists of five emotions extracted from Twitter. The proposed ensemble model achieves
83% accuracy using a balanced dataset for emotion prediction. This research provides
valuable insights into the public perception of cryptocurrency and its market implications.
Additionally, it is also observed that most of the time, a service provider asks for
feedback regarding the quality or satisfaction level of the services or products via a customer feedback form provided in an online mode, most probably by using a social media
platform [22]. Such assessments are critical in determining the quality of services and
products. However, it is necessary to examine the views of user concepts and impressions.
Negative sentiment ratings, in particular, include more relevant recommendations for
enhancing the quality of the product/service. Given the significance of the text analysis,
there is a huge amount of work on the sentiment analysis. For example, studies [23–25]
classify app reviews by using machine learning and deep learning models. Another piece
of research [26] looked at the Shopify app reviews and classified them as pleased or dissatisfied. For sentiment classification, many feature extraction approaches are used in
conjunction with supervised machine learning algorithms. For the experiments, 12,760 samples of app reviews were utilized with machine learning. Different hybrid approaches to
combining the features were used to enhance the performance. But LR performed with
83% accuracy and an 86% F score. The performance of machine learning models in the
sentiment analysis can be influenced by the techniques used for feature engineering. Research studies [27,28] indicate that altering the feature engineering process can result in
8
Information 2023, 14, 474
changes to the models’ performance. The research [29] provides a method for categorizing
and evaluating employee reviews. For employee review classification, it employs an ETC
with BoW features. The study classified employee reviews using both numerical and
text elements and achieved 100% and 79% accuracy, respectively. Ref. [30] used NB in
conjunction with the RF and SVM to categorize mobile app reviews from the Google Play
store. The researcher collected over 90,000 reviews posted in the English language for
10 applications available on the Google Play Store. A total of 7500 reviews were annotated
from a dataset of 90,000 tweets. The final experiments implemented the use of 7500 reviews. The results indicated that a baseline 10-fold validation yielded an accuracy of 90.8%.
Additionally, the precision was found to be 88%, the recall was 91%, and the f score was
89%. Ref. [31] also used an RF algorithm to identify the variables that distinguish reviews
from those from other nations. The research [32] looked at retail applications in Bangladesh.
The authors gathered data from Google Play and utilized VADER and AFFIN to annotate
sentiments. For sentiment categorization, many machine learning models are employed,
and RF outperforms with substantial accuracy. Bello et al. [33] proposed a BERT model
for a sentiment analysis on Twitter data. The authors used the BERT model with different
variants including the recurrent neural network (RNN) and Bi-long short-term memory
(BILSTM) for classification. Catelli et al. [34] and Patel et al. [35] also employed the BERT
model for a sentiment analysis on app reviews with lexicon-based approaches.
The study [36] presented a hybrid approach for the sentiment analysis of ChatGPT
tweets. Raw tweets were transformed into structured and normalized forms to improve the
accuracy of the model and a lower computing complexity. For the objective of classifying
tweets from ChatGPT, the authors developed hybrid models. Although state-of-the-art models are unable to provide correct predictions, hybrid models incorporate multiple models to
eliminate bias, improve overall outcomes, and make precise predictions. Bonifazi et al. [37]
proposed a framework for determining the spatial and spatio-temporal extent of a user’s
sentiment regarding a topic on a social network. First, the authors introduced the idea
of their research, observing that it summarizes a number of previously discussed ideas
about social websites. In reality, each of these ideas represents a unique fact about the
concept. Then, they established a framework capable of expressing and controlling a
multidimensional view-of scope, which is the sentiment of an individual regarding a topic.
After that, they recommended a number of parameters and a method for assessing the
spatial and spatio-temporal scope of a user’s opinion on a topic on a social platform. They
conducted several experiments on actual data collected through Reddit to test the proposed
framework. Similarly, Bonifazi et al. [38] presented another Reddit-based study. They
proposed a model for evaluating and visualizing the eWoM Power of Reddit blog posts.
In a similar way, ref. [39] examined app reviews, where the authors initially extracted
negative reviews, constructed a time series of these reviews, and subsequently trained a
model to identify key patterns. Additionally, the study focused on an automatic review
classification to address the challenge of handling a large volume of daily submitted reviews.
To tackle this, the study presented a multi-label active-learning technique, which yielded
superior results compared to state-of-the-art methods. Given the impracticality of manually
analyzing a vast number of reviews, many researchers have turned to topic modeling,
a technique that aids in identifying the main themes within a given text. For instance, in the
study [40], the authors investigated the relationship between Arabic app elements and
assessed the accuracy of reflecting the type and genre of Arabic mobile apps available on
the Google Play store. By employing the LDA approach, valuable insights were provided,
offering potential improvements for the future of Arabic apps. Furthermore, in [41],
the authors developed an NB and XGB technique to determine user activity within an app.
The literature review provides an analysis of the advantages, disadvantages, and limitations associated with different approaches. Nevertheless, it is worth noting that a
significant number of researchers have directed their attention toward the utilization of
Twitter datasets for the purpose of analyzing tweets and app evaluations. The researchers
employed natural language processing (NLP) techniques and machine learning primarily
9
Information 2023, 14, 474
for the purpose of a sentiment analysis. Commonly utilized Machine learning models,
including random forests, support vector machines, and extra tree classifiers, are limited
in their ability to learn intricate patterns and are typically not utilized for large datasets.
When the aforementioned models are employed on extensive datasets, their performance
is inadequate and demands an excessive amount of time for training, especially in the case
of handcrafted features. Furthermore, the existing literature employs a limited collection
of datasets, which are only comprised of tweets that are not linked to ChatGPT tweets.
Previous research has not extensively examined the topic of ChatGPT or OpenAI-related
tweets and achieved a low accuracy. Table 1 shows the summary of the literature review.
Table 1. Summary of related work.
Authors
Techniques
Advantages
Disadvantages
Limitations
[16]
TextBlob, CNN,
RNN, GRU, DT,
RF, SVM
The authors make an ensemble model by
combining the GRU, CNN, and RNN for
the extraction of features from the tweets
and detection. They also performed seven
experiments to test the proposed
ensemble approach.
The authors develop ensemble
models, which need a significant
amount of time to both train and
identify the sentiments.
The authors used a limited
dataset and did not develop
transformer-based models that
are the most up-to-date and that
provide high accuracy.
[17]
TextBlob, CNN,
LSTM, SVM,
GBM, KNN, DT,
LSTM-CNN
This study employed machine learning as
well as deep learning for the analysis of
tweets. They utilized various annotation
and feature engineering techniques.
Machine learning outperformed deep
learning with an accuracy of 95%.
The study did not clearly
describe the preprocessing
stages and
their implementations.
The dataset included in this
study was restricted to tweets
that were not associated with
ChatGPT tweets.
BERT
The authors conducted this research to
analyze the depression tweets during the
period of COVID-19 and achieved
remarkable results with BERT.
To speed up computation,
the research did not remove
stopwords, punctuation,
numerical values, etc., from the
text. Additionally, the accuracy
was inadequate.
The research only proposed one
model, which was BERT,
and did not compare with
other studies.
Naïve Bayes
The data in the study was labeled using the
Vader technique, and the Nave Bayes
model was implemented to examine the
influence of chatbots on customer opinions
and demands within the retail industry.
The study detected positive,
neutral, and negative sentiments
and used the Ancova test only
for the experiments.
The study did not use the most
important metrics like accuracy,
deep models, or transformers.
The study is limited to the Nave
Bayes model.
LSTM + GRU,
CNN, SVM,
DT, TFIDF
Their primary area of research revolves
around sentiment evaluation and detecting
emotions using tweets that are associated
with cryptocurrencies. The utilization of an
ensemble model, namely the LSTM-GRU
model, involves the integration of both
LSTM and GRU architectures in order to
improve the accuracy of the analysis.
The author used ensemble
models, which necessitate
substantial time for both
training and
sentiment identification.
The study is regarding the
cryptography analysis. Also,
transformers are ignored in
this study.
RF, LR, and AC
The study used various feature engineering
strategies, including bag-of-words; term
frequency, inverse document-frequency,
and Chi-2 are employed individually and
collectively in order to attain meaningful
information from the tweets.
The study employed various
feature engineering strategies
but did not use cross-dataset
experiments with machine
learning classifiers. The LR
achieved a 83% lowest accuracy.
The study does not use Chatbots
or ChatGPT-related tweets for
the experiments. In addition,
their focus is on utilizing
machine learning models for
Shopify reviews.
SVM, RF,
and NB
The dataset was obtained by the authors
from the most popular ten applications.
The findings of the study revealed that a
baseline 10-fold validation approach
resulted in an accuracy rate of 90.8%.
The paper is about app reviews,
not ChatGPT tweets.
The accuracy achieved is very
low, and the study did not use
any deep transformers to
improve its efficiency.
[18]
[19]
[21]
[26]
[30]
As a result, this paper proposes a transformer-based BERT model that leverages
self-attention mechanisms, which have demonstrated remarkable efficacy in the context
of machine learning and deep learning. The proposed model addresses the problems
mentioned in the literature review. They have the ability to comprehend the correlation
between consecutive items that are widely separated. The transformers achieved an
exceptional performance. Additionally, the performance of the proposed method was
10
Information 2023, 14, 474
evaluated using cross-validation findings and statistical tests. The ChatGPT tweets study
utilizes BERTopic and LDA-based topic modeling techniques to ascertain the most pertinent
topics or keywords within the datasets.
3. Methodology
The proposed methodology’s workflow is depicted in Figure 1, illustrating the steps
involved. Firstly, unstructured tweets related to ChatGPT are collected from Twitter using
the Twitter Tweepy API. These tweets undergo several preprocessing steps to ensure
cleanliness and remove noise. Lexicon-based techniques are then utilized to assign labels of
positive, negative, or neutral to the tweets. Feature extraction is performed using the Bag
of Words (BoW) technique on the labeled dataset. The data is subsequently split into an
80/20 ratio for training and testing purposes. Following model training, evaluation metrics
such as accuracy, precision, recall, and the F1 score are employed to analyze the model’s
performance. Each component of the proposed methodology for sentiment classification is
discussed in greater detail in the subsequent sections.
Figure 1. The workflow diagram of the proposed approach for sentiment classification.
3.1. Dataset Description and Preprocessing
In this study, the ChatGPT tweets dataset is utilized, which is scraped from Twitter
using the Tweepy API Python library. A total of 21,515 raw tweets are collected for this
purpose. The dataset contains the date, user name, user friends, user location, and text
features. The dataset is unstructured and requires several preprocessing steps to make it
appropriate for machine learning models.
Text preprocessing is very important in NLP tasks for a sentiment analysis. The dataset
used in this paper is unstructured, unorganized, and contains unnecessary and redundant
information. The machine learning or deep learning models do not perform well on these
types of datasets, which increases the computational cost [42]. Different preprocessing
techniques are utilized to remove unnecessary, meaningless information from the tweets.
Preprocessing is a crucial step in data analysis that involves transforming unstructured
data into a meaningful and comprehensible format [43]. The purpose of preprocessing is
to enhance the quality of the dataset while preserving its original content, enabling the
model to identify significant patterns that can be utilized to extract valuable and efficient
information from the preprocessed data. There are many steps in preprocessing to convert
unstructured text into structured data. These techniques are used to remove the least
important information from the data and make it easier for the machine to train in less time.
The dataset consists of 20,801 tweets, 8095 of which are positive, 2727 of which are
negative, and 9979 of which are neutral. Following the split, 6476 positive tweets were used
for training and 1619 for testing. There were 1281 negative tweets utilized for training and
546 for testing. For neutral tweets, 7983 were training and 1996 were testing. The hashtags
#chatgpt, #ChatGPT, #OpenAI, #ChatGPT-3, #Chatbots, #Powerful OpenAI, etc., were used
to collect all of the tweets in English. Table 2 shows the dataset statistics.
11
Information 2023, 14, 474
Table 2. Dataset statistics after splitting.
Tweets
Training
Testing
Total
Positive
6476
1619
8095
Negative
1281
546
2727
Neutral
7983
1996
9979
Total
16,640
4161
20,801
The most important step in natural language processing (NLP) is the pre-processing
stage. It enables us to remove any unnecessary information from our data so that we can
proceed to the following processing stage. The Natural Language Toolkit (NLTK), which
provides modules, is an open-source Python toolkit that can be used to perform operations
such as tokenization, stemming, classification, etc. The first step in preprocessing is to
convert all textual data to lowercase. Conversion is an essential step in sentiment classification, as the machine considers “ChatGPT” and “chatgpt” as individual words. The dataset
contains text in upper, lower, and sentence case, which the model takes separately, which
affects the classification performance as well and makes the data more complex if we do not
convert it all into lowercase. The second step is to remove numbers from the text because
they do not provide meaningful information and are useless in the decision-making process.
The removal of numerical data enhances the quality of the data [44]. The third step is to
remove punctuation such as [?,@,#,/,&,%] to increase the quality of the dataset and the
performance of the models. The fourth step is to remove HTML and URL tags that also
provide no important information. The URLs in the text data are meaningless because
they expand the dataset and require extra computation. It has no impact on the machine
learning performance. The fifth step is to remove stopwords like ‘an’, ‘the’, ‘are’, ‘was’,
‘has’, ‘they’, etc., from the tweets during preprocessing. The model’s accuracy improves,
and the training process is faster, with only relevant information [44]. Additionally, the removal of stopwords allows for a more thorough analysis, which is advantageous for a
limited dataset [45]. The last step is to perform stemming and lemmatization. The effectiveness of machine learning is slightly influenced by the stemming and lemmatization steps.
After performing all important preprocessing steps, the sample tweets are presented in
Table 3.
Table 3. Sample Tweets before preprocessing and after preprocessing.
Unstructured Tweets
Structured Tweets (Preprocessed)
I asked #chatgpt to write a story instalment with Tim giving the
octopus a name. Originality wasn’t its strongpoint e|
https://t.co/rbB5prcJ2r (accessed on 2 April 2023).
asked chatgpt write story instalment tim
giving octopus name originality strongpoint
ChatGPT is taking the web by storm; If you’re unable to try it on
their site, feel free to test it out through us! e|
https://t.co/jfmOQmjSHo (accessed on 2 April 2023).
chatgpt taking web storm unable try site
feel free test
People weaponizing a powerful AI tool like ChatGPT days
into launch has to be the most predictable internet
people weaponizing powerful tool like
chatgpt days launch predictable internet
3.2. Lexicon Based Techniques
TextBlob [46] and VADER [47] are the two most important lexicon-based techniques
used in this study to label the dataset. TextBlob provides the subjectivity and polarity
scores, where 1 represents the positive response and −1 represents the negative response
in polarity. The subjectivity score is represented by [0, 1]. The VADER technique calculates
the sentiment score by adding the intensity of each word in the preprocessed text.
12
Information 2023, 14, 474
3.3. Feature Engineering
The labeled dataset is divided into training and testing subsets. The training data
has been used to fit the model, while the test data is used by the model for predictions on
unseen data, which are then compared to determine the model’s efficacy.
Important features from the cleaned tweets are extracted using the BoW approach.
The BoW approach extracts valuable features from the data to enhance the performance of
machine learning models. Features are very crucial and have a great impact on sentiment
classification. This approach reduces processing time and effort. The BoW approach creates
a bag of words of text data and converts it into a numeric format. The models learn and
understand complex patterns and sequences from the numeric format [48].
3.4. Machine and Deep Learning Models
This subsection provides details about the machine and deep learning models. The applications of machine and deep learning span across various domains, such as disease
diagnosis [49], education [50], computer/machine vision [51,52], text classification [53],
and many more. In this study, we utilize these techniques for text classification. The objective of text classification is to automatically classify texts into predetermined categories.
Deep learning and machine learning are both forms of artificial intelligence [54]. Classification of text using machine learning entails the transformation of input data into a numeric
form. Then, manually extracting features from the data using a bag of words, term frequency, inverse document frequency, word2vec, etc., to extract crucial features. Frequently
employed models of machine learning, such as random forests, support vector machines,
extra tree classifiers, etc., cannot learn complex patterns and are not employed for large
datasets. When we apply these models to large datasets, they perform poorly and require
excessive training time, particularly for handcrafted features. If the researchers applied
machine learning to complex problems, they would require manual feature engineering to
retain only the essential information, which is time-consuming and requires expertise in
the same fields to improve classification results.
Deep learning [55], on the other hand, has a method for automatically extracting
features. Large and complex patterns are automatically learned from the data using DL
models like CNN, LSTM, GRU, etc., minimizing the need for manual feature extraction.
When there is a lack of data, the model could get overfitted and perform poorly. These
models address the issue of vanishing gradients. In terms of computing, gated recurrent
units (GRU) outperform LSTM, reduce the chances of overfitting, and are better suited for
small datasets. Additionally, GRU has a straightforward structure with fewer parameters.
The authors only used models that are quick and effective in terms of computing.
We developed transform-based models that use self-attention mechanisms since they
are the most effective after machine and deep learning. They have the capacity to comprehend the relationship between consecutive elements set far apart from one another. They
achieve an outclass performance. They give each component of the sequence the same
amount of attention. The large data can be processed and trained by transformers in a
shorter period of time. They are capable of processing almost any form of sequenced information. The hyperparameters and their fine-tuned values are represented in Table 4. These
parameters are obtained using the GridSearchCV method which performs an exhaustive
search for the given parameters to evaluate a model’s performance and provides the best
set of parameters for obtaining optimal results.
Table 4. Hyperparameters and their tuned values for experiments.
Model
Parameters Tuning
RF
n_estimators = 100, random_state = 50, max_depth = 150
GBM
n_estimators = 100, random_state = 100, max_depth = 300
LR
random_state = 150, solver = ‘newton-cg’, multi_class = ‘multinomial’, C = 2.0
13
Information 2023, 14, 474
Table 4. Cont.
Model
Parameters Tuning
SVM
kernel = ‘linear’, C = 1.0, random_state = 200
KNN
n_neighbors = 3
DT
random_state = 100, max_depth = 150
ETC
n_estimators = 100, random_state = 150, max_depth = 300
SGD
loss = “hinge”, penalty = “l1”, max_iter = 6
CNN
616,003 trainable parameters
RNN
633,539 trainable parameters
LSTM
655,235 trainable parameters
BILSTM
726,787 trainable parameters
GRU
692,547 trainable parameters
•
•
•
•
•
•
•
•
Logistic Regression: LR [56] is a simple machine learning model used in this study for
sentiment classification. LR provides accurate results with preprocessed and highly
relatable features. It is simple to implement and utilizes low computational resources.
This model may not perform well on large datasets, cause overfitting, and does not
learn complex patterns due to its simplicity.
Random Forest: The RF is an ensemble supervised machine learning model used
for classification, regression, and other NLP tasks [57]. The RF ensembles multiple
decision trees to form a forest. A large amount of textual data and the ensemble of
trees make the model more complex which takes a higher amount of time to train. The
RF is powerful and has attained high accuracy for the sentiment analysis.
Decision Tree: A DT is a supervised non-parametric learning model for classification
and regression. The DT predicts a target variable using learned features to classify
objects. A decision tree requires less data cleaning than other machine learning
methods. In other words, decision trees do not require normalization during the early
stages of machine learning tasks. They can handle both categorical and numerical
information [58].
K Nearest Neighbour: The KNN model requires no previous knowledge and does not
learn from training data. It is also called the lazy learner. It does not perform well when
data is not well normalized and structured. The performance can be manipulated
with the distance metrics and K value [59].
Support Vector Machine: The SVM is mostly used for classification tasks. It performs
well where the number of dimensional spaces is greater than the number of samples [17]. The SVM does not perform well on large datasets because the training time
increases. It is more robust and handles imbalanced datasets efficiently. The SVM can
be used with ‘poly’, ‘linear’, and ‘rbf’ kernels.
Extra Tree Classifier: The ETC is used for classification and regression [60]. Extra
trees do not use the bootstrapping approach and train faster. The ETC requires fewer
parameters for tuning compared to RF. Also, with extra trees, the chances of overfitting
are less.
Gradient Boosting Machine (GBM) and Stochastic Gradient Descent (SGD): The
GBM [61] and SGD are supervised learning models for classification. To enhance
the performance, the GBM combines multiple decision trees, and the SGD optimizes
the gradient descent. The GBM is more complex and handles imbalanced data better
than the SGD.
Convolutional Neural Networks (CNN): The CNN [62] is a deep neural network
model that is used for image classification, sentiment classification, object detection,
and many other tasks. For sentiment classification, it first converts textual data into a
numeric format, then make a matrix of word embedding layers. These embedding
14
Information 2023, 14, 474
•
•
•
•
layers are then passed into convolutional, max-pooling, and dense layers, and the
final output is passed through a dense softmax layer for classification.
Recurrent Neural Network (RNN): The RNN [63] is a widely used model for text
classification, speech recognition, and NLP tasks. The RNN can handle sequential
data with complex long-term dependencies. This model is expensive to train and has
the vanishing gradient issue for text classification.
Long Shor-Term Memory: The LSTM [64] model was released to handle long-term
dependencies, the gradient vanishing issue, and the complex training time. When
compared to RNN, this model is much faster and uses less memory. It has three gates,
including input, output, and forget, which are used to manage the data flow.
Bidirectional LSTM: The BiLSTM is a deep learning model which is used for several
tasks, including text classification as well [65]. The model provides better results
for understanding the text in past and future contexts than the LSTM. It can learn
information from both directions.
Gated Recurrent Unit (GRU): The GRU solves the problem of vanishing gradient,
faced by RNN [66]. It is fast and performs well on small datasets. The model has two
gates: an update gate and a reset gate.
3.5. Transformer Based Architecture
BERT is a transformer-based model presented by Devlin et al. [67] in 2018. The BERT
model uses an attention mechanism that takes actual input from the text. The BERT has
two parts: an encoder and a decoder. The encoder gets the input as text and produces
output such as predictions. The BERT model is particularly well suited for NLP tasks,
including a sentiment analysis and questioning-and-answering, because it is trained on a
large amount of textual data. The traditional models only use word context-of-word in just
one direction, normally from left to right. The BERT model considers the context of words
in NLP in both directions. In contrast to previous deep learning models, this model has a
clear understanding of word meanings. The BERT model is trained on a large amount of
data to obtain accurate results and to learn complex patterns and structures [68].
The BERT with fine-tuned hyperparameters works well for a variety of NLP tasks.
Santiago Gonzalez and Eduardo C. Garrido-Merchan [69] published a study that compared
the BERT architecture to traditional machine learning models for sentiment classification.
The traditional models were trained using features extracted from TF-IDF. The performances demonstrate that the BERT transformer-based model outperforms the traditional
models. To solve NLP-related problems, the BERT model has also been used for lowresource languages. BERT was used to pre-train text data and fine-tuned low-resource
languages by Jan Christian Blaise Cruz and Charibeth Cheng [70]. Because this model
takes input words with multiple word sequences at once, the results for that language
were improved.
Figure 2 shows the proposed architecture of BERT for sentiment classification. The BERT
uses a large, pre-trained vocabulary to generate input ids that are numeric values of the
input text. First of all, a sequence of tokens is created from whole input text tokens,
and unique ids are assigned to the tokens. Basically, input ids are numerical representations
of input text. In BERT, the input mask works like an attention mechanism, which clearly
differentiates between input text tokens and padding. The input mask identifies which
tokens in the input sequence are evaluated by the model and which ones are not evaluated.
Segment ids indicate extra tokens to differentiate different sentences. After that, it is
concatenated with the BERT Keras layer. This study uses three dense layers in BERT
with 128, 64, and 32 units and two 20% dropout layers. The final dense layer is used for
classification with the softmax activation function.
XLNet was released by Ashish Vaswani in 2019, and its architecture is similar to BERT.
The BERT is an auto-encoder, and the XLNet is an autoregressor model [71]. The BERT
model cannot correctly model the dependencies between tokens in a sentence. XLNet
overcomes this problem by adopting permutation-based training objectives as compared to
15
Information 2023, 14, 474
mask-based objectives. The permutation-based objective permits XLNet to represent the
dependencies with all tokens in a paragraph.
Figure 2. The architecture for the proposed sentiment classification.
Robustly optimized BERT pretraining (RoBERTa) [72] is a transformer-based model
used for various NLP tasks. It was developed in 2019. RoBERTa is a modification of
the BERT model to overcome the limitations of the BERT model. RoBERTa is trained on
160 billion words, whereas BERT is trained on only 3.3 billion words. RoBERTa is trained
on large data sets, is fast to train, and may use large batch sizes. RoBERTa uses a dynamic
masking approach, and BERT uses a static approach.
3.6. Performance Metrics
The performance of the machine, deep, and transformer-based models are also measured using evaluation metrics including accuracy, precision, recall, and the F1 score [73].
Accuracy is calculated using
Accuracy =
( TP + TN )
( TP + TN + FP + FN )
(1)
where TP stands for true positive, TN for true negative, FP for false negative, and FN for
false negative.
Precision is another performance metric used to measure performance. Precision is
defined as the ratio of actual positives to the total number of positive predictions.
Precision =
TP
( TP + FP)
(2)
The recall is also used to measure the performance of models. The recall is calculated
by dividing the true positives by the sum of true positives and false negatives.
Recall =
TP
( TP + FN )
(3)
The F1 score is a better metric than other metrics in a situation where classes are imbalanced because it considers both precision and recall and provides a better understanding
of the model’s performance.
F1 − score = 2 ∗
16
( Recall ∗ precision)
( Recall + precision)
(4)
Information 2023, 14, 474
4. Results and Discussion
This section presents the details regarding experiments on the ChatGPT Twitter dataset
using machine learning, deep learning, and transformer-based models. The Colab Notebook in Python with Tensorflow, Keras, and Sklearn libraries is used to evaluate the research
experiments. Different measures including accuracy, precision, recall, and the F1 score are
used to assess the performance of various models. For deep and transformer-based models,
a graphics processing unit (GPU) and 16 GB of RAM are used to speed up the training
process. Experimental results are presented in the subsequent sections.
4.1. Results of Machine Learning Models
Table 5 shows the results of eight machine learning models utilizing Textblob and
VADER lexicon-based techniques on ChatGPT Twitter data. With an accuracy of 94.23%,
SVM outperforms while SGD achieves an accuracy of 92.74%. A 91% accuracy is attained by
ETC, GBM, and LR while the lazy learner KNN obtains only a 58.03% accuracy. The SVM
model has 88% accuracy, 89% recall, and an 83% F1 score for the negative class, whereas
the GBM model has 91% precision, 63% recall, and a 74% F1 score. Utilizing BoW features,
the neutral tweets get the highest recall scores.
Table 5. Results of machine learning models using VADER and TextBlob techniques.
Vader
Model
SGD
RF
DT
ETC
KNN
SVM
GBM
LR
Accuracy
89.13
82.40
82.26
87.11
54.38
90.72
89.56
88.44
TextBlob
Class
Precision
Recall
F1 Score
Positive
93
92
93
Negative
84
69
76
Neutral
87
94
Accuracy
Precision
Recall
F1 Score
94
93
93
89
75
81
90
93
95
97
92.76
Positive
92
83
88
94
85
89
Negative
92
43
58
94
47
63
Neutral
73
98
84
82
99
90
Positive
93
82
87
94
85
90
Negative
82
47
60
89
56
69
Neutral
94
97
84
84
99
91
Positive
93
89
91
94
91
93
Negative
92
56
69
90
66
76
Neutral
81
98
89
90
99
94
Positive
95
47
22
95
20
34
Negative
83
20
33
80
18
30
Neutral
47
99
64
54
99
70
Positive
95
92
94
96
94
95
Negative
85
73
79
88
89
83
Neutral
89
96
92
94
99
96
Positive
93
92
92
94
94
94
Negative
92
65
76
91
63
74
Neutral
85
97
91
91
99
95
Positive
93
91
92
95
91
93
Negative
89
63
74
92
66
77
Neutral
84
96
90
89
99
96
17
86.99
88.29
91.80
58.03
94.23
92.28
91.56
Information 2023, 14, 474
Table 5 also shows the results of various models using the VADER technique. Using a VADER lexicon-based technique, SVM performs best with an accuracy of 90.72%.
The models SGD and GBM both achieved an 89% accuracy score. The model that performs
worse, in this case, is KNN, with a 54.38% accuracy. This model also performs poorly on
the TextBlob technique. The only model in machine learning that performs with the highest
accuracy is SVM with the linear kernel. The accuracy score of various machine learning
models using TextBlob and Vader are compared in Figure 3.
Figure 3. Performance of models using the TextBlob and VADER techniques. The X-axis presents the
machine learning models that we utilized in this study, and the Y-axis presents the accuracy score.
4.2. Performance of Deep Learning Models
Deep learning models are also used to perform a sentiment classification and analysis.
Results using the TextBlob technique are shown in Table 6. The experimental results on
the ChatGPT preprocessed Twitter dataset show that the BiLSTM deep model achieves a
93.12% accuracy score, which is the highest as compared to CNN, RNN, LSTM, and GRU.
The LSTM model also performs well, with an accuracy score of 92.95%. The other two
deep models, GRU and RNN, reached an accuracy higher than 90%. The performance
of the CNN model is not good. The CNN model achieved a 20% lower accuracy than
other models.
Table 6. Results of deep learning models using the TextBlob technique.
Model
CNN
RNN
LSTM
BiLSTM
GRU
Accuracy
70.88
90.35
92.95
93.12
92.33
18
Class
Precision
Recall
F1 Score
Positive
73
66
69
Negative
56
48
52
Neutral
71
81
77
Positive
91
92
92
Negative
80
71
75
Neutral
92
94
93
Positive
93
94
93
Negative
83
82
82
Neutral
96
96
96
Positive
91
96
93
Negative
86
81
83
Neutral
97
94
12
Positive
92
94
93
Negative
82
81
82
Neutral
95
94
95
Information 2023, 14, 474
Table 7 shows the results of deep learning using the VADER technique. The performance of five deep learning models is evaluated using accuracy, precision, recall, and the
F1 score. The LSTM model achieves the highest accuracy of 87.33%, while the CNN model
achieves the lowest accuracy of 68.77%. The GRU and BiLSTM models achieve a 93%
recall score for the positive sentiment class. The lowest recall of 44% is obtained by CNN.
The CNN model shows poor performance both with the TextBlob and VADER techniques.
Table 7. Results of deep learning models using the VADER technique.
Model
CNN
RNN
LSTM
BiLSTM
GRU
Accuracy
68.77
82.40
87.33
86.95
86.48
Class
Precision
Recall
F1 Score
Positive
77
68
72
Negative
56
44
50
Neutral
65
80
72
Positive
809
88
89
Negative
62
66
64
Neutral
83
82
83
Positive
89
92
90
Negative
74
75
75
Neutral
91
87
89
Positive
88
93
90
Negative
76
74
75
Neutral
91
86
88
Positive
88
93
90
Negative
74
70
72
Neutral
90
86
88
4.3. Results of Transformer-Based Models
Currently, transformer-based models are very effective and perform well on complex
natural language understanding (CNLU) tasks in sentiment classification. Machine learning
and deep learning models are also used for sentiment analyses, but machine learning
performs well on small datasets and deep learning models require large datasets to achieve
a high accuracy.
Table 8 shows the results of transformer-based models using the TextBlob technique.
The transformer-based robustly optimized BERT model achieves the lowest accuracy of
93.68% while 96% of recall scores are achieved for positive and neutral classes by RoBERTa.
The XLNet model achieves an 85.96% accuracy which is low as compared to the RoBERTa
and proposed BERT model. In comparison to any other machine or deep learning model,
the proposed approach achieves the highest accuracy of 96.49%. The precision, F1 score,
and recall of the proposed approach are also higher than those of others.
The results of transformer-based models are also evaluated using the VADER technique. The proposed approach also performs well using the VADER technique with the
highest accuracy, as shown in Table 9. The proposed approach understands full contextual
content, gives importance to relevant parts of textual data, and makes efficient predictions.
The RoBERTa and XLNet transformer-based models achieve 59.59% and 68.51% accuracy
scores, respectively. Using the VADER technique, the proposed method achieved a 93.37%
accuracy which is higher than all of the other transformer-based models when used with
VADER. The other performance metrics, such as precision, recall, and the F1 score, achieved
by the proposed model are also better than the other models.
19
Information 2023, 14, 474
Table 8. Performance of transformer-based models using the TextBlob technique.
Model
RoBERTa
XLNet
Proposed BERT
Accuracy
93.68
85.96
96.49
Class
Precision
Recall
F1 Score
Positive
95
96
93
Negative
84
85
85
Neutral
95
96
96
Positive
93
83
87
Negative
66
77
71
Neutral
86
91
89
Positive
96
98
97
Negative
92
90
91
Neutral
98
97
98
Table 9. Performance of transformer-based models using the VADER technique.
Model
RoBERTa
XLNet
Proposed BERT
Accuracy
86.68
68.51
93.37
Class
Precision
Recall
F1 Score
Positive
75
79
77
Negative
88
88
88
Neutral
90
88
89
Positive
66
72
69
Negative
25
45
32
Neutral
85
70
76
Positive
97
92
95
Negative
87
89
88
Neutral
93
96
94
Table 10 shows the correct and wrong predictions by deep learning and BERT models
using the TextBlob. Results are given only for the TextBlob technique, as the models perform
well using the TextBlob technique. Out of 4000 predictions, the RNN made 3614 correct
predictions and 386 wrong predictions. The LSTM made 3718 correct predictions while
282 predictions are wrong. The BiLSTM has 3725 correct and 275 wrong predictions.
The GRU shows 3693 correct predictions, compared to 307 wrong ones. Out of 4160 predictions, the XLNet made 3576 correct and 584 wrong predictions. On the other hand,
the RoBERTa made 3897 correct and 263 wrong predictions. The BERT made 4015 correct
predictions whereas 146 predictions are wrong. The results demonstrate that the BERT
model performed better than the machine learning and deep learning models. Only with
2835 correct and 1165 wrong predictions, the only CNN model performed poorly.
Table 10. Correct and wrong predictions by various models using the TextBlob technique.
Model
Correct-Predictions
Wrong-Predictions
Total-Predictions
CNN
2835
1165
4000
RNN
3614
386
4000
LSTM
3718
282
4000
BiLSTM
3725
275
4000
GRU
3693
307
4000
20
Information 2023, 14, 474
Table 10. Cont.
Model
Correct-Predictions
Wrong-Predictions
Total-Predictions
XLNet
3576
584
4160
RoBERTa
3897
263
4160
Proposed BERT
4015
146
4161
4.4. Results of K-Fold Cross-Validation
K-fold cross-validation is the most effective method for assessing the model’s robustness and validating its performance. Table 11 shows the results of Transformer-based
models with K-fold cross-validation. Experiments show that the proposed BERT model is
highly efficient in the sentiment analysis for ChatGPT tweets with an average accuracy of
96.49% using the TextBlob approach with a ±0.01 standard deviation. The proposed model
also works well using the VADER approach with a ±0.01 standard deviation. The RoBERTa
on the K-fold achieves a 91% accuracy with a ±0.06 standard deviation, while XLNet
achieves a 68% accuracy with a ±0.18 standard deviation.
Table 11. K-fold cross-Validation results using TextBlob and VADER approaches.
TextBlob
VADER
Model
Accuracy
Standard Devation
RoBERTa
0.91
XLNet
0.68
±0.06
Proposed BERT
0.95
RoBERTa
0.85
XLNet
0.66
Proposed BERT
0.93
±0.18
±0.01
±0.02
±0.02
±0.01
4.5. Topic Modeling Using BERTopic and LDA Method
Topic modeling is an important approach in NLP, as it automatically extracts the most
significant topics from textual data. There is a vast amount of unstructured data available
on social media, and traditional approaches are incapable of handling such data. Topic
modeling can handle and extract meaningful information from unstructured text data
efficiently. In Python, topic modeling is applied to the preprocessed data with important
libraries to improve the results. Topic modeling is also used to discover related topics from
frequently discussed tweets’ datasets.
In various NLP tasks, transformer-based models have produced very promising results.
BERTopic is a new topic modeling method that employs the BERT transformer model to
extract key trends or keywords from large datasets. BERTopic gathers semantic information
that better represents topics. BERT extracts contextual and complicated problems more
accurately and efficiently. Furthermore, BERTopic extracts relevant recent trends from
Twitter. When compared to LDA modeling, LDA is incapable of extracting nuanced and
complicated contextual issues from tweets. In comparison to BERTopic, LDA employs
outdated techniques and is unable to extract current patterns. However, BERTopic is a
better choice for topic modeling for large datasets.
LDA [74] is an approach used for topic modeling in NLP problems. It is easy to use,
efficient, and faster than other approaches for topic modeling. LDA modeling is performed
on textual data, and then a document term matrix is created that shows the frequency of
each term in a document. The BoW features are utilized to understand the most crucial
terms in a document. After that, the most prominent keywords are extracted from ChatGPT
tweets using BERTopic, and the LDA are shown in Figure 4.
21
Information 2023, 14, 474
(b)
(a)
Figure 4. Comparison of LDA-based and BERT-based topic modeling techniques through word
clouds: (a) Visualization of tweets using LDA topic modeling, and (b) Visualization of tweets using
BERTopic modeling.
Figure 5 depicts the most prominent topics extracted by BERTopic. First, we load the
BERT model and associated tokenizers. The tweet data are then preprocessed to extract the
embeddings for the BERT model. Then, for dimension reduction or clustering, we used
k-means clustering and the principal component analysis (PCA). The BERT model was
used to extract the most prominent topics, which were then displayed in a scatter plot.
Figure 5. Most Prominent Topics extracted from ChatGPT Tweets using BERTopic.
Figure 6 expresses the content or words of the top ten positive and negative topics and
their frequency. The word ChatGPT is mostly discussed in the Twitter tweets in a positive
context, and negative words like fake and wrong are discussed but less. The words good,
best, and love have the lowest frequency in the top ten topics.
22
Information 2023, 14, 474
Figure 6. Words extracted from top ten topics with their frequency using the LDA model.
Figures 7 and 8 show the most discussed positive and negative topics, extracted
from the ChatGPT tweets using the LDA approach with BoW features. These Figures
illustrate positive and negative words in the context of various topics. The users shared
their opinions regarding ChatGPT on social media platforms like Twitter. The user posted
positive or negative opinions about ChatGPT. The authors extract these tweets from Twitter
and perform an analysis to analyze how people feel about or discuss this technology.
The authors used LDA-based Topic modeling to extract the most prominent keywords from
the tweets. These keywords provide important themes to understand the main context and
identify the emotions; they also capture semantic meanings. In the tweets, the word “good”
indicates a cheerful mood. It represents anything beneficial or pleasurable. The majority
of the time, “good” refers to a positive quality. It is classified as positive sentiment in
the sentiment analysis because this inference is generally understood to be positive. It
is important to clarify that these words are not inherently positive or negative; rather,
their categorization depends on the positive or negative topics they are associated with.
For instance, words like “better”, “best”, and “good” are included in positive topics and are
used in a positive context within GPT. Better indicates an advance over a previous state or
condition, indicating a positive development. ChatGPT is frequently spoken of favorably
due to its features and potential applications in a variety of industries. The development
of AI language models like ChatGPT is demonstrated by their ability to comprehend
and generate text responses that resemble human responses. ChatGPT allows users to
partake in entertaining and engaging conversations. On the other hand, ChatGPT in the
negative context indicates that it sometimes produces irrelevant or incorrect results, raises
privacy concerns, and an excessive dependence on ChatGPT may impair the ability to
think critically and solve technical problems. Social media users frequently use words
like “bad”, “wrong”, “little”, and “hot” in a negative sense, aligning with negative topics.
Sentiment analysis models can be refined and improved over time based on feedback and
real-world data to better capture the nuances of sentiments expressed in different contexts.
The performance can be analyzed by policymakers based on these prominent keywords,
and they can modify their product according to this.
23
Information 2023, 14, 474
Figure 7. Visualization of highly discussed positive topics.
Figure 8. Visualization of highly discussed negative topics.
24
Information 2023, 14, 474
4.6. Comparison of Proposed Approach with Machine Learning Models Using Statistical Test
The comparison between the machine learning and the proposed Transformer-based
BERT model is presented in Table 12. Machine learning models are fine-tuned to optimize
the results. The authors evaluated the proposed approach using the TexBlob and Vader
technique. In all scenarios, the proposed approach rejects the Ho and accepts the Ha ,
which means that the proposed approach is statistically significant in comparison with
other approaches.
Table 12. Statistical test comparison with the proposed model.
TextBlob
Scenario
Proposed BERT Vs. SGD
Proposed BERT Vs. RF
Proposed BERT Vs. DT
Proposed BERT Vs. ETC
Proposed BERT Vs. KNN
Proposed BERT Vs. SVM
Proposed BERT Vs. GBM
Proposed BERT Vs. LR
Vader
Statistics
p-Value
Ho
Statistics
p-Value
Ho
−7.999
0.015
Rejected
7.284
Rejected
−39.167
3.661
Rejected
−31.128
0.343
Rejected
0.633
0.571
Rejected
−3.695
5.545
Rejected
−63.516
8.598
Rejected
−34.097
0.041
Rejected
−8.225
0.003
Rejected
−3.43
0.008
Rejected
−9.792
0.002
Rejected
−6.140
0.047
Rejected
−9.845
0.002
Rejected
−3.257
0.045
Rejected
−17.691
0.000
Rejected
−3.313
−3.368
0.043
Rejected
4.7. Performance Comparison with State-of-the-Art Studies
For evaluating the robustness and efficiency of the proposed approach, its performance
is compared with the state-of-the-art existing studies. Table 13 shows the results of stateof-the-art studies. The study [26] used machine learning models for a sentiment analysis
and LR performed well with 83% accuracy. Khalid et al. [27] performed an analysis on
Twitter data using an ensemble of machine learning models and achieved 93% accuracy
with the BBSVM model. Another study [75] carried out a sentiment analysis on Twitter
data using machine learning models. Machine learning models do not perform well due to
small datasets and show poor accuracy. As a result, the authors used transformer-based
models for the sentiment analysis. For example, Bello et al. [33] used the BERT model on
tweets. The proposed BERT model utilizes contextual information to produce a vector
representation. When integrated with neural network classifiers such as CNN, RNN,
or BiLSTM for prediction, it attains an accuracy rate of 93% and an F measure of 95%.
The BiLSTM model exhibits some shortcomings, one of which is its inability to effectively
capture the underlying contextual nuances of individual words. Other authors, such
as [34,35], used the BERT models for the sentiment analysis with various datasets. They
conducted an evaluation of the efficacy of Google’s BERT method in comparison to other
machine learning methods. Moreover, this study investigates the Bert architecture, which
received pre-training on two natural language processing tasks, namely Masked language
Modeling and sentence Prediction. The Random Forest (RF) is commonly employed as a
benchmark for evaluating the performance of the BERT language model due to its superior
performance among various machine learning methods. Previous methodologies are
mostly on natural language techniques for the classification and analysis of tweets, yielding
insufficient results. The aforementioned prior research indicates the need for an approach
that can effectively analyze tweets based on their precise classification. The performance
analysis indicates that the proposed BERT model shows efficient results with a 96.49%
accuracy and outperforms existing studies.
25
Information 2023, 14, 474
Table 13. Comparison of proposed approach with state-of-the-art existing studies.
Authors
Model
Dataset
Accuracy
Publication
Rustam et al. [26]
Logistic Regression
App reviews
83%
2020
Khalid et al. [27]
GBSVM
Twitter Data
93%
2020
Wadhwa et al. [75]
Logistic Regression
Twitter Data
86.51%
2021
Bello et al. [33]
BERT
Twitter Data
93%
2022
Catelli et al. [34]
BERT
E-commerce reviews
75%
2021
Patel et al. [35]
BERT
Reviews
83
2022
Proposed
BERT
Twitter Data
96.49%
2023
4.8. Validation of Proposed Approach on Additional Dataset
The validation of the proposed approach is carried out using an additional public
benchmark dataset. For this purpose, experiments are performed on the well-known
SemEval2013 dataset [76]. The proposed TextBlob+BERT approach is applied to the SemEvel2013 dataset, where TextBlob generates new labels for the dataset, and the proposed
BERT model performs classification. Moreover, experiments are also done using the original labels of SemEvel2013. Experimental results are presented in Table 14 which indicate
the superior performance of the proposed approach. It can be observed that the proposed
approach performs significantly well on the SemEvel2013 dataset with a 0.97 accuracy score
when labels are assigned using the TextBlob and BERT is used for classification. For the
second set of experiments which involves using the original labels of the SemEvel2013
dataset, LR shows the best performance with a 0.65 accuracy score.
Table 14. Experimental results on the SemEvel2013 dataset.
Approach
TextBlob + BERT
Original + LR
Accuracy
0.97
0.65
Class
Precision
Recall
F1 Score
Negative
0.97
0.91
0.94
Neutral
0.98
0.99
0.98
Positive
0.96
0.98
0.97
macro avg
0.97
0.96
0.97
weighted avg
0.97
0.97
0.97
Negative
0.65
0.47
0.54
Neutral
0.63
0.72
0.67
Positive
0.69
0.65
0.67
macro avg
0.65
0.62
0.63
weighted avg
0.65
0.65
0.65
4.9. Statistical Significance Test
This study performs a statistical significance t-Test to show the significance of the
proposed approach. For the statistical test, several scenarios are considered, as mentioned
in Table 15. The t-test shows the significance of one approach on the other by accepting or
rejecting the null hypothesis (Ho ). In this study, we consider two cases [77]:
•
•
Null Hypothesis ( Ho ) => µ1 = µ2: The population means of the proposed approach’s
results is equal to the compared approach’s results. (No statistical significance)
Alternative Hypothesis ( Ha ) => µ1 6= µ2: The population means of the proposed
approach’s results is not equal to the compared approach’s results. ( Proposal approach
is statistically significant)
26
Information 2023, 14, 474
Table 15. Statistical significance t-test.
Scenario
Statistic
p-Value
Ho
Proposed BERT Vs. RoBERTa
3.304
3.304
Rejected
Proposed BERT Vs. XLNet
7.292
0.0003
Rejected
Proposed BERT Vs. GRU
4.481
0.004
Rejected
Proposed BERT Vs. BiLSTM
2.621
0.003
Rejected
Proposed BERT Vs. LSTM
2.510
0.045
Rejected
Proposed BERT Vs. RNN
6.474
0.000
Rejected
Proposed BERT Vs. CNN
8.980
0.000
Rejected
The t-test can be interpreted as if the output p-value is greater than the alpha value
(0.05), it indicates that the Ho is accepted and there is no statistical significance. Moreover,
if the p-value is less than the alpha value, it indicates that Ho is rejected and Ha is accepted
which means that there is statistical significance between the compared results. We perform
a t-test on results using Textblob and compare all models’ performances. In all scenarios,
the proposed approach rejects the Ho and accepted the Ha , which means that the proposed
approach is statistically significant in comparison with other approaches.
4.10. Discussion
In this study, we observed that the majority of sentiment towards chatGPT was
positive, indicating a generally favorable perception of the tool. This aligns with the
notion that chatGPT has gained significant attention and popularity on various online
platforms. The positive sentiment towards chatGPT can be attributed to its advanced
language generation capabilities and its ability to engage in human-like conversations.
Figure 9 shows the sentiment ratio for chatGPT.
Figure 9. Sentiment ratio in extracted data.
The positive sentiment towards chatGPT is also reflected in the widespread discussions
and positive experiences shared by individuals, communities, and social media platforms.
People are fascinated by its ability to understand and respond effectively, enhancing
user engagement and satisfaction. However, it is important to acknowledge that there
are varying opinions and discussions surrounding chatGPT. While most sentiments are
positive, some individuals criticize its services and express negative sentiments, particularly
concerning its suitability for students. These discussions highlight the need for a further
analysis and exploration to address any concerns and improve the tool’s effectiveness.
27
Information 2023, 14, 474
If students rely excessively on ChatGPT, they will lose their capacity to independently
compose or generate answers to questions. Students’ writing skills may not have improved
if they used ChatGPT for projects. As the exam date approaches, individuals have difficulty
writing and responding to queries efficiently. There is also the possibility of receiving erroneous information, becoming excessively reliant on technology, and having poor reasoning
skills when utilizing ChatGPT. When utilized for personalized learning, ChatGPT may
necessitate a comprehensive understanding of the course being taken, the learning preferences of each individual student, and the cultural context in which the students are based.
Another negative sentiment regarding ChatGPT is that when students completely rely
on AI chatbots to search for specific information about their subject, their level of knowledge does not improve. They cannot advance or increase the topic’s knowledge, and it
is extremely difficult to maintain concentration when studying. Additionally, students
enter data into ChatGPT while looking up specific queries, which could pose a security
concern because ChatGPT stores the data that users submit. Over fifty percent of students
are motivated to cheat and use ChatGPT to generate information for their submissions.
While most students did not admit to using ChatGPT in their writing, integrity may be
compromised when ChatGPT generates text.
Additionally, we conducted an analysis using an external sentiment analysis tool
called SentimentViz [78]. This tool allowed us to visualize people’s perceptions of ChatGPT
based on their data. The sentiment analysis results obtained from SentimentViz complemented and validated the findings of the proposed approach. Figure 10 presents visual
representations of the sentiment expressed by individuals regarding ChatGPT. This visualization provides further support for the positive sentiment observed in our study and
reinforces the credibility of our results.
Figure 10. SentimentViz output for chatGPT sentiment.
Discussions regarding the set RQs for this study are also given here.
i.
RQ1: What are people’s sentiments about ChatGPT technology?
Response: The authors analyzed a large dataset of tweets and were able to determine
how individuals feel about ChatGPT technology. The results indicate that users
have mixed feelings about ChatGPT, with some expressing positive opinions and
others expressing negative views. These results provide useful information about
how the public perceives ChatGPT and can assist researchers and developers in
understanding the chatbot’s strengths and weaknesses. The favorable perception of
chatGPT is attributable to its advanced language generation features and its ability to
become involved in human-like interactions. Individuals are attracted by its cognitive
power as well as its ability to effectively respond, thereby increasing user interest
and satisfaction. The positive sentiments, like the new openai ChatGPT, writes usergenerated content in a better way; it is a great language tool that codes you for your
specific queries, etc.
28
Information 2023, 14, 474
ii.
iii.
iv.
RQ2: Which classification model is most effective, such as the proposed transformerbased models, machine learning-based models, and deep learning-based models,
for analyzing sentiments about ChatGPT tweets?
Response: The experiments indicate that transformer-based BERT models are more
effective and accurate for analyzing sentiments about the ChatGPT tweets. Since
transformers make use of self-attention mechanisms, they give the same amount of
attention to each component of the sequence that they are processing. They have
the ability to virtually process any kind of sequential information. When it comes to
natural language processing (NLP), the BERT model takes into account the context
of words in both directions (left to right and right to left). Transformers have an indepth understanding of the meanings of words and are useful for complex problems.
In contrast, manual feature engineering, rigorous preprocessing, and a limited dataset
are required for machine learning in order to improve accuracy. Additionally, deep
learning has a less accurate automatic feature extraction method.
RQ3: What are the impacts of ChatGPT on student learning?
Response: The findings show that ChatGPT may have a significant impact on students’
learning. ChatGPT’s learning capabilities can help students learn when they do not
attend school. ChatGPT is not recommended to be used as a substitute for analytical
thinking and creative work, but also as a tool to develop research and writing skills.
Students’ writing skills may not have improved if they relied completely on ChatGPT.
There is also the possibility of receiving erroneous information, becoming excessively
reliant on technology, and having poor reasoning skills.
RQ4: What role does topic modeling play in the sentiment analysis of social media tweets?
Response: Topic modeling refers to an unsupervised statistical method to assess
whether or not a particular batch of documents contains any “topics” that are more
generic in nature. In order to create a summary that is the most accurate depiction of
the document’s contents, it extracts the text for commonly used words and phrases.
There is a vast amount of unstructured data related to OpenAI ChatGPT, and traditional approaches are incapable of handling such data. Topic modeling can handle
and extract meaningful information from unstructured text data efficiently. LDAbased modeling extracts the most discussed topics and prominent positive or negative
keywords. It also provides clear information from the large corpus, which is very
time-consuming if an individual extracts topics manually.
5. Conclusions
This study conducted a sentiment analysis on ChatGPT-related tweets to gain insight
into people’s perceptions and opinions. By analyzing a large dataset of tweets, we were able
to identify the overall sentiment expressed by users towards ChatGPT. The findings indicate
that there are mixed sentiments among users, with some expressing positive views and
others expressing negative views about ChatGPT. These results provide valuable insights
into the public perception of ChatGPT and can help researchers and developers understand
the strengths and weaknesses of the chatbot. Further, this study utilized the BERT model to
analyze tweets related to ChatGPT. The BERT model proved to be effective in understanding
and classifying sentiments expressed in these tweets. By employing the BERT model, we
were able to accurately classify sentiments and gain a deeper understanding of the overall
sentiment trends surrounding ChatGPT.
The experimental results demonstrate the outstanding performance of the proposed
model, achieving an accuracy of 94.96%. This performance is further validated through kfold cross-validation and comparison with existing state-of-the-art studies. Our conclusions
indicate that the majority of people expressed positive sentiments towards the ChatGPT
tool, while a minority had negative sentiments. It was observed that many users appreciate
the tool for its assistance across various domains. However, some individuals criticized
29
Information 2023, 14, 474
the ChatGPT tool’s services, particularly its suitability for students, expressing negative
sentiments in this regard.
This study recognizes the limitation of a relatively small dataset, comprising only
21,515 tweets, which may restrict comprehensive insights. To overcome this limitation,
future research will prioritize the collection of a larger volume of data from Twitter and other
social media platforms to gain a more accurate understanding of people’s perceptions of the
trending chatGPT tool. Moreover, the study aims to develop a machine learning approach
that incorporates the sentiment analysis, enabling exploration of how such technologies
can be developed to mitigate potential societal harm and ensure responsible deployment.
Author Contributions: Conceptualization, S.R. and M.M.; Data curation, M.M. and F.R.; Formal
analysis, S.R., F.R., R.S. and I.d.l.T.D.; Funding acquisition, I.d.l.T.D.; Investigation, V.C. and M.G.V.;
Methodology, F.R., M.M. and R.S.; Project administration, R.S. and V.C.; Resources, M.G.V. and J.B.B.;
Software, M.G.V. and J.B.B.; Supervision, I.d.l.T.D. and I.A.; Validation, J.B.B. and I.A.; Visualization,
R.S. and V.C.; Writing—original draft, M.M., R.S., F.R. and S.R.; Writing—review & editing, I.A. All
authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the European University of Atlantic.
Data Availability Statement: https://www.kaggle.com/datasets/furqanrustam118/chatgpt-tweets.
Conflicts of Interest: The authors declare no conflict of interests.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Meshram, S.; Naik, N.; Megha, V.; More, T.; Kharche, S. Conversational AI: Chatbots. In Proceedings of the 2021 International
Conference on Intelligent Technologies (CONIT), Hubli, India, 25–27 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6.
The Future of Chatbots: 10 Trends, Latest Stats & Market Size. Available online: https://onix-systems.com/blog/6-chatbottrends-that-are-bringing-the-future-closer (accessed on 23 May 2023).
Size of the Chatbot Market Worldwide from 2021 to 2030. Available online: https://www.statista.com/statistics/656596
/worldwide-chatbot-market/ (accessed on 23 May 2023).
Chatbot Market in 2022: Stats, Trends, and Companies in the Growing AI Chatbot Industry. Available online: https://www.
insiderintelligence.com/insights/chatbot-market-stats-trends/ (accessed on 23 May 2023).
Malinka, K.; Perešíni, M.; Firc, A.; Hujňák, O.; Januš, F. On the educational impact of ChatGPT: Is Artificial Intelligence ready to
obtain a university degree? arXiv 2023, arXiv:2303.11146.
George, A.S.; George, A.H. A review of ChatGPT AI’s impact on several business sectors. Partners Univers. Int. Innov. J. 2023,
1, 9–23.
Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z. ChatGPT and a new academic reality: Artificial Intelligencewritten research papers and the ethics of the large language models in scholarly publishing. J. Assoc. Inf. Sci. Technol. 2023,
74, 570–581.
Kirmani, A.R. Artificial Intelligence-Enabled Science Poetry. ACS Energy Lett. 2022, 8, 574–576.
Cotton, D.R.; Cotton, P.A.; Shipway, J.R. Chatting and cheating: Ensuring academic integrity in the era of ChatGPT. Innov. Educ.
Teach. Int. 2023, 1–12. [CrossRef]
Tlili, A.; Shehata, B.; Adarkwah, M.A.; Bozkurt, A.; Hickey, D.T.; Huang, R.; Agyemang, B. What if the devil is my guardian
angel: ChatGPT as a case study of using chatbots in education. Smart Learn. Environ. 2023, 10, 15.
Edtech Chegg Tumbles as ChatGPT Threat Prompts Revenue Warning. Available online: https://www.reuters.com/markets/
us/edtech-chegg-slumps-revenue-warning-chatgpt-threatens-growth-2023-05-02/ (accessed on 23 May 2023).
Liu, B. Sentiment Analysis and Opinion Mining; Synthesis Lectures on Human Language Technologies; Springer: Cham, Switzerland,
2012; Volume 5, 167p.
Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014,
5, 1093–1113.
Hussein, D.M.E.D.M. A survey on sentiment analysis challenges. J. King Saud Univ.-Eng. Sci. 2018, 30, 330–338.
Lee, E.; Rustam, F.; Ashraf, I.; Washington, P.B.; Narra, M.; Shafique, R. Inquest of Current Situation in Afghanistan Under Taliban
Rule Using Sentiment Analysis and Volume Analysis. IEEE Access 2022, 10, 10333–10348.
Lee, E.; Rustam, F.; Washington, P.B.; El Barakaz, F.; Aljedaani, W.; Ashraf, I. Racism detection by analyzing differential opinions
through sentiment analysis of tweets using stacked ensemble gcr-nn model. IEEE Access 2022, 10, 9717–9728. [CrossRef]
Mujahid, M.; Lee, E.; Rustam, F.; Washington, P.B.; Ullah, S.; Reshi, A.A.; Ashraf, I. Sentiment analysis and topic modeling on
tweets about online education during COVID-19. Appl. Sci. 2021, 11, 8438. [CrossRef]
Tran, A.D.; Pallant, J.I.; Johnson, L.W. Exploring the impact of chatbots on consumer sentiment and expectations in retail. J. Retail.
Consum. Serv. 2021, 63, 102718. [CrossRef]
30
Information 2023, 14, 474
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
Muneshwara, M.; Swetha, M.; Rohidekar, M.P.; AB, M.P. Implementation of Therapy Bot for Potential Users With Depression
During Covid-19 Using Sentiment Analysis. J. Posit. Sch. Psychol. 2022, 6, 7816–7826.
Parimala, M.; Swarna Priya, R.; Praveen Kumar Reddy, M.; Lal Chowdhary, C.; Kumar Poluru, R.; Khan, S. Spatiotemporal-based
sentiment analysis on tweets for risk assessment of event using deep learning approach. Softw. Pract. Exp. 2021, 51, 550–570.
[CrossRef]
Aslam, N.; Rustam, F.; Lee, E.; Washington, P.B.; Ashraf, I. Sentiment analysis and emotion detection on cryptocurrency related
Tweets using ensemble LSTM-GRU Model. IEEE Access 2022, 10, 39313–39324. [CrossRef]
Aslam, N.; Xia, K.; Rustam, F.; Lee, E.; Ashraf, I. Self voting classification model for online meeting app review sentiment analysis
and topic modeling. PeerJ Comput. Sci. 2022, 8, e1141. [CrossRef] [PubMed]
Araujo, A.F.; Gôlo, M.P.; Marcacini, R.M. Opinion mining for app reviews: An analysis of textual representation and predictive
models. Autom. Softw. Eng. 2022, 29, 1–30. [CrossRef]
Aljedaani, W.; Mkaouer, M.W.; Ludi, S.; Javed, Y. Automatic classification of accessibility user reviews in android apps. In
Proceedings of the 2022 7th international conference on data science and machine learning applications (CDMA), Riyadh, Saudi
Arabia, 1–3 March 2022; IEEE: Piscataway, NJ, USA, 2022, pp. 133–138.
Naeem, M.Z.; Rustam, F.; Mehmood, A.; Ashraf, I.; Choi, G.S. Classification of movie reviews using term frequency-inverse
document frequency and optimized machine learning algorithms. PeerJ Comput. Sci. 2022, 8, e914. [CrossRef]
Rustam, F.; Mehmood, A.; Ahmad, M.; Ullah, S.; Khan, D.M.; Choi, G.S. Classification of shopify app user reviews using novel
multi text features. IEEE Access 2020, 8, 30234–30244. [CrossRef]
Khalid, M.; Ashraf, I.; Mehmood, A.; Ullah, S.; Ahmad, M.; Choi, G.S. GBSVM: Sentiment classification from unstructured
reviews using ensemble classifier. Appl. Sci. 2020, 10, 2788. [CrossRef]
Umer, M.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Predicting numeric ratings for google apps using text features and
ensemble learning. ETRI J. 2021, 43, 95–108. [CrossRef]
Rehan, M.S.; Rustam, F.; Ullah, S.; Hussain, S.; Mehmood, A.; Choi, G.S. Employees reviews classification and evaluation (ERCE)
model using supervised machine learning approaches. J. Ambient Intell. Humaniz. Comput. 2022, 13, 3119–3136. [CrossRef]
Al Kilani, N.; Tailakh, R.; Hanani, A. Automatic classification of apps reviews for requirement engineering: Exploring the
customers need from healthcare applications. In Proceedings of the 2019 Sixth International Conference on Social Networks
Analysis, Management and Security (SNAMS), Granada, Spain, 22–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 541–548.
Srisopha, K.; Phonsom, C.; Lin, K.; Boehm, B. Same app, different countries: A preliminary user reviews study on most
downloaded ios apps. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution
(ICSME), Cleveland, OH, USA, 29 September–4 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 76–80.
Hossain, M.S.; Rahman, M.F. Sentiment analysis and review rating prediction of the users of Bangladeshi Shopping Apps. In
Developing Relationships, Personalization, and Data Herald in Marketing 5.0; IGI Global: Pennsylvania, PA USA, 2022; pp. 33–56.
Bello, A.; Ng, S.C.; Leung, M.F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [CrossRef]
Catelli, R.; Pelosi, S.; Esposito, M. Lexicon-based vs. Bert-based sentiment analysis: A comparative study in Italian. Electronics
2022, 11, 374. [CrossRef]
Patel, A.; Oza, P.; Agrawal, S. Sentiment Analysis of Customer Feedback and Reviews for Airline Services using Language
Representation Model. Procedia Comput. Sci. 2023, 218, 2459–2467. [CrossRef]
Mujahid, M.; Kanwal, K.; Rustam, F.; Aljadani, W.; Ashraf, I. Arabic ChatGPT Tweets Classification using RoBERTa and BERT
Ensemble Model. Acm Trans. Asian-Low-Resour. Lang. Inf. Process. 2023. [CrossRef]
Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for
Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130. [CrossRef]
Bonifazi, G.; Corradini, E.; Ursino, D.; Virgili, L. Modeling, Evaluating, and Applying the eWoM Power of Reddit Posts. Big Data
Cogn. Comput. 2023, 7, 47. [CrossRef]
Messaoud, M.B.; Jenhani, I.; Jemaa, N.B.; Mkaouer, M.W. A multi-label active learning approach for mobile app user review
classification. In Proceedings of the Knowledge Science, Engineering and Management: 12th International Conference, KSEM
2019, Athens, Greece, 28–30 August 2019; Proceedings, Part I 12; Springer: Berlin/Heidelberg, Germany, 2019; pp. 805–816.
Fuad, A.; Al-Yahya, M. Analysis and classification of mobile apps using topic modeling: A case study on Google Play Arabic
apps. Complexity 2021, 2021, 1–12. [CrossRef]
Venkatakrishnan, S.; Kaushik, A.; Verma, J.K. Sentiment analysis on google play store data using deep learning. In Applications of
Machine Learning; Springer: Singapore, 2020; pp. 15–30.
Alam, S.; Yao, N. The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis.
Comput. Math. Organ. Theory 2019, 25, 319–335. [CrossRef]
Vijayarani, S.; Ilamathi, M.J.; Nithya, M. Preprocessing techniques for text mining-an overview. Int. J. Comput. Sci. Commun. Netw.
2015, 5, 7–16.
R, S.; Mujahid, M.; Rustam, F.; Mallampati, B.; Chunduri, V.; de la Torre Díez, I.; Ashraf, I. Bidirectional encoder representations
from transformers and deep learning model for analyzing smartphone-related tweets. PeerJ Comput. Sci. 2023, 9, e1432. [CrossRef]
Kadhim, A.I. An evaluation of preprocessing techniques for text classification. Int. J. Comput. Sci. Inf. Secur. 2018, 16, 22–32.
Loria, S. Textblob Documentation. Release 0.15. 2018. Volume 2. Available online: https://buildmedia.readthedocs.org/media/
pdf/textblob/latest/textblob.pdf (accessed on 23 May 2023).
31
Information 2023, 14, 474
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
Borg, A.; Boldt, M. Using VADER sentiment and SVM for predicting customer response sentiment. Expert Syst. Appl. 2020,
162, 113746. [CrossRef]
Karamibekr, M.; Ghorbani, A.A. Sentiment analysis of social issues. In Proceedings of the 2012 International Conference on
Social Informatics, Alexandria, VA, USA, 14–16 December 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 215–221.
Kumar, Y.; Koul, A.; Singla, R.; Ijaz, M.F. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing
framework and future research agenda. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 8459–8486 [CrossRef]
Shafique, R.; Aljedaani, W.; Rustam, F.; Lee, E.; Mehmood, A.; Choi, G.S. Role of Artificial Intelligence in Online Education: A
Systematic Mapping Study. IEEE Access 2023, 11, 52570–52584. [CrossRef]
George, A.; Ravindran, A.; Mendieta, M.; Tabkhi, H. Mez: An adaptive messaging system for latency-sensitive multi-camera
machine vision at the iot edge. IEEE Access 2021, 9, 21457–21473. [CrossRef]
Ravindran, A.; George, A. An edge datastore architecture for Latency-Critical distributed machine vision applications. In
Proceedings of the USENIX Workshop on Hot Topics in Edge Computing (HotEdge 18), Boston, MA, USA, 10 July 2018.
Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 2019,
52, 273–292. [CrossRef]
Chen, H.; Wu, L.; Chen, J.; Lu, W.; Ding, J. A comparative study of automated legal text classification using random forests and
deep learning. Inf. Process. Manag. 2022, 59, 102798. [CrossRef]
Schröder, C.; Niekler, A. A survey of active learning for text classification using deep neural networks. arXiv 2020,
arXiv:2008.07267.
Prabhat, A.; Khullar, V. Sentiment classification on big data using Naïve Bayes and logistic regression. In Proceedings of the 2017
International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 5–7 January 2017; IEEE:
Piscataway, NJ, USA, 2017; pp. 1–5.
Valencia, F.; Gómez-Espinosa, A.; Valdés-Aguirre, B. Price movement prediction of cryptocurrencies using sentiment analysis
and machine learning. Entropy 2019, 21, 589. [CrossRef] [PubMed]
Zharmagambetov, A.S.; Pak, A.A. Sentiment analysis of a document using deep learning approach and decision trees. In
Proceedings of the 2015 Twelve International Conference on Electronics Computer and Computation (ICECCO), Almaty,
Kazakhstan, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–4.
Shah, K.; Patel, H.; Sanghvi, D.; Shah, M. A comparative analysis of logistic regression, random forest and KNN models for the
text classification. Augment. Hum. Res. 2020, 5, 12. [CrossRef]
Tiwari, D.; Singh, N. Ensemble approach for twitter sentiment analysis. IJ Inf. Technol. Comput. Sci. 2019, 8, 20–26. [CrossRef]
Arya, V.; Mishra, A.K.M.; González-Briones, A. Analysis of sentiments on the onset of COVID-19 using machine learning
techniques. ADCAIJ Adv. Distrib. Comput. Artif. Intell. J. 2022, 11, 45–63. [CrossRef]
Severyn, A.; Moschitti, A. Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings
of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA, 4–5 June 2015; pp. 464–469.
Seo, S.; Kim, C.; Kim, H.; Mo, K.; Kang, P. Comparative study of deep learning-based sentiment classification. IEEE Access 2020,
8, 6861–6875. [CrossRef]
Nowak, J.; Taspinar, A.; Scherer, R. LSTM recurrent neural networks for short text and sentiment classification. In Proceedings of
the Artificial Intelligence and Soft Computing: 16th International Conference, ICAISC 2017, Zakopane, Poland, 11–15 June 2017;
Proceedings, Part II 16; Springer: Cham, Switzerland, 2017; pp. 553–562.
Mujahid, M.; Rustam, F.; Alasim, F.; Siddique, M.; Ashraf, I. What people think about fast food: Opinions analysis and LDA
modeling on fast food restaurants using unstructured tweets. PeerJ Comput. Sci. 2023, 9, e1193. [CrossRef]
Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings
of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015;
pp. 1422–1432.
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
Tenney, I.; Das, D.; Pavlick, E. BERT rediscovers the classical NLP pipeline. arXiv 2019, arXiv:1905.05950.
González-Carvajal, S.; Garrido-Merchán, E.C. Comparing BERT against traditional machine learning text classification. arXiv
2020, arXiv:2005.13012.
Cruz, J.C.B.; Cheng, C. Establishing baselines for text classification in low-resource languages. arXiv 2020, arXiv:2005.02068.
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language
understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
Amaar, A.; Aljedaani, W.; Rustam, F.; Ullah, S.; Rupapara, V.; Ludi, S. Detection of fake job postings by utilizing machine learning
and natural language processing approaches. Neural Process. Lett. 2022, 54, 2219–2247 [CrossRef]
Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet allocation (LDA) and topic modeling: Models,
applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [CrossRef]
Wadhwa, S.; Babber, K. Performance comparison of classifiers on twitter sentimental analysis. Eur. J. Eng. Sci. Technol. 2021,
4, 15–24. [CrossRef]
32
Information 2023, 14, 474
76.
77.
78.
SemEvel2013 Dataset. Available online: https://www.kaggle.com/datasets/azzouza2018/semevaldatadets?select=semeval-20
13-train-all.csv (accessed on 23 May 2023).
Rustam, F.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Tweets classification on the base of sentiments for US airline companies.
Entropy 2019, 21, 1078. [CrossRef]
Sentiment Viz: Tweet Sentiment Visualization. Available online: https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_
app/ (accessed on 23 May 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
33
information
Article
Computing the Sound–Sense Harmony: A Case Study of
William Shakespeare’s Sonnets and Francis Webb’s Most
Popular Poems
Rodolfo Delmonte
Department of Linguistics and Comparative Cultural Studies, Ca Foscari University of Venice, 30123 Venice, Italy;
[email protected]
Abstract: Poetic devices implicitly work towards inducing the reader to associate intended and
expressed meaning to the sounds of the poem. In turn, sounds may be organized a priori into
categories and assigned presumed meaning as suggested by traditional literary studies. To compute
the degree of harmony and disharmony, I have automatically extracted the sound grids of all the
sonnets by William Shakespeare and have combined them with the themes expressed by their
contents. In a first experiment, sounds have been associated with lexically and semantically based
sentiment analysis, obtaining an 80% of agreement. In a second experiment, sentiment analysis
has been substituted by Appraisal Theory, thus obtaining a more fine-grained interpretation that
combines dis-harmony with irony. The computation for Francis Webb is based on his most popular
100 poems and combines automatic semantically and lexically based sentiment analysis with sound
grids. The results produce visual maps that clearly separate poems into three clusters: negative
harmony, positive harmony and disharmony, where the latter instantiates the need by the poet to
encompass the opposites in a desperate attempt to reconcile them. Shakespeare and Webb have been
chosen to prove the applicability of the method proposed in general contexts of poetry, exhibiting the
widest possible gap at all linguistic and poetic levels.
Citation: Delmonte, R. Computing
the Sound–Sense Harmony: A Case
Keywords: specialized NLP system for poetry; automatic poetic analysis; visualization of linguistic
and poetic content; Sound–Sense matching algorithm; phonetic and phonological analysis; automatic
lexical and semantic sentiment analysis; computing irony; appraisal theory framework
Study of William Shakespeare’s
Sonnets and Francis Webb’s Most
Popular Poems. Information 2023, 14,
576. https://doi.org/10.3390/
info14100576
Academic Editors: Peter Revesz and
Tudor Groza
Received: 7 July 2023
Revised: 21 September 2023
Accepted: 5 October 2023
Published: 20 October 2023
Copyright:
© 2023 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
1. Introduction
In this article, I will propose a totally new technique to assess and appreciate poetry,
the Algorithm for Sound and Sense Harmony (henceforth ASSH). The main tenet of this
paper is the existence of a hidden and systematic plan by important poets like Shakespeare
and Webb to organize rhyming structures in accordance with a principle of overall ASSH.
What is meant here by “Sound Harmony” is the presence of rhymes whose sound—the
stressed vowel that is dominant—belongs to the four sound classes that may comprise all
vowel sounds, phonologically speaking, i.e., low, mid, high-front, high-back, or part of
them. In addition, the “Sound Harmony” is composed with Sense to make up the ASSH,
where the choice of sounds reflects the contents of the poem, as it may be represented by
main topics, intended meaning and overall sentiment. The same argument is presented for
the presence of the three main classes of consonants, i.e., continuants, sonorants, obstruents
and their partition into voiced vs. unvoiced. The choice to favor the presence of one
class vs. another is to be interpreted as a way to highlight sense-related choices of words
that will either accompany or contrast with Sounds. In particular, we associate different
mood—following traditional judgements—to vowels and consonants according to their
class, as follows:
1.
Low and mid vowels evoke a sense of brightness, peace and serenity;
4.0/).
Information 2023, 14, 576. https://doi.org/10.3390/info14100576
34
https://www.mdpi.com/journal/information
Information 2023, 14, 576
2.
3.
4.
High, front and back vowels evoke a sense of surprise, seriousness, rigor and gravity;
Obstruent and unvoiced consonants evoke a sense of harshness and severity;
Sonorant and continuant consonants evoke a sense of pleasure, softness and lightness.
Classes 1 and 4 will be regarded in the same area of positive thinking, while classes 2
and 3 will more naturally be accompanied by negative sentiment. Of course, it may be the
case that crossed matches with classes belonging to opposite types will take place more or
less frequently, indicating the need to reconcile opposite feelings in the same poem. This is
what happens in both Shakespeare’s and Webb’s poems, as will be shown in the sections
below.
It is important to highlight the role of sounds in poetry, which is paramount for the
creation of poetic and rhetoric devices. Rhyme, alliterations, assonances and consonances
may contribute secondary and, in some cases, primary additional meaning by allowing
words which are not otherwise syntactically or semantically related to share part if not
all of their meaning by means of metaphors and other similar devices. Thus, most of the
difficult work of every poet is devoted to the choice of the appropriate word to use for
rhyming purposes, mainly, but also for the other important devices mentioned above.
In the case of Shakespeare, for the majority of the sonnets, he took care of choosing
words for the rhymes contributing sounds to the four varieties, thus producing a highly
varied sound harmony. We will discuss this in the sections below, paying attention to
associate choice of one class vs. another, with choice of specific themes or words. This
important feature of the sonnets has never been noticed by literary critics in the past.
Reasons for this apparent lack of attention may be imputed to the existence of two seemingly
hindering factors: a former factor is the use of words which had a double pronunciation
at the time, as for instance LOVE which could be pronounced as MOVE in addition to its
current pronunciation. The latter factor regards the existence of a high—in comparison
with other poets of the same Elizabethan period—percentage of a variable we call Rhyme
Repetition Rate (TripleR), which indicates the use of the same “head” word—i.e., the
rhyming word that precedes the alternate rhyme scheme—or sometimes the same couple
of words.
The use of mood and related colours associated with sound in poetry has a long
tradition. Rimbaud composed a poem devoted to “Vowels”, where colours were associated
with each of the main five vowels. Roman Jakobson [1,2] and Mazzeo [3] wrote extensively
about the connection between sound and colour in a number of papers. Fónagy [4] wrote
an article in which he explicitly connected the use of certain types of consonant sounds
associated with certain moods: unvoiced and obstruent consonants are associated with
aggressive mood; sonorants with tender moods. Macdermott [5] identified a specific quality
associated with “dark” vowels, i.e., back vowels, that of being linked with dark colours,
mystic obscurity, hatred and struggle. As a result, we are using darker colours to highlight
back and front vowels as opposed to low and middle vowels, the latter with light colours.
The same applies to representing unvoiced and obstruent consonants as opposed to voiced
and sonorants. But as Tsur (see [6], p. 15) notes, this sound–colour association with mood
or attitude has no real significance without a link to semantics.
In one of the visual outputs produced by our system, SPARSAR—presented in a
section below, the Semantic Relational View, we are using dark colours for concrete referents
vs. abstract ones [7] with lighter colours; and dark colours also for negatively marked
words as opposed to positively marked ones with lighter colours. The same strategy
applies to other poetic maps: this technique has certainly the good quality of highlighting
opposing differences at some level of abstraction. Our approach is not comparable to
work by Saif Mohammad [8], where colours are associated with words on the basis of
what their mental image may suggest to the mind of annotators hired via Mechanical Turk
(Amazon Mechanical Turk (MTurk) is a crowdsourcing marketplace that makes it easier
for individuals and businesses to outsource their processes and jobs). The resource only
contains word–colour association for some 12,000 entries over the 27 K items listed. It is,
however, comparable to a long list of other attempts at depicting phonetic differences in
35
Information 2023, 14, 576
poems as will be discussed further on. With this experiment, I intend to verify the number
of poems in Webb’s corpus in which it is possible to establish a relationship between
semantic content in terms of negative vs. positive sense—usually referred to with one
word as “the sentiment”—and the sound produced by syllables in particular, stressed ones.
We adopt a lexical approach, mainly using the database of 40 K entries made available by
Brysbaert et al. 2014.
Thus, I will match the negative sentiment expressed by the words’ sense with sadsounding rhymes and poetic devices as a whole, and the opposite for positive sentiment by
scoring and computing the ratios. I repeat here below the way in which I organized vowel
and consonant sounds:
-
Low, middle, high-front, high-back
Where I identify the two classes low and middle as promoting positive feelings, and
the two high as inducing negative ones. As to the consonants, I organized the sounds into
three main classes and two types:
-
Obstruents (plosives, affricates), continuants (fricatives), sonorants (liquids, vibrants,
approximants) plus the distinction into
Voiced vs. unvoiced.
In this case, the ratios are computed dividing the sum of continuants and sonorants by
the number of obstruents; and the second parameter will be the ratio obtained by dividing
number of voiced by unvoiced. Whenever the value of the ratios is above 1, positive results
are obtained; the contrary applies whenever values are below 1. In this way, counting
results is immediate and very effective.
The Result section of the paper has a first rather lengthy subsection dedicated to
the problem of rhyming structure which in the Sonnets constitutes the basic framework
onto which all the subsequent reasoning is founded. Another subsection is dedicated to
associating rhyming schemes with different themes as they have evolved in time. We
dedicate a subsection to explaining the importance of the lexical approach in organizing
the rules for the system SPARSAR, which derives the final vowel and consonant grids
that allow us to make the first comparison. The lexical and semantic approach to deriving
the sentiment of each sonnet operates a first subdivision of harmonic and disharmonic
sonnets into negatively vs. positively marked sonnets. Measuring correlations reveals a
constant contrasting attitude induced by the sound–sense agreement, which we interpret
as an underlying hidden intention to produce some form of ironic mood in Shakespeare’s
sonnets.
Detecting irony requires a much deeper and accurate analysis of the semantic and the
pragmatics of the sonnets. We proceed into two separate but conjoined ways: producing a
gold standard of the sonnets and then manually annotating each sonnet using the highly
sophisticated labeling system proposed by the Appraisal Theory Framework, ATF that
we introduce briefly in Section 3.2.4. Matching the empirical approach and the automatic
analysis confirms the overall underlying hypothesis: the sound–sense disharmony has a
fundamental task, that of suggesting an underlying ironic attitude which is at the heart of
all the sonnets. ATF makes available a more fine-grained approach which takes non-literal
language into due account, thus improving on the previous method of sentiment-based
analysis (see Martin et al. [9] and Toboada et al. [10]).
2. Materials and Methods
In this section, I will present the system SPARSAR and the pipeline of modules that
allow it to carry out the complex analysis reported above.
2.1. SPARSAR—A System for Poetry Analysis and Reading
SPARSAR [11] produces a deep analysis of each poem at different levels: it works
at the sentence level at first, then at the verse level and finally at the stanza level (see
Figure 1 below). The structure of the system is organized as follows: the input text is
36
Information 2023, 14, 576
ff
processed at first at a syntactic and semantic level and grammatical functions are evaluated.
Then, the poem is translated into a phonetic form, preserving its visual structure and its
subdivision into verses and stanzas. Phonetically translated words are associated with
mean duration values taking into account position in the word and stress. At the end of the
analysis of the poem, the system can measure the following parameters: mean verse length
in terms of msec. and in number of feet. The latter is derived by a verse representation into
tt
metrical structure. Another important component of the analysis of rhythm is constituted
by the algorithm that measures and evaluates rhyme schemes at the stanza level and then
the overall rhyming structure at the poem level. In addition, the system has access to
a restricted list of typical pragmatically marked phrases and expressions that are used
to convey specific discourse function and speech acts, and need specialized intonational
contours.
Figure 1. Architecture of SPARSAR with main pipeline organized into three levels.
We use the word “expressivity” [12], referring to the following levels of intervention
of syntactic–semantic and pragmatic knowledge, which include the following:
-
-
Syntactic heads which are quantified expressions;
Syntactic heads which are preverbal subjects;
Syntactic constituents that starts and ends an interrogative or an exclamative sentence;
Distinguish realis from irrealis mood;
Distinguish deontic modality including imperative, hortative, optative, deliberative,
jussive, precative, prohibitive, propositive, volitive, desiderative, imprecative, directive and necessitative, etc.;
Distinguish epistemic modality including assumptive, deductive, dubitative, alethic,
inferential, speculative, etc.;
Any sentence or phrase which is recognized as a formulaic or frozen expression with
specific pragmatic content;
Subordinate clauses with inverted linear order, distinguishing causal from hypotheticals and purpose complex sentences;
Distinguishing parentheticals from appositives and unrestricted relatives;
37
Information 2023, 14, 576
-
Discourse Structure to distinguish satellite and dependent clauses from the main
clause;
Discourse structure to check for discourse moves—up, down and parallel;
Discourse relations to tell foreground relations from backgrounds;
Topic structure to tell the introduction of a new topic or simply a change at relational
level.
Current TTS only takes into account information coming from punctuation and, in
some cases, from tagging. This hampers the possibility to capture the great majority
of structures listed above. In addition, they do not adequately consider ambiguity of
punctuation: for instance, the comma is a highly ambiguous punctuation mark with a
whole set of different functions which are associated with specific intonational contours,
and require semantic- and discourse-level knowledge to disentangle ambiguity. In general,
punctuation marks like question and exclamative marks, are always used to modify the
prosody of the previous word, which on the contrary is clearly insufficient to reproduce
such pragmatically marked utterances and would encompass the whole sentence from its
beginning word.
2.2. The Modules for Syntax and Semantics
The system uses a modified version of VENSES, a semantically oriented NLP pipeline
[13]. It is accompanied by a module that works at sentence level and produces a whole set of
analyses at quantitative, syntactic and semantic levels. As regards syntax, the system makes
available chunks and dependency structures. Then, the system introduces semantics both in
the version of a classifier and by isolating the verbal complex in order to verify propositional
properties, like presence of negation, to compute factuality from a crosscheck with modality,
aspectuality—that is derived from the lexica—and tense. On the other hand, the classifier
has two different tasks: separating concrete from abstract nouns, and identifying highly
ambiguous from singleton concepts (from number of possible meanings from WordNet
and other similar repositories). Eventually, the system carries out a sentiment analysis of
the poem, thus contributing a three-way classification: neutral, negative, and positive that
can be used as a powerful tool for prosodically related purposes.
Semantics in our case not only refers to predicate–argument structure, negation scope,
quantified structures, anaphora resolution and other similar items. It essentially refers
to a propositional-level analysis, which is the basis for discourse structure and discourse
semantics contained in discourse relations. It also paves the way for a deep sentiment
or affective analysis of every utterance, which alone can take into account the various
contributions that may come from syntactic structures like NPs and Aps, where affectively
marked words may be contained. Their contribution needs to be computed in a strictly
compositional manner with respect to the meaning associated with the main verb, where
negation may be lexically expressed or simply lexically incorporated in the verb meaning
itself.
In Figure 1 above the architecture of the deep system for semantic and pragmatic
processing, in which phonetics are shown, prosodics and NLP are deeply interwoven. The
system does low-level analyses before semantic modules are activated, that is tokenization,
sentence splitting, and multiword creation from a large lexical database. Then, chunking
and syntactic constituency parsing is conducted using a rule-based recursive transition network: the parser works in a cascaded recursive way to include higher syntactic structures
up to the sentence and complex sentence level. These structures are then passed to the first
semantic mapping algorithm that looks for subcategorization frames in the lexica freely
made available for English, including a proprietor lexicon of some 10 K entries, with most
frequent verbs, adjectives and nouns, also containing a detailed classification of all grammatical or function words. This mapping is performed following LFG principles [14,15],
where c-structure is mapped onto f-structure, thus obeying uniqueness, completeness and
coherence. The output of this mapping is a rich dependency structure, which contains
information related to implicit arguments as well, i.e., subjects of infinitivals, participials
38
Information 2023, 14, 576
and gerundives. LFG representation also has a semantic role associated with each grammatical function, which is used to identify the syntactic head lemma uniquely in the sentence.
When fully coherent and complete predicate argument structures have been built, pronominal binding and anaphora resolution algorithms are fired. The coreferential processed are
activated at the semantic level. Discourse-level computation is conducted at the propositional level by building a vector of features associated with the main verb of each clause.
They include information about tense, aspect, negation, adverbial modifiers, and modality.
These features are then filtered through a set of rules which have the task to classify a
proposition as either objective/subjective, factual/nonfactual, foreground/background. In
addition, every lexical predicate is evaluated with respect to a class of discourse relations.
Eventually, discourse structure is built, according to criteria of clause dependency where a
clause can be classified either as coordinate or subordinate.
2.3. The Modules for Phonetic and Prosodic Analysis
The second set of modules is a rule-based system that converts graphemes of each
poem into phonetic characters; it divides words into stressed/unstressed syllables and
computes rhyming schemes at the line and stanza level. To this end, it uses grapheme-tophoneme translations made available by different sources, amounting to some 500 K entries,
and include the CMU dictionary (Freely downloadable from http://www.speech.cs.cmu.
edu/cgi-bin/cmudict accessed on 6 July 2023), MRC Psycholinguistic Database (Freely
downloadable from https://websites.psychology.uwa.edu.au/school/mrcdatabase/uwa_
mrc.htm accessed on 6 July 2023), Celex Database [16], plus a proprietor database made
of some 20,000 entries. Out-of-vocabulary words are computed by means of a prosodic
parser implemented in a previous project [17], containing a big pronunciation dictionary
which covers 170,000 entries, approximately. Besides the need to cover the majority of
grapheme-to-phoneme conversions through the use of appropriate dictionaries, the remaining problems to be solved are related to ambiguous homographs like “import” (verb)
and “import” (noun), and are treated on the basis of their lexical category derived from
previous tagging. Eventually, there is always a certain number of out-of-vocabulary words
(OOVW). The simplest case is constituted by differences in spelling determined by British
vs. American pronunciation. This is taken care of by a dictionary of graphemic correspondences. However, whenever the word is not found, the system proceeds by morphological
decomposition, splitting at first the word from its prefix and if that still does not work,
its derivational suffix. As a last resource, an orthographically based version of the same
dictionary is used to try and match the longest possible string in coincidence with the
current OOVW. Then, the remaining portion of the word is dealt with by guessing its
morphological nature, and if that fails, a grapheme-to-phoneme parser is used. Some of
the OOVWs that have been reconstructed by means of the recovery strategy explained
above are wayfarer, gangrened, krog, copperplate, splendor, filmy, seraphic, seraphine, and
unstarred.
Other words we had to reconstruct are shrive, slipstream, fossicking, unplotted, corpuscle, thither, wraiths, etc. In some cases, the problem that made the system fail was
the presence of a syllable which was not available in our database of syllable durations,
VESD [17]. This problem has been coped with by manually inserting the missing syllable
and by computing its duration from the component phonemes, or from the closest similar syllable available in the database. We only had to add 12 new syllables for a set of
approximately 1000 poems that the system computed.
The system has no limitation on type of poetic and rhetoric devices; however, it is
dependent on language: Italian line verse requires a certain number of beats and metric
accents which are different from the ones contained in an English iambic pentameter. Rules
implemented can demote or promote word-stress on a certain syllable depending on the
selected language, line-level syllable length and contextual information. This includes
knowledge about a word being part of a dependency structure either as dependent or as
head.
39
Information 2023, 14, 576
As R. Tsur [18] comments in his introduction to his book, iambic pentameter has to
be treated as an abstract pattern and no strict boundary can be established. The majority
of famous English poets of the past, while using iambic pentameter, have introduced
violations, which in some cases—as for Milton’s Paradise Lost—constitute the majority of
verse patterns. Instead, the prosodic nature of the English language needs to be addressed,
at first. English is a stress-timed language as opposed to Spanish or Italian which are
syllable-timed languages. As a consequence, what really matters in the evaluation of
iambic pentameters is the existence of a certain number of beats—5 in normal cases, but
also 4 in deviant ones. Unstressed syllables can number higher, as for instance in the case
of exceptional feminine rhyme or double rhyme, which consists of a foot made of a stressed
and an unstressed syllable (very common in Italian) ending the line—this is also used by
Greene et al. [19] to loosen the strict iambic model. These variations are made to derive
from elementary two-syllable feet, the iamb, the trochee, the spondee, and the pyrrich.
According to the author, these variations are not casual, they are all motivated by the higher
syntactic–semantic structure of the phrase. So, there can be variations as long as they are
constrained by a meaningful phrase structure.
In our system, in order to allow for variations in the metrical structure of any line, we
operate on the basis of syntactic dependency and have a stress demotion rule to decide
whether to demote stress on the basis of contextual information. The rule states that
word stress can be demoted in dependents in adjacency with their head in case they are
monosyllabic words. In addition, we also have a promotion rule that promotes function
words which require word stress. This applies typically to ambiguously tagged words,
like “there”, which can be used as an expletive pronoun in preverbal position, and be
unstressed; but, it can also be used as locative adverb, in that case in postverbal position,
and be stressed. For all these ambiguous cases, but also for homographs not homophones,
tagging and syntactic information is paramount.
Our rule system tries to avoid stress clashes and prohibits sequences of three stressed/
three unstressed syllables unless the line syntactic–semantic structure allow it to be interpreted otherwise. Generally speaking, prepositions and auxiliary verbs may be promoted;
articles and pronouns never. An important feature of English vs. Italian is length of words
in terms of syllables. As may be easily gathered, English words have a high percentage of
one-syllable words when compared to Italian which, on the contrary, has a high percentage
of 3/4-syllable words.
2.4. Computing Metrical Structure and Rhyming Scheme
Any poem can be characterized by its rhythm which is also revealing of the poet’s
peculiar style. In turn, the poem’s rhythm is based mainly on two elements: meter, that is
distribution of stressed and unstressed syllables in the verse, presence of rhyming and other
poetic devices like alliteration, assonance, consonance, enjambments, etc., which contribute
to poetic form at the stanza level. This level is combined then with syntax and semantics
to produce the adequate breath groups and consequent subdivision: these will usually
coincide with line-stop words, but they may continue to the following line by means of
enjambments.
As discussed above, see Figure 1, the analysis starts by translating every poem into
its phonetic form. After processing the whole poem on a line-by-line basis and having
produced all phonemic transcription, the system looks for poetic devices. Here, assonances,
consonances, alliterations and rhymes are analysed and then evaluated. Here, metrical
structure is computed, that is the alternation of beats: this is performed by considering all
function or grammatical words which are monosyllabic as unstressed. In particular, “0” is
associated with all unstressed syllables, and a value of “1” to all stressed syllables, thus
including both primary and secondary stressed syllables. Syllable building is a discovery
process starting from longest possible phone sequences to shortest one. This is performed
heuristically trying to match pseudo syllables with the syllable list. Matching may fail
and will then result in a new syllable which has not been previously met. The assumption
40
Information 2023, 14, 576
is that any syllable inventory will be deficient, and will never be sufficient to cover the
whole spectrum of syllables available in the English language. For this reason, a certain
number of phonological rules has been introduced in order to account for any new syllable
that may appear. Also, syntactic information is taken advantage of, which is computed
separately to highlight chunks’ heads as produced by the bottomup parser. In that case,
stressed syllables take maximum duration values. Dependent words, on the contrary, are
“demoted” and take minimum duration values.
Metrical structure is used to evaluate its distribution in the poem by means of statistical
measures. As a final consideration, we discovered that even in the same poem, it is not
always possible to find that all lines have an identical number of syllables, identical number
of metrical feet and identical metrical verse structure. If we consider the sequence “01” as
representing the typical iambic foot, and the iambic pentameter as the typical verse metre
of English poetry, there is no poem strictly respecting it in our analyses. On the contrary,
we found trochees, “10”, dactyls, “100”, anapests, “001”and spondees, “11”. At the end
of the computation, the system is used to measure two important indices: “mean verse
length” and “mean verse length in no. of feet”, that is, mean metrical structure.
Additional measures that we are able to produce are related to rhyming devices.
Since we consider it important to take into account structural internal rhyming schemes
and their persistence in the poem, the algorithm makes available additional data derived
from two additional components: word repetition and rhyme repetition at the stanza
level. Sometimes, “refrain” may also apply, that is, the repetition of an entire line of verse.
Rhyming schemes together with metrical length are the strongest parameters to consider
when assessing similarity between two poems.
Eventually, the internal structure of metrical devices used by the poet can be reconstructed: in some cases, stanza repetition at the poem level may also apply. To create the
rhyming scheme, couples of rhyming lines are searched by trying a match recursively
of each final phonetic word with the following ones, starting from the closest to the one
that is further apart. Each time, both rhyming words and their distance are registered. In
the following pass, the actual final line numbers are reconstructed and then an indexed
list of couples, line number–rhyming line for all the lines is produced, including stanza
boundaries. Eventually, alphabetic labels are assigned to each rhyming verse starting from
A to Z. A simple alphabetic incremental mechanism updates the rhyme label. This may go
beyond the limits of the alphabet itself and in that case, double letters are used.
2.5. From Sentiment Analysis to the Deep Pragmatic Approach by ATF
We based a first approach to detecting sound–sense harmony on sentiment analysis,
which in our case encompasses both a lexical and a semantic analysis at the propositional
level. More generally speaking, computational research on sentiment analysis has been
based on the use of shallow features with a binary choice to train statistical model [20]
that, when optimized for a particular task, will produce acceptable performance. However,
generalizing the model to new texts is a hard task and, in addition, the sonnets contain
a lot of nonliteral language. The other common approach used to detect irony, in the
majority of the cases, is based on polarity detection. Sentiment analysis [21,22] is in fact an
indiscriminate labeling of texts either on a lexicon basis or on a supervised feature basis,
where in both cases, it is just a binary—ternary or graded—decision that has to be made.
This is certainly not explanatory of the phenomenon and will not help in understanding
what it is that causes humorous reactions to the reading of an ironic piece of text. It
certainly is of no help in deciding which phrases, clauses or just multiwords or simply
words, contribute to create the ironic meaning (see [23]).
Shakespeare’s Sonnets are renowned for being full of ironic content [24,25] and for
their ambiguity, thus sometimes reverting the overall interpretation of the sonnet. Lexical
ambiguity, i.e., a word with several meanings, emanates from the way in which the author
uses words that can be interpreted in more ways not only because they are inherently
polysemous, but because sometimes the additional meaning they evoke can sometimes be
41
Information 2023, 14, 576
derived on the basis of the sound, i.e., homophone (see “eye” and “I” in sonnet 152). The
sonnets are also full of metaphors which many times require contextualising the content to
the historical Elizabethan life and society. Furthermore, there is an abundance of words
related to specific language domains in the sonnets. For instance, there are words related to
the language of economy, war, nature and to the discoveries of the modern age, and each of
these words may be used as a metaphor of love. Many of the sonnets are organized around
a conceptual contrast, an opposition that runs parallel and then diverges, sometimes with
the use of the rhetorical figure of the chiasmus. It is just this contrast that generates irony,
sometimes satire, sarcasm, and even parody. Irony may be considered in turn as what
one means using language that normally signifies the opposite, typically for humorous or
emphatic effect; and a state of affairs or an event that seems contrary to what one expects
and is amusing as a result. As to sarcasm, this may be regarded the use of irony to mock
or convey contempt. Parody is obtained by using the words or thoughts of a person but
adapting them to a ridiculously inappropriate subject. There are several types of irony,
though we select verbal irony which, in the strict sense, is saying the opposite of what you
mean for outcome, and it depends on the extra-linguistic context [26]. As a result, satire and
irony are slightly overlapping but constitute two separate techniques; eventually, sarcasm
can be regarded as a specialization or a subset of irony. It is important to remark that
in many cases, these linguistic structures may require the use of non-literal or figurative
language, i.e., the use of metaphors.
Joining sentiment, irony and sound as they could have been heard by Elizabethan
audiences is what makes the Sonnets so special even today, and our paper succeeds in
clarifying the peculiarities of the at the same time deep and shallow combination of factors
intertwined to produce the final glamorous result that every sonnet does also today.
3. Results
This section will present results of the analysis of Shakespeare’s sonnets at first and
then of Webb’s poems highlighting all cases of harmony and disharmony with relation to
theme and meaning intended in the poem.
3.1. Sound Harmony in the Sonnets
We postulate the existence of a hidden plan in Shakespeare’s poetic approach, to abide
to a harmonic principle that requires all varieties of sound classes to be present and to relate
by virtue of a sound–meaning correspondence, to thematic and meaning development
in the sonnet. To discover such a plan, we analysed the phonetic representation of the
rhyming words of all sonnets using SPARSAR—the system that analyzes automatically any
poem, see below—and then organized the results of all vowel sounds into the four classes
mentioned above. We did the same with consonants and consonant clusters in order to
obtain a sound grid that is complete and retains as much complexity as possible of each
poem and compared it with sense-related analyses.
However, in order to produce such a result, almost 500 phonologically ambiguous
rhyming words had to be checked and transformed into the pronunciation current in the
XVIth century when Early Modern English was still existent. This will be explained in a
section below. It is also important to remind that the sonnets contain some 800 contractions
and some 50 metrical adjustments which require the addition of an end of word syllable.
After all these corrections, we obtained a sound map which clearly testifies to the intention
of preserving a sound–sense harmony in the overall poetic scheme of the sonnets.
We may state as a general principle that the sound–sense harmony is respected whenever there is a full agreement between the sound grid and the mood associated with the
meaning of the words. We assume then that there exists a sound–meaning correspondence
by which different emotions or sentiments may be associated with each class. And of
course, different results will be obtained by subtracting one class from the set, as we will
comment below.
42
Information 2023, 14, 576
3.1.1. Periods and Themes in the Sonnets
The sonnets have been written in the short period that goes from 1592 to 1608, but
we do not know precisely when. The majority of critics have divided them up into two
main subperiods: a first one from 1592 to 1597 encompassing Sonnets from 1 to 126 and a
second subperiod from 1598 to 1608 that includes Sonnets 127 to 154 (see Melchiori [9]). In
addition, the sonnets have been traditionally subdivided into five main cycles or themes
(Melchiori: Introduction): from 1 to 17, the reproduction sonnets, progeny, in which the
poet spurs the young man to marry; from 18 to 51, immortality of poetry, the temptation
of the friend by the lady, friend is guilty, and the absence of the loved one; from 52 to 96,
poetry and memory, beauty and poetic rivalry; from 97 to 126, memory, the mistakes of the
poet; and the last one from 127 to 152, the theme of the dark lady and unfaithfulness.
In Michael Schoenfeldt’s Introduction to his edited book [27], we find a similar subdivision: Sonnets 1–126 are addressed to a beautiful young man, while Sonnets 127–152 are
directed to a dark lady, and there are many other thematic and narrative sequences like
1–17 mentioned above (ibid. iii).
In the study of inversion made by Ingham and Ingham [28] on all of Shakespeare’s
plays, the authors reported three separate historical periods characterized by different
frequencies in the use of subject inversion (VS) compared with canonical order (SV) on a
total number of 951 clause structures:
1.
2.
A first period that goes from 1592 to 1597, where we have the majority of the cases of
VS (214 over 421 total cases).
A second period that goes from 1598 to 1603, where the number of cases is reduced by
half, but the proportion remains the same (109 over 213 total cases). A third period
that goes from 1604 to 1608, where the proportion of cases is reverted (95 over 317
total cases) and VS cases are the minority.
The main themes of the sonnets are well-known: from 1 to 126 they are stories about
a handsome young man, or rival poet; from 127 to 152 the sonnets concern a mysterious
“dark” lady the poet and his companion love. The last two poems are adaptations from
classical Greek poems. In the first sequence, the poet tries to convince his companion to
marry and have children who will ensure immortality. Aside from love, the poem and
poetry will “defeat” death. In the second sequence, both the poet and his companion have
become obsessed with the dark lady, the lexicon used is sensual and the tone distressing.
These themes are at their highest in the best sonnets indicated above. So, we would expect
these sonnets to exhibit properties related to popularity that set them apart from the rest.
We decided to look into the “themes” matter more deeply and discovered that the
immortality theme is in fact present through the lexical field constituted by the keyword
DEATH. We thus collected all words related to this main keyword and they are the following
ones, omitting all derivations, i.e., plurals for nouns, third person, past tense and gerundive
forms for verbs:
BURY, DEAD, DEATH, DECEASE, DECAY, DIE, DISGRACE, DOOM, ENTOMBED,
GRAVE, GRIEF, GRIEV ANCE, GRIEVE, SCYTHE, SEPULCHRE, TOMB, and WASTE
Which we connected to SAD, SADNESS, UNHAPPYNESS, and WRINKLE. We ended
up by counting 64 sonnets containing this lexical field which can be safely regarded as the
most frequent theme of all. We then looked for the opposite meanings, the ones related to
LIFE, HAPPY, HAPPYNESS, PLEASURE, PLEASE, MEMORY, POSTERITY, and ETERNITY.
In this case, 28 sonnets are the ones mentioning these themes. So, overall, we individuated
92 sonnets addressing emotionally related strong themes. When we combine the two
contrasting themes, death/eternity, sadness/memory, we come up with the following
19 sonnets:
1, 3, 6, 8, 15, 16, 25, 28, 32, 43, 48, 55, 63, 77, 81, 92, 97, 128, 147
3.1.2. Measuring All Vowel Classes
We show in the Table 1. below general statistics of the distribution of stressed vowel
sounds in rhyming words of all the sonnets. We included also diphthongs, considering the
43
Information 2023, 14, 576
stressed portion as the relevant sound. The expected result is that the phonological class of
high-back is the one less present in the sonnets, followed by high-front and low. Rhyming
words with the middle stressed vowel are the ones with the highest frequency.
Table 1. Distribution of sounds of end-of-line rhyming words divided into four phonological classes.
Phon. Class
High-Front
Mid
Low
High-Back
Total
No. Class
119
159
142
111
531
StrVowDiph
493
861
587
314
2155
Here below are some examples of the classification of stressed vowels of rhyming
words in the first three sonnets:
Sonnet 1: FRONT—increase, decease, spring, niggarding, be, thee;
BACK—fuel, cruel;
LOW—die, memory, eyes, lies;
MIDDLE—ornament, content;
Sonnet 2: BACK—use, excuse, old, cold;
MIDDLE—field, held, days, praise;
LOW—lies,eyes, mine, thine, brow, now;
Sonnet 3: HIGH—thee, see, husbandry, posterity, be, thee;
BACK—womb, tomb, viewest, renewest;
LOW—another, mother, prime, time.
In Table 2, we show the presence of the four classes in each sonnet, confirming our
starting hypothesis about the intention to maintain a sound harmony in each sonnet: as
can be easily gathered, 140 sonnets over 154 have rhymes with sounds belonging to more
than two classes.
Table 2. Subdivision of the sonnets by number of classes.
No. Classes
4-Class
3-Class
2-Class
1-Class
Total
No. Sonnets
77
64
12
1
154
There is one sonnet with only one class and it is sonnet 146; then, there are 13 sonnets
with 2 classes of sounds: 8, 9, 64, 71, 79, 81, 87, 90, 92, 96, 124, and 149. These sonnets
contain rhyming pairs with low and middle sounds, except for three sonnets: sonnet 71
which contains high-back and middle sounds; sonnet 9 which contains high-front and low
sounds; and sonnet 96 containing high-front and middle sounds. The themes developed in
these sonnets fit perfectly into the rhyming sound class chosen. Let us consider sonnet VIII
which is all devoted to music and string instruments which require more than one string
to produce their sound, thus suggesting the need to find a companion and get married.
Consider the line “the true concord of well tunèd sounds,” where hints to the need that
sounds should be “well” tuned. Sonnet 81 celebrates the poet and his verse which shall
survive when death will come. Sonnet 92 is in fact pessimistic in the possibility that love
will last “for the term of life” and no betrayal will ensue. As to sonnet 146, it is a mixture
of two seemingly different themes: a criticism of extravagant display or rich clothing of
wealth by writers of the time, or perhaps his mistress and trying to convince her to change
her ways for eternal salvation. Some critics regard this as the most profoundly religious
or meditative sonnet. But, the feeling of the lover renouncing something brings back
his mistress and the feeling of being powerless against her chastity, so that religious life
becomes a desirable aim. In this sense, death can also be depicted as desirable.
It is important to notice the overall strategy of choice of sound in relation to meaning,
in the rhyming devices used, for instance, in sonnet 147 (all sonnets are taken from the
44
Information 2023, 14, 576
online version made available at https://www.shakespeares-sonnets.com/ accessed on 6
July 2023):
My reason, the physician to my love,
Angry that his prescriptions are not kept,
Hath left me, and i desperate now approve
Desire is death, which physic did except.
The interesting fact in this case is that the appearance of a back high sound like |U|
would match with the appearance of the saddest word, DEATH in the same stanza. In
other words, the magistral use of rhyming sounds goes hand in hand with the themes and
meaning developed in the sonnet.
Interesting to note how the rhyming sound evolves in the Sonnets taking sonnet 107
as an example: from SAD sounds (back and high), to MID and CLOSE to LOW and OPEN
in the third stanza, to end with a repetition of MID sounds in the couplet:
Not mine own fears, nor the prophetic soul
Of the wide world dreaming on things to come,
Can yet the lease of my true love control,
Supposed as forfeit to a confin’d doom.
The mortal moon hath her eclipse endur’d,
And the sad augurs mock their own presage;
Incertainties now crown themselves assur’d,
And peace proclaims olives of endless age.
Now with the drops of this most balmy time,
My love looks fresh, and death to me subscribes,
Since, spite of him, I will live in this poor rime,
While he insults o’er dull and speechless tribes:
And thou in this shalt find thy monument,
When tyrants’ crests and tombs of brass are spent.
In Sonnet 145, the overall feeling of sadness is transferred in the rhyming sounds:
in the first stanza, the correct EME pronunciation requires |come| to be pronounced as
|doom|, CUM/DUM a high-back sound which is then be repeated in the final couplet
where “sav’d my life” appears. Here, important echoes of the |U| sound appear in the
couplet with end-of-line words THREW and YOU.
. . .. . .
Straight in her heart did mercy come,
Chiding that tongue that ever sweet
Was us’d in giving gentle doom;
. . .. . ..
From heaven to hell is flown away.
‘I hate’, from hate away she threw,
And sav’d my life, saying ‘not you’.
We saw above the subdivision into classes; however, it does not tell us how the four
phonological classes are distributed in the sonnets.
The resulting sound image coming from rhyme repetitions is eventually highlighted
by the frequency of occurrence of same stressed vowel as shown in Table 3. In this table, we
separated vowel sounds into three classes, high, middle, and low, to allow a better overall
evaluation.
45
Information 2023, 14, 576
Table 3. Total count for vowel, final consonants and sonorant sounds organized into classes for all
Shakespearean sonnets.
N.
Un/StressVow/
Con
Following Vowel/
Consonant
Freq Occ
High
Middle
Low
Consonant
1
ay
d, er, f, l, m, n, r, t, v, z
109
2
ey
d, jh, k, l, m, n, s, t, v, z
81
3
n_
d, iy, jh, s, t, z
80
80
4
r_
ay1, d, ey1, iy, iy1, k, n, ow,
ow1, s, t, th, uw1, z
68
68
5
eh
d, jh, k, l, n, r, s, t, th
68
6
ih
d, l, m, n, ng, r, s, t, v
51
7
ao
d, l, n, ng, r, s, t, th, z
40
8
iy
d, f, ih, k, l, m, n, p, s, t, v, z
45
9
s
iy, st, t
38
10
uw
d, m, n, s, t, th, v, z
47
109
81
68
51
40
45
38
47
11
ah
d, l, n, s, t, z
34
12
ow
k, l, n, p, t, th, z
21
34
13
t
er, ey1, iy, s, st
21
14
ah
d, k, m, n, ng
17
15
aa
n, r, t
16
16
16
ae
ch, d, k, ng, s, v
14
14
17
d_z
18
er
25
21
17
13
ay1, d, iy, z
11
Total final sounds
778
13
11
168
200
190
220
Eventually, we come up with 61 more frequent heads with occurrences up to four and
a total of 778 repeated vowel and consonant line-ending sounds. We now consider the
remaining 288 rhyming pairs organized into “head” and “dependent”, i.e., the preceding
end of the line’s rhyming word and the one in the corresponding alternate/adjacent end of
line.
A direct consequence of the level of rhyming pair repetition rate is the sound image
created in each sonnet. We assume that a high level of repetition will create a sort of echo
from one sonnet to the next and a continuation effect, but it will also contribute a sense
of familiarity. We decided to verify what would be the overall sound effect created by
the total number of rhyming pairs analysed. Thanks to SPARSAR modules for phonetic
transcription and poetic devices detection discussed elsewhere [29], we managed to recover
all correct rhyming pairs and their phonetic forms. We report the results in the tables below.
The resulting sound image coming from rhyme repetitions is eventually highlighted
by the frequency of occurrence of same stressed vowel as shown in the two tables below.
We separated vowel sounds into three classes, high, middle, and low, to allow for an easy
overall evaluation. If we consider all vowel sounds, there appears to be a highly balanced
use of rhyming pairs with stressed low vowels being the more frequent. Not so if we
consider diphthongs—we always consider the stressed vowel in both rising and falling
diphthongs.
3.1.3. Distributing Vowel and Diphthong Classes into Thematic Periods
Win Table 4 below, we collected all stressed vowels and diphthongs for the five periods
or phases into which the Sonnets collection can be divided up and found interesting
variations: Period 1 has only 17 sonnets and 238 stressed rhyming words; Period 2 has
34 sonnets and 476 rhyming words; Period 3 has the majority, 45 sonnets and 630 rhyming
46
Information 2023, 14, 576
words; Period 4 has 30 sonnets and 420 words; and Period 5 has the remaining 28 sonnets
and 398 rhyming words.
Table 4. (a) Distribution of stressed rhyming vowels in five phases. (b) Weighted values of the
distribution of stressed rhyming vowels in five Phases.
(a)
Low
Middle
High
Total
Period 1
40
42
57
139
Period 2
105
68
102
275
Period 3
111
105
136
352
Period 4
59
79
122
260
Period 5
66
60
99
225
Totals
381
354
516
1251
(b)
Low
Middle
High
Total
Period 1
2.3529
2.4706
3.3529
8.1765
Period 2
3.0882
2
3
8.0882
Period 3
2.4667
2.3334
3.0223
7.8223
Period 4
1.9667
2.6334
4.0667
8.6667
Period 5
2.3571
2.1429
3.5357
8.0357
Totals
30.529%
28.365%
41.106%
100%
In Table 4a we computed absolute values for each vowel class distributed in the five
periods and what can be preliminarily noted is the high number of “high” vowels and the
lower number of the two other classes. In Table 4b, we produced weighted measures in
order to take into account differences in number of sonnets considered which, as a result,
will produce a disparity in the total number of occurrences. Frequency values for each
vowel class are now a ratio of the number of sonnets per phase, the same with total values.
In this case, we can easily see that high vowels are always the class which had the most
occurrences and Periods 4 and 5 are the ones with the highest number—which, however,
needs to be divided by two subclasses, front and back. The low vowel class is the one with
higher percentage, and in Period 2, low vowels have their highest value when compared
to the other Periods. The opposite takes place in Period 4, where High vowels are at their
highest also compared with the other Periods and low vowels are at their lowest also
compared to other Periods. We may note that, overall, the highest number of stressed
vowels belongs to Phase 4, whereas the lowest number to Phase 3. Overall, the majority
of stressed vowels belongs to the phonetic class of high vowels followed by low and then
middle.
We must now consider diphthongs and verify whether the same picture applies.
Diphthongs, as annotated in the CMU dictionary, do not contain any high stressed nuclear
vowel, because the choice was to separate high vowels in all those cases. So, we are left
with five diphthongs: two low, AW and AY; and three middle, EY, OW, and OY.
As can be easily gathered from absolute total values, middle diphthongs constitute by
far the majority. In Table 5 below is their distribution in the five phases, and as we did in
Table 4, we show at first absolute values and then in section (b) weighted values:
47
Information 2023, 14, 576
Table 5. (a) Distribution of stressed diphthongs in the sonnets divided in 5 phases. (b) Weighted
valued of the distribution of stressed diphthongs in the sonnets in 5 phases.
(a)
Low
Middle
Total
Phase 1
50
46
96
Phase 2
78
103
181
Phase 3
112
154
266
Phase 4
81
72
153
Phase 5
65
85
150
Totals
386
460
846
Low
Middle
Total
Phase 1
2.9412
2.7059
5.6471
Phase 2
2.2941
3.0294
5.3235
Phase 3
2.4889
3.4223
5.9112
Phase 4
2.7
2.4
5.1
Phase 5
2.3214
3.357
5.3571
Totals
45.626%
54.373%
100%
(b)
Both Phases 1 and 4 show a decrease of middle vs. low diphthongs, while the remaining three phases behave in the opposite manner: more middle than low diphthongs. The
total distribution indicates Phase 3 as the highest number of diphthongs and Phase 4 as the
lowest, just the opposite of the previous distribution. General totals show a distribution
of middle vs. low diphthongs which is strongly in favour of middle ones. This is just the
opposite of what we found in previous counts, and in part then compensates with the lack
of high diphthongs.
Eventually, in Table 6 the overall sound image is determined by a strong presence of
middle sounds, followed by low sounds and eventually high sounds.
Table 6. Sound image of the sonnets.
Vowels
Diphthongs
Total
Low
Middle
High
Total
381
386
767
354
460
814
567
1312
854
2166
567
3.2. Rhyming and Rhythm: The Sonnets and Poetic Devices
3.2.1. Contractions vs. Rhyme Schemes
Contractions are present in a great number in the sonnets. Computing them requires
reconstructing their original complete corresponding word form in order to be able to
match it to the lexicon or simply derive the lemma through morphological processing. This
is essentially due to the fact that they are not predictable and must be analysed individually.
Each type of contraction has a different manner to reconstruct the basis wordform. In order
to understand and reconstruct it correctly, each contraction must go through recovering
of the lemma. We have found 821 contractions in the collection, where 255 are cases of
genitive’s, and 167 are cases of past tense/participle ‘d. The remaining cases are organised
as follows:
48
Information 2023, 14, 576
-
SUFFIXES attached at word end, for example (’s, ‘d, ’n, ‘st, ’t, (putt’st));
PREFIXES elided at word beginning, for example (‘fore, ‘gainst, ’tis, ‘twixt, ‘greeing);
INFIXES made by consonant elision inside the word (o’er, ne’er, bett’ring, whate’er,
sland’ring, whoe’er, o’ercharg’d, ‘rous).
Now, consider a contracted word like “sland’ring”: as said before, at first the complete
wordform must be reconstructed in order to use it for recovering the lemma and using the
grammatical category for syntax and semantics. However, when computing the metrical
structure of each line, the phonetic translation should be made on the contracted word,
which does not exist in any dictionary neither in the form “slandring” nor in the form
“sland-ring”. What we carry out is finding the phonetic transcription, if already existent,
in the dictionaries, and then subtracting the phoneme that has been omitted, creating in
this way a new word. This is okay until we come to the module where metrical counts are
made on the basis of the number of syllables. But, here again, the phonetic form derived
from the complete word is not easy to accommodate. There are two possible subdivisions
of the phonetic form s_l_ae_n_d_r_ih_ng (in ARPAbet characters): syllable 1: s_l_ae_n_d_;
syllable 2: r_ih_ng. Syllable 1 does not correspond to the subdivision for the complete word
which would be s_l_ae_n_|d_eh_|r_ih_ng. Luckily, the syllable exists independently, but
this only happens occasionally. In the majority of the cases, the new word form produces
syllables which are inexistent and need to be created ad hoc.
3.2.2. Rhythm and Rhyme Violations
In poetry, in particular in the tradition of the sonnets in Elizabethan times, poetic
devices play a fundamental role. Sonnets in their Elizabethan variety had a stringent
architecture which required the reciter to organize the presentation according to logical
structure in the stanza structure, on the one side introducing the main theme, expanding
and developing the accompanying subthemes, exploring consequences, finding some
remedies to solve the dilemma or save the protagonist. On the other side, the line-byline structure required the reciter to respect the alternate rhyming patterns which were
usually safeguarded by end-stopped lines. Thus, the audience expectations were strongly
influenced by any variation related to rhyming and rhythm as represented by the sequence
of breath groups and intonational groups. Whenever the rhyming pattern introduced a
new unexpected pronunciation—not in other contexts—of a rhyming word, the audience
was stunned: say a common word like love was pronounced to rhyme with prove. The
same effect must have been produced with enjambments, whenever lines had to run-on
because meaning required the syntactic structure to be reconstructed—as for instance, in
lines ending in a head noun which had its prepositional-of modifier in the beginning of
the following line. Breath groups and intonational groups had to be recast to suit the
unexpected variation, but rhyming had to be preserved. We will explore these aspects of
the sonnets thoroughly in this section.
In a previous paper [30], we discussed the problem of (pseudo) rhyme violations as
it has been presented in the literature on Shakespeare. In particular, we referred to the
presence of more than 100 apparent rhyme violations, that is, rhyming end-of-line words
which according to current pronunciation do not allow the rhyming scheme of the stanza
to succeed, but it did in the uncertain grammar of Early Modern English. For instance,
in sonnet 1, we find two lines 2–4 with the stanza rhyme scheme ABAB, ending with the
words die-memory. In this case, the second word memory should undergo a phonological
transformation and be pronounced “memo’ry”(memoray) ending in a diphthong at the
end and sounding like “die”/(dye). Linguist David Crystal has discussed and reported on
this question in many papers and also on a website—http://originalpronunciation.com/
(accessed on 6 July 2023). He collects and comments rhyming words whose pronunciation
is different from Modern RP English pronunciation, listing more than 130 such cases in the
Sonnets. However, in our opinion, what is missing is a rigorous proposal to cope with the
problem of rhyme violation, and the list of transformations contains many mistakes when
compared with the full transcription of the sonnets published in [31]. The solution is lexical
49
tt
ff
Information 2023, 14, 576
as we showed in a number of papers [29,30], i.e., variations should be listed in a specific
lexicon of violations and the choice determined by an algorithm. Here below is an excerpt
of the table, where we indicated the number of the sonnet, the line number, the rhyming
word pair, their normal phonetic transcription using ARPAbet and in the last column the
adjustment provided by the lexicon as shown in the example reported here below.
Variants are computed by an algorithm that takes as input the rhyming word and its
stressed vowel from the first line in a rhyming pair and compares it with the rhyming word
and vowel of the alternate line. Here, as in the following pages, we will use the phonetic
alphabet called ARPAbet which is the one of the phonetic dictionary made available
by CMU for computational purposes. The phonetic annotation makes use of American
English but includes all vowel phonemes of British English: it has 12 vowels, and two
semiconsonants. The missing part regards diphthongs: there are eight diphthongs in the
chart, but three of them—descending diphthongs—never appear in the CMU dictionary
or are treated as a sequence of a semivowel and a stressed vowel—IA (for CLEAR, _ih_),
EA (for DOWNSTAIR CAREFUL, eh_), and UA (for ACTUAL, w_ah). In case of failure,
the lexicon of Elizabethan variants is searched. The same stressed vowel may undergo a
ff
number of different transformations, so it is the lexicon that drives the change, and it is
impossible to establish phonological rules at feature level. Some words may be pronounced
in two manners according to rhyming constraints; thus, it is the rhyming algorithm that will
decide what to do with the lexicon of variants. The lexicon in our case has not been built
manually but automatically, by taking into account all rhyming violations and transcribing
the pair of words at line end on a file. The algorithm searches couples of words in alternate
lines inside the same stanza and in sequence when in the couplet, and whenever the rhyme
is not respected, it writes the pair in output. Take for instance the pair LOVE/PROVE, in
that order in alternate lines within the same stanza: in this case, it is the first word that has
to be pronounced like the second. The order is decided by the lexicon: LOVE is included in
the lexicon with the rule for its transformation; PROVE is not. In some other cases, it is the
second word that is modified by the first one, as in CRY/JOLLITY; again, the criterion for
one vs. the other choice is determined by the lexicon.
In Table 7. below, we list the total number of violations we found subdividing them
by five phases as we did before, in order to verify whether the conventions dictated by
Early Modern English grammars of the time did eventually impose a standard in the last
period, beginning with the XVIIth century. After Total, we indicate the total number of
violations found followed by slash and the number of sonnets. The ratio gives a weighted
number that can be used to compare different occurrences in the five phases. As can be
noted, the highest number of violations are to be found in the first two phases. Then, there
is a decrease from Phase II to Phase IV which is eventually followed by a slight increase
in Phase V which, however, is lower than what we found in previous phases. The first
two phases then have numbers well over the average: the decrease in the following phases
testifies to a tendency in Shakespeare’s work to fix pronunciation rules in the sonnets
as more and more grammarians tried to document what constituted the rules for Early
Modern English.
50
Information 2023, 14, 576
Table 7. Number of rhyme violations x five phases.
Sonnets
Interval
No. Rhyme
Violations/
No. Sonnets
Ratio
%
Phase I
1–17
22/17
1.2941
Phase II
18–51
40/34
1.1765
Phase III
52–96
34/45
0.7556
Phase IV
97–126
18/30
0.6
Phase V
127–154
23/28
0.8214
137/154
0.8896
Total
We call these (pseudo) rhyming violations because current reciters available on
Youtube do not dare use the old pronunciation required and produce a rhyming violation by using Modern English pronunciation. One of these reciters is the famous actor
John Gilgoud, who when reading Sonnet 66, correctly pronounces DESERT with its original
meaning, but then in Sonnet 116 produces three violations when rhyming pairs required
transformations that were clearly mandatory in Early Modern English, and they are |love|
to be pronounced with the vowel of |remove| in lines 2/4, |come| to be pronounced with
the vowel of |doom| in lines 10/12, and |loved| to be pronounced with the vowel of
|proved| in the couplet. How do we know that these words should be pronounced in that
manner and not in the opposite way—say |remove| as |love|, |doom| as |come| and
|proved| as |loved|, as is being asserted by Ben Crystal son of David? There are three
criteria that determine the way in which words should rhyme: the first one is the rhyming
constraints which were so stringent at the time owing to the fact that poetry was only recited
and not read on books. Okay, then, there are rhyming constraints but how do they work, in
which direction? The direction is determined by two factors: the first one is determined
by universal phonological principles, as for instance the one the governs phonological
variations of vowel sounds—in the vowel shift of verbs or nouns due to morphological
changes—which systematically changed “low” and “mid” features into “high” features and
not vice versa [32]. The other factor is simply lexical: i.e., not all words will be subject to a
transformation in that period. As a result, some words had double pronunciation. This was
extensively documented in books and articles published at the time and written by famous
poets like Ben Jonson and a great number of grammarians of the XVI and XVII century. All
this information is made available by the famous historical phonologist Wilhelm Vietor of
the XIX century in a book published at first in 1889 (2 (we use 1909 Vol 2. edition that can
be freely visualized at: https://books.google.it/books?id=rhEQAwAAQBAJ&printsec=
frontcover&hl=it&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false accessed on
6 July 2023), by the title “A Shakespeare Phonology” which we have adopted as our reference. Variants are then lexically determined. Some words involved in the transformation
are listed below using ARPAbet as the phonetic alphabet in the excerpt taken from the
lexicon. As can be easily noticed, variants are related also to stress position, but also to
consonant sounds.
Lexicon 1.
shks(despised,d_ih2_s_p_ay1_s_t,ay1,ay1)
shks(dignity,d_ih2_g_n_ah_t_iy1,iy1,ay1).
shks(gravity,g_r_ae2_v_ah_t_iy1,iy1,ay1).
shks(history,hh_ih2_s_t_er_iy1,iy1,ay1).
shks(injuries,ih2_n_jh_er_iy1_z,iy1,iy1).
shks(jealousy,jh_eh2_l_ah_s_iy1,iy1,ay1).
shks(jollity,jh_aa2_l_t_iy1,iy1,ay1).
shks(majesty,m_ae2_jh_ah_s_t_iy1,iy1,ay1).
shks(memory,m_eh2_m_er_iy1,iy1,ay1).
51
Information 2023, 14, 576
shks(nothing,n_ah1_t_ih_ng,ah1,ow1).
It is now clear that variants need to interact with information coming from the rhyming
algorithm that alone can judge whether the given word, usually at line end—but the word
can also be elsewhere—has to undergo the transformation or not. The lexicon in our
case has not been built manually but automatically, by taking into account all rhyming
violations and transcribing the pair of words at line end on a file. The algorithm searches
couples of words in alternate lines inside the same stanza and whenever the rhyme is not
respected, it writes the pair in output. Take for instance the pair LOVE/PROVE, in that
order, in alternate lines within the same stanza: in this case, it is the first word that has to
be pronounced like the second. The order is decided by the lexicon: LOVE is included in
the lexicon with the rule for its transformation, PROVE is not. In some other cases, it is
the second word that is modified by the first one, as in CRY/JOLLITY, again the criterion
for one vs. the other choice is determined by the lexicon. Thus, the system SPARSAR has
a lexicon of possible transformations which are checked by an algorithm that whenever
a violation is found, it is searched for the word to be modified and alters the phonetic
description. In case both words of the rhyming pair are in the lexicon, the type of variation
to be selected is determined by the overall sound map of the sonnet: Shakespeare produced
a careful sound harmony in the choice of rhyming pairs including four or at least three
sound classes.
Commenting on David Crystal’s Point of View
Since the rhyming scheme is a fundamental issue for establishing sound harmony,
the problem constituted by rhyming violations needs a deeper inspection. David Crystal
makes available on his website the full phonetic transcription of the sonnets. However, as
said above, these transcriptions contain many mistakes. There are two vague explanations
Crystal finds to support his transcriptions in his OP (Old Pronunciation) and the first is a
tautology: the “pronunciation system has changed since the 16th century”: this is what he
calls “a phonological perspective” (ibid.:298). In Section 2, entitled “Phonological rhymes”,
he writes
“Far more plausible is to take on board a phonological perspective, recognizing that
the reason for rhymes fail to work today is because the pronunciation system has changed
since the 16th century. . . . a novel and illuminating auditory experience, and introduced
audiences to rhymes and puns which modern English totally obscures. The same happens
when the sonnets are rendered in OP. In sonnet 154, the vowel of “warmed” echoes that of
“disarmed”, “remedy” echoes “by”, the final syllable of “perpetual” is stressed and rhymes
with “thrall”, and the vowel of “prove” is short and rhymes with “love”.
And further on (ibid:299):
“Ben Jonson. . . wrote an “English Grammar” in which he gives details about how
letters should be pronounced. How do we know that “prove” rhymed with “love”? This
is what he says about letter “O” in Chapter 4: “It naturally soundeth. . .. In the short time
more flat, and akind to “u;” as “cosen”, “dosen”, “mother”, “brother”, “love”, “prove” “.
And in another section, he brings together “love, glove” and “move”. This is not to deny,
of course, that other pronunciations existed at the time. . .. “Love” may actually have had
a long vowel in some regional dialects, as suggested by John Hard (a Devonshire man)
in 1570 (and think of the lengthening we sometimes hear from singers today, who croon
“I lurve you”). But the overriding impression from the orthoepists is that the vowel in
“love” was short. It is an important point, because this word alone affects the reading of
19 sonnets. . ..”
The second one is the need to respect puns (ibid. 298) which work in OP but not
in modern English and, finally, the idiosyncratic spellings in the First Folio and Quarto
and the description of contemporary orthoepists, who often give real detail about how
pronunciations were in those days. There are no phonological rules, not even a uniform
criterion that underlies the variations. The first reason was expressed as follows at the
beginning of the paper: “The pronunciation of certain words has changed between Early
52
Information 2023, 14, 576
Modern English and today, so that these lines (referring to sonnet 154 lines) would have
rhymed in Shakespeare’s time”. The list of pronunciation variations in the Supplementary
Material of his paper [33] is messy and confusing but what is more important is that it also
contains many mistakes, and we will comment on the first 10 items below.
First of all, the new rhyming transformation of “loved” is not mentioned in the
Supplementary Material where according to Crystal “a complete” list should have appeared
(ibid.:299). But the most disturbing fact is the recital performed by Ben Crystal (his son and
actor in the Globe Theater), which is courageously made publicly available on Youtube
(at https://www.youtube.com/watch?v=gPlpphT7n9s accessed on 6 July 2023). We are
given a reading of Sonnet 116 which is illuminating of the type of OP Crystal is talking
about (see time point 6:12 of total 10:21). The reading in fact does not start there but
further on in the last stanza. The first contradictory assertion is just here, in the first stanza
where lines B should rhyme and LOVE should be made to rhyme with REMOVE (as
it is suggested in the Supplementary Material). The question is that in sonnet 154, the
same rhyming pair in the same order LOVE—>REMOVE is transcribed with the opposite
pronunciation. In the same paper, he asserts that “the vowel of PROVE is short and rhymes
with LOVE” (ibid.:298) referring to the couplet of Sonnet 154 which we assume should
be also applied to the B rhyming pair in sonnet 116 and not give us lav/rimav, but rather
luv/rimuv. Here, an important additional series of alliteration would be fired if we adopt
this pronunciation which in fact is the rule all over the Sonnets: TRUE would rhyme with
LOVE and REMOVE/R. But also, further on as we will see, LOVE will rhyme with FOOL
and DOOM.
On p.296,
Let me not to the marriage of true minds
Admit impediments, love is not love
Which alters when it alteration finds,
Or bends with the remover to remove.
The recital starts in third stanza, continuing with the couplet.
Love’s not Time’s fool, though rosy lips and cheeks
Within his bending sickle’s compass come;
Love alters not with his brief hours and weeks,
But bears it out even to the edge of doom:
If this be error and upon me proved,
I never writ, nor no man ever loved.
In the Supplementary Material, we find another mistake or contradiction, where
Crystal wrongly transcribes “doom” to rhyme with “come” (came/dam) rather than the
opposite (cum/dum) and “loved” to rhyme with “proved” (pravd/lavd) which again
should be the opposite, (pruvd/luvd). Here, as elsewhere, for instance in Sonnet 55,
DOOM rhymes with ROOM in the correct order, ROOM/DOOM, and with the correct
sound. Again, let us consider Crystal’s wrongly reporting in the Supplementary Material
the rhyming pair LOVE/APPROVE as rhyming in the opposite manner, i.e., LOVE is being
pronounced as APPROVE which is just the contrary in the transcription; APPROVE is
being pronounced as LOVE with a short open-mid back sounds. In Crystal’s words,
“There are 19 instances in the sonnets where “love” is made to rhyme with “prove”,
“move”, and their derived forms. And when we look at the whole sequence, we find a
remarkable 142 rhyme pairs that clash (13% of all lines). Moreover, these are found in
96 sonnets. In sum: only a third of the sonnets rhyme perfectly in modern English. And in
18 instances, it is the final couples which fails to work, leaving a particularly bad taste in
the ear.”
This is how he explains the list of the Supplementary Material:
. . .a complete list is given in the Supplementary Material to this paper. The list
indicates a rhyming pair where the first element is the one to be transformed because
otherwise violating the rhyme. For instance MEMORY = DIE (1) must be interpreted as
53
Information 2023, 14, 576
follows: pronounce “memory” with the same vowel of “die” in modern RP pronunciation
to be found in sonnet 1.
It is important to note that the first element in most cases appears as the SECOND
rhyming word in the pair, but in some other cases as the first word of the pair. But then,
we find a long list of mistakes if we compare the expected pronunciation encoded in the
Supplementary Material with the complete transcription of the sonnets made available by
David Crystal in a pdf file in the same website, where results are turned upside down. For
instance, LOVED = PROVED (116) has been implicitly turned into PROVED = LOVED,
that is the transcription of the stressed vowel of “proved” is the same as the one of “loved”
and not the opposite. More mistakes in the list can be found where words like TOMB
and DOOM are wrongly listed in the opposite manner. In particular, DOOM is made to
rhyme with the vowel of COME and not the oppositee; also, TOMB is made to rhyme
with COME and DUMB reverting in both cases the order of the rhyming pair and of the
transformation. The phonetic transcription file confirms the mistakes: in the related sonnets
we find the same short mid-front vowel instead of a short U, dumb/tomb both in sonnet 83
and 101. In all of these cases, the head (the rhyming word of the first line) should be made
to rhyme with the dependent (the rhyming word of the second line) as it happens in Sonnet
1 with MEMORY/DIE and in the great majority of cases. So, two elements must be taken
into account: the order of the two words of the rhyming pair and then the commanding
word, i.e., the word that governs the transformation. In the case of MEMORY/DIE, DIE
is the head or the commanding word of the transformation, and comes first in the stanza,
whereas MEMORY is the dependent word and comes as second line of the rhyming pair.
We list below only the wrong cases and comment the type of mistake made, i.e., either as
reverted order, the first element of the pair comes before and it should be read as second;
reverted order, the first element is in fact the one deciding the type of vowel to be used;
else the order is correct, but the pronunciation chosen is wrong. To comment on the wrong
pronunciation required by the rhyme we sometimes use the pronunciation indicated by
Vietor in his book, and the phonetic transcription of all the sonnets Crystal made in his pdf
file.
There are more mistakes in the Supplementary Material, here are some of them:
anon/alone
are/care
are/care
are/compare
are/prepare
are/rare
beloved/removed
brood/blood
dear/there
doom/come
-should be alone/anon (Vietor:70) both the order and the
governor are wrong. It should be: pronounce ALONE as
ANON with a short or long /o/
-the order should be care/are, but then the mistake is ARE
48
transcribed like CARE [kEUR :r]
112, 147
-the order is correct but the transcription is wrong as before
35
-the order should be compare/are, transcription correct
-the order should be prepare/are, transcription wrong: ARE is
13
pronounced like PREPARE [pEUR :r]
-order correct and in transcription ARE is like RARE [rEUR
52
:r]—but it should be the opposite. RARE should sound like
ARE, rare/are even though the line with RARE comes first.
-order correct, but the transcription is wrong: remove is
25
transcribed with the vowel of beloved
the order should be blood/brood: the transcription is also
wrong BROOD is transcribed like BLOOD. see Vietor:87, whilst
19
[u] in blood, flood, good, wood s. seems to be the usual
Elizabethan sound.
correct order but the pronunciation of DEAR is transcribed
110
wrongly as [di:r] while the one of THERE is [thEUR :re]
correct order but the pronunciation should be governed by
107,116,145 DOOM, a short or long [u](Vietor:86): transcription of DOOM is
instead with the vowel of COME
75
We solved the problem by creating a lexicon of phonetic transformations and an
algorithm that looked at first for a match in the rhyming word pair positioned in alternate
54
Information 2023, 14, 576
lines if in stanza, and in a sequence if in couplet. In case there was no match, the algorithm
looks up the second word in the lexicon, and then the first word and chooses the one that
is present. In case both are present in the lexicon, the decision is taken according to the
position of the rhyming pair in the sonnet with respect to previous rhymes.
3.2.3. Rhyming Constraints and Rhyme Repetition Rate
If on the one side we have rhyme-apparent violations using the EME pronunciation to
suit the rhyme scheme of the sonnet, on the other side, the Sonnets show a high “Repetition
Rate” as computed on the basis of rhyming words alone. Due to the requirements imposed
by the Elizabethan sonnet rhyme scheme, violations are very frequent, but they are not
sufficient to allow the poet with the needed quantity of rhyming words. For this reason,
it can be surmised that Shakespeare was obliged to use a noticeable amount of identical
rhyming word pairs. The level of rhyming repetition is in fact fairly high in the sonnets,
if compared with other poets of the same period, as can be gathered from the tables
below. This topic has not gone unnoticed, as for instance [34], which indicates repetition
of rhyming words as occurring in a limited number of consecutive adjacent sonnets, but
does not give an overall picture of the phenomenon. In fact, as will be clear from the data
reported below, the level of rhyming repetition is fairly high and reaches 65% of all rhyming
pairs. In [34], we also find an attempt at listing all sonnets violating rhyme schemes which
according to him amount to 25. However, as can be easily noticed in the list reported in
the Supplementary Material, the number of sonnets violating the rhyme scheme is much
higher than that.
To enumerate rhyming repetitions, we collected all end-of-line words with their
phonetic transcription and joined them in alternate or sequential order as required by the
sonnet rhyme scheme 1–3, 2–4, 5–7, 6–8, 9–11, 10–12, 13–14—apart from sonnet 126 with
only 12 lines and a scheme in couplets aabbccddeeff, and sonnet 99 with 15 lines. Seven
rhyming pairs for a total of 1078, i.e., 154 sonnets multiplied by 14 equal 2156 divided
by two—less one 2155. In the tables reported as an Supplementary Material—the tables
related to Rhyming Pair Repetition Rate have only been presented in Torino [30] at the
conference and have not been published elsewhere—we only consider at first pairs with a
frequency occurrence higher than 4, and we group together singular and plural of the same
noun, and third person present indicative, d/n past with base form for verbs. We list pairs
considering first occurrence as the “head” and following line as the “dependent”. Rhyme
may be sometimes determined by rules for rhyme violations as is the case with “eye”. We
include under the same heading all morphologically viable word forms as long as word
stress is preserved in the same location, as said above, including derivations. We decided
to separate highly frequent rhyming heads in order to verify whether less frequent ones
really matter in the sense of modifying the overall sound image of the sonnets. For that
purpose, we produce a first sound map below, limited to higher frequency rhyming pairs
and only in a separate count we consider less frequent ones, i.e., hapax, trislegomena and
dislegomena.
In many cases, the same pair is repeated in inverted order as for instance “thee/me”
and “me/thee”, “heart/part” and “part/heart”, “love/prove” and “prove/love” but also
“love/move” and “love/remove” and “approve/love” and “love/approve”, “moan/gone”
and “foregone/moan”, “alone/gone” and “gone/alone”, “counterfeit/set” and “unset/
counterfeit”, “worth/forth” and “forth/worth”, “elsewhere/near” and “near/there”, etc.
“Thee” is made to rhyme with “me”, but also with “melancholy”, “posterity”, “see”.
“Eye/s” are made to rhyme with almost identical monosyllabic sounding words like “die”,
“lie”, “cries”, “lies”, “spies”; but also with “alchemy”, “gravity”, “history”, “majesty”,
and “remedy”, which require the conversion of the last syllable into a diphthong /ay/
preceded by the current consonant. Most of the rhyming pairs evoke a semantic or symbolic
relation which is asserted or suggested by the context in the surrounding lines of the stanza
that contain them. Just consider the pairs listed above where relations are almost explicit.
However, as remarked by [34], rhyme repetition inside the same sonnet may have a different
55
Information 2023, 14, 576
goal: linking lines at the beginning of the sonnet to lines at the end as is the case with
sonnet 134 and the rhyme pair “free/me” which reappears in the couple in reversed order.
Similar results are suggested by repetition of rhyme pair “heart/part” in sonnet 46.
In Table 8. we did the same count with two other famous poets writing poetry in
the same century, Sir Philip Sydney and Edmund Spenser. We wanted to verify whether
the high level of rhyming pairs repetition might also apply to other poets writing love
sonnets. The results show some remarkable differences in the degree of repetitivity. In
Table 9, repeated rhyming pairs are compared to unique ones or hapax rhyming pairs in
three Elizabethan poets. Percentages reported are a ratio of all occurrences of rhyming
pairs. In the first column, types are considered and Sydney overruns Shakespeare and
Spenser. When we come to Token repeating rate—i.e., counting all occurrences of each type
and summing them up, we still have the same picture. Eventually, unique or unrepeated
rhyming pairs are higher in Spenser than in Shakespeare and Sydney.
Table 8. Rhyme repetition rates in three Elizabethan poets.
Author/
QuantiTies
RhymePair
Repeat
Types
RhymePair
Repeat
Token
Hapax or
Unique
RhymePairs
Shakespeare
18.02%
65.21%
34.79%
Spenser
17.84%
47.45%
53.55%
Sydney
22.37%
72.08%
27.02%
Table 9. Rhyme repetition word class-frequency distribution for Shakespeare’s sonnets.
X
Typ
FX
Tok
Sum
FX
Sum
FX + X
% Sum
FX + X
28
1
28
28
2.72
17
1
17
45
4.37
14
2
28
73
7.09
12
2
24
97
9.43
10
1
10
107
10.4
9
5
45
152
14.77
8
3
24
176
17.1
7
1
7
183
17.78
6
6
36
219
21.28
5
10
50
269
26.14
4
29
116
385
37.41
3
37
111
496
48.2
2
87
174
670
65.11
1
359
359
1029
100.0
Now, let us consider the distribution of rhyming words into the corpus of the sonnets.
As to general frequency data, the Sonnets contain a number of tokens equal to 18,283 with
3085 types, so-called Vocabulary Richness that is used to measure the ability of a writer to
use different words in a corpus, corresponds to 16.87%, a high value for that time when
compared with other poets. Also, the number of Hapax and Rare Words (indicating the
union of Hapax, Dis and TrisLegomena) corresponds to average values for other poets,
respectively to 56%, the first type, and 79%, the second one. If we look at similar data for
56
Information 2023, 14, 576
rhyming words, we see that Rare Words cover more than 65% of all as can be gathered
from Table 10 below:
Table 10. Quantitative data for six appraisal classes for sonnets with highest contrast.
Appr.Pos
Appr.Neg
Affct.Pos
Affct.Neg
Judgm.Pos
Judgm.Neg
Sum
56
25
53
77
32
122
Mean
2.533
1.133
2.4
3.466
1.444
5.466
St.Dev.
8.199
3.691
7.732
11.202
4.721
17.611
We report for each word frequency type in column 1—there is only one head word
(thee) with frequency 28—the corresponding number of tokens in Table 9, followed by
the sum of tokens, the incremental sum and the corresponding percentage with respect
to total corpus. As can be noticed from the last column, where incremental percent of
rhyme-pair words corpus coverage is reported, the total of rare words, i.e., type rhyme-pair
with frequency of occurrence lower than 4, is 62.59%, a fairly low value if compared to
the measure evaluated on simple type/token ratios. If we look at most important English
poets, as documented in a previous paper , we can see that the average value for Rare
Words is 77.88%. However, we are here dealing with rhyming words and the comparison
may not be so relevant.
3.2.4. The Sound–Sense Harmony Visualized in Charts
As will appear clearly from the charts below, all the data show a contrasting behaviour
which will be attested by correlation values. Where sentiment values increase, the corresponding values for vowels and consonants decrease. To allow better perusing of the
trends we split the sonnets into separate tables according to whether their sentiment values
are positive or negative. The first chart contains the eleven sonnets which received the
highest positive sentiment values. All the charts are drawn from the tables of data derived
from the analysis files in xml format, which will be made available as supplementary data
(please see Figure 2).
Figure 2. The eleven most positively marked sonnets: 7, 24, 43, 47, 52, 76, 85, 87, 128, 136, 154.
As can be easily noticed, all sound data seem to agree, showing a trend which is very
close for the three variables. On the contrary, the sentiment variable has strong peaks and
its values are set apart from the sound values. However, the interval of variability for sound
variables does remain below or close to 1, thus indicating an opposite trend. In particular,
consonants are all below 1, vowels oscillate in three cases, 52, 128, and 136, voiced in two
cases, 52 and 85, in this case still below 1 but very close 93% in favour of unvoiced.
We interpret consistently contrasting values as a way to convey ironic, sarcastic and
sometimes parodistic meaning. More on this interpretation below. Sonnet 136 is the one
that is highly ambiguous and consequently ironic, celebrating the “Will” or simply “will”.
57
Information 2023, 14, 576
Sonnet 128 is all devoted to music and playing with a wooden instrument which is the
target of the ironic vein and the double meaning of words like “tickle”. Finally, sonnet 52 is
the celebration of the beloved as a “chest” where the rich keep their treasure, and which
must be enjoyed “seldom”. Sonnet 85 is a celebration of silent thought, and for this theme,
it is filled with consonants which are continuants |h,f,th| and are unvoiced, but many
words are marked by a sonorant syllable, thus voiced.
We now separate 16 sonnets which have sentiment equal to 1 or slightly lower than 1
but always higher than 92% in favour of positively marked. They are the following: 22, 33,
51, 60, 64, 73, 94, 97, 101, 102, 109, 118, 123, 131, 141, and 150 (please see Figure 3).
Figure 3. Chart of the 16 borderline sonnets positively marked for sentiment.
In this chart we added the ratio for Abstract/Concrete, which shows a peak for sonnet 73. As the chart clearly shows, the line for Sentiment borders 1, as to the remaining
variables, Vowels is the one oscillating most after Abstract. Voiced and Consonants are
fairly always aligned apart from sonnet 33 and 102. In both sonnets, the number of “Obstruents” (|b,d,p,t,k,g|) is very low and real consonants are substituted by “Continuants”
(|s,sh,th,f,v,h|) both voiced and unvoiced. In the following analysis, for this reason, I will
only consider Voicing as the relevant variable for consonants andtt this will show better
agreement in the overall data. Now, we show charts for all negatively marked sonnets
using only three variables, starting from Figure 4 below.
Figure 4. Chart of the 42 negatively marked sonnets: 3, 8, 9, 19, 28, 30, 34, 35, 50, 55, 57, 58, 60, 62, 63, 65,
66, 71, 86, 89, 92, 103, 107, 112, 116, 120, 121, 124, 126, 127, 129, 132, 133, 134, 138, 139, 140, 143, 146, 148, 149.
58
Information 2023, 14, 576
As can be easily seen, the Sentiment variable is always below 1 but the two remaining
variables oscillate up and down, the vowel one oscillating most in the upper portion of the
chart, and the voiced one in the lowe portion. In this case, the contrast is even stronger
and correlations show a negative trend between Vowels and Sentiment: the one has a
decreasing trend while the other has it increasing, apart from a few exceptions, sonnets 30,
35 and 127, which have almost identical values for the three variables. The other correlation
between Voicing and Sentiment is positive but very weak: 0.1769.
Correlation between Vowel and Sentiment is positive but very weak; correlation between the Voicing parameter and Sentiment is again negative and very weak at −0.0065037.
Thus, results for the 42 sonnets negatively marked by sentiment show that we have negative
correlation between vowels and voicing, and vowels and sentiment, but positive correlation
between voicing and sentiment. So, it is just the opposite of what we obtain with positively
marked sonnets. And finally, in Figure 5. we show the eleven most positively marked
sonnets show the same contrasting results.
Vowels
Voiced
Sentiment
8
7
6
5
4
3
2
1
0
0
20
40
60
80
100
120
140
160
180
Figure 5. The eleven most positively marked sonnets show the same slightly positive correlation
for Vowels–Voicing but very strong negative correlation between Vowels–Sentiment and slightly
negative for Voicing/Sentiment
− at −0.11423482—colours in this case have no meaning.
As to the remaining 85 sonnets positively marked for sentiment, they all have very
weak but positive correlations between sound and sense, i.e., below 0.1, respectively, 0.0387
for vowels, 0.05 for consonants, and 0.091 for voicing. The conclusion we may draw is that
the sound–sense harmony in Shakespeare’s sonnets is represented by a weak extended
harmony for those positively marked for sentiment but a strong disharmony for those
sonnets negatively marked for sentiment: in particular in all the sonnets we have an inverse
correlation, between the two most important variables, Voicing (whether a consonant is
a real Obstruent or not) and Sentiment. As said above, voicing includes real obstruents
and unvoiced continuants: |p,t,k,s,sh,f,th|. When the pair Voicing/Sentiment assumes
a positive correlation value, the other pair Vowel/Sentiment shows the opposite and is
negative. Then, we saw the exceptions, in those sonnets which are most positively marked
for sentiment, the correlations between Vowel and Sentiment are positive but the correlation
between Voicing and Sentiment is negative. Sonnets negatively marked for sentiment have
a positive correlation between Voicing and Sentiment, but a negative correlation between
Vowel and Sentiment. In other words, the behaviour is just reversed: when meaning is
positively marked the sound harmony verges towards a negative feeling. On the contrary,
when the meaning is negatively marked the sound harmony verges, bends towards a
positive sound harmony. I assume what Shakespeare intended to produce in this way was
a cognitive picture of ironic poetic creation.
3.2.5. From Sentiment to Deep Semantic and Pragmatic Analysis with ATF
The final part of the analysis takes us deep into the hidden meaning that the sonnets
communicate, i.e., irony. To carry that out, we need to substitute sentiment analysis with
a much more semantically consistent framework that could allow us to enter the more
ffi
59
Information 2023, 14, 576
complex system of relational meanings that are governed by pragmatics. In this case,
neither a word-by-word analysis or propositional-level analysis would be sufficient. We
need to capture sequences of words which may have a non-literal meaning and associate
appropriate labels: this is what the Appraisal Theory Framework can be useful for.
We have devised a sequence of steps in order to confirm experimentally our intuitions.
The preliminary results obtained using sentiment analysis cannot be regarded as fully
satisfactory for the simple reason that both the lexical and the semantic approach based
on predicate-argument structures are unable to cope with the use of non-literal language.
Poetic language is not only ambiguous but it contains metaphors which require abandoning
the usual compositional operations for a more complex restructuring sequence of steps.
This has been carefully taken into account when annotating the sonnets by means of
Appraisal Theory Framework (henceforth ATF). In our approach, we have followed the
so-called incongruity presumption or incongruity-resolution presumption. Theories connected to the incongruity presumption are mostly cognitive-based and related to concepts
highlighted, for instance, in [35]. The focus of theorization under this presumption is that
in humorous texts, or broadly speaking in any humorous situation, there is an opposition
between two alternative dimensions. As a result, we have been looking for contrast in our
study of the sonnets, produced by the contents of manual classification. Thus, we have used
the Appraisal Framework Theory [36]—which can be regarded as the most scientifically
viable linguistic theory for this task, as has already been conducted in the past by other
authors (see [12,37] but also [38]), showing its usefulness for detecting irony, considering
its ambiguity and its elusive traits.
Thus, we proceeded like this: we produced a gold standard containing strong hints in
its classification in terms of humour, by collecting most important literary critics’ reviews
of the 154 sonnets (the gold standard will be made available as Supplementary Material).
To show how the classification has been organized we report here below two examples:
•
SONNET 8
•
SONNET 21
SEQUENCE: 1–17 Procreation MAIN THEME: One against many ACTION: Young
man urged to reproduce METAPHOR: Through progeny the young man will not be alone
NEG.EVAL: The young man seems to be disinterested POS.EVAL: Young man positive
aesthetic evaluation CONTRAST: Between one and many
SEQUENCE: 18–86 Time and Immortality MAIN THEME: Love ACTION: The Young
man must understand the sincerity of poet’s love METAPHOR: True love is sincere
NEG.EVAL: The young man listens the false praise made by others POS.EVAL: Young Man
positive aesthetic evaluation CONTRAST: Between true and fictitious love.
As can be seen, the classification is organized using seven different linguistic components: we indicate SEQUENCE for the thematic sequence into which the sonnet is
included; this is followed by MAIN THEME which is the theme the sonnet deals with;
ACTION reports the possible action proposed by the poet to the protagonist of the poem;
METAPHOR is the main metaphor introduced in the poem sometimes using words from a
specialized domain; NEG.EVAL and POS.EVAL stand for Negative Evaluation and Positive
Evaluation contained in the poem in relation to the theme and the protagonist(s); finally,
CONTRAST is the key to signal presence of opposing concrete or abstract concepts used by
Shakespeare to reinforce the arguments purported in the poem. Not all the sonnets were
amenable to a pragmatic/linguistic classification. We ended up with 98 sonnets classified
over 154, corresponding to a percentage of 63.64%, the rest have been classified as Blank.
Many sonnets have received more than one possible pragmatic category. This is due to
the difficulty in choosing one category over another. In particular, it has been particularly
hard to distinguish irony from satire, and irony from sarcasm. Overall, we ended up with
54 sonnets receiving a double marking over 98. This was also one of the reasons to use ATF:
often literary critics were simply hinting at “irony” or “satire”, but the annotation gave us a
60
Information 2023, 14, 576
precise measure of the level of contrast present in each of the sonnets regarded generically
as “ironic”.
The annotation has been organized around only one category, Attitude, and its direct
subcategories, in order to keep the annotation at a more workable level, and to optimize
time and space in the XML annotation. Attitude includes different options for expressing
positive or negative evaluation, and expresses the author’s feelings. The main category is
divided into three primary fields with their relative positive or negative polarity, namely:
•
•
•
Affect is every emotional evaluation of things, processes or states of affairs, (e.g.,
like/dislike); it describes proper feelings and any emotional reaction within the text
aimed towards human behaviour/process and phenomena.
Judgement is any kind of ethical evaluation of human behaviour, (e.g., good/bad), and
considers the ethical evaluation on people and their behaviours.
Appreciation
is every aesthetic or functional evaluation of things, processes and state of
ff
affairs (e.g., beautiful/ugly; useful/useless), and represent any aesthetic evaluation of
things, both man-made and natural phenomena.
ff
ff
ff
Eventually, we ended up with six different classes: Affect Positive, Affect Negative,
Judgement Positive, Judgement Negative, Appreciation Positive, and Appreciation Negative. Overall, in the annotation, there is a total majority of positive polarities with a ratio of 0.511,
in comparison to negative annotations with a ratio of 0.488. In short, the whole of the
positive poles is 607, and the totality of the negative poles is 579 for a total number of
1186 annotations. Judgement is the more interesting category because it allows social moral
sanction, which is then split into two subfields, Social Esteem and Social Sanction—which,
however, we decided not to mark. In particular, whereas the positive polarity annotation
of Judgement extends to Admiration and Praise, the negative polarity annotation deals with
Criticism and Condemnation or Social Esteem and Social Sanction (see [38], p. 52). Here below
is the list of 77 sonnets manually classified with ATF over 98 matching critics’ evaluation.
As a first result, we may notice a very high convergence existing between critics’
opinions and the output of manual annotation by Appraisal classes: 77 over 98 corresponds
to a percentage of 78%. As to the sonnets’ structure, Judgement is found mainly in the final
couplet of the sonnets (for more details, see [3]). As to interpretation criteria, we assumed
that the sonnets with the highest contrast could belong to the category of Sarcasm. The
reason for this is justified by the fact thatff a high level of Negative Judgements accompanied
by Positive Appreciations or Affect is by itself interpretable as the intention to provoke a
sarcastic mood. As a final result, there are 44 sonnets that present the highest contrast
and are specifically classified according to the six classes above. There is also a group
that contains ambiguity sonnets which have been classified with a double class, mainly
by Irony and Sarcasm. As a first remark, in all these sonnets, negative polarity is higher
than positive polarity with the exception of sonnet 106. In other words, if we consider this
annotation as the one containing the highest levels of Judgement, we come to the conclusion
that a possible Sarcasm reading is mostly associated with presence of Judgement Negative
and in general with high Negative polarity annotations. In Figure 6 below, we show the
44 sonnets classified with Sarcasm.
61
Information 2023, 14, 576
Figure 6. The 44 sonnets classified with Sarcasm with the highest level of Judgements—colours in
this case have no meaning.
ff
We associated different colours to make the subdivision into the six classes visually
clear. It is possible to note the high number of Judgements both Negative
(in orange) and
ff
Positive (in pale blue): in case Judgement Positive is missing, it is substituted by Affect Positive
ff to all 44 sonnets apart from
(pale green) or by Appreciation Positive (blue). This applies
sonnets 120 and 121 where Judgement Negative is associated with Affect Negative and to
Appreciation Negative. In other words, if we consider this annotation as the one containing
the highest levels of Judgement, we come to the conclusion that possible Sarcasm reading is
mostly associated with presence of Judgement Negative and, in general, with high Negative
polarity annotations. As a first result, we may notice a very high correlation existing
between critics’ opinions as classified by us with the label highest contrast and the output
of manual annotation by Appraisal classes.
In Figure 7 we show the group of 50 sonnets classified, mainly or exclusively, with
Irony and check their compliance with Appraisal classes.
As can be easily noticed, the presence of Judgement Negative is much lower than in
the previous diagram for Sarcasm. In fact, in only half of them—25—have annotations for
that class; the remaining half introduces two other negative classes: mainly Affect Negative,
but also Appreciation Negative. As to the main Positive class, we can see that it is no longer
Judgement Positive, but Affect Positive which is present in 33 sonnets (please see Table 11).
Table 11. Quantitative data for six appraisal classes for sonnets with lowest contrast.
Appr.Neg
Appr.Pos
Affct.Pos
Affct.Neg
Judgm.Pos
Judgm.Neg
Sum
139
65
64
81
59
37
Mean
5.346
2.5
2.461
3.115
2.269
1.423
St.Dev.
18.82
8.843
8.707
11.009
8.029
5.047
62
Information 2023, 14, 576
Figure 7. The 50 sonnets classified with Irony, with a lower level of Judgement Negative but higher
ff
Affect
Negative.
In other words, we can now consider that Sarcasm is characterized by a majority
of negative evaluations 146/224 while Irony is characterized by ffa majority of Positive
evaluations 262/183 and that the values are sparse and unequally distributed.
The final figure, Figure
ff 8, concerns the number of sonnets with blank or neutral
evaluation by critics which amount to 60. As a rule, this group of sonnets should look
different
from the two groups we already analysed.
ff
Figure 8. The 60 Sonnets classified by critics as neutral.
ff
As expected, this figure looks fairly different
from the previous two. The prevailing
colour is pale blue, i.e., Judgement Positive; orange, i.e., Appraisal Negative, is only occasionally
ff
present; and green is perhaps the second prominent colour, i.e., Affect Positive.
In order to
ff
63
Information 2023, 14, 576
know how much the difference is, we can judge it from the quantities shown in Table 12
below.
Table 12. Quantitative data for six appraisal classes for sonnets with no contrast.
Appr.Pos
Appr.Neg
Affct.Pos
Affct.Neg
Judgm.Pos
Judgm.Neg
Sum
88
59
89
109
49
8
Mean
3.034
2.034
3.068
3.758
1.689
0.275
St.Dev.
1.268
7.638
11.482
14.052
6.368
1.079
3.2.6. Matching ATF Classes with the Algorithm for Sound–Sense Harmony (ASSH)
The experiment with ATF classes matching critics’ evaluation has been fairly successful,
but how do these classes gauge with the Sound–Sense harmony? In order to check this,
we transferred the data related to vowels and consonants and matched them with ratios of
the three main ATF categories: Appreciation Positive/Negative, Affect Positive/Negative, and
Judgement Positive/Negative. As in previous computation, all data below 1 will be interpreted
as a case of superior Negative Polarity and the opposite when data are above 1. To allow a
better view of the overall data, we split them into sonnets with contrast to the first group
that we show in Figure 9, and sonnets with no contrast to the second group, that we show
in Figure 10. This time, however, we used our classification and abandoned the critics’ one.
Figure 9. Distribution of 89 sonnets manually classified by ATF with no contrast.
The data in Figure 10 show the distribution of the Sound–sense variable for the three
parameters: we did not introduce variables for vowels and voicing which are, however,
present in the same table and allow us to evaluate the correlation between ATF and sound,
which as can be seen below is negative for both Judgement and Affect:ff
1.
2.
3.
4.
5.
6.
−
Correlation between Vowels and Judgement: −0.1254;
−
Correlation between Voicing andffJudgement:
−0.1468;
−
Correlation between Vowels and Affect:
−
0.08859;
ff
−
Correlation between Voicing and Affect:
ff −
− 0.01346;
Correlation between Judgement
and Affect:
−0.1376;
ff
−
Correlation between Affect and Appraisal: −0.0351.
Correlations of sound data with Appraisal are on the contrary both positive. If we
consider now the
ff remaining 65 sonnets which have been classified by ATF with contrast,
64
ff
ff
Information 2023, 14, 576
ff
ff
−
−
−
−
−
−
we obtain a different
picture. In this case, we have separated each class and projected them
ff
with sound data, Vowels and Voicing in the following three diagrams.
Figure 10. Distribution of 65 sonnets classified as Judgements with contrast and their sound data.
All correlation measures with Judgements are negative:
Correlation between Vowels and Judgements: −0.0594;
−
Correlation between Voicing and Judgements: −0.0677;
−
Correlation between Judgement and Affect: −0.0439;
ff
−
Correlation between Judgement and Appraisal:
−0.0522.
−
In Figure 11 below we use again sound data and the second parameter
ff Affect:
Figure 11. Distribution of 65 sonnets classified by ATF asffAffect with contrast and their sound data.
ff
Correlation data for Affect
are only partly negative:
Correlation between Vowels and ffAffect: 0.09;
Correlation between Voicing and ffAffect:
− −0.1435;
Correlation between Affect
and
Appraisal:
0.2594.
ff
Finally in Figure 12 we project sound data with Appraisal parameters:
65
ff
ff
Information 2023, 14, 576
ff
ff
ff
−
Figure 12. Distribution of 65 sonnets classified by ATF as Appraisal with contrast and their sound
data.
Eventually, the correlations for Appraisal
are also both negative:
−−
−− −0.2068;
Correlation between Vowels and Appraisal:
Correlation between Voicing and Appraisal: −0.0103.
ffff
Now the only positive correlations are the ones shown by Affect with Vowels and with
Appraisal; the remaining correlations are all negative. The subdivision operated now using
our manual classification with ATF seems more consistent than the one made before using
the critics’ evaluation. As a first comment, these data confirm our previous evaluation
ff
ff are mainly disharmonic due to
made on the basis of sentiment analysis, i.e., the sonnets
Shakespeare’s intention to produce ironic effects on the audience. Here below is the list of
the 89 sonnets classified by our manual ATF labeling as having no contrast:
Comparing the “contrast” criterion with the sentiment-based classification is not possible; however, the “contrast” group of sonnets is included in majority by the “negatively”
marked sonnets, with the exception of 16 sonnets which are the following ones:
What these sonnets have in common is an identical number of Appraisal Positive/
Negative feature (28), a high number of AffectffffNegative feature (38) vs. Positive ones (11), and
the relatively lowest number of Judgement Negative features (10) vs. Positive ones (18). In
other words, by decomposing Negative Polarity items into three classes, we managed to
show the weakness of sentiment analysis, where Negativity is a cover-all class. Overall, the
66
Information 2023, 14, 576
sonnets contain a majority of positively or neutrally evaluated sonnets in both sentiment
and appraisal analysis, and a minority of negatively evaluated sonnets: the SSH is, however,
mostly a disharmony.
3.3. Sound and Harmony in the Poetry of Francis Webb
In this section, I will presents the results obtained from the analysis of the poetry by
Francis Webb, who is regarded by many critics among the best English poets of the last
century—and differently from Shakespeare, he never uses ironic attitudes. All the poems I
will be using are taken from the Collected Poems edited by Toby Davidson [39].
I will introduce a type of graphical maps highlighting differences using colours associated with sound and sense (see [11]). The representation of the proposed harmony between
sense and sound will be cast on the graphical space as follows:
-
Class A:
Negatively harmonic poems, mainly negatively marked poems on the left. Either the
sounds or the sentiment are in majority negative, or both the sounds and the sentiment are
negative.
-
Class C:
Positively harmonic poems, mainly positively marked poems on the right. Either the
sounds or the sentiment are in majority positive, or both the sounds and the sentiment are
positive.
-
Class B:
Disharmonic ones in the middle. The sounds and the sentiment have opposite values
and either one or the other have values below a given threshold.
In addition to the evaluation of positive/negative values, we consider the two parameters
we already computed related to Metrical Length and Rhyming Scheme that we add together
and use for its 10% added value to compensate for poetic relevant features. On the basis of
poetic devices analyzed by SPARSAR, a list of 14 poems is considered as deviant, and they
are the following: A Sunrise, The Gunner, The Explorer’s Wife, For My Grandfather, Idyll, Middle
Harbour, Politician, To a Poet, The Captain of the Oberon, Palace of Dreams, The Room, Vancouver
by Rail, Henry Lawson, and Achilles and the Woman. In Figure 13 we show the first map of
sense–sound evaluation where the split of the “deviants” poems appears clearly:
Figure 13. Poems considered as deviants evaluated for their degree of sense/sound harmony.
67
Information 2023, 14, 576
The poem that best represents balanced positive values is Five Days Old and this may
be deduced by the presence of the largest box positioned on the right hand side. Overall, the
figure shows which poems achieved harmonic values and positions positives on the right
and negative on the left sides, and then in the middle disharmonic ones. As clearly appears,
“Five Days Old”, “Politician” and “Vancouver by Rail” are the three poems computed
as endowed with positive harmony, while the remaining poems are either characterized
as strongly negative—“Poet”, “Palace of Dreams” and “The Room”—or just negative,
“The Captain of the Oberon”, The Explorer’s Wife”, “For My Grandfather”, “Idyl”, and
“Henry Lawson”. Finally, the last three poems positioned in the centre left are disharmonic,
“The Gunner”, “Middle Harbour”, and “A Sunrise”, where disharmonic means that the
parameters of sounds are in opposition to those of sense. Slight variations in the position
are determined by the contribution of parameters computed from poetic devices as said
above. Disharmony as will be discussed further on might be regarded as a choice by the
poet with the intended aim to reconcile the opposites in the poem.
The choice of these 14 poems includes poetry written
at the beginning of the career, i.e.,
tt
included in the Early Poems—A Sunrise, Palace of Dreams, To a Poet, Idyll, Middle Harbour,
and Vancouver by Rail—two poems from A Drum for Ben Boyd; Politician, The Captain
of the Oberon—five poems from Leichhardt in Theatre—The Room, The Explorer’s Wife,
For My Grandfather, The Gunner, Henry Lawson—and finally, one poem from Birthday,
Achilles and the Woman, and one poem from Socrates, Five Days Old. In what follows,
ff
at first, I will show small groups of poems taken from different periods
in Webb’s poetic
production and discuss them separately, rather than conflating them all in a single image.
In fact, at the end of this section, I will show a bigger picture where I analysed 87 poems
together, resulting in two big figures. Now, I will back to the second experiment where I
collected and analyzed the following poems,
Early Poems—Idyll, The Mountains, Vancouver by Rail, A Tip for Saturday, This Runner
Leichhardt in Theatre—Melville at Woods Hole, For Ethel, On First Hearing a Cuckoo
Poems 1950–52—The Runner, Nuriootpa
Birthday—Ball’s Head Again, The Song of a New Australian
Socrates—The Yellowhammer
The Ghost of the Cock—Ward Two and the Kookaburra
Unfinished Works—Episode, Untitled
In Figure 14 I show their distribution in the three separate rows:
Figure 14. Sixteen poems from different periods of Webb’s poetic production computed for their
Sense/Sound Harmony.
ff
68
Information 2023, 14, 576
Here again, it is important to notice the majority of the poems positioned on the left
hand side are thus analyzed as possessing negative harmony and only three poems on the
right hand side, one of which is the unfinished “Untitled”. And then, in the middle, there
is a small number of disharmonic poems, or we could call them poems in which there were
conflicting forces contributing to the overall meaning intended by the poem. Also take
into account the dimension of the box which signals the major or minor contribution of
the overall parameters computed as discussed in previous section, of all the linguistic and
poetic features contained in the poem, but measured on the basis of their minor or major
dispersion using standard deviation. In the following group, I added more poems from
later work, which were computer mainly as positive:
Birthday—Hopkins and Foster’s Dam
Socrates—A Death at Winson Green, Eyre All Alone, Bells of St Peter Mancroft
The Ghost of the Cock—Around Costessey, Nessun Dorma
Late Poems 1969–73—Lament for St Maria Goretti, StttTherese and the Child
As showns before, also in Figure 15 the poems are positioned in three separate rows
according to their overall sentiment:
Figure 15. Sixteen poems taken mainly from late poetic production computed for their sense/sound
harmony.
In Figure 16, I will now show a bigger picture containing 50 poems, where we can see
again the great majority of them being positioned on the left hand side. The positive side is
enriched by “Moonlight” from Early Poems, and “Song of the Brain” from Socrates, and the
middle disharmonic list now counts 16 poems.
69
Information 2023, 14, 576
Figure 16. Fifty poems computed by sense/sound harmony.
So, we can safely say that the great majority of Webb’s poems contain a negative
harmony. This is further confirmed by the following Figure 17, which represents the
analysis of 87 poems. I decided not to increase the number of poems up to 130 as was the
ffi
case with the APSA system simply because otherwise the image becomes too difficult
to
tt
read and poems’ labels will be too cluttered
together.
70
Information 2023, 14, 576
Figure 17. Sound–sense harmony in Webb’s 87 poems.
4. Discussion
As now appears more clearly, the sound–sense harmony poses strict requirements
on the execution of the overall experiment, which is composed of a first part dedicated
71
Information 2023, 14, 576
to sound harmony, thus deriving the poet’s major or minor intention to fill completely
the harmonic scheme with the four classes of sounds available. For this first part of the
experiment, the paper has been mainly concentrated on Shakespeare’s sonnets, which
require a much harder level of elaboration in order to complete the sound–sense harmony
experiment due to the presence of rhyme violations. As the data presented have extensively
shown, sounds in Shakespeare’s sonnets are mainly distributed in the four classes and
the three main classes; only a few sonnets have two classes and only one sonnet has one
single class. The distribution is not casual as discussed above and responds to requirements
imposed by the contents. In order to obtain such an important but preliminary result, all
rhyming pairs had to undergo a filtering check to evaluate their role in the overall rhyming
scheme of the sonnet. In case of rhyme violation, the lexicon would have to be checked and
the appropriate phonetic variation inserted.
We have then shown that the sound–sense relation may represent similar but distinct
situations: in case of disharmony, we may be in presence of ironic/sarcastic expressions,
as happens in Shakespeare’s sonnets. This is derived from the data: as shown above, the
correlation has a negative trend, meaning that the two main variables—the ones defining
the behaviour of the sound patterns, and the other the behaviour of the sense, in this case
the sentiment pattern—diverge and move in opposite directions. On the contrary, in the
case of Webb’s poetry, the contrast—when present—represents his need to encompass the
opposites in life and this is testified by the frequent use of oxymora and by his condition of
outcast rejected by society. Data for Webb show a great agreement in negatively marked
sound–sense harmony and a much reduced agreement for positively marked data. Webb
has lived half of his life in psychiatric hospitals rejected by the people who knew him, and
was only accepted as a poet.
The use of two sense-related approaches has allowed us to differentiate what sentiment analysis reduced to two parameters. With the Appraisal Theory Framework, we
thus managed to better specify the nature of negative sentiment using more fine-grained
distinctions derived from the tri-partite subdivision of Attitude into Judgement, Appraisal
and Affect. The data confirmed the previous analysis but allowed a further distinction of
negatively marked sonnets into sarcastic vs. ironic.
The approach has been proven general enough to encompass poets embodying the
widest possible gap from the cultural, linguistic and poetic point of view. Current DNNs
are unable to cope with this task which is highly complex. It requires a sequence of carefully
wrought processes in order to produce a final evaluation: in particular, the first task that is
problematic for AI systems like ChatGPT is an as faithful as possible phonetic transcription
of each poem. When asked to produce one such transcription, ChatGPT carried it out using
IPA symbols, but as for the ARPAbet version, the result was a disaster. Word stress was
assigned correctly only for a 75% of the words. The reason for this situation is very simple:
dictionaries for DNN models number over one million distinct word forms and there is no
resource available which counts more than 200,000 fully transcribed entries. The solution is
to provide rule-based algorithms but we know that DNNs are just the opposite. They are
unable to generalize what they might have learnt from a dictionary to new unseen word
forms [40]. In addition, transcribing in another language—like Italian—has resulted in a
complete failure. And phonetic transcription is just the first step in the pipeline of modules
which are responsible for the final evaluation, as the previous section has clarified.
5. Conclusions
In this article, we have proposed a totally new technique to assess and appreciate
poetry, the algorithm for Sound–Sense harmony (ASSH). In order to evaluate poetry, we
associated the phonetic image of a poem as derived from stressed syllables of rhyming
words with the computed semantic and pragmatic meaning of the clauses contained in
the poem. Meaning is represented by so-called “sentiment analysis” in a first approach
and then by the “appraisal theory framework” in a second approach, which has offered
a more fine-grained picture of the contents of each poem. We tested the technique with
72
Information 2023, 14, 576
the work of two famous poets, Shakespeare—an Elizabethan poet—and Francis Webb, a
contemporary poet. The results obtained show the possibility to reclassify ASSH into two
subcategories: disharmony and harmony, where the majority of Shakespeare’s sonnets
belong to the first and Webb’s poetry—and as I assume the majority of current poetry—to
the second. Disharmony is characterized by the presence of a marked opposition between
classes—both phonetically and semantically; on the contrary, harmony is characterized by
a convergence of sound and sense in the two possible nuances, negative and positive.
The data from Shakespeare’s sonnets have been analyzed by usual methods with
graphic charts; in the case of Webb, a new methodology has been proposed, by projecting
on a graphic space the image of a poem based on its parameters, in a three dimensional
manner. This is performed by drawing a coloured box representing each poem which can
vary its shape according to its relevance, while its position varies according to the overall
semantic parameters computed. The position of the box is assigned on one of the three sides
into which the graphic space is organized: left for negatively marked harmonic poems,
center for disharmonic ones, and right for positively marked harmonic poems. Boxes
may vary slightly their position in one of the sides assigned according to their parameters.
Differently from the results obtained for Shakespeare’s sonnets, Webb’s poetry—we tested
the system with 100 of the most important poems—is thus characterized by a majority
of poems positioned on the left, i.e., possessing negatively marked parameters for SSH.
Finally, disharmony has at least two possible interpretations: in the case of Shakespeare, it
represents an ironic/sarcastic mood, while in Webb’s poetry, it is the result of the internal
struggle for psychic survival. The method has thus been shown to be most general and
applicable to any type of poetry characterizing the poet’s personality by ASSH’s deep
analysis of the explicit and implicit contents of her/his poetic work.
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/info14100576/s1.
Funding: This research received no external funding.
Data Availability Statement: We make available data of the complete analysis of the 154 sonnets
and of 100 Webb’s poems used in his section. Data is contained within supplementary material.
Acknowledgments: The ATF classification task has been carried out partly by Nicolò Busetto, coauthor of a number of papers describing the work done. Thanks to two anonymous reviewers for the
stimulating and inspiring comments that allowed me to improve the paper.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Jakobson, R. Six Lectures on Sound and Meaning; MIT Press: Cambridge, MA, USA, 1978.
Jakobson, R.; Waugh, L. The Sound Shape of Language; Indiana University Press: Bloomington, IN, USA, 1978.
Mazzeo, M. Les voyelles Colorées: Saussure et la Synesthésie. Cah. Ferdinand Saussure 2004, 57, 129–143.
Fónagy, I. The Functions of Vocal Style. In Literary Style: A Symposium; Chatman, S., Ed.; Oxford University Press: Oxford, UK,
1971; pp. 159–174.
Macdermott, M.M. Vowel Sounds in Poetry: Their Music and Tone Colour, Psyche Monographs, No.13; Kegan Paul: London, UK, 1940.
Tsur, R. What Makes Sound Patterns Expressive: The Poetic Mode of Speech-Perception; Duke University Press: Durham, NC, USA,
1992.
Brysbaert, M.; Warriner, A.B.; Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behav.
Res. Methods 2014, 46, 904–911. [CrossRef] [PubMed]
Mohammad, S. Even the Abstract have Colour: Consensus in Word Colour Associations, 2011b. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24
June 2011.
Melchiori, G. Shakespeare’s Sonnets; Adriatica Editrice: Bari, Italy, 1971.
Taboada, M.; Grieve, J. Analyzing appraisal automatically. In AAAI Spring Symposium on Exploring Attitude and Affect in Text:
Theories and Applications; AAAI Press: Washington, DC, USA, 2004; pp. 158–161.
Delmonte, R. Visualizing Poetry with SPARSAR—Poetic Maps from Poetic Content. In Proceedings of the NAACL-HLT Fourth
Workshop on Computational Linguistics for Literature, Denver, CO, USA, 4 June 2015; pp. 68–78.
73
Information 2023, 14, 576
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
Montaño, R.; Alías, F.; Ferrer, J. Prosodic analysis of storytelling discourse modes and narrative situations oriented to Text-toSpeech synthesis. In Proceedings of the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain, 31 August–2 September 2013; pp.
171–176.
Delmonte, R.; Tonelli, S.; Boniforti, M.A.P.; Bristot, A. VENSES—A Linguistically-Based System for Semantic Evaluation. In
Machine Learning Challenges; Quiñonero-Candela, J., Dagan, I., Magnini, B., Florence , d’Alché-Buc, F., Eds.; Springer: Berlin,
Germany, 2005; pp. 344–371.
Bresnan, J. (Ed.) The Mental Representation of Grammatical Relations; The MIT Press: Cambridge, MA, USA, 1982.
Bresnan, J. (Ed.) Lexical-Functional Syntax; Blackwell Publishing: Oxford, UK, 2001.
Baayen, R.H.; Piepenbrock, R.; Gulikers, L. The CELEX Lexical Database (CD-ROM); CELEX2 LDC96L14. Web Download; Linguistic
Data Consortium: Philadelphia, PA, USA, 1995.
Bacalu, C.; Delmonte, R. Prosodic Modeling for Syllable Structures from the VESD—Venice English Syllable Database. In Atti 9
Convegno GFS-AIA; Sirea: Venice, Italy, 1999.
Tsur, R. Poetic Rhythm: Structure and Performance: An Empirical Study in Cognitive Poetics; Sussex Academic Press: Eastbourne, UK,
2012; p. 472.
Greene, E.; Bodrumlu, T.; Knight, K. Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation. In
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October
2011; pp. 524–533.
Carvalho, P.; Sarmento, L.; Silva, M.; de Oliveira, E. Clues for detecting irony in user-generated contents: Oh...!! it’s so easy;-). In
Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, Hong Kong, 6 November
2009; ACM: New York, NY, USA, 2009; pp. 53–56.
Kao, J.; Jurafsky, D. A Computational Analysis of Style, Affect, and Imagery in Contemporary Poetry. In Proceedings of the
NAACL Workshop on Computational Linguistics for Literature, Montréal, ON, Cananda, 8 June 2012.
Kim, S.-M.; Hovy, E. Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational
Linguistics—COLING, Stroudsburg, PA, USA, 23–27 August 2004; pp. 1367–1373.
Reyes, A.; Rosso, P. Mining subjective knowledge from customer reviews: A specific case of irony detection. In Proceedings of the
2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis WASSA ’11, Stroudsburg, PA, USA, 24 June
2011; pp. 118–124.
Weiser, D.K. Shakespearean Irony: The ‘sonnets’. In Neuphilologische Mitteilungen; Modern Language Society: NY, USA, 1983;
Volume 84, pp. 456–469. Available online: http://www.jstor.org/stable/43343552 (accessed on 6 July 2023).
Weiser, D.K. Mind in Character—Shakespeare s Speaker in the Sonnets; The University of Missouri Press: Columbia, MI, USA, 1987.
Attardo, S. Linguistic Theories of Humor; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 1994.
Schoenfeldt, M. Cambridge Introduction to Shakespeare’s Poetry; Cambridge University Press: Cambridge, UK, 2010.
Ingham, R.; Ingham, M. Chapter 5: Subject-verb inversion and iambic rhythm in Shakespeare’s dramatic verse. In Stylistics and
Shakespeare’s Language: Transdisciplinary Approaches; Ravassat, M., Culpeper, J., Eds.; Continuum: London, UK, 2011; pp. 98–118.
Delmonte, R. Exploring Shakespeare’s Sonnets with SPARSAR. Linguist. Lit. Stud. 2016, 4, 61–95. Available online: https:
//www.hrpub.org/journals/jour_archive.php?id=93&iid=772 (accessed on 6 July 2023). [CrossRef]
Delmonte, R. Poetry and Speech Synthesis, SPARSAR Recites. In Ricognizioni—Rivista di Lingue, Letterature e Culture Moderne;
Università Di Torino: Torino, Italy, 2019; Volume 6, pp. 75–95. Available online: http://www.ojs.unito.it/index.php/ricognizioni/
article/view/3302 (accessed on 6 July 2023).
Crystal, D. Sounding out Shakespeare: Sonnet Rhymes in Original Pronunciation. 2011. Available online: https://www.
davidcrystal.com/GBR/Books-and-Articles (accessed on 6 July 2023).
Mazarin, A. The Developmental Progression of English Vowel Systems, 1500–1800: Evidence from Grammarians, in Ampersand.
2020. Available online: https://reader.elsevier.com/reader/sd/pii/S2215039020300011?token=BCB9A7B6C95F35D354C940E0
8CBA968ED124CF160B0AC3EF5FE9146C4B3885E825A878104ED06E127685F139918CCEB6 (accessed on 6 July 2023).
Crystal, D. Think on My Words: Exploring Shakespeare’s Language; Cambridge University Press: Cambridge, UK, 2008.
McGuire, P.C. Shakespeare’s non-shakespearean sonnets. Shakespear. Q. 1987, 38, 304–319. [CrossRef]
Attardo, S. Irony as relevant inappropriateness. J. Pragmat. 2000, 32, 793–826. [CrossRef]
Martin, J.; White, P.R. Language of Evaluation, Appraisal in English; Palgrave Macmillan: London, UK; New York, NY, USA, 2005.
Read, J.; Carrol, J. Annotating expressions of appraisal in English. Lang. Resour. Eval. 2012, 46, 421–447. [CrossRef]
Delmonte, R.; Busetto, N. Detecting irony in Shakespeare’s sonnets with SPARSAR. In Proceedings of the Sixth Italian Conference
on Computational Linguistics, Bari, Italy, 13–15 November 2019. Available online: https://dblp.org/rec/bib/conf/clic-it/
DelmonteB19 (accessed on 6 July 2023).
Webb, F. Collected Poems; Davidson, T., Ed.; University of Western Australia Publishing: Crawley, Australia, 2011.
Delmonte, R. What’s wrong with deep learning for meaning understanding. In Proceedings of the 2nd Italian Workshop on
Explainable Artificial Intelligence (XAI.it 2021), Virtual, 1–3 December 2021.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
74
information
Article
Morphosyntactic Annotation in Literary Stylometry
Robert Gorman
Department of Classics and Religious Studies, College of Arts and Sciences, University of Nebraska–Lincoln,
Lincoln, NE 68588, USA;
[email protected]
Abstract: This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing
on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style
has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing
input variables that are interpretable in traditional grammatical terms. This study demonstrates how
widely available Universal Dependency parsers can generate useful morphological and syntactic
data for texts in a range of languages. These data can serve as the basis for input features that are
strongly informative about the style of individual novels, as indicated by accuracy in classification
tests. The interpretability of such features is demonstrated by a discussion of the weakness of an
“authorial” signal as opposed to the clear distinction among individual works.
Keywords: stylometry; Universal Dependencies; authorship attribution
1. Introduction
Citation: Gorman, R.
Morphosyntactic Annotation in
Literary Stylometry. Information 2024,
15, 211. https://doi.org/10.3390/
info15040211
Academic Editor: Horacio Saggion
Received: 30 January 2024
Revised: 14 March 2024
Accepted: 15 March 2024
Published: 9 April 2024
Copyright:
© 2024 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Stylometry is a discipline that attempts to apply rigorous measurement to the traditional concerns of stylistics. Stylistics involves the identification and evaluation of certain
characteristics that may distinguish the language use of individuals, groups of individuals,
genres, etc. For humanists, the objects of stylometric study are most frequently literary,
historical, or philosophical texts. Recent years have seen much research in the application
of stylometrics in the humanities, and this work has produced many advances in the field.
However, there remain important weaknesses in the predominant methods in the field.
The present study is an attempt to address aspects of these weaknesses: the lack of stylometric input features that produce results that are both (1) topic agnostic and (2) directly
interpretable in traditional terms.
Generally, to be of interest to researchers, the characteristics of the “style” of a text
must be distinctive enough to allow us to discriminate that text from other relevant material.
Thus, from the early days of stylometrics [1], success in classification experiments has served
to establish the stylometric value of the input features on which accurate classifications
were based. Of course, considerations other than high accuracy must also be considered
when evaluating input features. For example, the frequency profiles for a set of common
words [2] or common word sequences—word n-grams [3] are generally quite effective
for classification, but because these input features may include “lexical” words, they are
usually avoided when the style of an individual writer is the focus. Lexical words, also
called “content” words, can be strongly influenced by topic, genre, etc., and this influence
may confound classification. In such a case, researchers rely upon features considered to
be “topic agnostic” since they are not closely and directly dependent on the subject matter
of the text in question. Chief among these topic agnostic inputs are “function” words
and character n-grams. Unlike lexical words, function words (for example, prepositions,
conjunctions, determiners, etc.) belong to a small, closed set. In spite of this fact, function
words as a group are used more often than content words [4]. In addition, function words
more closely reflect syntactic structure than semantic content. Thus, we can reasonably
assume that function words are relatively free of confounding effects. Some function
words may, however, be more closely dependent on genre or topic than others. Gendered
Information 2024, 15, 211. https://doi.org/10.3390/info15040211
75
https://www.mdpi.com/journal/information
Information 2024, 15, 211
pronouns, for example, are for this reason sometimes removed from studies of function
words [5]. On the other hand, while function words often allow for accurate classification
and therefore clearly capture something distinctive about many texts, it is difficult to
translate the frequency profile of a set of prepositions, conjunctions, etc., into a detailed
understanding of the style of a text.
Character n-grams, which recently have become quite popular in textual studies,
share, but to a more extreme degree, the advantages and disadvantages of function words.
Consisting of character sequences without regard to their position in a word, their order in
a sentence, etc., character n-grams represent a text at a sub-lexical level although, because
spaces between words are usually counted as “characters”, rough information about word
boundaries is reflected in this input. For this reason, they are generally free of the criticism
that they are closely dependent on external factors such as topic or intended audience (for
reservations, see [6]). However, it should be obvious that the frequency distribution of
randomized sequences of letters is practically uninterpretable in terms more of traditional
approaches to style, and character n-grams are therefore uninteresting from that perspective.
Thus, approaches to stylometry are closely connected to the ongoing debate in machine
learning and related fields about the relative advantage of choosing, on the one hand,
heuristic input features that may be difficult to interpret and, on the other hand, input that
represents a symbolic structure such as syntax. (For a recent examination of the topic with
a bibliography, see [7]).
Another example of a non-morphosyntactic computational analysis of texts is frontback vowel harmony testing [8,9], which tests whether there is a tendency of having
words to have only front vowels or only back vowels. Front-back vowel harmony is so
characteristic of certain languages that this feature can be detected even if these languages
are written in an undeciphered syllabic script [8].
This paper is an introduction to the stylometric and stylistic value of the morphosyntactic information provided in the annotations of the Universal Dependency treebanks.
First, using the standard criterion of text/authorship attribution, it will demonstrate that
morphosyntactic input features can successfully discriminate among texts without the
identification of any vocabulary items. This result indicates that these features can be
effective while being topic agnostic. Second, it will show that many morphosyntactic input
features can be interpreted in a relatively straightforward way that is consistent with terms
and concepts long used in the precomputational study of literary style.
Advances in the field of interpretable machine learning have provided important
tools for expanding the usefulness of ML by making results easier to understand, even
with “black-box” algorithms (my thanks go to the anonymous reviewer for emphasizing
this point). It nonetheless remains an advantage, at least when attempting to persuade
researchers in literary fields of the validity of computational approaches, to select input
variables that are explainable by referring to traditional stylistics. The academic study of
literary style has its roots in the traditional disciplines of Poetics and Rhetoric [10]. Both
approaches agree that among the most important parts of a description of the style of a text
or corpus are analyses of diction and word arrangement.
Diction is essentially word choice or vocabulary. This traditional focus is also central
to our investigations, in that information about every word in every text analyzed is
included in our data. At the same time, morphological annotation allows us to abstract
away from individual vocabulary choices. Each word is included in our input features
not as a token of a particular lexical item but rather as a representation of the relevant
morphosyntactic categories (part-of-speech, singular or plural, subject or direct object, etc.).
Thus, in accordance with the traditional importance of diction in stylistic research, words
remain the basic unit of analysis in this study, but in a way that seeks to be topic agnostic
and avoid the confounding effects often introduced by a consideration of vocabulary.
The traditional importance of word arrangement to stylistic research in the humanities is also reflected in the input features chosen for this study. Information about the
syntactic annotation of every word in the corpus is reflected in the input features. Syntactic
76
Information 2024, 15, 211
annotation explicitly encodes the relationship between words, and therefore, its use as a
dimension of analysis can be seen as a natural expansion of a traditional approach.
Thinking of the morphosyntactical features used in these experiments as computational “enhancements” of the traditional pillars of literary style, diction and arrangement,
will, it is hoped, promote a broad understanding of the approach. Interpretability should
also be increased by our use of traditional terminology. Terms used for morphological
categories and values are known to any serious researcher in literary style. The widespread
adoption of dependency grammar is, admittedly, relatively recent, but generally, the protocols of dependency grammar are closely related to the traditional concepts and terminology
of the humanistic study of language and literature.
Thus, a stylometric analysis based on the features presented here is well suited to
contribute to a thoroughgoing investigation of the style of a work or author in a way
that should be interpretable to a wide range of readers. In addition, in the course of
our discussion, it will become clear that this morphosyntactic approach is effective, with
minimal adjustments, across a range of languages. This quality is beneficial in a field where
English texts have been the predominant source of, and testing ground for, stylometric
methods [11]).
The organization of the remainder of this paper is as follows. First, in Section 2 the various corpora are described and a step-by-step construction of effective input features from
morphosyntactic annotation is described. Section 3 explains the classification experiment
used to demonstrate the stylometric value of the input features. In Section 4, the results of
the classification are briefly discussed. In Section 5 morphosyntactic frequencies are used
as the basis of an investigation into the interrelations between the “local” characteristics of
individual novels and the more general “authorial” signature. A short conclusion rounds
off the article.
2. Corpora and Morphosyntactic Input Features
The input features used in this study are derived from texts annotated according to the
framework used in the Universal Dependencies Treebank Collection [12,13]. The Universal
Dependencies (UD) project is an open community effort that has been growing rapidly in
recent years. The project has given impetus to the development and publication of software
implementing pipelines for tokenization, tagging, lemmatization and dependency parsing
of texts in a wide range of languages. These invaluable programs—called UDPipes—cover
a wide range of languages and are available for the R and Python environments as well as
through a convenient web interface (https://lindat.mff.cuni.cz/services/udpipe/, accessed
1 January 2024).
Because the focus of this paper is the advantages of morphosyntactic features as
stylometric tools for the humanities, corpora consisting of a selection of novels have
been chosen. While much recent stylometric work has concentrated on social media
texts and the like, this material is less central to the interests of humanists than more
traditional literary writing. Our corpus includes novels in English, French, German, and
Polish. A more diverse set of languages would have been preferable, but such works
meeting the requirements of our study were not readily available. The design of our
investigation calls for literary works that are similar in genre and chronology. As many
authors as possible should be represented, and for each author, the set should include
three separate novels. These criteria could be met by combining reference corpora freely
available at github.com. The English, German, and Polish novels were made available by
the Computation Statistics Group (https://github.com/computationalstylistics, accessed
1 January 2024). The French novels were a resource provided by Computerphilologie Uni
Würzburg (https://github.com/cophi-wue/refcor, accessed 1 January 2024). Because we
will compare the performance of morphosyntactic input variables for works in the different
languages, all corpora were limited to the size dictated by the smallest set (German). As
a result, each language corpus contains 15 authors, each represented by three different
77
Information 2024, 15, 211
works. In order to facilitate comparisons between the languages, only the first 20,000 tokens
(excluding punctuation) of each work in each corpus are considered.
After the collection of a suitable set of texts, the next step is to generate the basic
annotations from which the input features will be assembled. This processing is carried
out with the appropriate UDPipes through the “udpipe” package for the R Software
Environment [14] (R version 4.2.1). Raw text (.txt files) provided to the udpipe produces
output in the CONLL-U format. An example of this output is given below (Tables 1–4).
Table 1. “Shallow” annotation output by UDPipe. Sentence: “It gives us the basis for several
deductions” (Doyle, The Hound of the Baskervilles, 1901).
Token
Lemma
Upos
Feats
It
it
PRON
Case = Nom|Gender = Neut|Number = Sing|Person = 3|PronType = Prs
gives
give
VERB
Mood = Ind|Number = Sing|Person = 3|Tense = Pres|VerbForm = Fin
us
we
PRON
Case = Acc|Number = Plur|Person = 1|PronType = Prs
the
the
DET
Definite = Def|PronType = Art
basis
basis
NOUN
Number = Sing
for
for
ADP
NA
several
several
ADJ
Degree = Pos
deductions
deduction
NOUN
Number = Plur
Table 2. “Deep” annotation output by UDPipe. Sentence: “It gives us the basis for several deductions”
(Doyle, The Hound of the Baskervilles, 1901).
Head_Token_Id
Dep_Rel
2
nsubj
0
root
2
iobj
5
det
2
obj
8
case
8
amod
5
nmod
Table 3. “Shallow” annotation by UDPipe. Sentence: “There, however, stood only a single bowl”
(Spyri, Heidi, 1880).
Token
Lemma
Upos
Feats
Da
Da
ADV
NA
stand
stehen
VERB
Mood = Ind|Number = Sing|Person = 3|Tense = Past|VerbForm = Fin
aber
aber
ADV
NA
nur
nur
ADV
NA
ein
ein
DET
Case = Nom|Gender = Neut|Number = Sing|PronType = Art
einziges
einzig
ADJ
Degree = Pos|Gender = Neut|Number = Sing
Schüsselchen
Schüsselchen
NOUN
Gender = Neut|Number = Sing|Person = 3
78
Information 2024, 15, 211
Table 4. “Deep” annotation by UDPipe. Sentence: “There, however, stood only a single bowl” (Spyri,
Heidi, 1880).
Head_Token_Id
Dep_Rel
2
advmod
0
root
2
advmod
5
advmod
7
det
7
amod
2
nsubj
For each token, the analysis gives the form as it appears in the text and its lemma.
This information is not used in the method described here since our goal is to examine the
discriminative power of morphosyntactic features. In addition, as noted above, general vocabulary may be largely dependent on genre or subject matter and may confound analysis.
It is worth noting that the elimination of word forms and lemmas from consideration simplifies preprocessing and, to some degree, compensates for the time required to extract input
features from the parsed text. Minimal clean-up of the .txt file is required; chapter titles
and the like can be left in the document without affecting the results of the classification.
Leaving aside the form and lemma, the remaining columns in the UDPipe output
shown in Tables 1–4 are essential to our method. The “upos” column contains the UD
part-of-speech tags for each word. The “feats” column gives the morphological analysis.
Morphology information has the form “TYPE = VALUE, with multiple features separated by a bar symbol (TYPE1 = VALUE|TYPE2 = VALUE). Consider, for example, the
morphological data supplied for the word us in Table 1. The upos column assigns us to
“pronoun” as its part of speech. The “feats” column then gives the following information:
Case = Acc|Number = Plur|Person = 1|PronType = Prs. This annotation can be read as
follows. The grammatical case of us is accusative; its grammatical number is plural; it refers
to the speaker, so it is considered grammatically a first-person word; lastly, us belongs to
the pronoun subtype “personal”.
A comparison of the “feats” column in Table 1 with that in Table 3 reflects an important typological difference among languages. Languages can vary significantly in their
morphological complexity. For example, English nouns (basis and deductions in Table 1)
are generally annotated only for grammatical number, while English adjectives (several in
Table 1) show only grammatical degree (i.e., positive, comparative, and superlative). In
contrast, German nouns (Schüsselchen “bowl” in Table 3) and adjectives (einziges “single”)
are considered to have grammatical gender (and case) as well as number. In addition to
the natural differences between languages, complications can be introduced by the parser.
For example, the UDPipe version used in this study (“german-hdt-ud-2.5-191206.udpipe”
Wijffels 2019) does not assign a case to every instance of a noun or adjective. Instead,
explicit annotation of case is generally restricted to words in which different cases are
indicated morphologically: for example, occurrences of Kindes (“child’s”) are marked as
genitive of the noun Kind (“child”); and occurrences of bösem are marked as dative of the
adjective böse (“bad”).
Parts of speech and morphology constitute what we may call “shallow” syntactic
features. These features reflect some syntactical structures, but do not represent them
directly. In contrast, the “head_token_id” and “dep_rel” columns are a direct representation
of syntactic organization. The head token is the item that is the immediate syntactic “parent”
of a given token. The “dep_rel” reports the dependency-type label as specified in the UD
annotation guidelines. The dependency relation specifies the type of grammatical structure
between the parent and target. From these columns, we can calculate the grammatical
79
Information 2024, 15, 211
structure of an entire sentence, as visualized in a dependency tree such as the one shown
below (Figure 1).
Figure 1. Universal Dependency tree for “It gives us the basis for several deductions”.
tt
The syntactic “path” from the sentence root to each “leaf” token is given by the
combination of head id and dependency relationship. The syntactic function of each word
is clearly and specifically defined by these two values. For example, the word basis is the
obj of the word gives. obj is the UD label for what is traditionally called the “direct object”
of a verb (a list of syntax labels along with examples can be found on the UD website:
https://universaldependencies.org/en/dep/index.html, accessed 1 January 2024). The
word several is labeled as amod of deductions. amod indicates an adjectival modifier. The word
deductions itself is an nmod of basis. In UD annotation, nmod means “nominal modifier”, a
noun or noun phrase directly dependent on and specifying another noun (or noun phrase).
For example, the prepositional phrase in “toys for children”.
It is important to recognize the special importance of the “head_token_id” annotation.
Because its values specify the configuration of the dependency tree for each sentence, head
token information can also be used to add structural/syntactic “depth” to the “shallow”
morphological data. For example, examined against the background of the dependency
tree, the German word Schüsselchen (“bowl”) is no longer just a neuter noun, but a neuter
noun that is dependent on a past tense verb, or a neuter noun that is dependent on the main
verb, etc. Thus, the head token annotation allows us to consider the “syntactic sequence” of
words, a hierarchically ordered analogue to the chronologically ordered sequence encoded
in traditional n-grams.
The input features in our study are composed primarily of the three kinds of information discussed above: (1) morphological annotation; (2) syntactic information; and (3)
morphosyntactic “n-grams” containing combinations of morphological and syntactic data
from words that are hierarchically contiguous in the dependency tree of a sentence.
When constructing input features from morphosyntactic annotation, it is important
to design the features in a way that preserves information while avoiding sparsity. We
can achieve this goal by incorporating, in the input features for a single word, a series
of combinations of individual morphosyntactic values. In many languages, a morphological analysis of a word may be relatively complex. For example, the UD annotation
for the German word stand (Table 3) indicates that its morphology may be identified as
a past indicative third-person singular finite verb. Naturally, as the number of more or less
independent values in a given complex annotation increases, the frequency of that set of
values will correspondingly decrease. Thus, while 12.5% of the words in Spyri’s Heidi are
80
Information 2024, 15, 211
annotated with the part of speech verb, only 4.2% are marked with a combination of the verb
annotation and the tense annotation past. If we take syntactic function into consideration,
only 1.6% of words are a past tense verb whose relationship is annotated as root (i.e., the
main verb of a sentence). Such a sharp tendency toward sparsity will rapidly compromise
the effectiveness of morphosyntactic data. To avoid this effect, we distribute the full morphological and syntactic annotation for each word into a set of combinations made from
its assigned grammatical values. In this way, for example, each verb is associated with
an input feature giving its tense, another giving its mood, another its person, etc. Then,
all binary combinations of types are generated (e.g., tense and mood, tense and person,
or mood and person). The same is carried out for ternary combinations. The result is a
framework for organizing input features that is satisfactorily informative while maintaining
an acceptable level of sparsity.
In addition to encoding the morphosyntactic information for each word in the text,
we take advantage of the opportunity afforded by the head token annotation to enrich
the data, as mentioned above. For each word that is not annotated as the root of a sentence, we include input features constructed from the morphosyntactic annotation of that
word’s dependency “parent”. For example, the input features for deductions in Table 1
would include combinations made from basis as well as deductions itself. These syntactically ordered n-grams bring a measure of structural depth to otherwise shallow “surface”
morphological information.
In addition to the morphosyntactic categories discussed so far, we have also included
a small additional group in the feature set. Natural language may be conceptualized as
a hierarchical structure (as illustrated in dependency treebanks) projected onto a linear
order, the chronological sequence of words in texts or speech. Word order, as well as word
hierarchy, can represent crucial stylometric information. We capture some of this linear
information by adding two values to the annotations provided by UDPipe: dependency
distance (DD) and dependency direction (DDir).
DD is the distance in the linear order of a sentence between a given word and its
parent word, measured by the number of words. More precisely, DD can be thought of
as the absolute value of the difference between the linear index of a word (its position
in the linear sequence) and the linear index of the word’s parent. Thus, in our example
sentence, “It gives us the basis for several deductions”, the DD of the word us is 1: the
index of us = 3, the index of parent word gives = 2; hence, 3 − 2 = 1. As treebanks of many
languages become widely available, research on DD is becoming more important. DD
has been suggested as a proxy for sentence complexity [15–17] and as an explanation for
aspects of word order. It is therefore reasonable to include DD among our input features on
the assumption that it represents something important about the style of a text. A second
addition to our set of categories is dependency direction (DDir). This category is quite
simple. A value is assigned to each word (except for sentence roots) indicating whether it
comes before or after its parent word in the linear order of the sentence. Word order has
long been a staple of analyses of stylistics, so it naturally finds a place in a stylometric study
(computational studies based on treebank data for DDir tend to be focused on typological
questions rather than stylistics ones [18]).
The restriction of our input features to unary, binary, and ternary combinations of
annotation categories is an attempt to balance the desire to include the widest range of
possibly useful stylometric data with the need to avoid a sparse set of inputs. Nevertheless,
additional culling of the input features is necessary. The limitation of combinations to
more than three elements still allows for over 16,000. And each of these combinations
is a type, not a variable. Each component of a given type may represent more than one
value; the ternary combination gender–number–case may take one of 24 different value
combinations in German (3 genders × 2 numbers × 4 cases). Thus, even our restricted set
of combinations, when populated with the appropriate values, would be computationally
unfeasible. We have addressed this problem with a naïve approach. Since we cannot
know in advance which combinations may be most distinctive for authors and texts, we
81
Information 2024, 15, 211
have selected among them based on frequency alone. For each combination length, only
those type–value pairs which occur in approximately 5% of the tokens in the corpus have
been included as input variables for classification. The process of populating feature types
with their values is computationally slow for combinations of more than two elements,
so we have used a smaller sample corpus for each language. Thus, the 5% cut-off is
an approximation. A separate set of variables has been identified in this way for each
language. Because UDPipe produces different types of morphological annotation, and
because syntactic annotation, although it largely consists of the same relationship labels,
has different frequency distributions in various languages, the same selection procedure
with the same 5% cut-off results in a different quantity of features for each language. Details
are given in Table 5.
Table 5. Number of input features by number of type–value components in each feature.
Unary
Binary
Ternary
Total
English
55
231
367
653
French
59
337
629
1025
German
63
325
673
1061
Polish
65
337
735
1137
The nature of these features may be difficult for the reader to visualize from a description alone. The examples given below in Section 5 should provide illustration.
3. Classification
The purpose of this study is to examine the effectiveness of morphosyntactic input
features as stylometric markers of literary texts. In particular, we test the usefulness
of the annotation produced by the UDPipe applications. As noted above, classification
experiments are a standard means to evaluate the worth of different sets of input features. It
is to be expected that the various steps implemented by UDPipe involve a greater or lesser
degree of error. Since the information/noise ratio worsens for shorter input texts [19], the
first round of classification tests will be performed using a range of shorter “texts” sampled
from our corpora. The sample sizes are 2000, 1000, and 500 words. For the purpose of
sampling, each text was treated as a “bag of words”. Each token was—naively—treated
as independent of all others; no further account was taken of the context of an individual
token in sentence, paragraph or any other unit of composition.
Recent years have seen the rapid development of many sophisticated classification
algorithms. Deep learning approaches are appearing frequently in stylometric studies (for
example, [20,21]). However, in spite of the accuracy achieved by some of these approaches,
they are often uninterpretable; it is unclear exactly how the algorithm arrived at a particular
classification, or even just what elements of a text were considered [22]. This is not a
satisfactory outcome for stylometrics in a literary or historiographical context. In such
fields, understanding and explaining the style of a text or author is often the principal goal.
In an effort to combine good accuracy and a high level of interpretability, we have
chosen logistic regression as the approach for this study. Logistic regression has long been
used extensively in many fields and is well understood. It is a straightforward approach to
identify the contribution of each input feature to the predictions produced by this method.
An additional advantage is that logistic regression is able to function well in the presence of
many co-linear variables. Morphosyntactic data are by nature highly inter-dependent, and
this may present a problem for some approaches. In this study, regression was implemented
through the LiblineaR package for the R Project for Statistical Computing [23,24]. This
package offers a range of linear methods; we selected the L-2 regularization option for
logistic regression.
The first experiment was designed to discover if morphosyntactic features could
distinguish among the individual novels in the corpora. For each input sample text size
82
Information 2024, 15, 211
in each language, 80% of the data were used for training the classifier and the remaining
20% were set aside for testing. Inclusion of a segment in the training or testing set was
random. For example, to test 2000 word samples, each 20,000-word text in a corpus was
split randomly into ten samples, eight of which were used for training and two were set
aside for testing. Each training step of the classifier was therefore based on 360 samples
(8 per novel for 45 novels). The procedure for other sample sizes was analogous. To validate
the results of the classification testing, we used Monte Carlo sub-sampling [25] applied at
two levels. As a rule, the populating of the segments with randomly selected tokens was
carried out ten times. For each of these partitionings to create text segments, 50 additional
random partitionings into a training set and a test set were made.
4. Results
We would expect the stylometric “signature” of individual novels to be very strong.
This expectation is based on the (over-)simplifying assumption that a single literary work
has a unitary style, arising from a shared theme, time of composition, etc). It should
therefore not be surprising that morphosyntactic attribution with separate classes for each
novel is quite successful. The results are given in Table 6, which gives the mean accuracy
rate (correct “guesses”/total “guesses”) for the 500 iterations in the top row of each cell,
with the accuracy range reported below the mean. There are 45 classes in the data set for
each language. All classifications were multi-class (one-versus-rest approach).
Table 6. Results of classification by individual novel (45 classes).
500-Word Samples
1000-Word Samples
2000-Word Samples
English
90.6%
(90.1–91.3%)
97.1%
(96.4–98.1%)
99.4%
(98.6–99.9%)
French
93.8%
(93.1–94.6%)
96.9%
(94.4–98.9%)
98.9%
(97.7–100%)
German
96.3%
(95.8–96.6%)
99.1%
(98.8–99.3%)
99.8%
(99.2–100%)
Polish
98.3%
(96.8–98.9%)
99.5%
(98.3–100%)
100%
(100–100%)
Clearly, the works in each corpus are sharply distinguishable at the morphosyntactic
level. Unfortunately, there is little published research to which these results may usefully
be compared. Generally, recent stylometric research has a quasi-forensic tendency, focused
on the ability to “prove” authorship of particular texts. In such cases, there is no reason to
examine the discriminability of the individual works of an author. In contrast, our interest
is in the descriptive value of stylometric measures as applied to works as well as authors.
Our assumption in this study is that input features that both discriminate texts clearly
and are understandable in terms of traditional stylistics may serve as the basis of valuable
stylometric descriptions. Our results indicate that discriminability is high even with the
relatively small 500-word samples; this success can be taken as an indication that a good
deal of stylistic information is in fact conveyed by the features that we have proposed.
We will examine some of the most important of these distinguishing features in the next
section. It is worth mentioning here that the same procedure (albeit with different input
features for each corpus) works quite well for each language tested. In fact, it is apparent
from the 500-word samples that the morphosyntactic signal is somewhat stronger the more
morphologically complex the language is. This complexity is reflected in the number of
features as reported in Table 5: a sharper distinction seems to exist between works in Polish,
which has 1137 total input features, compared to between works in English (653 total
features).
83
Information 2024, 15, 211
5. Interpretability
In this section, we explore how the morphosyntactic input features presented here can
be interpreted in a relatively straightforward manner using traditional grammatical terms.
This advantage is not associated with more popular inputs such as character n-grams,
which can achieve high accuracy in a classification test but do not lend themselves to
clear interpretation.
In order to better illustrate the interpretability of our feature set, we will carry out our
discussion against the background of an important open problem in stylometrics. This
problem concerns the relationship between a stable authorial “signature” and the variability
that all authors can be expected to display among their individual works. We have seen
above (Table 6) that each novel in our four corpora has its own strong stylometric “signal”
that allows it to be uniquely identified. Thus, all 2000-word samples in our classification
were assigned to the correct work with an aggregated mean accuracy greater than 98%.
Matters are different if, instead of isolating the morphosyntactic characteristics that
distinguish particular novels from each other, we try to abstract from the particular works
the more general “style” of each author. Table 7 presents the results of such an experiment.
Once again, data were randomly partitioned into samples of 2000, 1000, and 500 words.
This time, however, classification followed the standard “leave-one-out” method. For
each training iteration of the logistic regression classifier, all samples from one novel were
withheld from the training data. The classes for attribution were the 15 authors in each
corpus; the target class was modeled on the basis of the two novels by the relevant author
remaining in the training set. In Table 7, the mean accuracy rate (correct “guesses”/total
“guesses”) for the 450 classification attempts is given in the top row of each cell, and
the accuracy range is reported below the mean (the data for each novel were partitioned
ten times into 2000-word samples; from each partitioning, 45 leave-one-out models were
trained and the held-out set of samples was classified). There are 15 classes in the data set.
Table 7. Results of leave-one-out classification by author (15 classes).
500-Word Samples
1000-Word Samples
2000-Word Samples
English
51.0%
(49.6–53.5%)
56.9%
(55.6–58.2%)
62.9%
(60.2–64.8%)
French
54.0%
(51.8–55.3%)
55.1%
(54.0–56.1%)
57.1%
(55.1–59.3%)
German
61.2%
(59.9–63.2%)
63.0%
(60.3–65.1%)
62.9%
(59.7–64.0%)
Polish
44.8%
(43.4–45.8%)
42.8%
(42.4–44.7%)
41.5%
(40.2–43.7%)
The sharp decrease in classification accuracy is striking. Presumably, an explanation is
to be found in the greatly increased difficulty of the problem. The results of the most closely
comparable previous studies point to the same conclusion. Maciej Eder has published three
important studies on authorship attribution [19,26,27] in which the corpora are similar to
our own. The accuracy of Eder’s experiments is consistent with our results. For example,
Eder (2010) classifies samples of various sizes drawn from 63 English novels; for samples
of around 1000 words, accuracy falls between 40% and 50%. A more precise comparison is
unfortunately not possible. All three of Eder’s works present their results in graphs rather
than tables. Thus, only rough estimates for the accuracy of a given sample size are possible.
Most of Eder’s data are based on the most frequent words. For a corpus of 66 German
novels, samples ranging from 500 to 2000 words seem to yield accuracy scores from 30% to
60%. Evidently, the low accuracy of our authorship attribution tests (as compared to novelby-novel classification) is not anomalous. Furthermore, it does not seem likely that the
combination of input features and classifier that was quite good at identifying individual
novels would become uninformative about the authorship of those same works. The field
84
Information 2024, 15, 211
of stylometry, at least as it pertains to the realm of literary writing, relies on the assumption
that each author displays a number of linguistic “peculiarities” which remain stable for
some significant length of time. Of course, this more or less stable authorial “signature”
only exists as it is manifested in their individual writings, and these writings naturally
vary in a host of ways. While we assume that the morphosyntactic dimension of authorial
style is less affected by the “local” variation among texts than other aspects of style may be,
true independence is of course impossible. It is essentially inconceivable that the author’s
personal linguistic “signature” could be completely separable from and unaffected by
certain “external” factors—a novel’s plot, setting or characters, for example—when it is
only in the treatment of these factors that style comes into existence.
A comparison of Tables 6 and 7 show that, to speak loosely, the authorial signal as
reflected by the morphosyntactic input features is only about half as strong as that produced
by the combination of author and the “local” characteristics of the individual novel. This
difference in results is the context against which we will examine the interpretability of
our feature set. In particular, we will choose a single text, Oliver Twist, an 1839 work by
Charles Dickens, on which to focus our discussion. We will briefly examine input features
that allow this novel to be distinguished from the other 44 works in the corpus. Then, we
will do the same for features that group Oliver Twist together with the other two Dickens
works in the corpus while at the same time distinguishing “Dickens” as a class separate
from the other 14 authors represented.
There are several simple ways to identify input features that are highly discriminative
with a “one-layer” classifier such as logistic regression. For example, we could select those
features to which the classifier assigned the most extreme weights (positive or negative).
Similarly, we could guide selection by looking at the product of the model weight and the
frequency of the feature, since it is on this basis that the algorithm assigns the probability
for any class. However, to avoid complicating this discussion, we will limit our focus to the
frequency of occurrence of the features. In particular, we examine the standardized value
of input frequencies and choose those with the largest z-scores. Because morphosyntactic
values are naturally interdependent, each input feature selected represents a group of
correlated grammatical phenomena. For example, in many languages, only verbs are
considered to have tense. Therefore, if a word is annotated as “tense = past”, the part-ofspeech annotation is redundant. Table 8 presents the selection of features “preferred” by
Oliver Twist, as compared to the remainder of the corpus.
Table 8. Selection of input features “preferred” in Oliver Twist.
#
Feature
Frequency,
Oliver
Frequency
(Mean of Corpus)
Frequency
Rank
Z-Score
1A
Number is singular,
parent precedes
0.176
0.148
1
2.98
2A
Parent’s own parent
follows
0.124
0.104
1
2.59
3A
Parent is singular,
parent’s DD = 2
0.071
0.061
3
2.04
4A
Article, parent is
singular noun
0.079
0.064
5
1.72
5A
Parent’s dependency
label is “object”
0.065
0.058
3
1.43
A few examples will help to illustrate the phenomena underlying these values. The
first feature is grammatically transparent. This sentence from Oliver Twist has two examples:
“They talked of hope and comfort”. The two bold-faced nouns are annotated with feature
#1; obviously singular, they are preceded by their dependency parent, talked. Although this
dependency—a noun upon a verb—is the most frequent structure annotated with feature
85
Information 2024, 15, 211
#1 (a rate of 0.403 in Oliver Twist), a noun dependent on a preceding noun is also common
(0.195), as in the following: “. . . an unwonted allowance of beer. . .”. Here, the singular beer
is dependent on allowance. One should also be aware that on rare occasions (0.018), the
word annotated with feature #1 is itself a verb instead of the more usual noun (0.871) or
pronoun (0.107): “I never knew how bad she was. . .”. In this sequence, was is considered
the head word of the indirect question clause how bad she was; UD grammar considers that
the clause is dependent on the verb knew, which precedes it.
When interpreting a feature such as #1, which reflects more than a single grammatical
value, it is a good idea to establish the relative contribution made by the components to the
combined frequency. For example, both elements of feature #1 are more frequent in Oliver
Twist than in the remainder of the corpus. Words annotated with singular: OT = 0.3171 and
corpus = 0.308; words annotated with parent precedes: OT = 0.3394 and corpus = 0.3158. At
the same time, the frequency of the combination of the two components (OT 0.17) is much
higher than would be expected based on the parts (0.107). Thus, in a study of the style of
the Dickens work, both feature #1 and its parts would be worthy of further analysis.
Feature #2 reflects a deep level of sentence structure since a word’s annotation is based
on its dependency parent and “grandparent”. For example, in the sentence “If he could
have known that. . . perhaps he would have cried the louder”, known, the head word of
the conditional clause, is the parent of the annotated word could; cried, the main verb of
the sentence, is the parent of known. Since cried follows known, feature #2 is appropriately
applied to could.
Feature #3, based simply on the dependency distance of the parent of the annotated
word, should need no illustration. It is worth noting that, as for feature #1, both components
of feature #3 are elements “preferred” by Oliver Twist. The frequency of words annotated
with “parent is singular” is OT = 0.401 and corpus = 0.376; for “DD of parent is 2”, OT = 0.134
and corpus = 0.129. Based on the individual frequencies, the expected rate of occurrence
for the combination is 0.0537 for Oliver Twist. The actual rate is almost one-third larger.
Feature #4 is simple but is an example of the importance that very elementary syntactic
structures can have in drawing stylometric distinctions. Essentially, this feature indicates
the number of singular nouns that are modified by an article, either definite (the) or
indefinite (a/an). Again, it is informative to analyze a compound feature according to
its components. In this instance, Oliver Twist displays a preference for singular nouns
(OT: 0.1534; corpus: 0.1404), but much of the distinctive force of feature #4 cannot be
explained in terms of the frequencies of singular nouns. When feature #4 is controlled
for the number of such words, the ratio of articles per singular noun is OT = 0.5182
and corpus = 0.4558. Clearly, a high frequency of articles is a stylistic characteristic of
Oliver Twist.
Feature #5 is grammatically more complex. To interpret it, one must be familiar with
two important aspects of dependency grammar: (1) verb valency and (2) functional syntax.
According to dependency grammar in general, sentences are structures “built” according
to the “requirements” of its component verbs. The primary requirement is the valency
of a verb. Simply, “valency” in the appropriate sense is the number of dependents that
are necessary to make a verb syntactically and semantically “correct”. Such necessary
components are called “arguments” of a given verb. For example, consideration of the
sentence “Caesar died in Rome” shows that die here has a valency of one. If we subtract
the dependencies, we produce “*died in Rome” and “Caesar died”. Only the second is
acceptable and indicates that died has a valency of one. A bivalent verb can be seen in
“Brutus killed Caesar in Rome”. Kill requires both Brutus and Caesar, but not in Rome. The
argument of a monovalent verb is called the verb’s subject; for bivalent verbs such as kill,
one dependency is labeled as the verb’s subject, the other as its object.
“Functional syntax” refers to the theory according to which dependencies are labeled
primarily according to the role that they play with respect to their parent word. Less
importance is given to the internal characteristics of the dependency. Consider the sentence
“She put money in the bank”. The verb put is trivalent since it requires a subject (she),
86
Information 2024, 15, 211
an object (money), and a third expression (in the bank) indicating a place/goal. This third
expression in the case of trivalent verbs is also often called an object (or second object).
The primacy of function is evident here in that the fact that in the bank is a prepositional
phrase does not affect its dependency label. Its internal structure is irrelevant. Compare the
sentence “She put money there”. The dependency relationships in this version would be
the same as in the first example. Although there in the second sentence is an adverb, it, like
in the bank, is correctly annotated as object. The two expressions serve the same function
with respect to put, and therefore receive the same relationship annotation.
The functional emphasis shown by dependency grammar reduces the number of
dependency relation labels and, at the same time, groups a range of internally different
phenomena in the same category. A few examples may help to clarify the phenomena that
are reflected by feature #5. Under the label “object”, UDPipe includes primarily nouns
and adjectives. An example of a noun object is “Give it a little gruel. . .”, where little is the
annotated word and gruel its parent; gruel is the second argument (here a direct object) of
give. One of Dickens’s most famous sentences provides an example of an adjective in the
object function: “‘Please, sir’, replied Oliver, ‘I want some more’”. Here, more is the direct
object of want, and some (the annotated word) is a dependent modifier of the adjective more.
The word characterized by feature #5, in distinction to its object dependency parent,
can also represent varying grammatical phenomena. To take only verbs, we find examples
like “I shall take a early opportunity of mentioning it . . .”. The annotated mentioning is a
verb in its gerund form and is dependent on opportunity, the direct object of take. A different
phenomenon is represented by “Bumble wiped from his forehead the perspiration which
his walk had engendered. . .”. The annotated engendered is the verb of the relative clause
describing (and therefore dependent on) perspiration. UD grammar considers the verb the
head of a relative clause, and therefore, engendered is the direct dependent of perspiration,
which in turn is the direct object of the main verb wiped. Yet another difference is apparent
in “The boy had no friends to care for. . .”, where the annotated care is part of an explanatory
infinitive structure which specifies the meaning of friends, the direct object of had.
This brief discussion of the dependency relationship object should make clear that the dependency labels are the most complicated annotation in the morphosyntactic data set created
by UDPipe. However, since their complexity arises primarily from the grouping together of
different grammatical “types” according to their grammatical “function”, the interpretation of
the relevant input features is time-consuming rather than conceptually difficult.
In addition to input features that are strongly “preferred” in Oliver Twist, there are
others that are sharply “avoided”. We will look only at three of the most important, as
given in Table 9.
Table 9. Input features “avoided” in Oliver Twist.
#
Feature
Frequency,
Oliver
Frequency
(Mean of Corpus)
Frequency
Rank
Z-Score
1B
Parent is an
infinitive verb
0.076
0.10
43
−1.71
2B
Personal pronoun
0.085
0.114
43
3B
Plural
0.041
0.051
42
−1.66
−1.22
Feature #1B represents dependencies of the infinitive form of the verb. The English
infinitive is morphologically the same as the dictionary lemma. It primarily occurs in one of
two configurations. An infinitive can be “introduced” by the particle to, as in the following
examples: “I have come out myself to take him there”; and “. . . the parish would like him
to learn a right pleasant trade . . .”. It is apparent that to plus the infinitive has a wide range
of syntactic functions. In the first example, to take expresses the purpose for which the
action of the main verb was undertaken. The infinitive phrase can be deleted from the
sentence without making it ungrammatical. In contrast, the syntax (and semantics) of like in
87
Information 2024, 15, 211
the second example requires an object; to learn performs that necessary function and cannot
be omitted without producing incorrect grammar. Note that in both example sentences,
the infinitives have three words directly dependent on them (marked in bold). All of these
words, then, would be correctly coded with feature 1B.
The second common configuration for the English infinitive is to be “introduced”
by certain modal and auxiliary verbs. Examples occur in the following passages: “Do I
understand that he asked for more. . .?”; “. . . [I]t may be as you say. . .”. In the first example,
the infinitive is understand which forms a verb phrase with the auxiliary do. In the second,
the infinitive is be, a usage sometimes called complementary since the infinitive is necessary
to complete the structure implied by a modal (here may). The bold-faced words in each
example again indicate direct dependents of the infinitives, as required for feature 1B.
From these examples and the brief discussion, it should be clear that English infinitives
in their most frequent structures cannot appear without at least one direct dependent.
According to the rules of UD, the particle to is considered a dependent of the infinitive it
precedes. Likewise, auxiliaries and modals such as those in our second set of examples are
annotated as immediate dependents of the infinitives with which they are associated. It will
not be surprising, then, to learn that, on average, each infinitive has more direct dependents
(OT: 2.516; corpus: 2.795) than finite verbs (OT: 1.752; corpus: 1.628). At the same time,
these numbers indicate a sharp difference between Oliver Twist and the rest of the corpus
with respect to the complexity of infinitive clauses. The increase in the average number of
dependencies from finite verbs to infinitives is much smaller than we might expect, given
that infinitives “automatically” come with at least one dependent word: dependency per
word increases by 1.167 for the corpus, but just by 0.764 for Oliver Twist. Thus, measured
by the number of dependencies, infinitive structures in OT are less complex than we would
expect based on finite structures in the same novel.
The grammatical categories reflected in features 2B and 3B are self-evident and require no examples. We only note that the relative avoidance of personal pronouns (I,
you, she, he, it, etc.) in Oliver Twist is no doubt associated with the same novel’s relative
preference for common and proper nouns: OT = 0.245 and corpus = 0.218. Relevant for
the interpretation of feature 3B is the fact that the frequency of all words annotated with
grammatical number—nouns, pronouns, verbs and a few determiners (this/these, etc.)—is
lower for Oliver Twist than for the remainder of the corpus (OT = 0.358; corpus = 0.364).
This difference, however, only partly explains OT’s relative avoidance of feature 3B. In fact,
the distribution of grammatical number within this subset of relevant parts of speech leans
strongly toward the singular as compared to the rest of the corpus (OT: singular = 0.884
and plural = 0.115; corpus: singular = 0.857 and plural = 0.142).
As noted above, comparison of the classification results in Tables 6 and 7 reveals
that for every corpus tested, the signal identifying each individual novel is much more
discernable than the authorial signal. While truly understanding this phenomenon—the
coexistence of “local” variability and “authorial” style—will no doubt require many years
of intensive study, stylometrics at the morphosyntactic level can offer valuable data bearing
on this issue.
A detailed discussion is beyond the limits of this paper, but a single straightforward
example will serve as a useful illustration. The accuracy values given in Table 7 are averages
that encompass a great deal of variation. The authorial signal for some writers in each
corpus was very weak; other authors were comparatively quite easy to distinguish. To take
the corpus of English language novels, the most distinguishable author, with the highest
mean accuracy of classification, was E. M. Forster. Subsamples of Forster’s novels were
attributed to the author with an accuracy of about 85%. (Forster’s works in the corpus are
Where Angels Fear to Tread (1903), A Room with a View (1908) and Howards End (1910)). At the
other extreme, accuracy for the works of Vernon Lee was generally less than 1%. (Vernon
Lee was a nom de plume for the writer Violet Paget. The relevant works in the corpus are The
Countess of Albany (1884), Miss Brown (1884) and Penelope Brandling (1903).) The authorial
signal for Charles Dickens, whose Oliver Twist was the focus of our discussion of input
88
Information 2024, 15, 211
features, was squarely in the middle at about 50% (in addition to Oliver Twist (1839), the
corpus also contains Dickens’s Bleak House (1853) and Great Expectations (1861)).
As one might expect, the algorithm’s success at discriminating authorial signals is to
some degree correlated with stylometric consistency within the works of each author. In
other words, authors whose works show greater variability in the values of the input features tend to be more difficult to correctly attribute an authorial signature to. Table 10 gives
a distribution summary for the “intra-author” standard deviations of each morphosyntactic
input feature.
Table 10. Summary of standard deviations of input features for selected authors.
Min.
1st Quart.
Median
Mean
3rd Quart.
Max.
Forster
0.00005
0.0014
0.0027
0.0037
0.0050
0.0174
Dickens
0.00027
0.0040
0.0066
0.0072
0.0099
0.0248
Lee
0.00037
0.0088
0.0130
0.0149
0.0185
0.0733
Each value in Table 10 is based on the standard deviation of the three works of each
author for each of the 653 input features used in the English corpus. It is evident that the
works of Lee are much less consistent with each other than the works of authors with better
classification results. Lee’s weak signal is not unexpected, given that the mean standard
deviation in her works is more than four times larger than Forster’s, while Lee’s median
value approaches five times Forster’s!
That inconsistency within a class is associated with a high noise-to-information ratio
and with difficulty in discrimination, which will surprise few people familiar with classification experiments. On the other hand, extensive use of morphosyntactic annotation is
rare in stylometric studies, and the data reported in Table 10 suggest that such information
would indeed be useful in exploring stylometric variation for an individual author. In
addition, because it preserves a relatively high amount of information even in short texts
(see the 500-word samples in Tables 6 and 7), a morphosyntactic approach may even be
effective in describing stylometric variability within a single work or a single chapter of
a work.
We will conclude our investigation into morphosyntactic stylometry by returning to
the work of Charles Dickens. Taken together, the works of Dickens in our corpus produce,
as mentioned above, a moderate authorial signal. A few of the important input features
through which the logistic regression model distinguishes “Dickens” from the other authors
in the corpus are given in Table 11.
Table 11. Selected input features preferred or avoided by the class “Dickens”.
#
Feature
Mean
Frequency of
Dickens’s
Works in
Corpus
1C
Parent precedes
0.336
0.314
0.0047
1.66
2C
Parent is a verb with
DD > 2
0.136
0.107
0.0091
1.58
3C
Parent is a verb and head
of an adverbial clause
0.069
0.055
0.0065
1.41
4C
Adjective
0.064
0.071
0.0056
5C
Parent is sentence root
0.183
0.218
0.0082
−0.93
Mean
Frequency of
Remaining
Corpus
Intra-Author
Standard
Deviation
Z-Score
(Dickens’s
Works in
Corpus)
−0.98
In Table 11, the intra-author standard deviation indicates that for each selected feature,
the frequencies in the three works of Dickens in the corpus are quite close to each other.
With regard to the z-scores, a distinction is noticeable between these and the scores in
89
Information 2024, 15, 211
Tables 6 and 7. The values in Table 11 indicate a selection of features with smaller differences
between the class of interest (here “Dickens”) and the corpus mean. This is not an accident
of selection, since relatively large z-scores are less frequent for the mean of Dickens’s three
relevant works than for Oliver Twist. For example, for OT, 52 input features had a z-score
with a magnitude greater than 1.5 (positive or negative). The corresponding count for the
three-work mean is 9.
To turn to the details of the input features, feature 3C needs no further elaboration;
any word that comes after its dependency parent in the linear order of the sentence is
encoded with this feature. Feature 2C should likewise be self-explanatory at this point. The
frequency of feature 4C is based on a simple count of parts of speech: a relative avoidance
of adjectives is a shared characteristic of the three Dickens texts.
Features 3C and 5C contain types of dependency relationships and therefore require
some explanation. In the UD annotation scheme, a sharp distinction is made between
words that function as arguments to a verb (see above) and words that do not. Words that
are not arguments are optional in the sense that they can be omitted without rendering the
sentence ungrammatical (or nonsensical). Adverbial is the label for the most important class
of “optional” words. If the word with this function is a verb, it is labeled as an adverbial
clause. Two examples will point to the many possible ways that a verb can function as
an adverbial for another clause: (1) “If he could have known that he was an orphan, . . .
perhaps he would have cried the louder”; (2) “But he hadn’t, because nobody had taught
him”. In sentence 1, known is the head verb of the conditional clause that is subordinate to
the main verb cried. In sentence 2, taught is the head of a causal clause, itself dependent on
hadn’t. In both sentences, words annotated with “parent is head of an adverbial clause” are
highlighted in bold.
The last of our exemplary “Dickensian” input features, feature 5C, indicates the
frequency of sentence main clauses. Generally, the root of a sentence is a label given to
the main verb, but a peculiarity of UD in this regard may be illustrated by the following
example: “Boys is wery obstinit. . .”. In equational structures such as this one, where the
subject is “linked” with a predicate nominal (here obstinit) by a copula verb (be and similar
verbs), UD grammar considers the predicate nominal, and not the verb, to be the head of
the clause. Thus, the bold-faced words in the example are annotated as dependents of the
adjective. As a result of this protocol, a not insignificant portion of sentence roots in the UD
scheme are nouns, pronouns and adjectives.
The final step in our discussion of the “Dickensian” authorial signal is to give a very
few examples of input features that weaken that signal. In particular, this is a selection of
features for which the values are relatively diverse across the three Dickens novels in the
corpus. Details are given in Table 12.
Table 12. Selected input features where frequency variability weakens the “Dickens” signal.
BH = Bleak House, GE = Great Expectations and OT = Oliver Twist.
#
Feature
Frequency
(Dickens)
Frequency
(Remainder of
Corpus)
IntraAuthor
s.d.
Frequency
Rank
1D
Personal pronoun
BH: 0.117
GE: 0.134
OT: 0.085
0.113
0.020
BH: 20
GE: 7
OT: 43
2D
Parent is singular
BH: 0.384
GE: 0.355
OT: 0.401
0.376
0.0188
BH: 20
GE: 35
OT: 8
3D
Parent is past
indicative verb
BH: 0.137
GE: 0.182
OT: 0.150
0.136
0.0188
BH: 24
GE: 3
OT: 13
4D
Parent is verb
BH: 0.414
GE: 0.449
OT: 0.409
0.403
0.0179
BH: 19
GE: 3
OT: 6
90
Information 2024, 15, 211
The morphosyntactic phenomena underlying these features should by now be clear
and examples unnecessary. In fact, we have already looked closely at feature 1D, which
appeared as feature 2B in Table 9. There, the avoidance of personal pronouns was identified
as a distinguishing characteristic of Oliver Twist. Here, we see that this characteristic is
not shared by the other two relevant works, in each of which personal pronouns are more
frequent than in the corpus mean. Thus, unsurprisingly, the same morphosyntactic feature
can be informative at one level of classification (e.g., novel by novel) and simultaneously
increase noise at another (e.g., author by author).
The features in Table 12 disrupt the authorial signal for “Dickens” because of how
much the frequencies vary among the three Dickens novels in the corpus. Comparison of
the “author-internal” standard deviations in Tables 11 and 12 shows that the dispersal of the
features in Table 12 is 1.9 to 4.2 times greater than in Table 11. Works with such a range of
frequencies for any given input feature will hinder the detection of an authorial signal. At
the same time, it is important to realize that a tight “grouping” of frequencies for a feature
is not in itself enough to make that feature informative for classification. For example, there
are many morphosyntactic input features in our set for which the “Dickensian” standard
deviation is quite small but whose frequencies are very close to the corpus mean. Two
such features are “part-of-speech is verb” and “part-of-speech is a preposition and parent
is a noun”. For the first of these, the mean frequency for the three Dickens novels is 0.134
(sd = 0.0043) and the corpus mean is 0.130. For the second, a feature that essentially reflects
the number of prepositional phrases in the texts, the frequency for Dickens’s works is 0.079
(sd = 0.0018), while the corpus mean is 0.078. Features with frequencies in this pattern
are generally not valuable for logistic regression. From a stylometric or stylistic point of
view, however, it may be just as interesting to know where Dickens’s morphosyntactic
characteristics adhere to the norm as where they depart sharply from it.
6. Conclusions
This paper has presented arguments for the potential value of morphosyntactic annotation for stylometric analysis. It has demonstrated that UDPipe parsers currently available
for many languages produce annotations whose inevitable errors do not seriously undermine the stylometric usefulness of this information, judged by accuracy in classification
experiments. Based on the assumption that morphosyntactic characteristics are not closely
dependent on the specific subject matter of the target texts, the input features described
in this study are to a significant degree topic agnostic. Further, this work has explored
the advantage offered by morphosyntax in terms of stylometric interpretability. The input
features used here are, for the most part, made up of grammatical concepts likely to be
familiar to anyone seriously investigating the style of a literary work or author. Admittedly,
the concepts underlying dependency grammar may be new to many investigators, but the
syntax of natural languages is itself a complex structure. Dependency grammar, assuming
no “hidden” structures, reflects this complexity in a fairly straightforward way. In view of
the demonstrated advantages of morphosyntactic information, it seems clear that it should
have a larger role in stylometric scholarship.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The raw data supporting the conclusions of this article will be made
available by the authors on request.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
Mosteller, F.; Wallace, D.L. Inference in an Authorship Problem. J. Am. Stat. Assoc. 1963, 58, 275–309. [CrossRef]
91
Information 2024, 15, 211
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
Koppel, M.; Schler, J.; Bonchek-Dokow, E. Measuring Differentiability: Unmasking Pseudonymous Authors. J. Mach. Learn. Res.
2007, 8, 1261–1276.
Coyotl-Morales, R.M.; Villaseñor-Pineda, L.; Montes-y-Gómez, M.; Rosso, P. Authorship Attribution Using Word Sequences.
In Progress in Pattern Recognition, Image Analysis and Applications; Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J., Eds.;
Springer: Berlin/Heidelberg, Germany, 2006; pp. 844–853.
Rochon, E.; Saffran, E.M.; Berndt, R.S.; Schwartz, M.F. Quantitative Analysis of Aphasic Sentence Production: Further Development and New Data. Brain Lang. 2000, 72, 193–218. [CrossRef]
Kestemont, M. Function Words in Authorship Attribution from Black Magic to Theory. In Proceedings of the 3rd Workshop on
Computational Linguistics for Literature, Gothenburg, Sweden, 27 April 2014; pp. 59–66.
Koppel, M.; Schler, J.; Argamon, S. Computational Methods in Authorship Attribution. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 9–26.
[CrossRef]
Ranaldi, L.; Pucci, G. Knowing Knowledge: Epistemological Study of Knowledge in Transformers. Appl. Sci. 2023, 13, 677.
[CrossRef]
Revesz, P.Z. A vowel harmony testing algorithm to aid in ancient script decipherment. In Proceedings of the 24th International
Conf. on Circuits, Systems, Communications and Computers, Chania, Greece, 19–22 July 2020; IEEE Press: New York, NY, USA,
2020; pp. 35–38.
VanOrsdale, J.; Chauhan, J.; Potlapally, S.V.; Chanamolu, S.; Kasara, S.P.R.; Revesz, P.Z. Measuring vowel harmony within
Hungarian, the Indus Valley Script language, Spanish and Turkish using ERGM. In Proceedings of the 26th International Database
Engineered Application Symposium, Budapest, Hungary, 13 September 2022; pp. 171–174.
Burke, M. Stylistics: From classical rhetoric to cognitive neuroscience. In The Routledge Handbook of Stylistics; Burke, M., Ed.;
Routledge handbooks in English Language Studies; Routledge, Taylor & Francis Group: London, UK; New York, NY, USA, 2014.
Eder, M.; Górski, R.L. Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish. J. Quant. Linguist. 2022,
30, 86–103. [CrossRef]
Nivre, J.; De Marneffe, M.C.; Ginter, F.; Goldberg, Y.; Hajic, J.; Manning, C.D.; McDonald, R.; Petrov, S.; Pyysalo, S.; Silveira, N.;
et al. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1659–1666.
De Marneffe, M.C.; Manning, C.D.; Nivre, J.; Zeman, D. Universal dependencies. Comput. Linguist. 2021, 47, 255–308. [CrossRef]
Wijffels, J. Udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’
Toolkit. 2019. Available online: https://CRAN.R-project.org/package=udpipe (accessed on 1 January 2024).
Lui, H. Dependency distance as a metric of language comprehension difficulty. J. Cogn. Sci. 2008, 9, 159–191.
Chen, R.; Deng, S.; Liu, H. Syntactic complexity of different test types: From the perspective of dependency distance both linearly
and hierarchically. J. Quant. Linguist. 2021, 29, 510–540. [CrossRef]
Ferrer-i-Cancho, R.; Gómez-Rodríguez, C.; Esteban, J.L.; Alemany-Puig, L. Optimality of syntactic dependency systems. Phys.
Rev. E 2022, 105, 014308. [CrossRef]
Chen, X.; Gerdes, K. Classifying languages by dependency structure: Typologies of delexicalized universal dependency treebanks.
In Proceedings of the Fourth International Conference On Dependency Linguistics, Pisa, Italy, 18–20 September 2017; pp. 54–63.
Eder, M. Short Samples in Authorship Attribution: A New Approach. In Proceedings of the Digital Humanities 2017, Conference
Abstracts, 8–11 August 2017; McGill University: Montreal, QC, Canada, 2017; pp. 221–224. Available online: https://dh2017.
adho.org/abstracts/341/341.pdf (accessed on 1 January 2024).
Hay, J.; Doan, B.L.; Popineau, F.; Elhara, O.A. Representation Learning of Writing Style. In Proceedings of the Sixth Workshop on
Noisy User-Generated Text (W-NUT 2020), Online, 19 November 2020; Available online: https://aclanthology.org/2020.wnut-1.30
(accessed on 1 January 2024).
Wegmann, A.; Schraagen, M.; Nguyen, D. Same Author or Just Same Topic? Towards Content-Independent Style Representations.
arXiv 2022, arXiv:2204.04907.
Patel, A.; Rao, D.; Callison-Burch, C. Learning Interpretable Style Embeddings via Prompting LLMs. arXiv 2023, arXiv:2305.12696.
Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. Liblinear: A library for large linear classification. J. Mach. Learn. Res.
2008, 9, 1871–1874.
Helleputte, T. LiblineaR: Linear Predictive Models Based On The Liblinear C/C++ Library. R Package Version 2.10-12. Available
online: https://www.csie.ntu.edu.tw/~cjlin/liblinear/ (accessed on 1 January 2024).
Simon, R. Resampling strategies for model assessment and selection. In Fundamentals of Data Mining in Genomics and Proteomics;
Dubitzky, W., Granzow, M., Berrar, D., Eds.; Springer: Boston, MA, USA, 2007; pp. 173–186.
Eder, M. Does size matter? Authorship attribution, short samples, big problem. In Digital Humanities 2010: Conference Abstracts;
King’s College London: London, UK, 2015; pp. 132–135.
Eder, M. Does size matter? Authorship attribution, small samples, big problem. Digit. Scholarsh. Humanit. 2015, 30, 167–182.
[CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
92
information
Article
A Benchmark Dataset to Distinguish Human-Written and
Machine-Generated Scientific Papers †
Mohamed Hesham Ibrahim Abdalla ‡ , Simon Malberg ‡ , Daryna Dementieva *, Edoardo Mosca
and Georg Groh *
School of Computation, Information and Technology, Technical University of Munich, 80333 Munich, Germany;
[email protected] (M.H.I.A.);
[email protected] (S.M.);
[email protected] (E.M.)
* Correspondence:
[email protected] (D.D.);
[email protected] (G.G.)
† This paper is a substantially extended and revised version of research published in Mosca E.; Abdalla M.H.I.;
Basso P.; Musumeci M.; and Groh G. Distinguishing Fact from Fiction: A Benchmark Dataset for Identifying
Machine-Generated Scientific Papers in the LLM Era. In Proceedings of the 3rd Workshop on Trustworthy
Natural Language Processing (TrustNLP 2023), pages 190–207, Toronto, Canada. Association for
Computational Linguistics.
‡ These authors contributed equally to this work.
Abstract: As generative NLP can now produce content nearly indistinguishable from human writing,
it is becoming difficult to identify genuine research contributions in academic writing and scientific
publications. Moreover, information in machine-generated text can be factually wrong or even
entirely fabricated. In this work, we introduce a novel benchmark dataset containing human-written
and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica, as well
as papers co-created by humans and ChatGPT. We also experiment with several types of classifiers—
linguistic-based and transformer-based—for detecting the authorship of scientific text. A strong focus
is put on generalization capabilities and explainability to highlight the strengths and weaknesses
of these detectors. Our work makes an important step towards creating more robust methods for
distinguishing between human-written and machine-generated scientific papers, ultimately ensuring
the integrity of scientific literature.
Keywords: text generation; large language models; machine-generated text detection
Citation: Abdalla, M.H.I.; Malberg,
S.; Dementieva, D.; Mosca, E.; Groh,
G. A Benchmark Dataset to
Distinguish Human-Written and
Machine-Generated Scientific Papers.
Information 2023, 14, 522. https://
doi.org/10.3390/info14100522
Academic Editor: Peter Revesz
Received: 31 July 2023
Revised: 13 September 2023
Accepted: 14 September 2023
Published: 26 September 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
Generative Natural Language Processing (NLP) systems—often based on Large Language
Models (LLMs) [1–3]—have experienced significant advancements in recent years, with stateof-the-art algorithms generating content that is almost indistinguishable from humanwritten text [1,4–7]. This progress has led to numerous applications in various fields,
such as chatbots [8], automated content generation [9], and even summarization tools [10].
However, these advancements also raise concerns regarding the integrity and authenticity
of academic writing and scientific publications [11,12].
It is indeed increasingly difficult to differentiate genuine research contributions from
artificially generated content. Moreover, we are at an increased risk of including factually
incorrect or entirely fabricated information [13,14]. Reliably identifying machine-generated
scientific publications becomes, thus, crucial to maintaining the credibility of scientific
literature and fostering trust among researchers.
This work introduces a novel benchmark to address this issue. Our contribution—also
briefly sketched in Figure 1—can be summarized as follows:
(1)
We present a dataset comprising human-written and machine-generated scientific
documents from various sources: SCIgen [15], GPT-2 [4], GPT-3 [1], ChatGPT [8],
and Galactica [16]. We also include a set of human–machine co-created documents
resembling scientific documents with both human-written and machine-paraphrased
Information 2023, 14, 522. https://doi.org/10.3390/info14100522
93
https://www.mdpi.com/journal/information
Information 2023, 14, 522
Benchmark Generation
(3)
Title:
"Video (language) modeling ..."
GPT-2
LLM
GPT-3
LLM
Galactica
LLM
Query
ChatGPT
LLM
SCIgen
CFG
Abstract: "..."
Introduction: "..."
Conclusion: "..."
Explainability
insights
Logistic
Regression
Coefficients
Random
Forest
Feature
Importance
ChatGPT
LLM
(Paraphraser)
LIME
Abstract: "Advances in video modeling ..."
Introduction: "Video data is a growing ..."
Conclusion: "In our work, we tested the ..."
Detection
SHAP
Logistic
Regr.
RoBERTa
Galactica
GPT-3
DetectGPT
ChatGPT
LLMFE
(our)
LLMFE
Feature
Distributions
Explanation
(2)
texts. Each document includes an abstract, introduction, and conclusion in a machinereadable format. While real titles were used to generate articles for the dataset, there
is no title intersection between real and machine-generated papers in our dataset.
We experiment with several classifiers—bag-of-words-based classifiers, RoBERTa [17],
Galactica [16], GPT-3 [1], DetectGPT [18], ChatGPT [8], and a proposed novel classifier
learning features using an LLM and Random Forest [19]—assessing their performance
in differentiating between human-written and machine-generated content. We also
assess each classifier’s generalization capabilities on out-of-domain data and human–
machine co-created papers to obtain a more accurate estimate of the likely real-world
performance of the different classifiers.
We explore explainability insights from different classifiers ranging from word-level
explanations to more abstract concepts to identify typical differences between humanwritten and machine-generated scientific papers.
Figure 1. This work’s overview. Six methods are used to machine-generate papers, which are then
mixed with human-written ones to create our benchmark dataset. Seven models are then tested as
baselines to identify the authorship of a given output.
We release our benchmark dataset, baseline models, and testing code to the public
to promote further research and aid the development of more robust detection methods.
(https://huggingface.co/datasets/tum-nlp/IDMGSP) (accessed on 31 July 2023). This
work extends a previously published study [20].
2. Related Work
2.1. Machine-Generated Text Detection Benchmarks
Since the significant improvement of text generation models, the potential danger
and harm of machine-generated text has been acknowledged by NLP researchers. For this
reason, existing generations of generative models have been tested to create texts in various
domains to compile human-written vs. machine-generated benchmarks.
One of the first datasets and models to detect neural generated texts in a news domain
was Grover [7]. The Grover model for neural news generation was based on GPT-2 [4]
and was used to create a benchmark for neural news detection. In addition, for the news
domain, a dataset for automatic detection of machine-generated news headlines was
created [21]. The machine-generated headlines were also created with GPT-2. Beyond fake
news, the detection of generated scientific articles received attention as well, leading to the
first task benchmarks introduced in [22].
94
Information 2023, 14, 522
With the increasing capabilities of LLMs, new benchmarks appeared recently, covering
several neural text generators and domains. In [23], the M4 (multi-generator, multi-domain,
and multi-lingual) dataset was presented. It covers various kinds of topics—Wikipedia articles, question-answering posts, news, and social posts—in six languages. In the MGTBench
benchmark [24], LLMs were evaluated on several different question-answering datasets.
Finally, the DeepfakeTextDetect dataset [25] covers news article writing, story generation,
scientific writing, argument generation, and question-answering.
2.2. Scientific Publication Corpora: Human and Machine-Generated
The ACL Anthology (https://aclanthology.org) (accessed on 24 April 2023). Ref. [26]
and arXiv [27] are widely used resources for accessing scientific texts and their associated metadata. However, these databases do not provide structured text for scientific
documents, necessitating the use of PDF parsers and other tools to extract text and resolve
references. Several efforts have been made to develop structured text databases for scientific
documents [28–30].
Despite progress in generating text, machine-generated datasets for scientific literature
remain limited. A recent study by Kashnitsky et al. [31] compiled a dataset including
shortened, summarized, and paraphrased paper abstracts and excerpts, as well as text
generated by GPT-3 [1] and GPT-Neo [32]. The dataset lists retracted papers as machinegenerated, which may not always be accurate, and only includes excerpts or abstracts of
the papers.
Liyanage et al. [22] proposed an alternative approach, in which they generated papers
using GPT-2 [4] and arXiv-NLP (https://huggingface.co/lysandre/arxiv-nlp) (accessed
on 24 April 2023). However, their dataset was limited to only 200 samples, which were
restricted to the fields of Artificial Intelligence and Computation and Language.
2.3. Generative NLP for Scientific Articles
Generative NLP for scientific publications has evolved significantly in recent years.
Early methods, such as SCIgen [15], used Context-Free-Grammars (CFG) to fabricate computer science publications. These often contain nonsensical outputs due to CFG’s limited
capacity for generating coherent text.
With the advent of attention, transformers [33] and LLMs [1] have paved the way
for more sophisticated models capable of generating higher-quality scientific content.
Some known (both opensourced and closed) LLMs—such as GPT-3 [1], ChatGPT [8],
Bloom [2], LLaMa-2 [6], and PaLM-2 [34]—are built for general purposes. Others, instead,
are domain-specific and specialized for generating scientific literature. Popular examples
in this category are SciBERT [35] and Galactica [16].
Both general and domain-specific models have shown outstanding results in various scientific tasks, demonstrating their potential to generate coherent and contextually
relevant scientific text. Consequentially, the same technology has been applied to other
domains, including writing news articles [7], producing learning material [36], and creative
writing [37]. Moreover, in education, the usage of advanced LLMs showed already promising results in providing “live“ help in the teaching process [38]. For such use cases, it is
important to develop trustworthy machine-generation technologies, able to provide both
factually correct information as well as display fluency in communication with the users.
2.4. Detection of Machine-Generated Text
The ability to automatically generate convincing content has motivated researchers
to work on its automatic detection, especially given its potential implications for various
domains.
Several approaches to detecting machine-generated text have emerged, employing
various techniques. In [39], a survey of the methods for machine-generated text detection
was presented. One solution is utilizing hand-crafted features [40]. In addition, linguisticbased and bag-of-words features can be quite powerful and well-explainable baselines [41].
95
Information 2023, 14, 522
The topology of attention masks was proven to be one of the efficient methods to detect
neural-generated texts in [42]. Finally, neural features in combination with supervised models can be trained to distinguish between human and machine-generated content [41,43,44].
Alternative approaches explore using the probability distribution of the generative
model itself [18] or watermarking machine-generated text to facilitate detection [45].
2.5. Detection of Machine-Generated Scientific Publications
As we have seen in Section 2.4, several general-purpose solutions exist aiming to
detect NLP-generated text. The detection of automatically generated scientific publications,
however, is an emerging subarea of research with a large potential for improvement.
Previous approaches have primarily focused on identifying text generated by SCIgen [15]
using hand-crafted features [46,47], nearest neighbor classifiers [48], and grammar-based detectors [49]. More recent studies have shown promising results in detecting LLM-generated
papers using SciBERT [50], DistilBERT [51], and other transformer-based models [22,52].
Nonetheless, these approaches have mostly been tested only on abstracts or a substantially
limited set of paper domains.
With the appearance of ChatGPT [8], several studies were dedicated to evaluating how
good this model can be in generating scientific papers. In [53], it was shown that human
annotators are incapable of identifying ChatGPT-generated papers. Since ChatGPT can not
only be used to generate papers from scratch but also to paraphrase them, a method to
identify the polish-ratio of ChatGPT in a piece of text was proposed in [54].
In the end, we can see the necessity for an explainable and robust detector able to
detect machine-generated and edited articles from the most recent LLMs. With this work,
we are aiming to make a step towards the creation of such automated detectors.
3. Benchmark Dataset
In this section, we delve into the construction of our benchmark dataset, which
comprises human-written, machine-generated, and human–machine co-created scientific
papers. Often, for simplicity, we refer to these groups as real, fake, and co-created, respectively. In Section 3.1, we elaborate on the process we followed to extract data from the PDF
documents of real papers. In Section 3.2, we describe our prompting pipelines and how we
utilized various generators to produce fake scientific papers. In Section 3.3, we explain our
approach to generating human–machine co-created papers.
Table 1 offers an overview of our dataset, including sources and numbers of samples
and tokens.
Table 1. Data sources included in our dataset and their respective sizes.
Source
Quantity
Tokens
arXiv parsing 1 (real)
arXiv parsing 2 (real)
SCIgen (fake)
GPT-2 (fake)
Galactica (fake)
ChatGPT (fake)
GPT-3 (fake)
ChatGPT (paraphrased real)
12 k
4k
3k
3k
3k
3k
1k
4k
13.4 M
3.2 M
1.8 M
2.9 M
2.0 M
1.2 M
0.5 M
3.5 M
Total real (extraction)
Total fake (generators)
Total co-created (paraphrased)
16 k
13 k
4k
16.6 M
8.4 M
3.5 M
Total
33 k
28.5 M
3.1. Real Papers Collection
To collect human-written—or real—scientific papers for our dataset, we source them
from the arXiv dataset [27] hosted on Kaggle (https://www.kaggle.com/datasets/Cornell-
96
Information 2023, 14, 522
University/arxiv (accessed on 24 April 2023)). We exclude scientific papers published after
ChatGPT (after November 2022) to avoid machine-generated papers leaking into our real
dataset. While it is still possible that some of the remaining papers were machine-generated,
we deem this to be highly unlikely and only affect a negligibly small number of papers,
if at all, given the lower accessibility and quality of generators before ChatGPT.
The arXiv dataset provides comprehensive metadata, including title, abstract, publication date, and category. However, the introduction and conclusion sections are not part of
the metadata, which implies the need for PDF parsing to extract these sections. From the
metadata, each paper’s ID and version are utilized to construct the document path and
retrieve the corresponding PDF from the publicly accessible Google Cloud Storage bucket.
Each PDF is then fed to the PyMuPDF [55] library to be parsed and to extract the relevant
content. Unfortunately, parsing PDFs is known to be very challenging. This is particularly
true for a double-column format, which many scientific papers have. Despite having tested
several heuristic rules to identify and extrapolate the correct sections, the process can still
fail at times. We discard data points where the parsing was unsuccessful.
The resulting set includes 12,000 real papers. Furthermore, we collect an additional
4000 samples undergoing a different parsing procedure. The intention is to ensure there
are no recognizable parsing artifacts that inadvertently ease the detection process (see
Section 4).
3.2. Fake Papers Generation
For the fake component of our dataset, we employ several models to generate abstracts,
introductions, and conclusions based on scientific paper titles. The overview of the models
used for generation is illustrated in Figure 2. The titles of the real papers sourced from
the arXiv database (see Section 3.1) serve as prompts for the models to generate the target
sections—i.e., abstract, introduction, and conclusion.
Model 1
Model 2
Model 3
GPT-2
Galactica/
GPT-3
ChatGPT
Title:
Lorem ipsum
Title:
Lorem ipsum
Title:
Lorem ipsum
Abstract:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Abstract:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Abstract:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Introduction:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Introduction:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Introduction:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Conclusion:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Conclusion:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
Conclusion:
Lorem ipsum dolor sit amet,
consectetur adipiscing elit.
(a) GPT-2 generation
(b) Galactica/GPT-3 generation
(c) ChatGPT generation
Figure 2. Generation pipeline used for each model. For GPT-2 (a), the abstract, introduction, and conclusion sections are generated by three separately fine-tuned model instances, each based solely on
the paper title. In the case of Galactica and GPT-3 (b), each section is generated conditioning on the
previous sections. Finally, ChatGPT’s generation sequence (c) requires only the title to generate all
the necessary sections at once.
To create fake scientific papers, we fine-tune GPT-2 and GPT-3 instances [1,4] and
also leverage SCIgen [15], Galactica [16], and ChatGPT [8]. For each model—as shown
in Figure 2—we employ a unique prompting/querying strategy to produce the desired
paper sections.
This combination of models, ranging from CFG to state-of-the-art LLMs, aims to
generate a diverse set of artificially generated scientific papers. Concrete examples of
generated papers can be found in Appendix A.
97
Information 2023, 14, 522
3.2.1. SCIgen
Alongside the papers produced by the various LLMs, our fake dataset incorporates
documents generated by SCIgen [15]. Although the quality of CFG-generated text is
rather low and hence straightforward to identify, it remains relevant to ensure that current
detectors can distinguish machine-generated papers even if poorly written and containing
nonsensical content. Stribling and Aguayo [56] show that such papers have been accepted
in scientific venues in the past.
Prompting SCIgen is done simply by running it as an offline script (https://github.com/
soerface/scigen-docker) (accessed on 24 April 2023) which generates all the needed sections,
including the title. The entire paper in LATEXformat is generated as a result.
3.2.2. GPT-2
We fine-tune three distinct GPT-2 base (https://huggingface.co/gpt2) (accessed on
24 April 2023) models (124 M parameters) [4] to individually generate each section based
on the given title. The models are trained in a seq2seq fashion [57], with the training
procedure spanning six epochs and incorporating 3500 real papers. When encountering
lengthy inputs, we truncate those exceeding 1024 tokens, potentially resulting in less coherent introductions and conclusions. Abstracts remain more coherent as they typically fall
below this threshold. We release these separately fine-tuned GPT-2 instances to generate abstract (https://huggingface.co/tum-nlp/IDMGSP-GPT-2-ABSTRACT) (accessed on 31 July
2023), introduction (https://huggingface.co/tum-nlp/IDMGSP-GPT-2-INTRODUCTION)
(accessed on 31 July 2023), and conclusion (https://huggingface.co/tum-nlp/IDMGSPGPT-2-CONCLUSION) (accessed on 31 July 2023) for public usage and investigation.
Hyperparameters: For training, we use a batch size of 16 across all six epochs. We set
the max_new_token to 512, top_k to 50, and top_p to 0.5 for all three models.
Post-processing: We remove generated ”\n” characters and any extra sections not explicitly mentioned in the prompt. Additionally, we remove incomplete sentences preceding
the start of a new sentence. These are indeed common artifacts of GPT-2 and are easily
identifiable by lowercase letters.
Although our GPT-2 model is specifically fine-tuned for the task, generating long
pieces of text occasionally results in less meaningful content. Moreover, we observe that
decoupling the generation of sections can lead to inconsistencies among the generated
sections within the papers.
3.2.3. Galactica
Galactica is trained on a large corpus of scientific documents [16]. Therefore, it is
already well-suited for the task of generating scientific papers. To facilitate the generation of coherent long-form text, we divide the generation process into smaller segments,
with each section relying on preceding sections for context. For instance, while generating a conclusion, we provide the model with the title, abstract, and introduction as
concatenated text.
Hyperparameters: We use Galactica base (https:// huggingface.co/facebook/galactica1.3b) (accessed on 24 April 2023) (1.3 B parameters) [16] to generate each paper section
based on the previous sections. The complete set of hyperparameters can be found in
Table A1 in the Appendix A. Additionally, we enforce max length left padding. Due to the
limited model capacity, restriction of the output number of tokens is necessary to avoid the
hallucination risk introduced by long text generation.
Post-processing: To ensure completeness and coherence in the generated text, we
devise a generation loop that meticulously assesses the quality of the output. For example,
if the generated text lacks an <EOS> (end-of-sentence) token, the model is prompted to
regenerate the text. Furthermore, we eliminate any special tokens introduced by Galactica
during the process.
While Galactica base has 1.3 B parameters, it is still smaller than ChatGPT, which
can result in less coherent outputs when generating longer text segments. As a result,
98
Information 2023, 14, 522
prompting the model to generate a specific section with preceding sections as context yields
better outcomes compared to providing only the title as context and requesting the model
to generate all three sections simultaneously.
3.2.4. ChatGPT
To generate a cohesive document, we prompt ChatGPT (https://help.openai.com/en/
articles/6825453-chatgpt-release-notes, release from 15 December 2022) [8] with “Write
a document with the title [T ITLE ], including an abstract, an introduction, and a conclusion”,
substituting [T ITLE ] with the desired title utterance. ChatGPT’s large size (20B parameters)
and strong ability to consider context eliminate the necessity of feeding previous output
sections into the prompt for generating newer ones.
Hyperparameters: For the entire generation process, we use the default temperature
of 0.7.
Despite not being explicitly trained for scientific text generation, ChatGPT can produce
extensive, human-like text in this domain. This capability likely stems from the model’s
large size, the extensive datasets it was trained on, and the incorporation of reinforcement
learning with human feedback.
3.2.5. GPT-3
We fine-tune an instance of GPT-3 (text-curie-001, 6.7 B parameters) [1] with 178 real
samples. Output papers generated through an iterative cascade process (as with Galactica)
present a much higher quality than those forged in a single step (as with ChatGPT). Hence,
we opt for the former strategy for GPT-3.
Pre/Post-Processing: To force the generation of cleaner outputs, we add an <END>
token at the end of each input used for fine-tuning. GPT-3 mimics this behavior and
predicts this token as well, so we remove every token added after generation <END>.
While still not on par with ChatGPT-generated outputs, we report a high quality for
GPT-3-crafted papers.
3.3. Co-Created Papers Generation
The co-created component of our dataset mimics papers written by humans and
models concurrently, a combination that is likely to appear in practice. That means texts
originally written by either a human or an LLM and subsequently extended, paraphrased,
or otherwise adjusted by the other. To create such papers at scale, we take a set of 4000 real
papers from our TEST dataset (see Table 2) and paraphrase them with ChatGPT [8]. To stay
within ChatGPT’s context length limits, we paraphrase each paper section—i.e., abstract,
introduction, and conclusion—in a separate prompt. We then construct co-created papers
with varying shares of human and machine input by combining original and paraphrased
sections as shown in Figure 3.
Introduction
Conclusion
Real
Paraphrased
Real
1000 papers
Real
Real
Paraphrased
1000 papers
Real
Paraphrased
Paraphrased
1000 papers
Paraphrased
Paraphrased
Paraphrased
1000 papers
Abstract
Count
Figure 3. Our co-created test dataset TEST-CC contains 4000 papers with varying shares of real and
ChatGPT-paraphrased sections.
99
Information 2023, 14, 522
Table 2. Overview of the datasets used to train and evaluate the classifiers. Each column represents
the number of papers used per source. Concerning real papers, unless indicated, we use samples
extracted with parsing 1 (see Section 3.1).
Dataset
arXiv
(Real)
ChatGPT
(Fake)
GPT-2
(Fake)
SCIgen
(Fake)
Galactica
(Fake)
GPT-3
(Fake)
ChatGPT
(Co-Created)
Standard train (TRAIN)
Standard train subset (TRAIN-SUB)
TRAIN without ChatGPT (TRAIN-CG)
TRAIN plus GPT-3 (TRAIN + GPT3)
Standard test (TEST)
Out-of-domain GPT-3 only (OOD-GPT3)
Out-of-domain real (OOD-REAL)
ChatGPT only (TECG)
Co-created test (TEST-CC)
8k
4k
8k
8k
4k
4 k (parsing 2)
-
2k
1k
2k
1k
1k
-
2k
1k
2k
2k
1k
-
2k
1k
2k
2k
1k
-
2k
1k
2k
2k
1k
-
1.2 k
1k
-
4k
Hyperparameters: For paraphrasing, we use OpenAI’s gpt-3.5-turbo-0613 model
and set the temperature to 1.0 to achieve the largest deviation from the original humanwritten text.
4. Detection Experiments
In this section, we conduct experiments about identifying the source of a given paper—
i.e., determining whether it is fake or real. We further investigate the ability of our baseline
classifiers to detect co-created papers with varying degrees of fake—i.e., paraphrased—
content. We start by defining data splits and subsets for training and testing, which
are useful to evaluate generalization capabilities. Next, we outline the classifiers used
as baselines to measure performance on the benchmark task. Finally, we examine the
detection performance of the classifiers, investigate the obtained explanations, and apply
additional post hoc explainability methods to the classifiers to gain deeper insights into the
detection process.
4.1. Data Splits and Generalization Tests
We divide our dataset (displayed in Table 1) into standard train and standard test sets for
training and testing our classifiers, respectively. Furthermore, we aim to evaluate models
on out-of-domain test data. To achieve this, we create various data subsets by applying
different splits to our benchmark. All the splits utilized for our experiments are detailed in
Table 2. For instance, the reader can observe the composition of a data split with no access
to ChatGPT samples (TRAIN-CG) and test sets composed only of differently-parsed real
papers (OOD-REAL), only ChatGPT papers (OOD-CG), or only GPT-3 ones (OOD-GPT3).
4.2. Classifiers
We build and evaluate seven classifiers to perform the downstream task of classifying
scientific papers as fake or real based on their content (abstract, introduction, and conclusion
sections)—we remind the reader that all paper titles are real and will therefore not serve as
input to the classifiers. To obtain an understanding of the difficulty of this classification
task, we train two simple bag-of-words-based classifiers, Logistic Regression (LR) [58] and
Random Forest (RF) [19]. Further, we fine-tune GPT-3 [1], Galactica [16], and RoBERTa [17]
for this detection task. Lastly, we use a ChatGPT-based classifier without fine-tuning and a
novel classifier that we call Large Language Model Feature Extractor (LLMFE) that learns
explainability features using an LLM and then performs classification with Random Forest.
To accommodate memory and API limitations, we impose a restriction on the input
tokens for GPT-3, Galactica, and RoBERTa by truncating texts after a certain number of
tokens (details described in the following sections per model). However, since the average
length of the combined input sections is about 900 tokens, which is less than the truncation
limit, this constraint does not lead to significant information loss.
100
Information 2023, 14, 522
4.2.1. Bag-of-Words Classifier
As the simplest classifiers, we evaluate Random Forest [19] and Logistic Regression [58] on TF-IDF [59] features. This is to obtain an indication of the difficulty of the
classification task—i.e., whether there is any classification signal in word frequencies alone
or the detection of fake scientific papers requires more complex features. With Random
Forest and Logistic Regression, we can explain the results by examining feature importance
and learned coefficients.
Hyperparameters: We use the Random Forest and Logistic Regression implementations in scikit-learn [60] with default hyperparameters. We create features based on n-grams.
A comparison of accuracies when using 1-grams, 2-grams, or a combination of both can
be found in Table A2 in the appendix. In the following, we will report results based on
1-grams as these yielded the highest accuracy scores.
4.2.2. GPT-3
We fine-tune a GPT-3 [1] Ada model (text-ada-001, 350 M parameters) for the classification task. GPT-3 is fine-tuned in a causal manner, where the model is prompted with
the concatenated paper sections along with their corresponding label. This is set up as a
binary classification where the output is a single token indicating whether the paper is real
(0) or fake (1). During inference, the model generates a single token based on the sections of
a given paper.
As fine-tuning GPT-3 models requires a paid API, we train it only on a smaller subset
of our dataset (TRAIN-SUB) shown in Table 2. We limit the number of input tokens to 2048
while retaining the default hyperparameters provided by the API.
4.2.3. Galactica
We adapt Galactica-mini (https://huggingface.co/facebook/galactica-125m) (accessed
on 24 April 2023) [16] from a causal language model that predicts probabilities for each
word in the vocabulary to a binary classifier with an output layer that predicts probabilities
for two labels: fake and real.
The model is provided with all sections concatenated together with the corresponding
label. Galactica, being a causal language model, generates a probability distribution
spanning the entire vocabulary in its output. Nevertheless, this approach incurs significant
memory usage, particularly when employed as a classifier. Therefore, we opted to retrain
the output layer to yield a probability distribution for binary outcomes.
Hyperparameters: To cope with memory constraints, we limit the number of input
tokens to 2048. Additionally, we adjust the batch size to 2 with gradient accumulation steps
of 4 and enabled mixed precision. Furthermore, we set the number of epochs to 4, weight
decay to 0.01, and warm-up steps to 1000. Our initial learning rate is 5 × 10−6 .
4.2.4. RoBERTa
We fine-tune RoBERTa base (125 M parameters) (https://huggingface.co/roberta-base)
(accessed on 24 April 2023) [17] for the classification task. RoBERTa is limited to 512 input
tokens, meaning that all text exceeding this limit is ignored. Our dataset exceeds this constraint
for many entries. We choose to address the problem by fine-tuning three separate RoBERTa models to classify the three sections individually rather than retraining the input layer by enlarging
the input size. https://huggingface.co/tum-nlp/IDMGSP-RoBERTa-TRAIN-ABSTRACT
(accessed on 31 July 2023) (https://huggingface.co/tum-nlp/IDMGSP-RoBERTa-TRAININTRODUCTION) (accessed on 31 July 2023) (https://huggingface.co/tum-nlp/IDMGSPRoBERTa-TRAIN-CONCLUSION) (accessed on 31 July 2023) We take the majority vote
from three model instances as the final output for each sample. We prompt each model
with the capitalized name of the section plus the content of the latter, e.g., “Abstract: In this
paper . . . ”.
101
Information 2023, 14, 522
Hyperparameters: To fine-tune the RoBERTa base, we set the number of epochs to 2,
weight decay to 0.001, and batch size to 16. As with Galactica, the initial learning rate is
5 × 10−6 , and the warmup steps 1000.
4.2.5. DetectGPT
We evaluate DetectGPT [18] as another classifier as it has been shown to detect LLMgenerated texts with high accuracy.
Hyperparameters: We use DetectGPT’s default configuration and code (https://github.
com/BurhanUlTayyab/DetectGPT) (accessed on 15 May 2023).
4.2.6. ChatGPT
To obtain natural-language explanations for classification directly, we prompt ChatGPT [8] via the OpenAI API. With this, we determine whether a scientific paper is fake or
real and retrieve an explanation for its decision. The prompts include the concatenated
sections, each beginning with the section name (e.g., “Abstract:\nIn this paper . . . ”), and task
instructions. We compare the detection performance of four different prompting styles:
(1)
(2)
(3)
(4)
Input-Output Prompting (IO): First, return the prediction (i.e., fake or real). Second,
follow up with an explanation of the reasons for the prediction.
Chain-of-Thought Prompting (CoT) [61]: First, return a sequence of thoughts on
whether the paper is more likely fake or real. Second, return the final prediction.
Indicator Prompting (IP): First, return a set of observations indicating that the paper
was written by a human. Second, return a set of observations indicating that the paper
was generated by a machine. Third, return the final prediction.
Few-Shot Prompting (FS) [1]: Perform Input-Output Prompting but include a set of
6 annotated examples—one example from each generator and one real example—in
the prompt (i.e., scientific papers with their abstract, introduction, conclusion, and fake
or real label).
On our specific task, we observe the best classification results for the IO prompting
style. Hence, we will only report accuracy scores for this prompting style in the following.
For a detailed accuracy comparison of the different prompting styles, see Table A3 in the
appendix. When using CoT prompting, there is a large number of instances where ChatGPT
refuses to return a definite class label (real or fake) but instead returns unknown. We treat
these results as incorrect answers and thus observe low accuracy scores for CoT prompting.
We did not observe this behavior for the other prompting styles.
Hyperparameters: For classification, we use OpenAI’s gpt-3.5-turbo-0613 model
with the default temperature of 0.7. Only for Few-Shot Prompting, we prompt the
gpt-3.5-turbo-16k-0613 model as a larger context length is needed. We do not perform
task-specific fine-tuning. Due to API limitations, we classify only 100 randomly sampled
papers from each test set using each of the four prompting styles. During implementation,
we also experimented with larger samples and observed consistent classification accuracy
scores independent of the sample size.
4.2.7. Large Language Model Feature Extractor (LLMFE)
Finally, we introduce and evaluate a novel explainable classifier LLMFE that learns
human-understandable features using an LLM and an approach inspired by contrastive
learning [62]. These features can range from very low-level (e.g., occurrences of a specific
word) to very high-level (e.g., logical conclusiveness of argumentation). Figure 4 shows
how LLMFE works conceptually. Training this classifier follows a four-step process:
(1)
Feature Engineering: The LLM is presented with a pair of one real and one fake
scientific paper and instructed to describe a list of features that would best distinguish
these papers from each other. As we score each feature on a range of 0 to 10, we further
instruct the LLM to label the meaning of the extreme ends of this scale for each feature
102
Information 2023, 14, 522
(2)
(3)
1. Feature
Engineering
Step
(4)
to avoid ambiguity. This prompt is repeated for n_pairs times to extract multiple
different sets of features based on different example pairs.
Feature Consolidation: As the previous step may have generated a large number of
features, many of which are duplicates or semantically similar, we consolidate the
extracted features into a smaller feature set. This is done by vectorizing each feature
description using embeddings and performing hierarchical/agglomerative clustering
[63] on the embeddings. We then manually investigate the cluster dendrogram and
define a distance threshold d_thres. We finally merge all features less than d_thres
apart from each other and represent each cluster through the feature closest to the
cluster centroid. If d_thres is chosen carefully, this results in a significantly smaller,
semantically diverse, and duplicate-free feature set. More detailed illustrations of this
step can be found in Appendix B.4.
Feature Scoring: The LLM is presented with an abstract, introduction, and conclusion
of a scientific paper and descriptions of all features in the feature set. It is then
instructed to assign an integer value from 0 to 10 to each feature that most accurately
describes the scientific paper. This prompt is repeated for each example in the training
dataset of size n_sample.
Classifier Training: The previous steps resulted in a structured dataset of n_sample
examples with one integer value for each feature in the learned feature set. Further,
class labels (i.e., real or fake) are known. This dataset is used to train a Random
Forest [19] classifier that learns to detect papers based on the features described by the
LLM.
2. Feature
Consolidation
d_thres
Training
1
2
Real paper vs. Fake paper
Real paper vs. Fake paper
…
n_pairs
Feature
Feature
LLM
Feature
Real paper vs. Fake paper
Feature
Embedding
Embedding
Embedding
Hierarchical
Embedding
Clustering
Embedding
…
Embedding
3. Feature
Scoring
Paper
Feature
Feature
Feature
Feature
LLM
+
4. Classifier
Training/Prediction
Label
Feature
Feature
Feature
Feature
3
7
1
8
Random
Forest
Classifier
Repeat n_sampletimes
Inference
Paper
LLM
Feature
Feature
Feature
Feature
3
7
1
8
Random
Forest
Classifier
Label
Figure 4. LLMFE follows a four-step process: (1) Generate features suitable for distinguishing real
and fake papers using the LLM based on multiple pairs of one real and one fake paper each. (2) Remove
duplicate features through hierarchical clustering on embeddings of the feature descriptions. (3) Score
scientific papers along the remaining features using the LLM. (4) Finally, train a Random Forest
Classifier to predict the real or fake label based on the feature scores.
Throughout the first three steps, the LLM is made aware of its overall goal of distinguishing real and fake scientific papers through the prompt instructions. We add this context
information to best exploit the LLM’s general world understanding obtained through
extensive pre-training and to compensate for the relatively small sample sizes used for
training. Inference on the test dataset then requires only two steps:
(1)
(2)
Feature Scoring: Similar to the Feature Scoring step during training, a set of new
papers is scored along the learned features.
Classifier Prediction: The class label of the new papers is predicted using the trained
Random Forest classifier.
Hyperparameters: Our LLMFE implementation uses OpenAI’s gpt-3.5-turbo-0613
with the default temperature of 0.7 for the Feature Engineering step and gpt-3.5-turbo-16k
-0613 with a temperature of 0.0—for deterministic behavior—for the Feature Scoring step.
We set n_pairs=100 and obtained 884 features from the Feature Engineering step. For the
Feature Consolidation step, we create embeddings of the feature descriptions with OpenAI’s text-embedding-ada-002 and chunk_size=1000. We apply agglomerative clustering from Scipy’s [64] linkage implementation with a cosine distance metric and calculate
103
Information 2023, 14, 522
the average distance between clusters. We chose d_thres=0.05 as this resulted in a convenient balance between de-duplication and semantic feature diversity, yielding a final set of
83 features. We finally trained a Random Forest classifier with scikit-learn’s [60] default
hyperparameters on 600 papers from the TRAIN dataset (300 real papers and 60 fake papers
from each generator).
4.3. Performance
Table 3 presents a summary of the accuracy scores achieved by our models on various splits. Given the significance of evaluating generalization to unseen generators, we
highlight out-of-domain settings in blue. We exclude experiments entailing training GPT-3
on TRAIN + GPT3 and TRAIN-CG due to limited OpenAI API credits. Results of our
fine-tuned models and LLMFE are also compared with DetectGPT as an existing zero-shot
detection baseline [18], ChatGPT, and our Logistic Regression (LR) and Random Forest (RF)
classifiers trained on 1-gram TF-IDF features.
Table 3. Experiment results reported with accuracy metric. Out-of-domain experiments, i.e., evaluation on unseen generators, are highlighted in blue. Highest values per test set are highlighted in
bold. (*) ChatGPT-IO and LLMFE accuracies have been evaluated on randomly sampled subsets of
100 scientific papers per test set due to API limits.
Model
LR-1gram (tf-idf) (our)
LR-1gram (tf-idf) (our)
LR-1gram (tf-idf) (our)
RF-1gram (tf-idf) (our)
RF-1gram (tf-idf) (our)
RF-1gram (tf-idf) (our)
Galactica (our)
Galactica (our)
Galactica (our)
RoBERTa (our)
RoBERTa (our)
RoBERTa (our)
GPT-3 (our)
DetectGPT
ChatGPT-IO (our) (*)
LLMFE (our) (*)
Train Dataset
TRAIN
TRAIN + GPT3
TRAIN-CG
TRAIN
TRAIN + GPT3
TRAIN-CG
TRAIN
TRAIN + GPT3
TRAIN-CG
TRAIN
TRAIN + GPT3
TRAIN-CG
TRAIN-SUB
TRAIN + GPT3
TEST
95.3%
94.6%
86.6%
94.8%
91.7%
97.6%
98.4%
98.5%
96.4%
72.3%
65.7%
86.0%
100.0%
61.5%
69.0%
80.0%
OOD-GPT3
4.0%
86.5%
0.8%
24.7%
95.0%
7.0%
25.9%
71.2%
12.4%
55.5%
100.0%
2.0%
25.9%
0.0%
49.0%
62.0%
OOD-REAL
94.6%
86.2%
97.8%
87.3%
69.3%
95.0%
95.5%
95.1%
97.6%
50.0%
29.1%
92.5%
99.0%
99.9%
89.0%
70.0%
TECG
96.1%
97.8%
32.6%
100.0%
100.0%
57.0%
84.0%
84.0%
61.3%
100.0%
100.0%
76.5%
100.0%
68.7%
0.0%
90.0%
TEST-CC
7.8%
13.7%
1.2%
8.1%
15.1%
1.7%
6.8%
12.0%
2.4%
63.5%
75.0%
9.2%
N/A
N/A
3.0%
33.0%
Our simplest models, LR and RF, already achieve accuracy scores greater than 90%
on the TEST dataset, suggesting that the classification task of distinguishing real and fake
scientific papers is rather easy to learn if trained on comparable scientific papers. However,
evaluated against out-of-domain scientific papers, accuracy scores drop significantly. All
models perform poorly on out-of-domain papers generated by GPT-3 curie (OOD-GPT3).
This result supports the findings of previous studies by Bakhtin et al. [43], which indicate
that models trained on specific generators tend to overfit and perform poorly on data
outside their training distribution. However, after training our Galactica and RoBERTa
models with GPT-3 examples (TRAIN + GPT3), the models achieve higher accuracies (71%
and 100%, respectively). A similar behavior can be observed for the LR and RF classifiers.
All models, except RoBERTa, perform poorly when detecting human–machine cocreated papers (TEST-CC). Seeing papers generated by ChatGPT and GPT-3 during training
each noticeably improves the detection accuracy for all models, presumably because these
examples are most similar to the ChatGPT-paraphrased papers that are part of the TEST-CC
dataset. RoBERTa still achieves an accuracy of 75%, which is remarkable given that many
examples only contain a relatively low share of machine-generated text. This seems to be
due to a high-recall bias of the trained RoBERTa model, which achieves comparatively high
accuracy scores on datasets that only contain fake papers (i.e., OOD-GPT3, TECG) but lower
scores on the remaining datasets that also contain real papers. GPT-3 and DetectGPT have
not been evaluated against TEST-CC due to limited computing resources and API credits.
104
Information 2023, 14, 522
Models that were not fine-tuned to the classification task, DetectGPT and ChatGPT,
perform noticeably worse than the fine-tuned models. Our ChatGPT-based LLMFE outperforms ChatGPT on all test datasets except OOD-REAL, indicating that LLM’s detection
abilities can be enhanced with a systematic prompting approach and guidance. In particular,
we observe great improvements in the more sophisticated texts in our TECG and TEST-CC
datasets. This may be because of the more high-level features identified by LLMFE—e.g.,
those that capture a paper’s overall coherence.
It is worth noting that our RoBERTa model exhibits excellent results when evaluated
on a dataset of ChatGPT-generated papers (TECG). The model achieves an accuracy of
77% without prior training on a similar dataset (TRAIN-CG), and 100% accuracy when a
similar dataset is included in the training (TRAIN). These results outperform Galactica in
both scenarios.
The overall good results on OOD-REAL—i.e., real paper processed with a different
parser—indicate that our models are not exploiting any spurious artifact introduced during
the parsing procedure. DetectGPT notably overfits papers generated with GPT-2 and
deems most samples coming from a different source as real. Indeed, it performs well on
OOD-REAL (100%) and poorly on OOD-GPT3 (0%).
4.4. Explainability Insights
The different types of classifier models provide a rich set of explainability insights
that help us understand what characterizes real and fake scientific papers, respectively.
LR and RF classifiers trained on TF-IDF 1-grams provide insights into individual words.
For Galactica, RoBERTa, and GPT-3, we extract insights on more complex features of word
combinations. Lastly, LLMFE learns very high-level, abstract features describing complex
relationships between words, such as grammar and cohesion. Additionally, we analyze
linguistic-based features such as readability scores and the length of papers.
4.4.1. Word-Level Insights from LR and RF
The coefficients learned by LR (see Figure 5a) and feature importance learned by
RF indicate that real papers draw from a diverse set of words and—more often than fake
papers—make references to specific sections (“section”), other papers (“et” and “al”),
or recent trends (“recently”). In contrast, fake papers tend to rely on one-size-fits-all
vocabulary such as “method”, “approach”, or “implications” more than real papers.
4.4.2. LIME and SHAP Insights for Galactica, RoBERTa, and GPT-3
We use LIME [65] and SHAP [66] to inspect predictions made by Galactica, RoBERTa,
and GPT-3. While these explanations fail to convey a concise overview, they are still useful
to notice patterns and similarities across samples sharing labels and sources [67,68].
Often, RoBERTa and Galactica models tend to classify papers as real when the papers
include infrequent words and sentences starting with adverbs. In addition, we notice that
SHAP explanations corresponding to real papers have all words with low Shapley values.
We believe this is intuitive as a paper appears real if it does not contain any artifact that
strongly signals an AI source.
On the other hand, papers whose sections begin with “In this paper, . . . ”, “In this work,
. . . ”, or “In this study, . . . ” are often marked as fake. The same goes for those containing
repeated words, spelling mistakes, or word fragments such as “den”, “oly”, “um”. Detectors
are also able to spot incoherent content and context, as well as sections that are unnaturally
short and do not convey any specific point. Several explanation instances of Galactica and
RoBERTa can be found in Appendix C for further inspection. We choose not to provide an
explanation for our GPT-3 classifier since it requires many requests to OpenAI’s paid API.
105
Information 2023, 14, 522
Grammar And Syntax
Mathematical Formalism
Real
Fake
0
2
4
6
8
10
0
2
Use Of Domain-Specific Terminology
0
2
4
6
8
10
0
2
Conclusion Length
0
2
4
6
2
4
6
6
8
10
4
6
8
10
8
10
8
10
Use Of Passive Voice
8
10
0
2
Emotive Language
0
4
Abstract Clarity
4
6
Cohesion
8
10
0
2
4
6
(b) LLMFE feature distributions
(a) LR 1-gram coefficients
Figure 5. Explainability insights from our Logistic Regression (LR) and Large Language Model
Feature Extractor (LLMFE) classifiers. (a) shows the 1-grams with the 10 lowest (indicating real) and
highest (indicating fake) coefficients learned by LR. (b) shows the distributions of scores for the eight
most important features (according to Random Forest feature importance) learned by LLMFE.
4.4.3. Abstract Features from LLMFE
LLMFE identifies more abstract features such as grammar and syntax, use of domainspecific terminology, or cohesion as shown in Figure 5b. We observe that score distributions of
real papers tend to be narrower than those of fake papers. This is not surprising given that
fake papers were generated by multiple generators, some more and some less advanced.
For many features, the distributions of real and fake papers have the same mode, suggesting
that collectively our dataset of machine-generated papers resembles real papers quite well.
4.4.4. Readability Metrics for Different Generators
Figure 6 shows the distribution of Flesch–Kincaid Grade Level [69] and Gunning
Fog [70] readability metrics [71] for papers from the different generators and real papers.
Flesch–Kincaid measures the technical difficulty of the papers, while Gunning Fog measures the readability of the papers. The comparison confirms our observation that our
machine-generated papers are representative of real papers with a slight increase in writing
sophistication from SCIgen and GPT-2 to ChatGPT and GPT-3 generators, with Galactica
being the median.
Flesch Kincaid Scores For Different Generators
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
Scigen
GPT-2 Galactica ChatGPT
GPT-3
All
Real
Generators
Gunning Fog Scores For Different Generators
40
Score
Score
40
CC
0
Scigen
GPT-2 Galactica ChatGPT
GPT-3
All
Real
Generators
CC
(a) Flesch Kincaid
(b) Gunning Fog
Figure 6. Distribution of readability metrics for papers from the different generators. (a) shows
Flesch–Kincaid scores while (b) shows Gunning Fog scores for all generators.
106
Information 2023, 14, 522
4.4.5. Generated Texts Length
We observe differences in the length of the sections in our fake scientific papers depending on the generator. Figure 7 shows the length distributions of sections generated
by the different generators. On average, machine-generated sections from all generators
are shorter than sections from real papers—the only exception being abstracts and conclusions generated by GPT-2, which are slightly longer than real abstracts and conclusions,
on average. For most generators, we also see less length variety compared to real papers.
Introduction Token Count For Different Generators
1000
400
800
Token Count
Token Count
Abstract Token Count For Different Generators
500
300
200
600
400
200
100
0
Scigen
GPT-2 Galactica ChatGPT
GPT-3
All
Real
Generators
CC
(a) Abstract Length
Scigen
GPT-2 Galactica ChatGPT
GPT-3
All
Real
Generators
CC
(b) Introduction Length
Conclusion Token Count For Different Generators
1000
Token Count
800
600
400
200
0
Scigen
GPT-2 Galactica ChatGPT
GPT-3
All
Real
Generators
CC
(c) Conclusion Length
Figure 7. The generators exhibit different tendencies for the length of the generated fake scientific
papers. (a) shows the length distribution of generated abstracts, (b) shows the same for introductions,
and (c) shows conclusion lengths.
For the co-created scientific papers (CC), despite prompting ChatGPT to return paraphrased sections with a similar length or even the exact word count as the original sections,
we observe a tendency of ChatGPT to summarize sections during paraphrasing. While
paraphrased abstracts have roughly the same length as their originals, paraphrased introductions, and conclusions sections are often significantly shorter, as shown in Figure 8. We
conclude that ChatGPT does not reliably follow length constraints when confronted with a
paraphrasing task on longer texts.
107
Information 2023, 14, 522
Abstract
Introduction
500
Token count (paraphrased)
Conclusion
1000
1000
800
800
400
300
200
100
600
600
400
400
200
200
0
100
200
300
400
Token count (original)
500
200
400
600
800
Token count (original)
1000
0
200
400
600
800
Token count (original)
1000
Figure 8. Paraphrasing sections with ChatGPT has a tendency to result in sections shorter than the
original. The reduction in section length is most visible for the longer introduction and conclusion
sections. For an analysis of lengths of generated fake scientific papers, see Figure 7 in the appendix.
5. Limitations and Future Work
Despite memory, GPU, and API limitations presenting significant obstacles for our
project, we could still create high-quality fake scientific papers. Nonetheless, we believe
there is room for improvement in addressing such limitations. For instance, beyond simply
improving the quality of the generated papers, further insights could be gained from
exploring generation processes entailing a collaboration between different models and
input prompts.
Due to the complexity of parsing PDFs, we are currently limited to specific sections
(abstract, introduction, conclusion) instead of complete papers. Moreover, processing entire
publications would require substantial computational efforts. We believe that selecting
sections dynamically at random instead of a fixed choice is worth exploring and will be the
focus of future work.
Beyond DetectGPT [18], other zero-shot text detectors such as GPTZero (https://gptzero.
me) (accessed on 31 July 2023) present promising solutions worth testing on our benchmark
dataset. However, at the time of writing, such solutions are not available for experiments
at scale.
In future work, we aim to address these limitations by exploring dynamic section
selection, combining models and prompts in the generation process, improving papers’
quality, and investigating the potential of zero-shot text detectors such as GPTZero as
they become more accessible and scalable. We think that future research should further
investigate how stable classifiers, such as the ones presented in this paper, are against
newly appearing LLMs and how to improve the classifiers’ generalization capabilities to
out-of-domain samples.
6. Discussion, Ethical Considerations, and Broader Impact
It is important to emphasize that our work does not condemn the usage of LLMs.
The legitimacy of their usage should be addressed by regulatory frameworks and guidelines.
Still, we strongly believe it is crucial to develop countermeasures and strategies to detect
machine-generated papers to ensure accountability and reliability in published research.
Our benchmark dataset serves as a valuable resource for evaluating detection algorithms, contributing to the integrity of the scientific community. However, potential
challenges include adversarial attacks and dataset biases [72,73]. It is essential to develop
robust countermeasures and strive for a diverse, representative dataset.
7. Conclusions
This work introduced a benchmark dataset for identifying machine-generated scientific
papers in the LLM era. Our work creates a resource that allows researchers to evaluate
the effectiveness of detection methods and thus support the trust and integrity of the
scientific process.
108
Information 2023, 14, 522
We generated a diverse set of papers using both SCIgen and state-of-the-art LLMs—
ChatGPT, Galactica, GPT-2, and GPT-3. This ensures a variety of sources and includes
models capable of generating convincing content. We fine-tuned and tested several baseline
detection models—Logistic Regression, Random Forest, GPT-3, Galactica, and RoBERTa—
and compared their performance to DetectGPT, ChatGPT, and a novel Large Language
Model Feature Extractor (LLMFE) that we propose. The results demonstrated varying
degrees of success, with some models showing remarkable performance on specific subsets
while sometimes struggling with out-of-domain data.
By providing a comprehensive platform for evaluating detection techniques, we
contribute to the development of robust and reliable methods for identifying machinegenerated content. Moving forward, we plan to address the current limitations and further
enhance the utility of our benchmark for the research community.
We release a repository containing our benchmark dataset as well as the code used for
experimental results (https:// huggingface.co/datasets/tum-nlp/IDMGSP) (accessed on
31 July 2023).
Author Contributions: Design of experiments, E.M. and D.D.; Dataset creation, E.M., M.H.I.A. and
S.M.; Experiments, M.H.I.A. and S.M.; writing—original draft preparation, E.M.; writing—journal
version preparation, D.D. and S.M.; writing—review and editing, G.G. All authors have read and
agreed to the published version of the manuscript.
Funding: This paper has been supported by the German Federal Ministry of Education and Research
(BMBF, grant 01IS17049).
Institutional Review Board Statement: Not applicable.
Data Availability Statement: All data created within this research is available online (https://hugg
ingface.co/datasets/tum-nlp/IDMGSP, accessed on 27 July 2023).
Acknowledgments: We would like to acknowledge the help of Social Research Computing Group
for providing an opportunity to conduct this research and Paolo Basso and Margherita Musumeci
for participating in the first version of this publication. Additionally, we would like to thank Leslie
McIntosh for her guidance throughout our research journey.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
NLP
LLM
BoW
OOD
CFG
Natural Language Processing
Large Language Model
Bag-of-Words
Out-of-Domain
Context-Free-Grammars
Appendix A. Generation Examples
In this section, we present examples of text that were generated using the models
we employed.
Appendix A.1. ChatGPT
Figures A1 and A2 provide examples of prompts that can be used to generate paper
sections using ChatGPT.
109
Information 2023, 14, 522
Figure A1. Example 1 of prompting ChatGPT to produce sections of a scientific paper given the
paper title.
110
Information 2023, 14, 522
Figure A2. Example 2 of prompting ChatGPT to produce sections of a scientific paper given the
paper title.
Appendix A.2. Galactica
For generating text with the Galactica model, an overview of the hyperparameters
used is provided in Table A1. Each subsection shows how Galactica is prompted to generate
the abstract, the introduction and the conclusion sections.
Table A1. Hyperparameters used to generate each paper section in the Galactica model. Each row
corresponds to a decoding of a section based on the previous input sections. Here we used parameters
of the MODEL . GENERATE function provided by Huggingface [74].
Input Section(s)
Output Section
Tokenizer Max_Input_Size
Max_New_Tokens
Do_Sample
Temperature
top_k
top_p
No_Repeat_Ngram_Size
Early_Stopping
Title
Title + Abstract
Title + Abstract + Introduction
Abstract
Introduction
Conclusion
64
576 (64 + 512)
1600 (64 + 512 + 1024)
512
1024
1024
True
True
True
0.7
0.7
25
25
0.9
0.9
10
10
True
True
Appendix A.2.1. Abstract Generation
INPUT:
Title: On the Global Structure of Hopf Hypersurfaces in Complex Space Form
111
Information 2023, 14, 522
OUTPUT:
Abstract: For a generic hypersurface in complex space form, all Hopf hypersurfaces
are proved to be homogeneous or Hopf homogeneous. As a consequence, it is shown
that the generalized Sasakian metric is a Ricci soliton on a Hopf hypersurface.
Appendix A.2.2. Introduction Generation
INPUT:
Title: On the Global Structure of Hopf Hypersurfaces in Complex Space Form
Abstract: For a generic hypersurface in complex space form, all Hopf hypersurfaces
are proved to be homogeneous or Hopf homogeneous. As a consequence, it is shown
that the generalized Sasakian metric is a Ricci soliton on a Hopf hypersurface.
Introduction:
OUTPUT:
A Hopf hypersurface is a hypersurface which admits a Hopf field, that is, a vector field
such that [ξ, ξ ] is a normal vector field. The Hopf field is unique up to multiplication
by a positive constant. The existence of Hopf fields is equivalent to the existence
of a 1-form η such that η ∧ dη n−1 is a volume form. Let Mn be a hypersurface in a
complex space form M2n+1 (c) with constant sectional curvature c. We denote by g
the metric induced by the Riemannian metric of M2n+2 (c). The Hopf field is given
by ξ = √1n J ∇ f , where J is the complex structure on M2n+3 and f is a function on
Mn ⊂ M2n+1 . In [1], Wang and Chen showed that a Hopf hypersurface in M2n+l (c)
is homogeneous if and only if the Hopf field is parallel. In this paper, we study
the structure of Hopf hypersurfaces in complex space form. We show that a Hopf
hypersurface is homogeneous if and only if it is Hopf homogeneous. We also show
that a Hopf hypersurface is Hopf homogeneous if and only if the generalized Sasakian
metric is a Ricci solitons on it.
Appendix A.2.3. Conclusion Generation
INPUT:
Title: On the Global Structure of Hopf Hypersurfaces in Complex Space Form
Abstract: For a generic hypersurface in complex space form, all Hopf hypersurfaces
are proved to be homogeneous or Hopf homogeneous. As a consequence, it is shown
that the generalized Sasakian metric is a Ricci soliton on a Hopf hypersurface.
Introduction:
A Hopf hypersurface is a hypersurface which admits a Hopf field, that is, a vector field
such that [ξ, ξ ] is a normal vector field. The Hopf field is unique up to multiplication
by a positive constant. The existence of Hopf fields is equivalent to the existence
of a 1-form η such that η ∧ dη n−1 is a volume form. Let Mn be a hypersurface in a
complex space form M2n+1 (c) with constant sectional curvature c. We denote by g
the metric induced by the Riemannian metric of M2n+2 (c). The Hopf field is given
by ξ = √1n J ∇ f , where J is the complex structure on M2n+3 and f is a function on
Mn ⊂ M2n+1 . In [1], Wang and Chen showed that a Hopf hypersurface in M2n+l (c) is
homogeneous if and only if the Hopf field is parallel. In this paper, we study the structure of Hopf hypersurfaces in complex space form. We show that a Hopf hypersurface
is homogeneous if and only if it is Hopf homogeneous. We also show that a Hopf
112
Information 2023, 14, 522
hypersurface is Hopf homogeneous if and only if the generalized Sasakian metric is a
Ricci solitons on it.
Conclusion:
OUTPUT:
We obtain a complete classification of Hopf hypersurfaces in complex space forms.
Appendix A.3. GPT-2
In the following, we show an example of how the fine-tuned GPT-2 was used to
generate the abstract, the introduction, and the conclusion sections.
Appendix A.3.1. Abstract Generation
INPUT:
Competitive Multi-Agent Load Balancing with Adaptive Policies in Wireless Networks
<|sep|>
OUTPUT:
In this paper, we propose adaptive multi-agent load balancing (AML) based on adaptive policies to achieve better performance. The adaptive policies can be designed to
maximize the number of agent users and to reduce the number of latency. We present
a comprehensive evaluation of AML on three wireless networks. We demonstrate that
the adaptive policies outperform state-of-the-art AML strategies by an average of 6.5%
and by an average of 2.5% when compared to state-of-the-art strategies. The performance of adaptive policies is evaluated on the three wireless networks with multiple
users and multiple latency.
Appendix A.3.2. Introduction Generation
INPUT:
Competitive Multi-Agent Load Balancing with Adaptive Policies in Wireless Networks
<|sep|>
OUTPUT:
Wireless networks are increasingly becoming the backbone of the modern world.
The wireless networks have been the main source of energy for many years, and they
are the main source of power for many businesses. However, wireless networks are
also the main source of information for many people. For example, people in many
countries rely on a variety of information services, such as social networks, social
media, and e-mail. However, the information services are not available in all countries.
Therefore, the information services are not available for all countries.
Appendix A.3.3. Conclusion Generation
INPUT:
Competitive Multi-Agent Load Balancing with Adaptive Policies in Wireless Networks
<|sep|>
OUTPUT:
113
Information 2023, 14, 522
In this paper, we propose adaptive policies that optimize the network performance
of different types of wireless networks. We show that adaptive policies can achieve
competitive performance for different types of wireless networks. The proposed
adaptive policies have been tested in several real-world wireless networks. We find that
adaptive policies outperform state-of-the-art policies for the most common wireless
networks. We also show that adaptive policies can be used to optimize the network
performance of wireless networks by adapting their performance to different types
of wireless networks. The proposed adaptive policies are able to achieve competitive
performance for different types of wireless networks.
Appendix A.4. GPT-3
The following frames show how each section is generated using a GPT-3 generator.
Appendix A.4.1. Abstract Generation
INPUT:
Primitive Representation Learning for Scene Text Recognition
OUTPUT:
Primitive scene representation learning is an important subfield of deep learning,
which has been widely applied in many image recognition tasks such as image segmentation and object detection. In recent years, several methods have been proposed
to learn the primitive scene representation. Among them, most methods are based
on U-Net architecture and use either a global or local context to capture the global or
local structure of the input scenes. However, these approaches do not fully exploit
the structure information in scenes. In this paper, we propose a novel primitive scene
representation learning method for scene text recognition using a residual block-based
encoder-decoder strategy with multiple level-up passes. To this end, we design three
residual block structures with different subsampling ratios: 1:1 (global), 2:1 (local)
and 1:2 (global+local). We also designed an attention module that can be used to
further guide the search space in each level-up pass by enforcing certain conditions on
hierarchical representations obtained during previous levels.
Appendix A.4.2. Introduction Generation
INPUT:
Primitive Representation Learning for Scene Text Recognition
Primitive scene representation learning is an important subfield of deep learning,
which has been widely applied in many image recognition tasks such as image segmentation and object detection. In recent years, several methods have been proposed
to learn the primitive scene representation. Among them, most methods are based
on U-Net architecture and use either a global or local context to capture the global or
local structure of the input scenes. However, these approaches do not fully exploit
the structure information in scenes. In this paper, we propose a novel primitive scene
representation learning method for scene text recognition using a residual block-based
encoder-decoder strategy with multiple level-up passes. To this end, we design three
residual block structures with different subsampling ratios: 1:1 (global), 2:1 (local)
and 1:2 (global+local). We also designed an attention module that can be used to
further guide the search space in each level-up pass by enforcing certain conditions on
hierarchical representations obtained during previous levels.
Introduction:
OUTPUT:
114
Information 2023, 14, 522
[1,2]. It has been widely applied in many image segmentation and object detection
[3,4] tasks. In these applications, deep learning is often used to learn the complex
parameters such as models and training schemes. Alternatively, it can be also applied
on the more fundamental problems of learning a better model for predicting new input
data that have not been seen before. Learning representations of unseen data is one of
the most fundamental problems within machine learning and has attracted tremendous
attention in recent years [5]. Primitive scene representation learning (PSRL) is an
important subfield of deep learning which aims at modeling the basic visual concepts
such as objects and scenes in order to solve various real-world applications likeimage
segmentation [6], pedestrian detection [7], video classification [8] etc.
Appendix A.4.3. Conclusion Generation
INPUT:
Primitive Representation Learning for Scene Text Recognition
Primitive scene representation learning is an important subfield of deep learning,
which has been widely applied in many image recognition tasks such as image segmentation and object detection. In recent years, several methods have been proposed
to learn the primitive scene representation. Among them, most methods are based
on U-Net architecture and use either a global or local context to capture the global or
local structure of the input scenes. However, these approaches do not fully exploit
the structure information in scenes. In this paper, we propose a novel primitive scene
representation learning method for scene text recognition using a residual block-based
encoder-decoder strategy with multiple level-up passes. To this end, we design three
residual block structures with different subsampling ratios: 1:1 (global), 2:1 (local)
and 1:2 (global+local). We also designed an attention module that can be used to
further guide the search space in each level-up pass by enforcing certain conditions on
hierarchical representations obtained during previous levels.
[1,2]. It has been widely applied in many image segmentation and object detection [3,4] tasks. In these applications, deep learning is often used to learn the complex
parameters such as models and training schemes. Alternatively, it can be also applied
on the more fundamental problems of learning a better model for predicting new input
data that have not been seen before. Learning representations of unseen data is one of
the most fundamental problems within machine learning and has attracted tremendous attention in recent years [5]. Primitive scene representation learning (PSRL) is an
important subfield of deep learning which aims at modeling the basic visual concepts
such as objects and scenes in order to solve various real-world applications likeimage
segmentation [6], pedestrian detection [7], video classification [8] etc.
OUTPUT:
We propose a novel residual block-based encoder-decoder (R-BiDe) method for learning the primitive scene representation. We introduce a residual block structure with
different subsampling ratios, and design an attention module that can be used to
guide the search space in each level-up pass. The proposed R-BiDe method achieves
better performance than other stateof-theart methods on several difficult scene text
recognition benchmarks with no overfitting issues. In future works, we will focus on
exploring the model design principles of R-BiDe as well as its application scenarios
such as object detection and autonomous driving using deep learning frameworks
such as Caffe and PyTorch.
115
Information 2023, 14, 522
Appendix A.5. SCIgen
In the case of SCIgen papers, a script is used to generate the paper documents. Once
the documents are generated, they are subsequently parsed to extract the relevant sections.
The following frame shows an example of a generation after parsing.
Title: Towards the Understanding of Randomized Algorithms
Abstract: Many security experts would agree that, had it not been for online algorithms, the investigation of XML might never have occurred. Given the current
status of heterogeneous epistemologies, statisticians predictably desire the evaluation
of public-private key pairs, which embodies the important principles of programming
languages. In order to solve this riddle, we present an analysis of expert systems (Judas), disconfirming that symmetric encryption and hash tables are never incompatible.
It at first glance seems counterintuitive but fell in line with our expectations .
Introduction: Recent advances in interposable modalities and trainable modalities
do not necessarily obviate the need for 802.11 mesh networks. After years of robust
research into Byzantine fault tolerance, we disprove the evaluation of information
retrieval systems. The notion that security experts collaborate with atomic symmetries
is generally adamantly opposed. The evaluation of link-level acknowledgements
would tremendously amplify vacuum tubes. The basic tenet of this solution is the
development of IPv7. For example, many frameworks allow the study of the transistor.
On a similar note, we emphasize that our system improves systems. Thus, we use
lossless communication to disprove that online algorithms and journaling file systems
can interact to fulfill this ambition. Relational methodologies are particularly unfortunate when it comes to “smart” information. This is an important point to understand.
However, for example, many frameworks observe the memory bus. Thusly, we see
no reason not to use trainable communication to develop concurrent theory. This
outcome is usually a private ambition but is supported by related work in the field.
We describe new perfect modalities, which we call Judas. Though such a claim might
seem counterintuitive, it has ample historical precedence. To put this in perspective,
consider the fact that little-known futurists never use IPv6 to surmount this problem.
Contrarily, stochastic technology might not be the panacea that cyberneticists expected.
Two properties make this method distinct: our application prevents homogeneous
configurations, and also Judas is copied from the analysis of DHTs [1]. To put this
in perspective, consider the fact that little-known information theorists rarely use
802.11b to address this challenge. Combined with the UNIVAC computer, such a claim
synthesizes new stochastic modalities. The rest of this paper is organized as follows.
Primarily, we motivate the need for the transistor. Similarly, we place our work in
context with the prior work in this area. On a similar note, to solve this question, we
construct an analysis of telephony (Judas), which we use to show that the seminal
relational algorithm for the exploration of active networks by Thompson [1] runs in
Ω(log log n) time. In the end, we conclude.
Conclusion: Our method will address many of the issues faced by today’s theorists. Similarly, Judas can successfully prevent many link-level acknowledgements
at once. Our methodology for constructing the improvement of the Turing machine
is particularly excellent. We plan to explore more problems related to these issues in
future work.
Appendix B. Classifier Details
Appendix B.1. Bag-of-Words Classifiers
Table A2 shows the detailed results for the different bag-of-words classifiers introduced
in Section 4.2.1.
116
Information 2023, 14, 522
Table A2. Experiment results for the different bag-of-words classifiers reported with accuracy metric.
Out-of-domain experiments are highlighted in blue. The highest values per test set are highlighted
in bold.
Model
LR-1gram
(tf-idf)
LR-1gram
(tf-idf)
LR-1gram
(tf-idf)
LR-2gram
(tf-idf)
LR-2gram
(tf-idf)
LR-2gram
(tf-idf)
LR-(1,2)gram
(tf-idf)
LR-(1,2)gram
(tf-idf)
LR-(1,2)gram
(tf-idf)
RF-1gram
(tf-idf)
RF-1gram
(tf-idf)
RF-1gram
(tf-idf)
RF-2gram
(tf-idf)
RF-2gram
(tf-idf)
RF-2gram
(tf-idf)
RF-(1,2)gram
(tf-idf)
RF-(1,2)gram
(tf-idf)
RF-(1,2)gram
(tf-idf)
Train Dataset
TEST
OOD-GPT3
OOD-REAL
TECG
TEST-CC
TRAIN
95.3%
4.0%
94.6%
96.1%
7.8%
TRAIN + GPT3
94.6%
86.5%
86.2%
97.8%
13.7%
TRAIN-CG
86.6%
0.8%
97.8%
32.6%
1.2%
TRAIN
89.1%
0.5%
96.5%
91.3%
6.4%
TRAIN + GPT3
90.0%
89.7%
86.1%
97.3%
15.7%
TRAIN-CG
73.3%
0.0%
99.6%
1.4%
0.6%
TRAIN
94.8%
0.2%
97.8%
94.6%
2.7%
TRAIN + GPT3
95.1%
83.3%
92.6%
97.8%
5.9%
TRAIN-CG
83.3%
0.2%
99.3%
1.7%
0.3%
TRAIN
94.8%
24.7%
87.3%
100.0%
8.1%
TRAIN + GPT3
91.7%
95.0%
69.3%
100.0%
15.1%
TRAIN-CG
97.6%
7.0%
95.0%
57.0%
1.7%
TRAIN
90.8%
12.4%
76.8%
99.3%
29.9%
TRAIN + GPT3
87.7%
96.8%
54.6%
99.9%
44.0%
TRAIN-CG
85.8%
3.4%
88.8%
44.1%
8.5%
TRAIN
95.4%
22.4%
87.8%
93.8%
9.1%
TRAIN + GPT3
93.8%
96.0%
66.6%
100.0%
19.7%
TRAIN-CG
87.8%
1.9%
96.8%
43.8%
1.1%
Appendix B.2. GPT-3
The following frame shows a GPT-3 classifier training prompt. The input label (1 for
fake and 0 for real) is separated from the input by the separator token (###).
Abstract:
For a generic hypersurface in complex space form, all Hopf hypersurfaces are proved
to be homogeneous or Hopf homogeneous. As a consequence, it is shown that the
generalized Sasakian metric is a Ricci soliton on a Hopf hypersurface.
Introduction:
A Hopf hypersurface is a hypersurface which admits a Hopf field, that is, a vector field
such that [ξ, ξ ] is a normal vector field. The Hopf field is unique up to multiplication
by a positive constant. The existence of Hopf fields is equivalent to the existence
of a 1-form η such that η ∧ dη n−1 is a volume form. Let Mn be a hypersurface in a
complex space form M2n+1 (c) with constant sectional curvature c. We denote by g
the metric induced by the Riemannian metric of M2n+2 (c). The Hopf field is given
by ξ = √1n J ∇ f , where J is the complex structure on M2n+3 and f is a function on
Mn ⊂ M2n+1 . In [1], Wang and Chen showed that a Hopf hypersurface in M2n+l (c) is
117
Information 2023, 14, 522
homogeneous if and only if the Hopf field is parallel. In this paper, we study the structure of Hopf hypersurfaces in complex space form. We show that a Hopf hypersurface
is homogeneous if and only if it is Hopf homogeneous. We also show that a Hopf
hypersurface is Hopf homogeneous if and only if the generalized Sasakian metric is a
Ricci solitons on it.
Conclusion:
For a generic hypersurface in complex space form, all Hopf hypersurfaces are proved
to be homogeneous or Hopf homogeneous. As a consequence, it is shown that the
generalized Sasakian metric is a Ricci soliton on a Hopf hypersurface.
###
1
Appendix B.3. ChatGPT
Table A3 shows the detailed results for the different ChatGPT prompting styles introduced in Section 4.2.6.
Table A3. Experiment results for different ChatGPT prompting styles reported with accuracy metric.
Out-of-domain experiments are highlighted in blue. Highest values per test set are highlighted in
bold. (*) ChatGPT accuracies have been evaluated on randomly sampled subsets of 100 scientific
papers per test set and prompting style due to API limits.
Model
Train Dataset
ChatGPT-IO (*)
ChatGPT-CoT (*)
ChatGPT-IP (*)
ChatGPT-FS (*)
TRAIN + GPT3
TEST
OOD-GPT3
OOD-REAL
TECG
69%
63%
57%
59%
49%
2%
18%
2%
89%
70%
92%
100%
0%
3%
7%
0%
TESTCC
3%
1%
5%
0%
Appendix B.4. Large Language Model Feature Extractor (LLMFE)
Figure A3 show an extract from the hierarchical clustering dendrogram learned during
the feature consolidation step of LLMFE.
Figure A3. Extract from the hierarchical clustering dendrogram learned during the feature consolidation step of LLMFE. The full dendrogram lists all 884 features. The distance threshold was chosen so
that 83 clusters were created from the 884 features.
118
Information 2023, 14, 522
Appendix C. Explainability Results
Appendix C.1. Bag-of-Words Classifiers
Figures A4–A6 show the coefficients and feature importance learned by our Logistic Regression (LR) and Random Forest (RF) classifiers on the TRAIN, TRAIN + GPT3,
and TRAIN-CG datasets, respectively.
(b) RF 1-gram feature importance
(a) LR 1-gram coefficients
Figure A4. Explainability insights from our Logistic Regression (LR) and Random Forest (RF)
classifiers on the TRAIN dataset. (a) shows the 1-grams with the 10 lowest (indicating real) and
highest (indicating fake) coefficients learned by LR. (b) shows the feature importance extracted from
RF after training.
(a) LR 1-gram coefficients
(b) RF 1-gram feature importance
Figure A5. Explainability insights from our Logistic Regression (LR) and Random Forest (RF)
classifiers on the TRAIN + GPT3 dataset. (a) shows the 1-grams with the 10 lowest (indicating real)
and highest (indicating fake) coefficients learned by LR. (b) shows the feature importance extracted
from RF after training.
119
Information 2023, 14, 522
(a) LR 1-gram coefficients
(b) RF 1-gram feature importance
Figure A6. Explainability insights from our Logistic Regression (LR) and Random Forest (RF)
classifiers on the TRAIN-CG dataset. (a) shows the 1-grams with the 10 lowest (indicating real) and
highest (indicating fake) coefficients learned by LR. (b) shows the feature importance extracted from
RF after training.
Appendix C.2. RoBERTa
Selected samples of SHAP and LIME explanations for our RoBERTa classifier can be
found in Figures A7–A17.
Figure A7. RoBERTa: Example of SHAP explanation on a real abstract correctly classified.
Figure A8. RoBERTa: Example of SHAP explanation on a real misclassified abstract.
Figure A9. RoBERTa: Example of SHAP explanation on a SCIgen generated abstract correctly classified.
Figure A10. RoBERTa: Example of SHAP explanation on a GPT-2 generated abstract correctly classified.
Figure A11. RoBERTa: Example of SHAP explanation on a Galactica generated abstract correctly classified.
120
Information 2023, 14, 522
Figure A12. RoBERTa: Example of SHAP explanation on a ChatGPT generated abstract correctly classified.
Figure A13. RoBERTa: Example of LIME explanation on a real abstract correctly classified.
Figure A14. RoBERTa: Example of LIME explanation on a SCIgen generated abstract correctly classified.
Figure A15. RoBERTa: Example of LIME explanation on a GPT-2 generated abstract correctly classified.
Figure A16. RoBERTa: Example of LIME explanation on a Galactica generated abstract correctly classified.
Figure A17. RoBERTa: Example of LIME explanation on a ChatGPT generated abstract correctly classified.
Appendix C.3. Galactica
Selected samples of SHAP explanations for our Galactica classifier can be found in
Figures A18–A21.
Figure A18. Galactica: Example of SHAP explanation on a real paper correctly classified.
121
Information 2023, 14, 522
Figure A19. Galactica: Example of SHAP explanation on a misclassified real paper.
Figure A20. Galactica: Example of SHAP explanation on a Galactica generated paper correctly classified.
Figure A21. Galactica: Example of SHAP explanation on a misclassified Galactica generated paper.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A
176b-parameter open-access multilingual language model. arXiv 2022, arXiv:2211.05100.
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI
Blog. 2019. Available online: https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf (accessed on 31 July 2023).
Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. CTRL: A Conditional Transformer Language Model for Controllable
Generation. arXiv 2019, arXiv:1909.05858. https://doi.org/10.48550/arXiv.1909.05858.
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al.
Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09
288.
Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.; Roesner, F.; Choi, Y. Defending Against Neural Fake News. In
Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing
Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019.
OpenAI. ChatGPT. 2022. Available online: https://openai.com/blog/chat-ai/ (accessed on 26 February 2023).
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al.
Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374.
Liu, Y. Fine-tune BERT for extractive summarization. arXiv 2019, arXiv:1903.10318.
Dergaa, I.; Chamari, K.; Zmijewski, P.; Saad, H.B. From human writing to artificial intelligence generated text: Examining the
prospects and potential threats of ChatGPT in academic writing. Biol. Sport 2023, 40, 615–622. [CrossRef]
Stokel-Walker, C. AI bot ChatGPT writes smart essays-should academics worry? Nature 2022. [CrossRef]
122
Information 2023, 14, 522
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. Online,
5–10 August 2021; pp. 1906–1919. [CrossRef]
Tian, R.; Narayan, S.; Sellam, T.; Parikh, A.P. Sticking to the facts: Confident decoding for faithful data-to-text generation. arXiv
2019, arXiv:1910.08684.
Stribling, J.; Krohn, M.; Aguayo, D. SCIgen—An Automatic CS Paper Generator. 2005. Available online: https://pdos.csail.mit.e
du/archive/scigen/ (accessed on 1 March 2023).
Taylor, R.; Kardas, M.; Cucurull, G.; Scialom, T.; Hartshorn, A.; Saravia, E.; Poulton, A.; Kerkez, V.; Stojnic, R. Galactica: A large
language model for science. arXiv 2022, arXiv:2211.09085.
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692.
Mitchell, E.; Lee, Y.; Khazatsky, A.; Manning, C.D.; Finn, C. DetectGPT: Zero-Shot Machine-Generated Text Detection using
Probability Curvature. arXiv 2023, arXiv:2301.11305.
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
Mosca, E.; Abdalla, M.H.I.; Basso, P.; Musumeci, M.; Groh, G. Distinguishing Fact from Fiction: A Benchmark Dataset for
Identifying Machine-Generated Scientific Papers in the LLM Era. In Proceedings of the 3rd Workshop on Trustworthy Natural
Language Processing (TrustNLP 2023), Toronto, ON, Canada, 9–14 July 2023; pp. 190–207.
Maronikolakis, A.; Schutze, H.; Stevenson, M. Identifying automatically generated headlines using transformers. arXiv 2020,
arXiv:2009.13375.
Liyanage, V.; Buscaldi, D.; Nazarenko, A. A Benchmark Corpus for the Detection of Automatically Generated Text in Academic
Publications. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France,
20–25 June 2022; pp. 4692–4700.
Wang, Y.; Mansurov, J.; Ivanov, P.; Su, J.; Shelmanov, A.; Tsvigun, A.; Whitehouse, C.; Afzal, O.M.; Mahmoud, T.; Aji, A.F.; et al. M4:
Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection. arXiv 2023, arXiv:2305.14902.
https://doi.org/10.48550/arXiv.2305.14902.
He, X.; Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. MGTBench: Benchmarking Machine-Generated Text Detection. arXiv 2023,
arXiv:2303.14822. https://doi.org/10.48550/arXiv.2303.14822.
Li, Y.; Li, Q.; Cui, L.; Bi, W.; Wang, L.; Yang, L.; Shi, S.; Zhang, Y. Deepfake Text Detection in the Wild. arXiv 2023, arXiv:2305.13242.
https://doi.org/10.48550/arXiv.2305.13242.
Bird, S.; Dale, R.; Dorr, B.; Gibson, B.; Joseph, M.; Kan, M.Y.; Lee, D.; Powley, B.; Radev, D.; Tan, Y.F. The ACL Anthology Reference
Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the Sixth International
Conference on Language Resources and Evaluation (LREC’08); European Language Resources Association (ELRA), Marrakech,
Morocco, 15–20 July 2008.
arXiv.org submitters. arXiv Dataset. 2023. [CrossRef]
Cohan, A.; Goharian, N. Scientific Article Summarization Using Citation-Context and Article’s Discourse Structure. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 12–17 September
2015; pp. 390–400. [CrossRef]
Saier, T.; Färber, M. Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and Citation-Based Tasks. In Proceedings of the
8th International Workshop on Bibliometric-Enhanced Information Retrieval (BIR 2019) Co-Located with the 41st European
Conference on Information Retrieval (ECIR 2019), Cologne, Germany, 14 April 2019; pp. 14–26.
Lo, K.; Wang, L.L.; Neumann, M.; Kinney, R.; Weld, D. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4969–4983. [CrossRef]
Kashnitsky, Y.; Herrmannova, D.; de Waard, A.; Tsatsaronis, G.; Fennell, C.C.; Labbe, C. Overview of the DAGPap22 Shared
Task on Detecting Automatically Generated Scientific Papers. In Proceedings of the Third Workshop on Scholarly Document
Processing, Association for Computational Linguistics. Gyeongju, Republic of Korea, 17 October 2022; pp. 210–213.
Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The Pile: An
800GB Dataset of Diverse Text for Language Modeling. arXiv 2021, arXiv:2101.00027.
Waswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In
Proceedings of the NIPS, 4–9 December 2017.
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. PaLM 2
Technical Report. arXiv 2023, arXiv:2305.10403. https://doi.org/10.48550/arXiv.2305.10403.
Maheshwari, H.; Singh, B.; Varma, V. SciBERT Sentence Representation for Citation Context Classification. In Proceedings of the
Second Workshop on Scholarly Document Processing, Online, 10 June 2021; pp. 130–133.
MacNeil, S.; Tran, A.; Leinonen, J.; Denny, P.; Kim, J.; Hellas, A.; Bernstein, S.; Sarsa, S. Automatically Generating CS Learning
Materials with Large Language Models. arXiv 2022, arXiv:2212.05113.
Swanson, B.; Mathewson, K.; Pietrzak, B.; Chen, S.; Dinalescu, M. Story centaur: Large language model few shot learning as
a creative writing tool. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational
Linguistics: System Demonstrations, Online, 21–23 April 2021; pp. 244–256.
123
Information 2023, 14, 522
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
Liu, S.; He, T.; Li, J.; Li, Y.; Kumar, A. An Effective Learning Evaluation Method Based on Text Data with Real-time Attribution—A
Case Study for Mathematical Class with Students of Junior Middle School in China. ACM Trans. Asian Low Resour. Lang. Inf.
Process. 2023, 22, 63:1–63:22. [CrossRef]
Jawahar, G.; Abdul-Mageed, M.; Lakshmanan, L.V.S. Automatic Detection of Machine Generated Text: A Critical Survey. In
Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), 8–13
December 2020; pp. 2296–2309. [CrossRef]
Gehrmann, S.; Strobelt, H.; Rush, A.M. Gltr: Statistical detection and visualization of generated text. arXiv 2019, arXiv:1906.04043.
Fagni, T.; Falchi, F.; Gambini, M.; Martella, A.; Tesconi, M. TweepFake: About detecting deepfake tweets. PLoS ONE 2021,
16, e0251415. [CrossRef]
Kushnareva, L.; Cherniavskii, D.; Mikhailov, V.; Artemova, E.; Barannikov, S.; Bernstein, A.; Piontkovskaya, I.; Piontkovski, D.;
Burnaev, E. Artificial Text Detection via Examining the Topology of Attention Maps. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7–11
November, 2021; Moens, M., Huang, X., Specia, L., Yih, S.W., Eds.; Association for Computational Linguistics: Cedarville, OH,
USA, 2021; pp. 635–649. [CrossRef]
Bakhtin, A.; Gross, S.; Ott, M.; Deng, Y.; Ranzato, M.; Szlam, A. Real or fake? Learning to discriminate machine from human
generated text. arXiv 2019, arXiv:1906.03351.
Ippolito, D.; Duckworth, D.; Callison-Burch, C.; Eck, D. Automatic detection of generated text is easiest when humans are fooled.
arXiv 2019, arXiv:1911.00650.
Kirchenbauer, J.; Geiping, J.; Wen, Y.; Shu, M.; Saifullah, K.; Kong, K.; Fernando, K.; Saha, A.; Goldblum, M.; Goldstein, T. On the
Reliability of Watermarks for Large Language Models. arXiv 2023, arXiv:2306.04634. https://doi.org/10.48550/arXiv.2306.04634.
Amancio, D.R. Comparing the topological properties of real and artificially generated scientific manuscripts. Scientometrics 2015,
105, 1763–1779. [CrossRef]
Williams, K.; Giles, C.L. On the use of similarity search to detect fake scientific papers. In Proceedings of the Similarity Search
and Applications: 8th International Conference, SISAP 2015, Glasgow, UK, 12–14 October 2015; Springer: Berlin/Heidelberg,
Germany, 2015; pp. 332–338.
Nguyen, M.T.; Labbé, C. Engineering a tool to detect automatically generated papers. In Proceedings of the BIR 2016 Bibliometricenhanced Information Retrieval, Padua, Italy, 20 March 2016.
Cabanac, G.; Labbé, C. Prevalence of nonsensical algorithmically generated papers in the scientific literature. J. Assoc. Inf. Sci.
Technol. 2021, 72, 1461–1476. [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3615–3620. [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019,
arXiv:1910.01108.
Glazkova, A.; Glazkov, M. Detecting generated scientific papers using an ensemble of transformer models. In Proceedings of the
Third Workshop on Scholarly Document Processing, Gyeongju, Republic of Korea, 17 October 2022; pp. 223–228.
Liu, Z.; Yao, Z.; Li, F.; Luo, B. Check Me If You Can: Detecting ChatGPT-Generated Academic Writing using CheckGPT. arXiv
2023, arXiv:2306.05524. https://doi.org/10.48550/arXiv.2306.05524.
Yang, L.; Jiang, F.; Li, H. Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text. arXiv 2023,
arXiv:2307.11380. https://doi.org/10.48550/arXiv.2307.11380.
Rudduck, P. PyMuPDF: Python Bindings for the MuPDF Renderer. 2021. Available online: https://pypi.org/project/PyMuPDF/
(accessed on 7 March 2023).
Stribling, J.; Aguayo, D. Rooter: A Methodology for the Typical Unification of Access Points and Redundancy. 2021. Available
online: https://dipositint.ub.edu/dspace/bitstream/123456789/2243/1/rooter.pdf (accessed on 31 July 2023).
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014,
27, 3104–3112.
Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B 1958, 20, 215–232. [CrossRef]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits
Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903.
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006;
Volume 2, pp. 1735–1742.
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv 2011, arXiv:1109.2378.
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.;
Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [CrossRef]
124
Information 2023, 14, 522
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 13–17
August 2016; pp. 1135–1144.
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. International Conference on Machine Learning.
arXiv 2017, arXiv:1705.07874.
Mosca, E.; Szigeti, F.; Tragianni, S.; Gallagher, D.; Groh, G. SHAP-Based Explanation Methods: A Review for NLP Interpretability.
In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October
2022; pp. 4593–4603.
Mosca, E.; Harmann, K.; Eder, T.; Groh, G. Explaining Neural NLP Models for the Joint Analysis of Open-and-Closed-Ended
Survey Answers. In Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022), Seattle,
WA, USA, 14 July 2022; pp. 49–63. [CrossRef]
Thomas, G.; Hartley, R.D.; Kincaid, J.P. Test-retest and inter-analyst reliability of the automated readability index, Flesch reading
ease score, and the fog count. J. Read. Behav. 1975, 7, 149–154. [CrossRef]
Gunning, R. The Technique of Clear Writing; McGraw-Hill: New York, NY, USA, 1952. Available online: https://books.google.de/
books?id=ofI0AAAAMAAJ (accessed on 31 July 2023).
DiMascio, C. py-readability-metrics. 2019. Available online: https://github.com/cdimascio/py-readability-metrics (accessed on
31 July 2023).
Mosca, E.; Agarwal, S.; Rando Ramírez, J.; Groh, G. “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP
Adversarial Attacks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7806–7816. [CrossRef]
Huber, L.; Kühn, M.A.; Mosca, E.; Groh, G. Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations.
In Proceedings of the 7th Workshop on Representation Learning for NLP, Dublin, Ireland, 26 May 2022; pp. 156–166. [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C. Hugging Face’s Transformers: State-of-the-Art Natural Language
Processing. 2019. Available online: https://github.com/huggingface/transformers (accessed on 31 July 2023) .
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
125
information
Article
A Survey on Using Linguistic Markers for Diagnosing
Neuropsychiatric Disorders with Artificial Intelligence
Ioana-Raluca Zaman 1 and Stefan Trausan-Matu 1,2, *
1
2
*
Department of Computer Science and Engineering, Politehnica Bucharest, National University for Science
and Technology, 060042 Bucharest, Romania;
[email protected]
Research Institute for Artificial Intelligence “Mihai Draganescu” of the Romanian Academy,
050711 Bucharest, Romania
Correspondence:
[email protected]
Abstract: Neuropsychiatric disorders affect the lives of individuals from cognitive, emotional, and
behavioral aspects, impact the quality of their lives, and even lead to death. Outside the medical
area, these diseases have also started to be the subject of investigation in the field of Artificial
Intelligence: especially Natural Language Processing (NLP) and Computer Vision. The usage of
NLP techniques to understand medical symptoms eases the process of identifying and learning
more about language-related aspects of neuropsychiatric conditions, leading to better diagnosis and
treatment options. This survey shows the evolution of the detection of linguistic markers specific
to a series of neuropsychiatric disorders and symptoms. For each disease or symptom, the article
presents a medical description, specific linguistic markers, the results obtained using markers, and
datasets. Furthermore, this paper offers a critical analysis of the work undertaken to date and suggests
potential directions for future research in the field.
Keywords: neuropsychiatric disorders; depression; dementia; hallucinations; linguistic markers;
natural language processing; artificial intelligence
Citation: Zaman, I.-R.; Trausan-Matu,
S. A Survey on Using Linguistic
Markers for Diagnosing
Neuropsychiatric Disorders with
Artificial Intelligence. Information
2024, 15, 123. https://doi.org/
10.3390/info15030123
Academic Editor: Arkaitz Zubiaga
Received: 30 January 2024
Revised: 10 February 2024
Accepted: 19 February 2024
Published: 22 February 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
In recent years, the advances of Artificial Intelligence (AI) have been seen in different areas of medicine, such as: oncology [1], cardiology [2], endocrinology [3], neurology, and psychiatry [4,5]. Neuropsychiatric disorders are becoming a challenge faced by
more and more people nowadays. The conditions include both mental health problems
(e.g., depression, anxiety, and schizophrenia) and neurological diseases (e.g., Alzheimer’s
disease, Parkinson’s disease, and epilepsy) [6]. One challenge regarding the detection
and the understanding of the disorders is the complexity of the symptoms, which vary
from patient to patient but also overlap between certain diseases. Problems related to
neuropsychiatric conditions are encountered more and more often, especially due to certain
contexts (e.g., epidemics) or for categories exposed to certain factors (e.g., low income) [7].
In a meta-analysis [8], it was discovered that the emergence of the first mental disorder
takes place before the age of 14 in over a third of cases (34.6%) and before the age of 25 in
almost two-thirds of cases (62.5%). Therefore, particularly for this group of disorders, early
detection has a significant impact; applying treatment in time ensures that worsening of
the symptoms is slowed down and that the patients have the needed support.
One method for the discovery of new and less obvious symptoms of neuropsychiatric
disorders implies studying the language of people, focusing on clues unnoticeable by
humans (e.g., the presence or high frequency of specific words, syntactic density, grammar
complexity, etc.) [9,10]. In order to find these differences between healthy people and those
suffering from certain neuropsychiatric disorders, their speech may be analyzed using AI
Natural Language Processing (NLP) methods. This paper presents some of such work and
also analyses their approaches.
Information 2024, 15, 123. https://doi.org/10.3390/info15030123
126
https://www.mdpi.com/journal/information
Information 2024, 15, 123
In recent years, NLP systems have used several Machine Learning (ML) techniques,
especially Deep Artificial Neural Networks (DANNs), which perform very well and can
include analyzing patients’ utterances in neuropsychiatric dialogues [11]. However, people
need to trust the decisions made by the ML models, particularly in the medical field. Currently, Transformer-based models [12] have the best performance; however, their results
are based only on experience, which can cause the classifications to be based on superficial
or invalid criteria [13]. Analyzing conversations from patients and finding patterns in data
using AI tools should also allow the interpretability of the results provided by DANNs
(which can be seen as black-box models), helping people to have more trust in the AI’s
contributions to medicine. An online study (N = 415) that measured people’s trust in the
involvement of AI in medicine based on questionnaires referring to medical scenarios
demonstrated that people still have more trust in a doctor than in an AI model [14]. Linguistic markers are characteristics or traits of the text or speech that can provide information
about the speaker. These markers can be divided into several categories, for example: grammar markers or lexical semantic markers [15]. If such markers (which can be understood by
humans) would be provided for assisting the diagnosis of a patient, the interaction between
AI systems based on ML models and doctors would face fewer challenges, and patients
would be more open to considering the indications coming from AI.
For a clear and systematic picture aiming to aid the reader with understanding this
paper, a concept map illustrating a summary of the main topics and their relations discussed
in our work is shown in Figure 1.
Figure 1. Concept map of the main topics and their relations discussed in the paper.
2. Materials and Methods
A formal literature search was conducted on Google Scholar from 29 August 2023 to
7 September 2023. The used search terms were the following: (“depression” OR “dementia” OR “Alzheimer’s disease” OR “hallucinations”) AND (“linguistic markers” OR “linguistic
analysis” OR “linguistic style”). There were several inclusion and exclusion criteria used
in this study. Firstly, the year of publication was chosen to be at least 2015 in order to
analyze only information from recent years when ML and especially DANN architectures
dramatically increased the performance of the implemented systems. Another screening
criterion involved the domain. This research exclusively incorporated papers related to
Computer Science. Consequently, papers addressing neuropsychiatric linguistic analysis
through an AI-related approach were taken into account. Research studies originating from
diverse domains, such as Medicine, were not taken into account. The ultimate criterion
127
Information 2024, 15, 123
pertained to the language of the publication, wherein only documents available in English
were included. This criterion aimed to reduce the complications linked to the process
of translation. Subsequently, for the selected papers, their eligibility was tested firstly
by the abstracts of the papers, and then full papers were subjected to a detailed review.
This process was important to guarantee that only those papers having linguistic analysis
of the mentioned disorders or symptoms as their focus were examined. Regarding the
datasets, in addition to those utilized in the selected publications, datasets found using the
following terms were also selected: (“depression” OR “dementia” OR “Alzheimer’s disease”
OR “hallucinations”) AND (“dataset” OR “corpus”).
3. Medical Descriptions
This section provides a description of the clinical characteristics, symptoms, and
impacts of depression and the neurocognitive disorder (NCD) dementia. As regards
NCDs, Alzheimer’s disease (AD), the most common form of dementia, will be studied in
particular. Moreover, hallucinations will be analyzed, these being a specific symptom of
several mental diseases. In addition to that, a comparison between hallucinations produced
by humans and those produced artificially by Large Language Models (LLMs—DANNs
trained with a huge number of texts) [16] will be illustrated. A deep understanding of
the medical symptoms of neuropsychiatric diseases is relevant for effective application of
NLPs in studies. Knowing the medical symptoms of the disorders can help with finding
associations between certain symptoms and patterns in speech or text. For instance, if two
diseases have a common symptom, it could be useful to search for the same linguistic
features associated with that symptom for both diseases.
3.1. Depression
Depression, medically known as major depressive disorder (MDD) [17], is the general condition for the class of depressive disorders. It can be seen in several forms,
from medication-induced depressive disorder to dysthymia (persistent depressive disorder), and the disease is marked by distinct episodes lasting at least two weeks [17].
The criteria on which the diagnosis of this disease is based are the following: depressed
mood (i.e., from feeling hopeless to even feeling irritated, especially for adolescents and
children), deeply reduced enjoyment in activities, notable changes in appetite and weight,
daily sleep problems (i.e., insomnia or hypersomnia), overwhelming fatigue, feelings of
worthlessness or guilt, and in some cases even delusion thoughts, indecisiveness or trouble
concentrating, and even suicidal thoughts [17]. Depression’s evolution or appearance can
be influenced by various risk factors such as: temperament (i.e., neurotic people have a tendency towards anxiety, anger, and emotional instability [18]), environment (i.e., shocking
events, especially in childhood), and genetics.
3.2. Dementia and Alzheimer’s Disease
Dementia involves conditions wherein the main problem affects the cognitive functions that were acquired rather than present from birth [19]. These conditions affect a
category of people over a certain age; at the age of 65, the general prevalence of dementia is
approximately 1–2%, and by the age of 85, it is 30% [17]. Dementia is a general term that
refers to a series of diseases having various early symptoms depending on which area of
the brain is affected [20]. Due to the fact that in the majority of cases of AD, the first part
of the brain affected is the hippocampus, the patient initially has problems remembering
facts from the recent past. After that, if the amygdala is affected, the person refers to
memories more from an emotional point of view than a factual one. As AD progresses,
its damage extends to various brain areas and lobes, resulting in a thinner cortex and
overall brain shrinkage. The left hemisphere’s impairment leads to issues with semantic
memory and language, causing difficulty with word retrieval. Damage to the visual system
in the temporal lobes hampers face and object recognition, although auditory recognition
might still be possible. Right parietal lobe damage affects spatial judgment (e.g., tasks
128
Information 2024, 15, 123
like navigating stairs). Frontal lobe damage results in challenges with decision making,
planning, and organizing complex tasks. Despite these losses, long-acquired abilities like
procedural memories (e.g., dancing or playing the piano) tend to remain intact, even in advanced stages of AD [20]. Besides AD, there are other forms of dementia such as: vascular
dementia, frontotemporal dementia, Lewy body dementia [21], and dementia caused by
other diseases (e.g., Huntington’s disease or Parkinson’s disease) [22].
3.3. Hallucinations
Hallucinations are a symptom present in a variety of diseases, from mental health
conditions such as psychotic depression to AD; however, most often they are found in
conditions within the spectrum of schizophrenia [23]. This symptom manifests as vivid
experiences resembling perceptions but arising without the external triggers [17]. Hallucinations can affect all the senses; however, a survey conducted on 10,448 participants
showed that the most frequent are auditory hallucinations (29.5%) (e.g., voices, laughing,
and crying), succeeded by visual hallucinations (21.5%) (e.g., shadows and people moving), tactile hallucinations (19.9%) (e.g., being touched and formication), and olfactory
hallucinations (17.3%) (e.g., fire, food, and drinks) [24]. Besides these types, there are also
gustatory hallucinations (e.g., metallic taste), presence hallucinations (i.e., the feeling that
someone is present in the room or behind the subject), and proprioceptive hallucinations
(i.e., the feeling that the body is moving) [23].
4. State of the Art
This section provides an overview of the current state of the art in utilizing AI for
analyzing neuropsychiatric disorders and their symptoms. The section is structured as
follows. The first subsection presents NLP techniques used to understand linguistic signs
in conversations about the disorders. It highlights the use of sentiment analysis, topic
modeling, and patterns concerning depression, dementia, and hallucinations. In the second
subsection, we examine the distinctive linguistic markers associated with each of the
diseases. Additionally, a comparison between the differences in linguistic markers between
human- and LLM-generated hallucinations is illustrated. The last subsection examines
datasets in order to offer insights into selecting suitable resources for NLP-based analysis
of neuropsychiatric disorders.
4.1. NLP Techniques
In the field of NLP, a variety of techniques and tools have been employed to investigate
linguistic markers associated with neuropsychiatric disorders and have provided valuable
insights from the textual data from individuals with these conditions. In this section,
the techniques and tools used in state-of-the-art works will be presented. All the studies
and techniques mentioned in this section will be presented in more detail in Section 4.2.
Sentiment analysis is a fundamental method utilized to evaluate the emotional tone
and sentiments from text or speech. One of the main approaches for sentiment analysis
is the usage of lexicons particularly created for this task. Linguistic Inquiry and Word
Count (LIWC) [25] is a lexicon-based tool used by researchers for extracting emotional and
psychological dimensions. This tool was used to extract features for predicting depression
and anxiety from therapy sessions [26] and for the detection of Reddit posts related to depression [27,28]. There are sentiment lexicons specialized for scores of positivity, negativity,
and neutrality, such as: SentiWordNet [29] and VADER [30]. The former was utilized by
Titla-Tlatelpa et al. [31] for extracting the polarity of posts in order to create a user profile.
Moreover, lexicons designed for specific linguistic markers can be created: for instance, the
Behaviour Activation lexicon [32].
Topics of discussion represent indicators of certain mental disorders, and they can
be identified by selecting key words. One often utilized method for this task is to consider the Term Frequency–Inverse Document Frequency (TF-IDF) [33], which measures
the importance of words within a corpus. A smoothed version of TF-IDF was used by
129
Information 2024, 15, 123
Wolohan et al. [28], who combined it with LIWC and n-grams (sequences of n words) in
order to capture word sequences and patterns. Another topic modeling algorithm is Latent
Dirichlet Allocation (LDA) [34]; an example of using this method is illustrated in the work
of Tadesse et al. [27]. Furthermore, tools such as KHCoder [35] can be utilized for plotting
co-occurrence networks or other statistics from texts. The results from part-of-speech (POS)
tagging tasks [36] are also relevant markers for neuropsychiatric disorders. For certain
disorders (e.g., depression), the tense of verbs is an important clue, and tools such as the
Stanford University Time (SUTime) temporal tagging system [37] can be used for analyzing
the tenses.
4.2. Linguistic Markers
4.2.1. Depression
Understanding how language can reveal insights about depression has become an
area of growing interest that is marked by evolving findings and methodologies. There
exists an established association between self-centeredness and depression [38], and this
preoccupation with the self is also reflected in linguistic patterns [27]. In the meta-analysis
conducted by Tølbøll [39], 26 papers published between the years 2004 and 2019 were
examined to study the link between the existence and severity of depression and firstperson singular pronouns (e.g., ‘I’ and ‘me’), positive emotion words, and negative emotion
words. The conclusions related to the usage of first-person singular pronouns and depression indicated a medium effect (Cohen’s d of 0.44) and a positive correlation (Pearson’s
r of 0.19). One study analyzed Reddit posts from 12,106 users and reconfirmed the link
between first-person singular pronouns and depression [28]. Furthermore, the authors
found that individuals experiencing depression used more dialogue terms in their posts,
specifically addressing the readers using second-person pronouns (e.g., “you”) and writing
the posts as if talking directly to them. In addition to the linguistic markers discovered,
Wolohan et al. [28] created a depression classification system that performed best using
LIWC features and n-grams and achieved an F1 score of 0.729. Burkhardt et al. [26] evaluated therapy sessions on Talkspace from over 6500 unique patients and stated correlations
between both first-person singular and plural pronouns, which is a conclusion that has also
been validated in other research [32] for singular but not plural forms.
Regarding POS tagging, we analyzed the Distress Analysis Interview Corpus/Wizardof-Oz (DAIC-WOZ) dataset [40] from the University of Southern California and concluded
that individuals suffering from depression utilized fewer prepositions, conjunctions, and
singular proper nouns. Regarding verb tenses, depressives also have a tendency to use
more verbs in gerund or past participle form [41]. Moreover, there are studies supporting
future language as an indicator of depression [42]. Using SUTime, the authors discovered
that depressed participants refer to future events more distally and think more about the
past and future rather than the present. The researchers created an FTR (Future Time
Reference) classifier that offers more information about the modality of verbs and achieved
an F score over 0.99 on a Reddit post classification task.
Some emotions are more often found in people with certain mental disorders. Tølbøll [39] discovered a strong effect (Cohen’s d of 0.72) between depression and negative
emotions and a negative correlation (Pearson’s r of −0.21) between the disease and the
usage of positive words; they also confirmed the correlation between negative emotions
and depression for the analyzed conversations. Burkhardt et al. [26] extracted 49 emotionrelated features using both LIWC and the GoEmotion dataset [43]. The authors measured
the explanatory impact of each feature by using the amount of variance explained by R2
(i.e., the variability of a dependent variable that is explained by an independent variable in
a regression model), and the top LIWC features for depression had values in the interval
[0.716, 0.729]. With the first tool, sadness, anger, anxiety, and both negative and positive
emotions were identified as indicators of the mental disease. The most relevant emotions
for depression are: grief, less pride, less excitement, relief, disgust [26], and fear [41]. These
were confirmed as well in the work of Tadesse et al. [27], which additionally highlighted:
130
Information 2024, 15, 123
words of anger and hostility (e.g., hate), suicidal thoughts (e.g., stop-stop, want die), interpersonal processes (e.g., feel alone, lonely), and cues of meaninglessness (e.g., empty,
pointless) and hopelessness (e.g., end, need help). Moreover, the usage of absolutist words
(e.g., all, always, never) is a marker for depression and its symptoms of suicidal ideation
and anxiety [44].
The topics addressed in discussions can be indicators for depression. One method
to acquire the topics is by selecting the 100 most-used words with TF-IDF and dividing
them into categories. Using this methodology, Wolohan et al. [28] concluded that people
suffering from depression more often discuss: therapy (e.g., psychiatrist) and medications (e.g., Prozac) or Reddit, manga, and video games (e.g., Goku). By developing a
new lexicon, [26] found that depressed individuals more frequently approach subjects
from biology and health categories, and individuals having severe depression talk less
about activities and relate them less with positive feelings (e.g., enjoyment, reward). Using
LIWC, Tadesse et al. [27] detected correlations (with a Pearson’s r coefficients in the interval
[0.11, 0.19]) between depressed people and psychological processes such as social processes (e.g., mate, talk), affective processes (e.g., cry, hate), cognitive processes (e.g., know,
think), as well as personal concerns such as work, money, and death using. By analyzing
depression-related text with LDA, Tadesse et al. [27] selected the top 20 most frequent
topics, combined the extracted features with LIWC, bigrams, and an MLP, and obtained
an F1 score of 0.93 on a Reddit post classification task. The authors reconfirmed the job
topic but also added keywords such as: tired, friends, and broke; they also added sleep and
help [41]. In their study, they used the KHCoder tool to identify the topics of the interviews
using co-occurrence networks and concluded that in terms of relationships, depressed
people talked more about child–parent relationship, while the control group talked more
about friends and family, and in terms of jobs, the first category referred more to finding a
job, while the second category referred to a dream job. Another approach is to take into
consideration the profile (i.e, gender and age) of the speaker when analyzing the text for
depression and using age-based classifiers and gender-based classifiers [31]. With this
methodology, the authors revealed differences between depressed and non-depressed
users per category (e.g., the word calories used in negative contexts can be a marker for
depression in young females, while drunk can be used as a marker for senior male users).
4.2.2. Dementia and Alzheimer’s Disease
Although it is a field that is just at the beginning, lexical–semantic and acoustic metrics
show promising results as digital voice biomarkers for AD [45]. Automating the analysis
of vocal tasks like semantic verbal fluency and storytelling provides an undemanding
method to support early patient characterization. Some of the speech features we extracted
have unique patterns (e.g., the ones related to tone and rhythm). This method could be
used as a clear way to tell if someone has depression or mild cognitive problems [46].
Patients with AD have shortfalls with using verbs and nouns [47]: especially verbs during
arguments [48]. Using only information from POS tagging, some features (e.g., readability,
propositional density, and content density) can be extracted and show promising results for
AD classification tasks. For instance, Guerrero et al. [49] achieved an accuracy of 0.817 for
Pitt corpus by using a Random Forest (RF) model and, as input, a fusion of features extracted
from grammar characteristics, TF-IDF, and Word2Vec (W2V). Eyigoz et al. [50] predicted the
beginning of AD by analyzing linguistic characteristics. One of the conclusions they reached
was that participants who will be diagnosed with AD had telegraphic speech, writing
mistakes, and more repetitions. Telegraphic speech is summarized and contains only the
essential words (e.g., nouns and verbs), the connective POS (e.g., articles or adverbs) being
omitted. Another characteristic of AD speech was referential specificity: a semantic feature
by which unique nouns are differentiated from general nouns (e.g., proper names). More
studies support the idea that one of the earliest signs in terms of linguistics is semantic
impairment [51,52]. Karlekar et al. [53] identified clusters specific to this disease: namely,
clarification questions (e.g., ‘Did I say elephant?’), outbursts in speaking and brief answers
131
Information 2024, 15, 123
(e.g., ’oh!’, ‘yes’), and statements starting with interjections (e.g., ‘Well . . . ’, ‘Oh . . . ’).
An accuracy of 0.911 was obtained by researchers [53] in an experiment using POS-tagged
utterances and a CNN-RNN model. In the case of dementia and AD, the results can be
improved by combining linguistic markers with features extracted using Computer Vision
(CV) or biomarkers. For instance, Koyama et al. [54] highlighted the role of peripheral
inflammatory markers in dementia and AD and found links between increased levels of
C-reactive protein or interleukin-6 and dementia. By using CV, neuroimaging techniques
can be utilized to detect changes in the brain that are signs of AD or mild cognitive
impairment (MCI), such as increased grey matter brain atrophy or hyperactivation within
the hippocampus memory network [55].
4.2.3. Hallucinations
Hallucinations from People with Neuropsychiatric Disorders
Hallucinations are a complex phenomenon that can manifest in a unique way from
person to person. This symptom, especially an auditory one, is difficult to detect, particularly the moment of its appearance, but using mobile data [56], dictation devices [57],
or auditory verbal recognition tasks [58], it is still possible. In accord with a review [59],
hallucinations are influenced by cultural aspects such as: religion, race, or environment
(e.g., magical delusions exhibited a high frequency in rural areas). Gender is not a factor
for auditory hallucinations, but female patients reported experiencing olfactory, tactile, and
gustatory hallucinations more frequently [60].
In a study of Dutch language [61], the researchers compared the auditory verbal hallucinations from clinical (i.e., diagnosed with schizophrenia, bipolar disorder, or psychotic
disorder) and non-clinical participants and observed that the hallucinations from the first
category of participants were more negative (i.e., 34.7% vs. 18.4%); this aspect was also
confirmed by [9]. They identified the most frequently encountered semantic classes in
the auditory hallucinations in Twitter posts, with the top three being abusive language
(e.g., hell), relatives (e.g., son), and religious terms (e.g., prayer), followed by semantic
classes related to the sense of hearing (e.g., audio recording, audio and visual media, audio
device, or communication tools). Another observation is that tweets containing auditory
hallucinations exhibited a greater proportional distribution during the hours of 11 p.m. to
5 a.m. compared to other tweets. By using a set of 100 semantic features, the authors of [9]
classified if a Twitter post was related to auditory hallucination and with a Naive Bayes
(NB) model reached an AUC of 0.889 and an F2 score of 0.831; the baseline value was 0.711.
In this study, the leave-one-out technique showed that the best results were obtained when
lexical distribution features were excluded (i.e., an AUC of 0.889 and F2 score of 0.833).
Artificial Hallucinations from ML Models
This subsection presents specific contexts (e.g., tasks or topics of discussion) in which
hallucinations were not emitted by humans but were generated from AI systems that generate texts based on LLMs. The models from the studies presented in this section represent a
range of representative DANN models, such as: Generative Pre-trained Transformer models (e.g., GPT-2, GPT-4, and ChatGPT) or Transformer-based multimodal models (e.g., VLP,
and LXMERT-GDSE). Image captioning is a task in which models may hallucinate; for example, Testoni and Bernardi [62] used the GuessWhat?! game (the goal of the game is for
one player to guess the target object by asking the other player binary questions about
an image) to force the models to produce hallucinations. The majority of hallucinations
manifested in consecutive turns, leading to hypotheses such as previous triggering words
and the cascade effect (i.e., the amplification of hallucinations) [62–65]; these phenomena
are not present in human hallucinations. However, the models can detect that they are
wrong: ChatGPT [66] detects 67.37% of cases and GPT-4 detects 87.03% [65]. Another
difference is that in these experiments, the hallucinations appeared more frequently after
negative responses; in human dialogues, this is not the case.
132
Information 2024, 15, 123
Dziri et al. [63] tried to discover the origin of hallucinations in conversational models
based on Verbal Response Modes (VRM) [67] and affirmed that the most effective strategies
for creating hallucinations were disclosure (i.e., sharing subjective opinions) and edification (i.e., providing objective information). The researchers [63] also studied the level of
amplification of hallucinations and concluded that, for instance, GPT2 amplifies full hallucinations by 19.2% in the Wizard of Wikipedia (WOW) dataset. Alkaissi and McFarlane [68]
tested ChatGPT for scientific writing, and the model generated nonexistent paper titles
with unrelated PubMed IDs and artificial hallucinations [69] regarding medical subjects.
Self-contradiction is a form of hallucination that can appear in human hallucinations and
LLM-generated hallucinations; for the second type of hallucinations, there are algorithms
regarding their evaluation, detection, and mitigation [70]. The authors created a test covering 30 subjects (e.g., Angela Merkel and Mikhail Bulgakov) for the models, and for the
detection task, they achieved F1 scores with values up to 0.865.
An overview of each research study presented in Sections 4.1 and 4.2 in chronological
order and grouped by medical condition is shown in Tables 1–4 following.
Table 1. Overview of the linguistic markers for depression extracted in the selected papers. Source:
Own work.
Dataset Source
Data Type
Linguistic Markers or Features
Tools and Techniques
Year
Ref.
Reddit
Reddit posts
N-grams, topics, psychological and personal
concern process features
N-grams, LDA, LIWC
2019
[27]
Reddit
Reddit posts
N-grams, topics, grammatical features,
emotions
N-grams, smoothed TF-IDF,
LIWC
2019
[28]
Reddit and Twitter
Social media posts
Polarity, gender, age,
Bow/BoP representations
Bag of Words (BoW), Bag of
Polarities (BoP),
SentiWordNet
2021
[31]
Talkspace
Messaging
therapy sessions
Grammatical features, topics and emotions
LIWC, GoEmotions
2022
[26]
Reddit
Reddit posts
Temporal features, modal semantics
SUTime
2022
[42]
Public forums
Forum posts
Absolutist index, LIWC features
LIWC, absolutist dictionary
2022
[44]
DAIC-WOZ
Clinical interviews
POS tagging, grammatical features, topics
and emotions
NLTK, NRCLex, TextBlob,
pyConverse, KHCoder
2023
[41]
Table 2. Overview of the linguistic markers for dementia extracted in the selected papers. Source:
Own work.
Dataset Source
Data Type
Linguistic Markers or Features
Tools and Techniques
Year
Ref.
Public blogs
Posts from public blogs
Context-free grammar features, POS tagging,
syntactic complexity, psycholinguistic features,
vocabulary richness, repetitiveness
Stanford Tagger, Stanford Parser,
L2 Syntactic Complexity
Analyzer
2017
[10]
Pitt Corpus—Dementia
Bank
Cookie Theft picture
description task
Grammatical features, POS tagging
Activation clustering,
first-derivative saliency heat
maps
2018
[53]
Pitt Corpus—Dementia
Bank
Cookie Theft picture
description task
Word embeddings, grammatical features,
POS tagging
Word2Vec, TF-IDF
2020
[49]
FHS study
Cookie Theft picture
description task
Word embeddings, grammatical features,
POS tagging
GloVe, NLTK
2020
[50]
4.3. Relevant Datasets
This subsection presents an overview of the relevant datasets used in state-of-the-art
works in which the mentioned neuropsychiatric disorders were studied. These datasets
are utilized for both the detection of the disorder and the extraction of linguistic markers
specific to the disease. The data can be obtained by web scraping (e.g., social media posts),
artificially (e.g., content generated with an LLM following a pattern), or from medical
sources (e.g., dialogues between a patient and a doctor). Another aspect of the data is that
it should be gathered over a period of time (e.g., having interviews with a patient over
133
Information 2024, 15, 123
five years periodically), which allows early detection and the evolution of symptoms to
be studied.
Table 3. Overview of the linguistic markers for hallucinations from people extracted in the selected
papers. Source: Own work.
Dataset Source
Data Type
Linguistic Markers or Features
Tools and Techniques
Year
Ref.
Twitter
Twitter posts
Semantic classes, POS tagging, use of
nonstandard language, polarity, key phrases,
semantic and lexical features
TweetNLP tagger, MySpell
2016
[9]
Clinical study
Audio reports from
sleep onset and REM
and non-REM sleep
Grammatical features
Measure of Hallucinatory
States (MHS)
2017
[57]
"Do I see ghosts?"
Dutch study
Auditory verbal
recognition task
Age, gender, education, and the presence of
visual, tactile, and olfactory hallucinations
IBM SPSS Statistics
2017
[57]
Clinical study
Electronic health
records (EHRs)
Age, gender, race, NLP symptoms
Clinical Record Interactive
Search (CRIS)
2020
[60]
Clinical study
Recordings of
participants’
hallucinations
Grammatical features, emotions, POS tagging
CLAN software, Pattern
Python package,
Dutch lexicons
2022
[61]
Clinical study
Audio diary by mobile
phone with periodic
pop-ups asking about
the hallucinations
Word embeddings
VGGish model, BERT,
ROCKET
2023
[56]
Table 4. Overview of the linguistic markers for artificial hallucinations extracted in the selected
papers. Source: Own work.
Dataset Source
Data Type
Linguistic Markers or Features
Tools and Techniques
Year
Ref.
500 randomly
selected images
Image captioning task
CHAIR metrics—CHAIR-i and CHAIR-s,
METEOR, CIDEr, SPICE
MSCOCO annotations, FC
model, LRCN, Att2In,
TopDown, TopDown-BB,
Neural Baby Talk (NBT)
2018
[64]
GuessWhat?!
game
Utterances from
GuessWhat?! game
CHAIR metrics—CHAIR-i and CHAIR-s,
analysis of hallucinations
MSCOCO annotations, BL,
GDSE, LXMERT-GDSE, VLP
2021
[62]
Wizard of
Wikipedia (WOW),
CMUDOG,
TOPICALCHAT
Dialogues between
two speakers
Hallucination rate, entailment rate, Verbal
Response Modes (VRMs)
GPT2, DoHA, CTRL
2022
[63]
3 new datasets
consisting of
yes/no questions
QA task answers
Snowballing of hallucinations, hallucination
detection, LM (in)consistency
ChatGPT, GPT-4
2023
[65]
Dataset consisting
of generated
encyclopedic text
descriptions for
Wikipedia topics
Description task
Average no. of sentences, perplexity,
self-contradictory features
ChatGPT, GPT-4,
Llama2-70B-Chat,
Vicuna-13B
2023
[70]
4.3.1. Depression
In our depression study [41], we used the DAIC-WOZ dataset, which is a corpus containing the conversations between an agent Ellie and 189 participants: 133 non-depressed
and 56 depressed. The agent is human-controlled and operates based on a predefined set
of questions for the conversations. In order to label the participants, the Patient Health
Questionnaire-8 (PHQ-8) is utilized, and for each entry, the database contains: an audio
recording and transcript of the conversation, the responses for the PHQ-8, the gender of
the participant, and metadata (i.e., non-verbal and verbal features). To minimize the effects
of dataset imbalance, we created an additional subset of similar conversations of depressed
patients using ChatGPT. Depression-related challenges are another source for datasets;
for instance, DepreSym is a corpus created from the eRisk 2023 Lab.
134
Information 2024, 15, 123
A methodology used to retrieve medical dialogues is through online platforms, such as
those specialized for therapy sessions (e.g., Talkspace) [26], or forums [44]. Extracting social
media posts is a popular method for constructing new corpora; for instance, Shen et al. [71]
developed from Twitter posts a dataset with three subsets: depression, non-depression,
and depression candidate. Several researchers [31,72] have also used Twitter as a source for
their data. Another social platform from which data are collected is Reddit [27,28,73].
4.3.2. Dementia and Alzheimer’s Disease
One method to create a dataset for dementia or AD is from tasks designed to emphasize
the particular symptoms of the conditions, such as “Boston Cookie Theft” (a task in which
the participants were asked to describe a given picture) or a recall test (a task in which
the participants were asked to recall attributes of a previously shown story or picture).
DementiaBank [74] is a database of corpora containing video, audio, and transcribed
materials for AD, Mild Cognitive Impairment (MCI), Primary Progressive Aphasia (PPA),
and dementia in multiple languages (e.g., English, German, and Mandarin).
The Framingham Heart Study (FHS) is a study started in 1948 and which continues
to this day. Its aim is to discover factors that play a role in the development of cardiovascular disease (CVD). However, it also contains recordings of participants suffering from
conditions such as AD, MCI, or dementia. Researchers have used the data from this study
in order to detect linguistic markers that can be utilized for the early prediction of the
previously mentioned diseases [50,75]. Dementia and AD can also be studied in an online
environment, such as blog posts. For instance, Masrani et al. [10] created the Dementia
Blog Corpus by scraping 2805 posts from 6 public blogs, and the authors of [76] studied
dementia using data from Twitter.
4.3.3. Hallucinations
One of the signs of the presence of hallucinations in speech can be the unreliability
of the facts presented in the conversation. To highlight this sign, Liu et al. [77] created
HaDeS (HAllucination DEtection dataSet), a corpus built by perturbing raw texts from the
WIKI-40B [78] dataset using BERT [79], and then checked the validity of the hallucinations
with human annotators. The authors of [80] studied the correlations between hallucinations
and psychological experiences using a dataset containing 10,933 narratives from patients
diagnosed with mental illnesses (e.g., schizophrenia or obsessive compulsive disorder);
the data had been previously collected by the authors [81].
Artificial hallucinations are usually generated from conversational agents by using
certain games [62] or by addressing sensitive subjects such as religion, politics, or conspiracy
ideas. The Medical Domain Hallucination Test (Med-HALT) [82] is a collection of seven
datasets containing hallucination from LLMs in the medical field. The datasets are based
on two tasks: more precisely, the Reasoning Hallucination Test (RHT) (i.e., a task in which
the model has to choose an option from a set of options for questions) and the Memory
Hallucination Test (MHT) (i.e., a task in which the model has to retrieve information from
a given input). The data utilized as input for the models are questions from medical
exams (e.g., the residency examination from Spain and the United States Medical Licensing
Examination (USMILE)) and PubMed.
5. Discussion and Challenges
A key area for the improvement of the discussed approaches involves the expansion
and refinement of existing datasets and the development of new corpora; for instance, more
emphasis should be on collecting data periodically over a longer period of time to study
the evolution of diseases and to find the most relevant linguistic symptoms. Additionally,
the building of diverse datasets covering various demographic groups and different stages
of these disorders could improve the results. Integrating multimodal approaches that
combine linguistic markers with medical imaging or other biological signals could offer a
135
Information 2024, 15, 123
more comprehensive understanding of these disorders. Correlating linguistic patterns with
physiological and visual data may amplify the accuracy of early diagnosis and prediction.
Considering that ethics is indispensable in a project using data from people, especially
such sensitive data as those from patients suffering from neuropsychiatric disorders, various
aspects such as algorithmic fairness, biases, data privacy, informed consent to use, safety,
and transparency [83] have to be taken into account for a project to be ethically valid.
Fulfilling all these conditions can create difficulties in a project, such as non-cooperation
and lack of patient consent for the collection of new data or legal challenges that require
the involvement of legal professionals. Another problem is represented by the limited
access to such data; for example, a significant part of medical datasets are accessible only to
researchers affiliated with certain universities or having certain citizenship.
Another aspect is the interpretability of the results. Especially in the medical field, each
diagnostic offered by a model should be argued and explained; the Explainable Artificial
Intelligence (XAI) [84] domain is at the beginning of development. DANN models perform
better than classic ML models (e.g., SVM, RF, and NB), yet they have the disadvantage
of a black-box nature; therefore, a trade-off between interpretability and performance is
still necessary [85]. An application based on a DANN model, particularly in the medical field, should have the following characteristics: fairness (i.e., ensure that the model
does not discriminate), accountability (i.e., decide who is responsible for the decision),
transparency (i.e., interpretability and understandability of the model’s decisions), privacy,
and robustness [86]. Meeting these criteria can pose challenges in situations where data
are scarce or originate exclusively from a specific category, such as being restricted to
more-developed countries. Lastly, future research should concentrate on refining these
linguistic markers and models to support real-time diagnostics, early intervention, and
treatment monitoring for neuropsychiatric disorders. Validation studies in clinical settings
are necessary to evaluate the reliability and generalizability of these linguistic models.
The generalizability of the presented research findings can represent a potential challenge to the use of AI in the medical field, especially in such subjective areas as mental
or psychiatric illnesses. A wrong generalization can be generated from the beginning by
using data limited only to certain categories of people. For example, a study [87] performed
on 94 adults demonstrated the link between depression and demographics and clinical
and personality traits. Larøi et al. [88] studied the influence that culture (i.e., multiple
factors such as religion and political beliefs) has on hallucinations. Taking these into account, the existence of a heterogeneous dataset that includes as many different elements
as possible would contribute to the discovery of linguistic symptoms that are as general
as possible. Another perspective from which this problem can be viewed is that of the
model. As mentioned, the models with the best performance are based on DANN; these
types of models are prone to unreliable results based on incorrect criteria if the training
data are biased.
A fundamental theoretical problem of DANNs, which are now considered the best
approaches for NLP and were used in the research discussed herein, is that transformers
and neural networks, in general, are based on an empiricist paradigm. It should be
mentioned that to obtain the best results, there is a need to integrate empiricist with nativist
perspectives, the latter being used in symbolic, knowledge-based AI. These two paradigms
correspond, in fact, to the two main, opposing philosophical schools of thought that have
Aristotle and Plato as parents, with the latter being also advocated by Chomsky [13].
6. Conclusions
This survey demonstrates the potential of NLP for identifying linguistic patterns
related to neuropsychiatric disorders. Advanced methods have identified specific linguistic
traits and offer promising results for the early recognition and treatment of these disorders.
The identified markers (e.g., specific emotions and verb tenses) linked to conditions such as
depression, dementia, or hallucinations represent cues that are sometimes undiscoverable
by conventional diagnosis methods. This interdisciplinary field that combines linguistic
136
Information 2024, 15, 123
analysis, medical science, AI, and multimodal approaches offers a promising direction for
future research and practical applications and will potentially revolutionize early detection,
treatment, and care for neuropsychiatric disorders. However, despite these advancements,
future efforts are needed to enhance AI model accuracy and interpretability. At last, but of
course not at least, it should be mentioned that the very important ethical aspects need be
permanently considered, and it should also be taken into account that AI ethics is now a
major subject of discussion, research, and regulation [89–91].
Author Contributions: Conceptualization, I.-R.Z. and S.T.-M.; methodology, I.-R.Z. and S.T.-M.;
validation, S.T.-M.; investigation, I.-R.Z. and S.T.-M.; resources, I.-R.Z. and S.T.-M.; data curation,
I.-R.Z. and S.T.-M.; writing—original draft preparation, I.-R.Z.; writing—review and editing, S.T.-M.;
supervision, S.T.-M. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data are contained within the article.
Acknowledgments: We would like to thank the authors of all datasets described in this paper for
making the data available to the community for research purposes.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Luchini, C.; Pea, A.; Scarpa, A. Artificial intelligence in oncology: Current applications and future perspectives. Br. J. Cancer 2021,
126, 4–9. [CrossRef]
Gupta, M.; Kunal, S.; Mp, G.; Gupta, A.; Yadav, R.K. Artificial intelligence in cardiology: The past, present and future. Indian
Heart J. 2022, 74, 265–269. [CrossRef] [PubMed]
Giorgini, F.A.; Di Dalmazi, G.; Diciotti, S. Artificial intelligence in endocrinology: A comprehensive review. Journal of Endocrinological Investigation. J. Endocrinol. Investig. 2023 . [CrossRef]
Zhong, Y.; Chen, Y.; Zhang, Y.; Lyu, Y.; Yin, J.; Yujun, G. The Artificial intelligence large language models and neuropsychiatry
practice and research ethic. Asian J. Psychiatry 2023, 84, 103577. [CrossRef] [PubMed]
Rainey, S.; Erden, Y.J. Correcting the brain? the convergence of neuroscience, neurotechnology, psychiatry, and artificial
intelligence. Sci. Eng. Ethics 2020, 26, 2439–2454. [CrossRef] [PubMed]
World Health Organization (WHO). Available online: https://platform.who.int (accessed on 10 November 2023).
Leung, C.M.C.; Ho, M.K.; Bharwani, A.A.; Cogo-Moreira, H.; Wang, Y.; Chow, M.S.C.; Fan, X.; Galea, S.; Leung, G.M.; Ni,
M.Y. Mental disorders following COVID-19 and other epidemics: A systematic review and meta-analysis. Transl. Psychiatry
2022, 12, 205. [CrossRef]
Solmi, M.; Radua, J.; Olivola, M.; Croce, E.; Soardo, L.; de Pablo, G.S.; Shin, J.I.; Kirkbride, J.B.; Jones, P.; Kim, J.H.; et al. Age at
onset of mental disorders worldwide: Large-scale meta-analysis of 192 epidemiological studies. Mol. Psychiatry 2021, 27, 281–295.
[CrossRef]
Belousov, M.; Dinev, M.; Morris, R.;Berry, N.; Bucci, S.; Nenadic, G. Mining Auditory Hallucinations from Unsolicited Twitter
Posts. 2016. Available online: https://ep.liu.se/en/conference-article.aspx?series=&issue=128&Article_No=5 (accessed on 29
January 2024).
Masrani, V.; Murray, G.; Field, T.; Carenini, G. Detecting dementia through retrospective analysis of routine blog posts by bloggers
with dementia. ACL Anthol. 2017. Available online: https://aclanthology.org/W17-2329/ (accessed on 29 January 2024).
Yoon, J.; Kang, C.; Kim, S.; Han, J. D-VLog: Multimodal vlog dataset for Depression Detection. Proc. AAAI Conf. Artif. Intell. 2022,
36, 12226–12234. [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N. Attention Is All you Need. Part of Advances in Neural Information Processing Systems
30. 2017. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845
aa-Abstract.html (accessed on 29 January 2024 ).
Ranaldi, L.; Pucci, G. Knowing Knowledge: Epistemological Study of knowledge in transformers. Appl. Sci. 2023, 13, 677.
[CrossRef] [CrossRef]
Yokoi, R.; Eguchi, Y.; Fujita, T.; Nakayachi, K. Artificial Intelligence Is Trusted Less than a Doctor in Medical Treatment Decisions:
Influence of Perceived Care and Value Similarity. Int. J. Hum.–Comput. Interact. 2020, 37, 981–990. [CrossRef]
Kozhemyakova, E.A.; Petukhova, M.E.; Simulina, S.; Ivanova, A.M.; Zakharova, A. Linguistic markers of native speakers.
In Proceedings of the International Conference “Topical Problems of Philology and Didactics: Interdisciplinary Approach in
Humanities and Social Sciences” (TPHD 2018), 2019 . [CrossRef]
137
Information 2024, 15, 123
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
Rawte, V.; Chakraborty, S.; Pathak, A.; Sarkar, A.; Tonmoy, S.M.T.I.; Chadha, A.; Sheth, A.P.; Das, A. The troubling emergence
of hallucination in large language models—An extensive definition, quantification, and prescriptive remediations. arXiv 2023,
arXiv:abs/2310.04988.
American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders (DSM-5-TR) (5TH ED); British Library;
American Psychiatric Association: Washington, DC, USA, 2013.
Widiger, T.A.; Oltmanns, J.R. Neuroticism is a fundamental domain of personality with enormous public health implications.
World Psychiatry Off. J. World Psychiatr. Assoc. (WPA) 2017, 16, 144–145. [CrossRef]
Dementia Australia. Available online: https://www.dementia.org.au (accessed on 11 November 2023).
Alzheimer’s Society. Available online: https://www.alzheimers.org.uk (accessed on 11 November 2023).
Dementia UK. Available online: https://www.dementiauk.org (accessed on 11 November 2023).
Alzheimer’s Association. Available online: https://www.alz.org (accessed on 11 November 2023).
Cleveland Clinic. Available online: https://my.clevelandclinic.org (accessed on 11 November 2023).
Linszen, M.M.J.; de Boer, J.N.; Schutte, M.J.L.; Begemann, M.J.H.; de Vries, J.; Koops, S.; Blom, R.E.; Bohlken, M.M.; Heringa, S.M.;
Blom, J.D.; et al. Occurrence and phenomenology of hallucinations in the general population: A large online survey. Schizophrenia
2022, 8, 41. [CrossRef]
Tausczik, Y.R.; Pennebaker, J.W. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang.
Soc. Psychol. 2019, 29, 24–54. [CrossRef]
Burkhardt, H.; Pullmann, M.; Hull, T.; Aren, P.; Cohen, T. Comparing emotion feature extraction approaches for predicting
depression and anxiety. In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, Seattle,
USA, July 2022 . [CrossRef]
Tadesse, M.M.; Lin, H.; Xu, B.; Yang, L. Detection of depression-related posts in Reddit social media forum. IEEE Access 2019, 7,
44883–44893. [CrossRef]
Wolohan, J.; Hiraga, M.; Mukherjee, A.; Sayyed, Z.A.; Millard, M. Detecting linguistic traces of depression in topic-restricted text:
Attending to self-stigmatized depression with NLP. In Proceedings of the First International Workshop on Language Cognition
and Computational Models, Santa Fe, NM, USA, August 2018 ; pp. 11–21. Available online: https://aclanthology.org/W18-4102/
(accessed on 29 January 2024).
Baccianella, S.; Esuli, A.; Sebastiani, F. SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining.
ACL Anthol. 2010, 10, 2200–2204. https://aclanthology.org/L10-1531/.
Hutto, C.; Gilbert, E. VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proc. Int. AAAI Conf.
Web Soc. Media 2014, 8, 216–225. [CrossRef]
Titla-Tlatelpa, J.d.J.; Ortega-Mendoza, R.M.; Montes-y-Gómez, M.; Villaseñor-Pineda, L. A profile-based sentiment-aware
approach for depression detection in social media. EPJ Data Sci. 2021, 10, 54. [CrossRef]
Burkhardt, H.A.; Alexopoulos, G.S.; Pullmann, M.D.; Hull, T.D.; Areán, P.A.; Cohen, T. Behavioral Activation and Depression
Symptomatology: Longitudinal Assessment of Linguistic Indicators in Text-Based Therapy Sessions (Preprint); JMIR Publications Inc.:
Toronto, ON, Canada, 2021. [CrossRef]
Jones, K.S. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022.
Higuch, K. A Two-Step Approach to Quantitative Content Analysis: KH Coder Tutorial Using Anne of Green Gables (Part II).
Ritsumeikan Soc. Sci. Rev. 2017, 52, 77–91.
Toutanova, K.; Manning, C.D. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings
of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora Held in
Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 7–8 October
2000. [CrossRef]
Chang, A.X.; Manning, C.D. SUTIME: A library for recognizing and normalizing time expressions. In Proceedings of the 8th
International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, 21–27 May 2012; pp. 3735–3740.
Wegemer, C.M. Selflessness, depression, and neuroticism: An interactionist perspective on the effects of self-transcendence,
perspective-taking, and materialism. Front. Psychol. 2020, 11, 523950. [CrossRef] [PubMed]
Tølbøll, K.B. Linguistic features in depression: A meta-analysis. J. Lang.-Work.–Sprogvidenskabeligt Stud. 2019, 4, 39–59.
Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et
al. The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, May 2014; pp. 3123–3128. Available online:
https://aclanthology.org/L14-1421/ (accessed on 29 January 2024).
Zaman, I.-R.; Trausan-Matu, S.; Rebedea, T. Analysis of medical conversations for the detection of depression. In Proceedings
of the 19th International Conference on Human-Computer Interaction—RoCHI 2023, 2023; pp. 15–22. Available online:
http://rochi.utcluj.ro/articole/11/RoCHI2023-Zaman.pdf (accessed on 29 January 2024).
Robertson, C.; Carney, J.; Trudell, S. Language about the future on social media as a novel marker of anxiety and depression: A
big-data and experimental analysis. Curr. Res. Behav. Sci. 2023, 4, 100104. [CrossRef] [PubMed]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions.
arXiv 2020, arXiv:2005.00547.
138
Information 2024, 15, 123
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
Al-Mosaiwi, M.; Johnstone, T. In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression,
and suicidal ideation. Clin. Psychol. Sci. 2018, 6, 529–542. [CrossRef] [PubMed]
Lanzi, A.M.; Saylor, A.K.; Fromm, D.; Liu, H.;MacWhinney, B.; Cohen, M.L. DementiaBank: Theoretical rationale, protocol, and
illustrative analyses. Am. J. Speech-Lang. Pathol. 2023, 32, 426–438. [CrossRef] [PubMed]
Labbé, C.; König, A.; Lindsay, H.; Linz, N.; Tröger, J.; Robert, P. Dementia vs. Depression: New methods for differential diagnosis
using automatic speech analysis. Alzheimer’s Dement. J. Alzheimer’s Assoc. 2022, 18, e064486. [CrossRef]
Almor, A.; Aronoff, J.M.; MacDonald, M.C.; Gonnerman, L.M.; Kempler, D.; Hintiryan, H.; Hayes, U.L.; Arunachalam, S.;
Andersen, E.S. A common mechanism in verb and noun naming deficits in Alzheimer’s patients. Brain Lang. 2009, 111, 8–19.
[CrossRef]
Kim, M.; Thompson, C.K. Verb deficits in Alzheimer’s disease and agrammatism: Implications for lexical organization. Brain
Lang. 2004, 88, 1–20. [CrossRef]
Guerrero-Cristancho, J.; Vasquez, J.C.; Orozco, J.R. Word-Embeddings and grammar features to detect language disorders in
alzheimer’s disease patients. Inst. Tecnol. Metrop. 2020, 23, 63–75. Available online: https://www.redalyc.org/journal/3442/344
262603030/html/ (accessed on 29 January 2024). [CrossRef]
Eyigoz, E.; Mathur, S.; Santamaria, M.; Cecchi, G.; Naylor, M. Linguistic markers predict onset of Alzheimer’s disease. EClinicalMedicine 2020, 28, 100583. [CrossRef]
Martin, A.; Fedio, P. Word production and comprehension in Alzheimer’s diseáse: The breakdown of semantic knowledge. Brain
Lang. 1983, 19, 124–141. [CrossRef] [PubMed]
Appell, J.; Kertesz, A.; Fisman, M. A study of language functioning in Alzheimer patients. Brain Lang. 1982, 17, 73–91. [CrossRef]
[PubMed]
Karlekar, S.; Niu, T.; Bansal, M. Detecting linguistic characteristics of Alzheimer’s dementia by interpreting neural models. ACL
Anthol. 2018. Available online: https://aclanthology.org/N18-2110/ (accessed on 29 January 2024).
Koyama, A.; O’Brien, J.; Weuve, J.; Blacker, D.; Metti, A.L.; Yaffe, K. The role of peripheral inflammatory markers in dementia and
Alzheimer’s disease: A meta-analysis. J. Gerontol. Ser. Biol. Sci. Med. Sci. 2012, 68, 433–440. [CrossRef]
Ewers, M.; Sperling, R.A.; Klunk, W.E.; Weiner, M.W.; Hampel, H. Neuroimaging markers for the prediction and early diagnosis
of Alzheimer’s disease dementia. Trends Neurosci. 2011, 34, 430–442. [CrossRef]
Mirjafari, S.; Nepal, S.; Wang, W.; Campbell, A.T. Using mobile data and deep models to assess auditory verbal hallucinations.
arXiv 2023, arXiv:2304.11049.
Speth, C.; Speth, J. A new measure of hallucinatory states and a discussion of REM sleep dreaming as a virtual laboratory for the
rehearsal of embodied cognition. Cogn. Sci. 2017, 42, 311–333. [CrossRef]
de Boer, J.N.; Linszen, M.M.J.; de Vries, J.; Schutte, M.J.L.; Begemann, M.J.H.; Heringa, S.M.; Bohlken, M.M.; Hugdahl, K.; Aleman,
A.;Wijnen, F.N.K.; et al. Auditory hallucinations, top-down processing and language perception: A general population study.
Psychol. Med. 2019, 49, 2772–2780. [CrossRef]
Viswanath, B.; Chaturvedi, S.K. Cultural aspects of major mental disorders: A critical review from an Indian perspective. Indian J.
Psychol. Med. 2012, 34, 306–312. [CrossRef]
Irving, J.; Colling, C.; Shetty, H.;Pritchard, M.; Stewart, R.; Fusar-Poli, P.; McGuire, P.; Patel, R. Gender differences in clinical
presentation and illicit substance use during first episode psychosis: A natural language processing, electronic case register study.
BMJ Open 2021, 11, e042949. [CrossRef] [PubMed]
de Boer, J.N.; Corona Hernández, H.; Gerritse, F.; Brederoo, S.G.; Wijnen, F.N.K.; Sommer, I.E. Negative content in auditory verbal
hallucinations: A natural language processing approach. Cogn. Neuropsychiatry 2021, 27, 139–149. [CrossRef]
Testoni, A.; Bernardi, R. “I’ve seen things you people wouldn’t believe”: Hallucinating entities in guesswhat?! In Proceedings of
the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing: Student Research Workshop, Online, August 2021. [CrossRef]
Dziri, N.; Milton, S.; Yu, M.; Zaiane, O.; Reddy, S. On the Origin of Hallucinations in Conversational Models: Is it the Datasets
or the Models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Seattle, DC, USA, 10–15 July 2022. [CrossRef]
Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object hallucination in image captioning. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018.
[CrossRef]
Zhang, M.; Press, O.; Merrill, W.; Liu, A.; Smith, N.A. How language model hallucinations can snowball. arXiv 2023,
arXiv:2305.13534. https://arxiv.org/abs/2305.13534.
OpenAI. ChatGPT 2023. Available online: https://chat.openai.com/chat (accessed on 29 January 2024).
Stiles, W.B. Verbal response modes taxonomy. In The Cambridge Handbook of Group Interaction Analysis; Cambridge University
Press: Cambridge, UK, 2019; pp. 630–638. [CrossRef]
Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. 2023. Available online:
https://pubmed.ncbi.nlm.nih.gov/36811129/ (accessed on 29 January 2024).
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural
language generation. ACM Comput. Surv. 2023, 55, 1–38. [CrossRef]
139
Information 2024, 15, 123
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
Mündler, N.; He, J.; Jenko, S.; Vechev, M. Self-contradictory hallucinations of large language models: Evaluation, detection and
mitigation. arXiv 2023, arXiv:2305.15852.
Shen, G.; Jia, J.; Nie, L.; Feng, F.; Zhang, C.; Hu, T.; Chua, T.-S.; Zhu, W. Depression detection via harvesting social media: A
multimodal dictionary learning solution. In Proceedings of the IJCAI, Melbourne, Canada, 2017; pp. 3838–3844. Available online:
https://www.ijcai.org/proceedings/2017/536 (accessed on 29 January 2024).
Megan, E.; Wittenborn, A.; Bogen, K.; McCauley, H. #MyDepressionLooksLike: Examining public discourse about depression on
twitter. JMIR Ment. Health 2017, 4, e8141. [CrossRef]
Cohan, A.; Desmet, B.; Yates, A.; Soldaini, L.; MacAvaney, S.; Goharian, N. SMHD: A large-scale resource for exploring online
language usage for multiple mental health conditions. In Proceedings of the 27th International Conference on Computational
Linguistics, Santa Fe, NM, USA, August 2018; pp. 1485–1497. Available online: https://aclanthology.org/C18-1126/ (accessed on
29 January 2024).
Becker, J.T.; Boiler, F.; Lopez, O.; Saxton, J.; McGonigle, K.L. The Natural History of Alzheimer’s Disease: Description of study
cohort and accuracy of diagnosis. Arch. Neurol. 1994, 51, 585–594. [CrossRef] [PubMed]
Xue, C.; Karjadi, C.; Paschalidis, I.C.; Au, R.; Kolachalama, V.B. Detection of dementia on voice recordings using deep learning: A
Framingham Heart Study. Alzheimer’s Res. Ther. 2021, 13, 1–15. [CrossRef]
Azizi, M.; Jamali, A.A.; Spiteri, R. Identifying Tweets Relevant to Dementia and COVID-19: A Machine Learning Approach
(Preprint). PREPRINT-SSRN 2023. Available online: https://pesquisa.bvsalud.org/global-literature-on-novel-coronavirus-2019
-ncov/resource/pt/ppzbmed-10.2139.ssrn.4458777 (accessed on 29 January 2024).
Liu, T.; Zhang, Y.; Brockett, C.; Mao, Y.; Sui, Z.; Chen, W.; Dolan, W.B. A token-level reference-free hallucination detection
benchmark for free-form text generation. ACL Anthol. 2022. Available online: hhttps://aclanthology.org/2022.acl-long.464/
(accessed on 29 January 2024).
Guo, M.; Dai, Z.; Vrandečić, D.; Al-Rfou’, R. Wiki-40B: Multilingual language model dataset. In Proceedings of the Twelfth
Language Resources and Evaluation Conference, Marseille, France, May 2020; pp. 2440–2452. Available online: https://
aclanthology.org/2020.lrec-1.297/ (accessed on 29 January 2024).
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
Ghosh, C.C.; McVicar, D.; Davidson, G.; Shannon, C.; Armour, C. Exploring the associations between auditory hallucinations and
psychopathological experiences in 10,933 patient narratives: Moving beyond diagnostic categories and surveys. BMC Psychiatry
2023, 23, 1–10. [CrossRef]
Ghosh, C.C.; McVicar, D.; Davidson, G.; Shannon, C. Measuring diagnostic heterogeneity using text-mining of the lived
experiences of patients. BMC Psychiatry 2021, 21, 1–12. [CrossRef] [PubMed]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-HALT: Medical domain hallucination test for large language models. arXiv 2023,
arXiv:2307.15343.
Gerke, S.; Minssen, T.; Cohen, G. Ethical and Legal Challenges of Artificial Intelligence-Driven Healthcare. 2020. Available online:
https://www.sciencedirect.com/science/article/pii/B9780128184387000125 (accessed on 29 January 2024).
Gohel, P.; Singh, P.; Mohanty, M. Explainable AI: Current status and future directions. arXiv 2021, arXiv:2107.07045.
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-López, S.; Molina, D.; Benjamins,
R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI.
arXiv 2019, arXiv:1910.10045.
Wei, W.; Landay, J. CS 335: Fair, Accountable, and Transparent (FAccT) Deep Learning, Stanford University, April 2020. Available
online: https://hci.stanford.edu/courses/cs335/2020/sp/lec1.pdf (accessed on 29 January 2024).
Enns, M.W.; Larsen, D.K.; Cox, B.J. Discrepancies between self and observer ratings of depression. J. Affect. Disord. 2000, 60, 33–41.
[CrossRef]
Larøi, F.; Luhrmann, T.M.; Bell, V.; Christian, W.A.; Deshpande, S.N.; Fernyhough, C.; Jenkins, J.H.; Woods, A. Culture and
Hallucinations: Overview and Future Directions. Schizophr. Bull. 2014, 40, S213–S220. [CrossRef]
Council of the EU Artificial Intelligence Act: Council and Parliament Strike a Deal on the First Rules for AI in the World. Available
online: https://www.consilium.europa.eu (accessed on 28 January 2024).
AI HLEG. Ethics Guidelines for Trustworthy AI. Available online: https://ec.europa.eu (accessed on 28 January 2024).
European Parliament. EU Guidelines on Ethics in Artificial Intelligence: Context and Implementation. Available online:
https://www.europarl.europa.eu (accessed on 28 January 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
140
information
Article
Linguistic Profiling of Text Genres: An Exploration of Fictional
vs. Non-Fictional Texts
Akshay Mendhakar
Faculty of Applied Linguistics, Uniwersytet Warszawsk, 00-927 Warszawa, Poland;
[email protected]
Abstract: Texts are composed for multiple audiences and for numerous purposes. Each form of text
follows a set of guidelines and structure to serve the purpose of writing. A common way of grouping
texts is into text types. Describing these text types in terms of their linguistic characteristics is called
‘linguistic profiling of texts’. In this paper, we highlight the linguistic features that characterize a
text type. The findings of the present study highlight the importance of parts of speech distribution
and tenses as the most important microscopic linguistic characteristics of the text. Additionally, we
demonstrate the importance of other linguistic characteristics of texts and their relative importance
(top 25th, 50th and 75th percentile) in linguistic profiling. The results are discussed with the use case of
genre and subgenre classifications with classification accuracies of 89 and 73 percentile, respectively.
Keywords: genres; subgenres; linguistic profiling; text; NLP
1. Introduction
Citation: Mendhakar, A. Linguistic
Profiling of Text Genres: An
Exploration of Fictional vs.
Non-Fictional Texts. Information 2022,
13, 357. https://doi.org/10.3390/
info13080357
Academic Editor: Peter Revesz
Received: 21 June 2022
Accepted: 22 July 2022
Published: 26 July 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright:
© 2022 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
With the advancement in computers and their processing abilities, powerful algorithms that can process complex data in seconds have led to the development of modern-day
natural language processing (NLP) algorithms. Present-day NLP techniques focus on both
the context and form of the text rather than focusing on just one of them.
The development of sophisticated NLP pipelines and the availability of multiple
large-scale corpora have given rise to a new range of data-driven NLP tools. These modern
tools can be used to answer classical linguistic research topics and many more topics with
relative ease. By accomplishing this, we can highlight the set of linguistic variables which
are suited for the given task and try training machine learning algorithms to build models
for a given task. These models represent a text type based on its linguistic features and can
be used for solving complex linguistic problems when coupled with complex statistical
methods. One such classical linguistics problem is identifying text patterns and highlighting
the linguistic characteristics/linguistic profiling [1]. This traditional question has led to
multiple advanced concepts such as genre identification [2], identification of one’s native
language [3], author identification [4], author attribution [5], author verification [6] etc.
Similar complex linguistic use cases have given rise to areas such as computational
register analysis [7], which looks at the register and genre variation from a functional
spectrum of context-driven linguistic differences; computational sociolinguistics [8], which
focuses on the social dimension of language and the underlying diversity associated with
it; computational stylometry is aimed at extracting knowledge from texts to verify and
attribute authorship [9]; and many more. While classical stylometric techniques place a
special emphasis on identifying the most salient or the rarest feature in a text, modern
techniques can uncover patterns even in smaller segments of text [1,10]. Identifying a
specific linguistic profile of different text types can be used for classification tasks and
measurement of readability [11].
2. Literature Review
The concept of linguistic profiling for identifying specific features is not new and
has been attempted by multiple researchers. However, the usage of linguistic profiling to
Information 2022, 13, 357. https://doi.org/10.3390/info13080357
141
https://www.mdpi.com/journal/information
Information 2022, 13, 357
understand the genre variation computationally is the focus of this review. Ref [12] was the
first to propose the multi-dimensional (MD) method for genre variation. The MD approach
has several salient characteristics [13]:
1.
2.
3.
4.
5.
6.
MD is a corpus-driven computational method, defined based on the analysis of a
naturally occurring large number of texts.
MD helps in identifying linguistic features and patterns in individual texts and
genres computationally.
MD is built on the hypothesis that different types of texts differ linguistically and functionally and that analysing only one or two of them is insufficient for reaching inferences.
MD is, as the name suggests, an explicitly multi-dimensional approach that assumes
that in any discourse, it is anticipated that numerous parameters of variation will
be active.
MD is quantitative in nature, i.e., early statistical techniques as reported by [14,15]
have been reported to be useful in measuring frequency counts of linguistic features.
Recent multivariate statistical techniques are useful in understanding the relationship
between linguistic elements of the text.
MD combines macro- and micro-level analysis. That is, macroscopic evaluation
of general linguistic patterns combined with microscopic measurement of specific
features of the specific texts.
In earlier days, the knowledge extraction methods for register and stylistic analysis focused on the extraction of simple language-specific features such as pronouns, prepositions,
auxiliary and modal verbs, conjunctions, determiners, etc. and a few language-independent
features such as frequency of linguistic features. Significant progress in information extraction from text has lately been made feasible because of the creation of strong and
reasonably accurate text analysis pipelines for a variety of languages [9]. This is also true in
all the aforementioned instances where NLP-based technologies that automate the feature
extraction process play a critical role. Various programmes exist now that utilize distinctive
kinds of features to evaluate register, stylistic, or linguistic complexity.
Among these, the Stylo package [16] provides a comprehensive and user-friendly set
of functions for stylometric studies. Stylo focuses on shallow text characteristics, such as
n-grams at the token and character levels, that may be automatically extracted without
the usage of language-dependent annotation tools. It should be noted, however, that it
can also handle the output of linguistic annotation software. Text complexity may also be
assessed using a variety of other tools. Coh-Metrix is a well-known example which uses
characteristics retrieved from multi-level linguistic annotation to calculate over 200 indices
of cohesion, language and readability from an input text [17]. Similarly, L2 Syntactical
Complexity Analyzer (L2SCA) [18] and TAASSC [19] both estimate multiple linguistic variables that highlight grammatical complexity at the phrasal and sentence levels. These types
of features are relevant in studies on first and second language acquisition. SweGram, a
system specifically designed to profile texts in the Swedish language [20], is a striking exception to the preceding technologies, which are all designed for the English language. From
this brief review, we can note that language-independent tools, such as Stylo, typically use
shallow features that do not require language-specific preprocessing, whereas techniques
based on a wide variety of multilevel linguistic features are frequently monolingual.
Profiling–UD [21] is a computational text analysis tool based on linguistic profiling
concepts. It allows for the extraction of over 130 linguistic features from the given text.
Because it is built on the Universal Dependencies framework, Profiling–UD is a multilingual tool that supports 59 languages. The features extracted from the tool can be grouped
under raw text-related properties, lexical variety related features, morpho-syntactic features, verbal predicate structure-based measures, Global and Local Parsed Tree Structures,
syntactic and subordination related measures. Table 1 highlights the information on feature
categories extracted from the Profiling–UD tool. For more details about the tool, visit
http://linguistic-profiling.italianlp.it/ (accessed on 2 March 2022).
142
Information 2022, 13, 357
Table 1. Features extracted from Profiling–UD.
Category of Feature
Definition of the Feature
Name as Seen in the Tool
Raw text features
This measures raw text features such as
document length, sentence and word
lengths, number and characters per token.
(n_sentences), (n_tokens),
(tokens_per_sent), (char_per_tok)
Lexical variety
Measured in terms of its Type/Token Ratio
(TTR) for both the first 100 and 200 tokens
of a text in lemma and form.
The TTR value ranges from one (high TTR)
to zero (low lexical variety).
(ttr_lemma_chunks_100),
(ttr_lemma_chunks_200),
(ttr_form_chunks_100),
(ttr_form_chunks_200)
Morpho–syntactic information
These measures highlight the percentage
distribution of 17 core part-of-speech
categories defined in the Universal POS
tags, the lexical density of content words
and inflectional morphology.
(upos_dist_*): distribution of the 17 core
part-of-speech categories and
(lexical_density),
(verbs_tense_dist_*),
(verbs_mood_dist_*),
(verbs_form_dist_*),
(verbs_gender_dist_*),
(verbs_num_pers_dist_*)
Verbal predicate structure
This estimates the distribution of verbal
heads and roots.
(verbal_head_per_sent),
(verbal_root_perc), (avg_verb_edges),
(verb_edges_dist_*)
Global and local parsed tree structures
These measure the average depth of the
syntactic tree, average clause length, length
of dependency links, the average depth of
embedded complement chains governed by
a nominal head, word order phenomena.
(avg_max_depth),
(avg_token_per_clause), (avg_links_len),
(avg_max_links_len), (max_links_len),
(avg_prepositional_chain_len), (n_
prepositional_chains), (prep_dist_*),
(obj_pre), (obj_post), (subj_pre),
(subj_post)
Syntactic relations
This estimates the distribution of
dependency relations of 37 universal
syntactic relations used in UD.
37 (dep_dist_*)
Subordination phenomena/Use
of Subordination
This evaluates the distribution of
subordinate and main clauses, the relative
order of subordinates concerning the verbal
head and the average depth of embedded
subordinate clauses.
(principal_proposition_dist),
(subordinate_proposition_dist),
(subordinate_post), (subordinate_pre),
(avg_subordinate_chain_len),
(subordinate_dist_*)
There have been increasingly large collections of data compiled across the internet.
With advancements in technologies, these datasets are annotated and automatically analysed for multiple purposes [22]. However, linguistic profiling of texts is usually carried out
for multiple different projects with a variety of end goals in mind. Language verification,
author identification and verification, and text classification are a few to highlight here.
Our focus is to identify specific linguistic features of a given text that influence the text
classification into genres and specific subgenres. A brief review of the studies which have
focused on linguistic profiling of fictional and non-fictional texts points to the study by [11],
where they tried to estimate the readability of Italian fictional prose based on the linguistic
profiling of the texts. Even though their study shows promising results, from a fictional
prose point of view the dataset considered in the study is devoid of the fictional texts or
does not cover most of the subgenres of the fictional type. Therefore, it is very important
to conduct studies that consider multiple fictional subgenres that are popularly noted in
the literature and compare their linguistic composition with the non-fictional text type. In
the study by [11], the four major categories considered were literature further divided into
children and adult literature, journalism (newspaper), educational writing (educational
materials for primary school and high school) and scientific prose. When we look at the
datasets which are utilized across literature for the task of classification or readability or
143
Information 2022, 13, 357
author identification, we note that the Brown Corpus [23], the Lancaster-Oslo/Bergen
(LOB) Corpus [24] and the British National Corpus (BNC) (The BNC is the result of a
collaboration, supported by the Science and Engineering Research Council (SERC Grant No.
GR/F99847), and the UK Department of Trade and Industry, between Oxford University
Press (lead partner), Longman Group Ltd. (London, UK), Chambers Harrap (London, UK),
Oxford University Computer Services, the British Library and Lancaster University) are
the most prevalently used ones. Even though the nature of the BNC is the availability of a
large mixed corpus which renders a possibility to analyse multiple genres of texts, it is not
suitable for understanding comparing genres of fiction and non-fiction in detail. The Brown
Corpus consists of over 500 samples coded under 15 genres as an early attempt at corpus
linguistics. These 15 genres represented are not the universally accepted classification of
genres. In fact, when the scope of the study is to measure readability or classification,
the available datasets are acceptable. However, if we are interested in understanding the
linguistic composition of various genres and subgenres of fictional and non-fictional texts,
it is crucial that we define what we consider genres and subgenres of texts. Genre is a fluid
concept which is always in constant flux due to the vast majority of researchers proposing
different classification systems and different research goals. As the scope of our study is
to highlight the linguistic similarities and differences in various subgenres of fiction and
non-fictional texts, it is very important to consider a new dataset suitable for the goal of
the experiment.
The goal of the present study was to investigate variation within and between genres
by comparing a corpus of literary texts to corpora representing other textual genres using
contrastive linguistic analysis.
3. Method
The study was carried out at the LELO laboratory located at the Institute of Specialized
Studies (IKSI), Faculty of Applied Linguistics at the University of Warsaw. The study
was carried out after obtaining ethical clearance from the local ethical committee at the
University of Warsaw. The methodology section is divided into three sections, the first
part deals with the corpora and the related preprocessing of the dataset. The second
part deals with the linguistic profiling results of individual genres. The third section
highlights the results of the classifier performance based on the linguistic profiled features
for genre identification.
3.1. Corpora and Preprocessing
For the creation of corpus, we considered the text classification of [25] (fictional, nonfictional and poetry). We choose to ignore the category of poetry, as it is beyond the scope
of our study. Further classification into subgenres was performed after considering the
Reading Habits Questionnaire (RHQ) by [26]. Table 2 highlights the subgenre classification
considered for the creation of corpus. The data for the corpus was gathered from various
sources. The data for the fictional texts were gathered from the Gutenberg project. Project
Gutenberg is a digital archive of over 65,000 books categorized under various subheadings
and can be accessed in multiple formats such as HTML, PDF, TXT, EPUB, MOBI etc.
All the materials downloaded from the Gutenberg project are covered under the Creative
Commons license which makes them ready to use for this research study. Project Gutenberg
is an online repository of texts such as short stories, novels, poems, biographies and many
more. Despite being smaller than other public collections such as HathiTrust [27] and
Google Books, Project Gutenberg has several advantages over those collections. It can be
downloaded as a single package or can be scrolled for individual texts, which makes it
versatile enough for multiple experiments. Also, most of the online repositories of digital
documents use OCR technology to convert and preserve the documents. Texts under
Project Gutenberg have been proofread by a human and in some cases even hand-typed,
making them more suitable for experimental usage. The texts needed for the non-fiction
were gathered from student writing samples of http://www.corestandards.org/ (accessed
144
Information 2022, 13, 357
on 2 March 2022) [28] and various articles from the procedural texts we chose different
projects/articles from the https://www.instructables.com/ (accessed on 2 March 2022) [29]
website. Instructables is a dedicated web portal to obtain step-by-step instructions in
building and carrying out a variety of projects.
Table 2. Summary of the dataset of the study.
Fiction (2153)
Non-Fiction (1514)
Children’s Fiction (190)
Fable (394)
Fantasy (249)
Legends (44)
Mystery (191)
Myths (48)
Romance (591)
Science Fiction (385)
Thriller (61)
Discussion Texts (395)
Explanatory Texts (242)
Instructional Texts (495)
Persuasive Texts (382)
Hence, we built a dataset which consists of both fictional and non-fictional texts with a
special focus on carrying out a detailed linguistic analysis. Table 2 highlights the number of
text samples (shown in brackets) considered in each subgenre grouped across fictional and
non-fictional genres. The selected texts were divided into chapters, and it was made sure
that the overall size of each of the texts would be around 100–2000 words. Preprocessing
of the selected text was carried out to remove licensing information, unnecessary spaces
and punctuation.
3.2. Linguistic Profiling of Texts
The scope of the present study is to carry out detailed linguistic profiling of various
texts in the fictional and non-fictional categories. We chose to use the tool called Profiling–
UD [21] for carrying out a detailed computational linguistic profiling of texts. As stated
in the previous sections, this tool provides the most comprehensive set of features for a
loaded text.
Each text was individually loaded onto the tool and corresponding features were
extracted and tabulated. This process was repeated for all the texts. The results obtained
from the analysis were loaded onto SPSS software [30] for further processing.
Even though the analysis of fictional and non-fictional texts was performed based on
chapter-wise text, it can be noted that the overall number of sentences and number of tokens
in the fictional texts are higher than in non-fictional texts. Table 3 shows the summary of
the raw textual features across all subgenres and genres. However, the number of tokens
per sentence and character per token is higher in non-fictional texts when compared to
fictional texts. It was noted that there were individual differences across subgenres in
terms of the number of sentences and tokens. Based on the raw text properties, it can be
noted that mystery and thrillers, myths and legends, and fantasy and romance subgenres
had similar raw text structures; whereas explanatory and persuasive texts had similar
scores in the noted raw text properties. These findings support the hypothesis of [31] that
non-fictional texts, notably informational and discussion texts, use substantially longer
words and sentences than fictional texts, which use short and easy phrases.
Table 4 highlights the lexical variety noted in the subgenres, and it can be said that
based on the values there were no statistical differences between them. However, it can be
noted that the subgenres of fables had simple lexical variety and complexity and can be
graded as even simpler than the non-fictional texts. This can be accredited to the population
that the fables are targeted for—children need simple lexical variety. These findings add
to the claims that fictional texts and subgenres report significantly higher TTR values
suggesting greater lexical diversity and usage of unique words [32].
145
Information 2022, 13, 357
Table 3. Summary of the raw textual features across genres.
Parameter/Subgenre
n_sentences
n_tokens
tokens_per_sent
char_per_tok
Children’s Fiction
Fable
Fantasy
Legends
Mystery
Myths
Romance
Science Fiction
Thriller
Discussion
Explanatory
Instructional
Persuasive
Fiction
Non-fictional
559.40
17.00
410.80
680.80
981.80
1378.20
644.40
551.80
1027.20
58.40
176.80
147.60
60.20
694.60
110.75
5593.40
147.00
4399.80
7034.00
9342.00
13,339.80
6597.00
5269.80
8921.20
1191.00
1759.80
1398.80
577.20
6738.22
1231.70
9.98
8.58
10.65
10.37
9.83
9.62
10.14
9.68
8.78
19.69
10.00
9.56
9.83
9.74
12.27
3.96
4.25
4.10
4.11
4.25
4.57
4.22
4.48
4.27
4.73
4.41
4.25
4.53
4.25
4.48
Table 4. Summary of the lexical variety features across genres.
Parameter/Subgenre
ttr_lemma_chunks_100
ttr_lemma_chunks_200
ttr_form_chunks_100
ttr_form_chunks_200
Children’s Fiction
Fable
Fantasy
Legends
Mystery
Myths
Romance
Science Fiction
Thriller
Discussion
Explanatory
Instructional
Persuasive
Fiction
Non-fictional
0.69
0.23
0.66
0.64
0.72
0.68
0.68
0.69
0.69
0.66
0.61
0.67
0.62
0.64
0.63
0.56
0.10
0.54
0.54
0.61
0.60
0.61
0.61
0.62
0.58
0.52
0.59
0.48
0.54
0.53
0.76
0.26
0.73
0.71
0.79
0.73
0.74
0.74
0.75
0.74
0.68
0.76
0.71
0.72
0.69
0.64
0.11
0.60
0.61
0.68
0.65
0.66
0.66
0.67
0.66
0.59
0.66
0.56
0.62
0.59
Similarly, we looked at the parts of speech distribution in the various subgenres.
Table 3 highlights the individual values of the distribution of parts of speech across various
subgenres. When the values are compared across fiction and non-fictional texts, it can
be noted that fictional texts have a lower number of adjectives but a higher number of
adverbs, adpositions, pronouns and punctuation when compared to non-fictional texts.
Whereas non-fictional texts have two times higher values of auxiliary verbs and nouns with
slightly elevated values in numbers compared to fictional texts. No significant differences
were noted in the values of coordinating and subordinating conjunctions, determiners,
interjections, symbols and pronouns across fictional and non-fictional texts. Overall, the
lexical density of fictional and non-fictional texts remained the same. Table 5 highlights the
parts of speech distribution across all subgenres.
According to the Universal Dependencies (UD) framework, parts of speech can be
divided into three types [33]. Figure 1 highlights this classification system, and it includes
open class (ADJ, ADV, NOUN, VERB, PROPN, INTJ), closed class words (ADP, AUX,
CONJ, DET, NUM, PART, PRON, SCONJ) and others (PUNCT, SYS, X). For more details,
refer to Figure 1.
146
Parameter
Subgenre
*
ADJ
*
ADP
*
ADV
*
AUX
Children’s Fiction
Fable
Fantasy
Legends
Mystery
Myths
Romance
Science Fiction
Thriller
Discussion
Explanatory
Instructional
Persuasive
Fiction
Non-fictional
4.48
4.09
4.58
4.09
5.46
6.22
4.79
6.86
5.80
5.15
6.70
6.83
5.26
6.94
6.43
7.39
8.99
10.37
11.17
9.87
11.41
9.09
10.52
10.33
9.90
8.36
8.66
8.14
6.78
7.99
6.81
5.83
6.01
4.38
6.48
3.76
5.48
5.99
5.65
5.60
5.31
5.29
3.68
5.27
4.89
5.82
3.23
4.06
4.59
5.78
4.15
5.95
5.16
5.31
4.89
6.80
6.66
2.73
6.70
5.72
*
*
CCONJ DET
3.89
4.42
5.56
4.49
3.19
3.77
3.51
3.72
2.68
3.91
3.54
3.42
2.70
3.19
3.21
7.66
12.98
9.53
10.27
9.44
10.61
7.44
9.91
9.28
9.68
7.51
10.54
10.75
7.90
9.18
* The parameters are upos_dist_.
*
INTJ
*
*
NOUN NUM
0.60
0.00
0.25
0.38
0.35
0.16
0.44
0.19
0.26
0.29
0.05
0.26
0.18
0.08
0.14
12.44
15.49
15.61
17.37
14.49
18.19
13.56
17.73
15.45
15.59
20.65
21.33
22.46
17.53
20.49
0.50
0.42
0.71
1.48
0.66
1.17
0.49
0.88
0.41
0.75
0.80
1.13
4.30
0.07
1.58
*
PART
*
PRON
2.77
3.46
1.85
1.63
2.02
1.42
2.64
2.09
2.06
2.22
3.62
2.53
2.01
4.79
3.24
11.33
10.25
10.35
9.31
11.55
5.98
11.68
8.79
12.02
10.14
7.19
6.68
5.06
12.82
7.94
*
*
*
*
PROPN PUNCT SCONJ SYM
3.37
2.70
4.17
4.34
1.77
7.12
4.20
2.03
2.43
3.57
2.86
1.45
7.47
0.60
3.10
18.76
11.22
13.08
15.15
15.66
15.12
16.80
13.38
14.89
14.90
11.13
11.68
12.19
9.17
11.04
1.47
2.29
1.90
1.03
2.28
1.35
1.93
1.87
1.76
1.76
2.67
1.77
1.30
3.00
2.19
0.12
0.00
0.02
0.06
0.17
0.23
0.05
0.12
0.11
0.10
0.03
0.07
0.43
0.00
0.13
*
VERB
*
X
12.56
14.64
11.93
10.07
10.77
8.89
11.91
10.72
11.57
11.45
12.74
11.66
10.88
15.18
12.62
0.05
0.00
0.01
0.19
0.07
0.45
0.03
0.03
0.01
0.09
0.06
0.05
0.44
0.00
0.14
Lexical
Density
0.49
0.48
0.49
0.47
0.46
0.52
0.48
0.50
0.48
0.49
0.54
0.53
0.57
0.50
0.54
Information 2022, 13, 357
Table 5. Summary of parts of speech across and lexical density of various genres.
147
Information 2022, 13, 357
Figure 1. Universal Dependencies (UD) tagset by [34].
When we carefully examine the parts of speech distribution across non-fictional texts,
we can note that instructional texts had significantly fewer adjectives, adverbs and auxiliary
verbs when compared to other non-fictional texts. Also, the concentration of proper
nouns in instructional texts is statistically higher than in any other text. Persuasive texts
have a statistically significant fewer number of nouns, punctuation and adpositions, but
higher values in pronouns and verbs overall. Based on the values of determiners, particle
structure, subordinate conjunctions and interjections, we can group non-fictional texts
– and explanatory–instructional. No significant
into two –groups: discussion–persuasive
differences were noted in the lexical density across all the subgenres. Therefore, it can be
noted that open class and closed class words are equally important in the classification of
texts into fictional and non-fictional genres.
Similarly, when we look at the subgenres of the fictional texts, we can note that
myths and science fiction texts have the highest and lowest concentration of open-class
words (specifically adjectives and adverbs, respectively) but this is not statically significant,
whereas fables
and children’s fiction have the least concentration of open-class words
’
(interjections) in the non-fictional text genre. No other significant differences in closed-class
words were noted across other subgenres of fiction. Adverbs are the fewest in myths
but others were statically insignificant. Auxiliary verbs and coordinating conjunctions
were’ the fewest in children’s fiction and thrillers, respectively, but were similar across all
the other domains. Nouns are ’the fewest in children’s fiction but are similar in all other
’ domains. Children’s fiction and romance had similar closed-class compositions. Myths and
legends have the highest number of numerals and proper nouns, and the least occurrence of
pronouns and verbs compared to all other subgenres. Lexical density, particles, punctuation,
subordinating conjunctions, symbols and other domains are insignificant and are similar
across all domains.
Table 5 highlights the part of speech distribution in the different text types. Pronouns
and verbs appear to be frequently occurring in non-fictional texts. These findings are
148
Information 2022, 13, 357
similar to the claim by [31] that these two elements are more common in conversation than
in written language forms. The frequency of occurrence of nouns, on the other hand, is
relatively low, resulting in a substantially lower noun/verb ratio. These findings are in line
with the findings of [35], who suggest that novels have a narrative structure with a plot
involved that requires the description of activities using verbs.
Further, when we look at the other morphosyntactic information such as inflectional
morphology, the distribution of verbs according to their tense pre and post showed significant differences across fictional vs non-fictional texts. Fictional texts had higher past tense
verbs whereas non-fictional texts are composed more of present tense verbs. Looking at
the indicative and imperative verbs in fictional versus non-fictional texts, it was found that
both kinds of texts are extensively composed of indicative verbs.
No statistical differences in the distribution of verbs according to their number and
person, their tenses or even verbal mood were noted. Fables had the highest concentration
of past tenses whereas persuasive texts had the lowest concentration of past tenses.
Syntactic features of verbal predicate structures, such as the verbal heads in the
document, roots headed by a lemma tagged as a verb, verbal arity and distribution of verbs
for arity class, were not found to be significantly different across the subgenres. Further,
there were no significant differences noted in the parsed tree structures either, except that
the prepositional chains for non-fictional texts had significantly lower values compared
to fictional texts. However, fables had the smallest concentration of prepositional chains
making their structure closer to non-fictional texts.
When studying the order of elements in syntactic structure, specifically the objects and
subjects preceding and following the verbs, it was noted that fictional texts had slightly
higher values, but this did not reach statistical significance. When examined individually,
it was noted that the objects preceding verbs were least for fables and similar to the nonfictional category where there was not much difference across other subgenres.
Further contrastive analysis of 37 universal syntactic relations was carried out across
fictional and non-fictional texts. It was observed that non-fictional texts had elevated values
of the clausal modifier of the noun (adnominal clause), adjectival modifier, compound,
phrasal verb particle, marker, numeric modifier, object, oblique nominal, which reached
statistical significance, but non-significantly different values in adverbial clause modifier
and punctuation.
In the use of subordination, none of the parameters reached statistical significance
across fictional and non-fictional texts, but slight differences in the values of the distribution
of principal clauses and subordinate clauses were noted. No further subgenre differences
were noted.
One of the aims of this experiment was to highlight the main features that can be used
for the classification of fictional and non-fictional categories for the task of genre classification.
4. Feature Reduction and Classification
We begin by providing a quick overview of the classification algorithm and feature
selection approaches we employed in our trials (Section 4.1). Following that, we discuss
the classification models that were trained on the dataset using the proposed feature
sets (Section 4.2). The next section includes a feature selection experiment in which we
evaluate the relevance of the features (Section 4.3). The next step is to re-run the classification methods using alternative subsets of the features to evaluate how this affects the
model’s accuracy.
4.1. The Classification Algorithm and the Feature Selection Methods
In this study, we utilised the Random Forest algorithm (RF), which is an ensemble
learning method, as our classifier. The classification is based on the outcomes of several
decision trees it generates during the training process [36,37]. We chose RF as it calculates
the permutation relevance of the variables reliably during training the classification models.
Table 6 highlights the feature details after dimensionality reduction. After that, we em-
149
Information 2022, 13, 357
ployed Rank Features by Importance (RFI) and Sequential Forward Search (SFS) to evaluate
the features included in each model. Sections 4.3 and 4.4 explain RFI and SFS in detail.
Table 6. Feature details after dimensionality reduction.
Linguistic Category
Old Size
New Size for
the Genre
New Size for
Subgenre
Ignored Features
Genre/Subgenre
Raw Text Properties
Lexical Variety
Morphosyntactic Information
- upos_dist
- lexical_density
Inflectional Morphology
Syntactic Features
- Verbal Predicate Structure
- Global and Local Parsed Tree Structures
- Order of Elements
Syntactic Relations
- dep_dist
- Use of Subordination
4
4
4
4
4
2
0/0
0/2
18
1
21
11
1
17
13
1
17
7/5
0/0
4/4
10
10
4
3
7
3
5
8
3
7/5
3/2
1/1
44
8
24
7
30
5
20/14
0/3
4.2. Constructing RF Models
We utilised Jupyter Notebook for a quick implementation of RF. The features were
evaluated in a sequential manner to predict the importance of each feature in the models’
prediction success.
4.3. Using RFI to Assess the Relevance of the Features: Experiment One
To evaluate the variables, we used RF’s built-in permutation importance [38] to rank
their “importance”. According to [39], the model is developed first, and its accuracy is
computed in out-of-bag (OOB) observations to determine the relevance of the feature (Xi).
Following that, any relationship between the values of Xi and the model’s outcome is
severed by permuting all the values of Xi, and the model’s accuracy with the permuted
values is re-computed. The permutation importance of Xi is defined as the difference
between the accuracy of the new model and the original score. As a result, if a feature
has noise or random values, the permutation is unlikely to affect the accuracy. A large
difference between the two rates, on the other hand, indicates the importance of the feature
for the prediction task. Figures 2 and 3 demonstrate the importance of several variables in
genre and subgenre classifier models. The greater the relevance of the feature, the greater
the value of the mean decrease in accuracy on the x-axis.
We also used the method of [40] to calculate the p-values for the variables under the
null hypothesis that the permutation of the variable has no effect on the accuracy. Out of
131 features, 89 and 83 features from the genre and subgenre models, respectively, were
found to have a significant effect on classifier models. The remaining features had a role in
the models to varying degrees which did not reach significance.
4.4. Measuring Relevance of the Features Using SFS—Experiment Two
To implement SFS, we used the R package mlr [41]. The algorithm starts with an
empty set of features and gradually adds variables until the performance of the model
no longer improves. In this model, we used the classif.randomForest learner and the
Holdout resampling method. If the improvement falls below the minimum needed value
(alpha = 0.01), the algorithm comes to a halt. Each box in Figure 4 shows the selected
features of each feature set.
150
Information 2022, 13, 357
Figure 2. Variable importance plot of the RF genre model. NOTE: The x-axis shows the permutation
relevance (mean decrease in accuracy) of each feature; the y-axis lists the features of the genre model.
4.5. Examining Various Feature Subsets Based on Their Significance—Experiment Three
Firstly, in Section 4.5.1, we explore the accuracy of different subsets of each feature set
based on the results of the RFI and SFS experiments. In Section 4.5.2, we explore the subsets
of all the features combined, trying to come up with an optimal consensus set of features.
151
Information 2022, 13, 357
Figure 3. Variable importance plot of the RF sub-genre model. NOTE: The x-axis shows the permutation relevance (mean decrease in accuracy) of each feature; the y-axis lists the features of the
subgenre model.
152
Information 2022, 13, 357
Figure 4. SFS optimal features of each feature set.
4.5.1. Subsets of Each Feature Set
Tables 7 and 8 highlight the list of features that are considered with greater importance
in genre determination, and the top 25th, 50th and 75th percentile of the features important
for classification can be noted in Figure 4. The initial accuracy of each model is reported in
the first row of Table 9 with the term original. Rows Top 25%, Top 50% and Top 75% report,
respectively, on performing RF on the 25th, 50th and 75th percentile of important features
for class determination. Similarly, the Top 5 row highlights the relevance of the first five
features of each feature set with the greatest importance (according to Figure 4). The row
Allimp highlights the results of applying the RF to all the features that were noted to play a
part in classification.
Table 7. Subgenre selection of top features.
Top 25
upos_dist_NUM
Top 50
Top 75
upos_dist_NUM,
subordinate_dist_5, n_tokens,
dep_dist_nummod,
ttr_form_chunks_200
upos_dist_NUM, subordinate_dist_5,
n_tokens, dep_dist_nummod,
ttr_form_chunks_200,
ttr_lemma_chunks_200,
dep_dist_compounds,
aux_num_pers_dist_Sing+,
dep_dist_reparandum,
verbs_tense_dist_Pres, prep_dist_4,
n_prepositional_chains
Table 8. Genre selection of top features.
Top 25
Top 50
Top 75
aux_tense_dist_Past,
aux_tense_dist_Pres,
aux_mood_dist_Ind
aux_tense_dist_Past,
aux_tense_dist_Pres,
aux_mood_dist_Ind,
verbs_tense_dist_Pres,
max_links_len, lexical_density,
upos_dist_NOUN
aux_tense_dist_Past,
aux_tense_dist_Pres,
aux_mood_dist_Ind,
verbs_tense_dist_Pres,
max_links_len, lexical_density,
upos_dist_NOUN, obj_post,
subj_pre, char_per_token
Table 9. Accuracy of the model with feature selection.
Row Name
Genre Data
Subgenre Data
Original
Top 25%
Top 50%
Top 75%
Top 5
Allimp
0.87
0.71
0.77
0.84
0.75
0.93
0.82
0.64
0.71
0.79
0.68
0.89
153
Information 2022, 13, 357
4.5.2. Subsets of All Features
To investigate the possible combinations of all features based on the findings of the
RFI and SFS tests, after trying out different subsets of each feature set.
1.
2.
In the RFI experiment, we initially applied RF to the set of attributes with the highest permutation relevance. The set, as shown in Figure 3, is {aux_tense_dist_Past,
aux_tense_dist_Pres, aux_mood_dist_Ind, verbs_tense_dist_Pres}. The accuracy of
this model is 0.889. From Figure 4, the set of features important are {upos_dist_NUM,
subordinate_dist_5, n_tokens, dep_dist_nummod, ttr_form_chunks_200, ttr_lemma_
chunks_200, dep_dist_compounds}. This model had an accuracy of 0.728.
The union of RF and the two most important features of each feature set: {aux_tense_dist_
Past, aux_tense_dist_Pres, aux_mood_dist_Ind, verbs_tense_dist_Pres, max_links_len,
lexical_density, upos_dist_NOUN, obj_post, subj_pre, char_per_token}. The accuracy
of this model is 0.913. Similarly, for subgenre model, {upos_dist_NUM, subordinate_dist_5, n_tokens, dep_dist_nummod, ttr_form_chunks_200, ttr_lemma_chunks_200,
dep_dist_compounds, aux_num_pers_dist_Sing+, dep_dist_reparandum, verbs_tense_dist_
Pres, prep_dist_4, n_prepositional_chains} revealed the model accuracy of 0.792. The
accuracy of this model is in line with the expected increase in the accuracy when
compared with the accuracy of the union of the single most relevant features.
5. Conclusions
In this paper, we tried to linguistically profile the features noted in various fictional
and non-fictional subgenres. By considering multiple feature sets highlighted in various
computational SRF studies from a linguistic perspective, we attempted to connect the
computational models and the linguistic explanations behind those features. As a result of
the experiment, we are able to linguistically grade the composition of texts that constitute
a text type. We also noted that for the task of genre classification the most important
set of features are inflectional morphology, morphosyntactic information and raw text
properties. However, for the task of subgenre classification, a mixture of semantic and
syntactic features is important, i.e., morphosyntactic information, use of subordination,
lexical variety, general syntactic features and parsed tree structures.
Based on the linguistic profiling of non-fictional texts we found that the linguistic
composition of discussion and persuasion texts are similar across most of the domains of
comparison, and explanatory and instructional texts show linguistic similarities as well.
Similarly, grouping of subgenres of fiction can be performed for dyads of children’s fiction
and fantasy, myths and legends, and mystery and thrillers.
The results of the present study highlight the use of exact estimates of linguistic elements in each text type. These estimates could be useful in planning future use case
experiments ranging from identifying developmental patterns in children [42,43] to estimating atypical language acquisition [44,45]. Further, we can also detect linguistic markers
for acquired language disorders and cognitive impairments such as dementia and aphasia [46]. Similarly, we can estimate the writing abilities of school children [47]. Furthermore,
from the perspective of computational sociolinguistics, the findings aid in the analysis of
variations in the social component of language [8] as well as the modelling of stylometric
features of authors [9]. By performing a comprehensive estimation of elements belonging
to morphological, semantic and syntactic domains, we are able to grade the text types in
terms of their complexities as well. This will be especially useful in such cases as readability
measurement and selection of specific texts for language learning, among others.
Similarly, the current trend in linguistic analysis is to use complex network models
for linguistic representation [48–50]. Complex networks have been used to model and
study many linguistic phenomena, such as complexity [51–53], semantics [54], citations [55],
stylometry [56–61] and genre classification [62–64]. Multiple studies [65,66] have concluded
that the different properties of specific words on the macroscopic scale structure of a whole
text are as relevant as their microscopic feature such as frequency of appearance. Linguistic
research from the complex network approach is a relatively young domain of scientific
154
Information 2022, 13, 357
endeavour. There is still a need for studies that can fill the gap in understanding the
relationships between the system-level complexity of human language and microscopic
linguistic features [48]. Although research in this area is on the rise and abundant findings
have already been made, researchers need to have a clear knowledge of the microscopic
linguistic features to determine the directions of further research. Our study highlights
the crucial microscopic linguistic features which can be used to build better complex
network models.
Even though the present study was comprehensive with the linguistic parameters
considered, the dataset used was unevenly distributed across fictional and non-fictional
text groups. Further studies which can address these issues and replicate the results of the
present study in a controlled dataset would be required.
Funding: This research was conducted as part of the ELIT project, which has received funding
from the European Union’s Horizon 2020 research and innovation programme under the Marie
Skłodowska-Curie, grant agreement no. 860516.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: I wish to thank all the reviewers and Monika Płużyczka, Eliza, Niharika,
Priyanka, Darshan and Deepak for helpful comments and discussion. I am also extremely grateful to
all the members of IKSI at the University of Warsaw, who helped me in completing this research work.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Halteren, H.V. Linguistic Profiling for Authorship Recognition and Verification. In Proceedings of the 42nd Annual Meeting of
the Association for Computational Linguistics (ACL-04), Barcelona, Spain, 21–26 July 2004.
Paltridge, B. Genre Analysis and the Identification of Textual Boundaries. Appl. Linguist. 1994, 15, 288–299. [CrossRef]
Cimino, A.; Wieling, M.; Dell’Orletta, F.; Montemagni, S.; Venturi, G. Identifying Predictive Features for Textual Genre Classification: The Key Role of Syntax. In Proceedings of the Fourth Italian Conference on Computational Linguistics CLiC-it, Rome, Italy,
11–13 December 2017.
Coulthard, M. Author Identification, Idiolect, and Linguistic Uniqueness. Appl. Linguist. 2004, 25, 431–447. [CrossRef]
Gamon, M. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of
the COLING 2004: 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004;
pp. 611–617.
Halteren, H.V. Author verification by linguistic profiling: An exploration of the parameter space. ACM Trans. Speech Lang.
Processing 2007, 4, 1–17. [CrossRef]
Argamon, S.E. Computational Register Analysis and Synthesis. Regist. Stud. 2019, 1, 100–135. [CrossRef]
Nguyen, D.; Doğruöz, A.S.; Rosé, C.P.; De Jong, F.M. Computational Sociolinguistics: A Survey. Comput. Linguist. 2016, 42,
537–593. [CrossRef]
Daelemans, W. Explanation in computational stylometry. In International Conference on Intelligent Text Processing and Computa-tional
Linguistics; Springer: Berlin/Heidelberg, Germany, 2013; pp. 451–462.
Montemagni, S. Tecnologie Linguistico-Computazionali E Monitoraggio Della Lingua Italiana. Studi Ital. Linguist. Te-Orica Appl.
(SILTA) 2013, 42, 145–172.
Dell’Orletta, F.; Montemagni, S.; Venturi, G. Linguistic profiling of texts across textual genres and readability levels. An
exploratory study on Italian fictional prose. In Proceedings of the International Conference Recent Advances in Natural Language
Processing RANLP, Hissar, Bulgaria, 9–11 September 2013; pp. 189–197.
Biber, D. Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings. Language 1986, 62, 384.
[CrossRef]
Biber, D. Variation across Speech and Writing; Cambridge University Press: Cambridge, UK, 1988. [CrossRef]
Carroll, J.B. Vectors of Prose Style. In Style in Language; Sebeok, T.A., Ed.; MIT Press: Cambridge, MA, USA, 1960; pp. 283–292.
Marckworth, M.L.; Baker, W.J. A discriminant function analysis of co-variation of a number of syntactic devices in five prose
genres. Am. J. Comput. Linguist. 1974, 11, 2–24.
Eder, M.; Rybicki, J.; Kestemont, M.; Eder, M.M. Stylometry with R: A package for computational text analysis. R Journal 2016, 8,
107–121. Available online: https://journal.r-project.org/archive/2016/RJ-2016-007/index.html (accessed on 2 March 2022).
155
Information 2022, 13, 357
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
Graesser, A.C.; McNamara, D.S.; Cai, Z.; Conley, M.; Li, H.; Pennebaker, J. Coh-Metrix Measures Text Characteristics at Multiple
Levels of Language and Discourse. Elem. Sch. J. 2014, 115, 210–229. [CrossRef]
Lu, X. Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 2010, 15, 474–496. [CrossRef]
Kyle, K. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices
of Syntactic Sophistication. Ph.D. Thesis, Georgia State University, Atlanta, GA, USA, 2016. [CrossRef]
Näsman, J.; Megyesi, B.; Palmér, A. SWEGRAM: A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts.
In Proceedings of the 21st Nordic Conference on Computational Linguistics, Nodalida, Gothenburg, Sweden, 22–24 May 2017;
pp. 132–141.
Brunato, D.; Cimino, A.; Dell’Orletta, F.; Venturi, G.; Montemagni, S. Profiling-UD: A tool for linguistic profiling of texts. In
Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 7145–7151.
Lu, X. Computational Methods for Corpus Annotation and Analysis; Springer: Berlin, Germany, 2014. [CrossRef]
Francis, W.N.; Kucera, H. Manual of Information to Accompany a Standard Sample of Present-Day Edited American English, for Use with
Digital Computers; Technical Report; Department of Linguistics, Brown University: Providence, RI, USA, 1964.
Johansson, S.; Leech, G.N.; Goodluck, H. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for
Use with Digital Computers; Department of English, University of Oslo: Oslo, Norway, 1978.
National Literacy Trust (Adapted from Crown Copyright). A Guide to Text Types: Narrative, Non-Fiction and Poetry. 2013.
Available online: https://www.thomastallisschool.com/uploads/2/2/8/7/2287089/guide_to_text_types_final-1.pdf (accessed
on 2 March 2022).
Kuijpers, M.M.; Douglas, S.; Kuiken, D. Capturing the Ways We Read. Anglistik 2020, 31, 53–69. [CrossRef]
Christenson, H.A. HathiTrust. Libr. Resour. Tech. Serv. 2011, 55, 93–102. [CrossRef]
Schutz, D. The Common Core State Standards Initiative. 2011. Available online: http://www.corestandards.org/ (accessed on
26 March 2022).
Wikipedia Contributors. Instructables. In Wikipedia, The Free Encyclopedia. Available online: https://en.wikipedia.org/w/
index.php?title=Instructables&oldid=1024372150 (accessed on 26 March 2022).
IBM Corp. Released. IBM SPSS Statistics for Windows; Version 26.0; IBM Corp.: Armonk, NY, USA, 2019.
Biber, D.; Conrad, S. Register, Genre, and Style; Cambridge University Press: Cambridge, UK, 2009.
Jacobs, A.M. (Neuro-)Cognitive poetics and computational stylistics. Sci. Study Lit. 2018, 8, 165–208. [CrossRef]
Nivre, J.; De Marneffe, M.C.; Ginter, F.; Goldberg, Y.; Hajic, J.; Ryan Petrov, S.; Pyysalo, S.; Sil-veira, N.; Tsarfaty, R.; Zeman,
D. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1659–1666.
Nivre, J.; de Marneffe, M.C.; Ginter, F.; Hajič, J.; Manning, C.D.; Pyysalo, S.; Sebastian, S.; Tyers, F.; Zeman, D. Universal
Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and
Evaluation Conference, Marseille, France, 11–16 May 2020. 2020.
Voghera, M. La misura delle categorie sintattiche. Parole e numeri. In Analisi Quantitative dei Fatti di Lingua; Aracne: Roma, Italy,
2005; pp. 125–138.
Nayak, A.; Natarajan, D. Comparative study of naive Bayes, support vector machine and random forest classifiers in sentiment
analysis of twitter feeds. Int. J. Adv. Stud. Comput. Sci. Eng. 2016, 5, 16.
Biau, G. Analysis of a random forests model. J. Mach. Learn. Res. 2012, 13, 1063–1095.
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform.
2008, 9, 307. [CrossRef]
Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics
2010, 26, 1340–1347. [CrossRef]
Bischl, B.; Lang, M.; Kotthoff, L.; Schiffner, J.; Richter, J.; Studerus, E.; Casalicchio, G.; Jones, Z.M. mlr: Machine Learning in R.
J. Mach. Learn. Res. 2016, 17, 5938–5942.
Lu, X. Automatic measurement of syntactic complexity in child language acquisition. Int. J. Corpus Linguist. 2009, 14, 3–28.
[CrossRef]
Lubetich, S.; Sagae, K. Data-driven measurement of child language development with simple syntactic templates. In Proceeings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland,
23–29 August 2014; pp. 2151–2160.
Prud’hommeaux, E.; Roark, B.; Black, L.M.; Van Santen, J. Classification of atypical language in autism. In Proceedings of the 2nd
Workshop on Cognitive Modeling and Computational Linguistics, Portland, OR, USA, 23 June 2011; pp. 88–96.
Rouhizadeh, M.; Sproat, R.; Van Santen, J. Similarity measures for quantifying restrictive and repetitive behavior in conversations
of autistic children. In Proceedings of the Conference Association for Computational Linguistics North American Chapter,
Meeting, Seattle, DC, USA, 29 April–4 May 2015; p. 117.
Roark, B.; Mitchell, M.; Hollingshead, K. Syntactic complexity measures for detecting mild cognitive impairment. In Biological,
Translational, and Clinical Language Processing; Association for Computational Linguistics: Cambridge, MA, USA, 2007; pp. 1–8.
[CrossRef]
156
Information 2022, 13, 357
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
Barbagli, A.; Lucisano, P.; Dell’Orletta, F.; Montemagni, S.; Venturi, G. CItA: An L1 Italian learners corpus to study the development
of writing competence. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16),
Portorož, Slovenia, 23–28 May 2016; pp. 88–95.
Cong, J.; Liu, H. Approaching human language with complex networks. Phys. Life Rev. 2014, 11, 598–618. [CrossRef]
Gao, Y.; Liang, W.; Shi, Y.; Huang, Q. Comparison of directed and weighted co-occurrence networks of six languages. Phys. A Stat.
Mech. Appl. 2013, 393, 579–589. [CrossRef]
Lužar, B.; Levnajić, Z.; Povh, J.; Perc, M. Community structure and the evolution of interdisciplinarity in Slovenia’s sci-entific
collaboration network. PLoS ONE 2014, 9, e94429. [CrossRef] [PubMed]
Amancio, D.R.; Oliveira, O.N., Jr.; Costa, L.D.F. Structure–semantics interplay in complex networks and its effects on the
predictability of similarity in texts. Phys. Stat. Mech. Appl. 2012, 391, 4406–4419. [CrossRef]
Segarra, S.; Eisen, M.; Ribeiro, A. Authorship attribution using function words adjacency networks. In Proceedings of the
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–30 May 2013;
pp. 5563–5567. [CrossRef]
Segarra, S.; Eisen, M.; Ribeiro, A. Authorship Attribution Through Function Word Adjacency Networks. IEEE Trans. Signal
Process. 2015, 63, 5464–5478. [CrossRef]
Silva, T.C.; Amancio, D.R. Word sense disambiguation via high order of learning in complex networks. Eur. Lett. 2012, 98, 58001.
[CrossRef]
Amancio, D.R.; Nunes, M.D.G.V.; Oliveira, O.N., Jr.; Costa, L.D.F. Using complex networks concepts to assess approaches for
citations in scientific papers. Scientometrics 2012, 91, 827–842. [CrossRef]
Brede, M.; Newth, D. Patterns in syntactic dependency networks from authored and randomised texts. Complex. InterNatl. 2008,
12, 051915.
Liang, W.; Shi, Y.; Tse, C.K.; Liu, J.; Wang, Y.; Cui, X. Comparison of co-occurrence networks of the Chinese and English languages.
Phys. Stat. Mech. Appl. 2009, 388, 4901–4909. [CrossRef]
Liang, W.; Shi, Y.; Tse, C.K.; Wang, Y. Study on co-occurrence character networks from Chinese essays in different periods. Sci.
China Inf. Sci. 2012, 55, 2417–2427. [CrossRef]
Liu, H.; Li, W. Language clusters based on linguistic complex networks. Chin. Sci. Bull. 2010, 55, 3458–3465. [CrossRef]
Antiqueira, L.; Nunes, M.D.G.V.; Oliveira, O.N., Jr.; Costa, L.D.F. Strong correlations between text quality and complex networks
features. Phys. Stat. Mech. Appl. 2007, 373, 811–820. [CrossRef]
Amancio, D.R.; Antiqueira, L.; Pardo, T.A.; Costa, L.D.F.; Oliveira, O.N., Jr.; Nunes, M.G. Complex net-works analysis of manual
and machine translations. Int. J. Mod. Phys. C 2008, 19, 583–598. [CrossRef]
Amancio, D.R.; Oliveira, O.N.; Costa, L.D.F. Identification of literary movements using complex networks to represent texts. New
J. Phys. 2012, 14, 043029. [CrossRef]
Costa, L.D.F.; Oliveira, O.N., Jr.; Travieso, G.; Rodrigues, F.A.; Villas Boas, P.R.; Antiqueira, L.; Viana, M.P.; Correa Rocha, L.E.
Analyzing and modeling real-world phenomena with complex networks: A survey of applications. Adv. Phys. 2011, 60, 329–412.
[CrossRef]
Newman, M.E.; Barabási, A.L.E.; Watts, D.J. The Structure and Dynamics of Networks; Princeton University Press: Princeton, NJ,
USA, 2022.
Ke, J.; Yao, Y. Analysing Language Development from a Network Approach. J. Quant. Linguist. 2008, 15, 70–99. [CrossRef]
Akimushkin, C.; Amancio, D.R.; Oliveira, O.N., Jr. Text authorship identified using the dynamics of word co-occurrence networks.
PLoS ONE 2017, 12, e0170527. [CrossRef]
157
information
Review
A Literature Survey on Word Sense Disambiguation for the
Hindi Language
Vinto Gujjar 1 , Neeru Mago 2 , Raj Kumari 3 , Shrikant Patel 4 , Nalini Chintalapudi 5, * and Gopi Battineni 5,6
1
2
3
4
5
6
*
Department of Computer Science & Applications, Panjab University, Chandigarh 160014, India;
[email protected]
Department of Computer Science & Applications, Panjab University Swami Sarvanand Giri Regional Centre,
Hoshiarpur 160014, India
University Institute of Engineering and Technology, Panjab University, Chandigarh 160014, India
School of IT & ITES, Delhi Skill and Entrepreneurship University, Government of NCT of Delhi,
Delhi 110003, India
Clinical Research Centre, School of Medicinal and Health Products Sciences, University of Camerino,
62032 Camerino, Italy
Department of Electronics and Communication Engineering, Velagapudi Ramakrishna Siddharth Engineering
College, Vijayawada 520007, India
Correspondence:
[email protected]
Abstract: Word sense disambiguation (WSD) is a process used to determine the most appropriate
meaning of a word in a given contextual framework, particularly when the word is ambiguous. While
WSD has been extensively studied for English, it remains a challenging problem for resource-scarce
languages such as Hindi. Therefore, it is crucial to address ambiguity in Hindi to effectively and
efficiently utilize it on the web for various applications such as machine translation, information
retrieval, etc. The rich linguistic structure of Hindi, characterized by complex morphological variations and syntactic nuances, presents unique challenges in accurately determining the intended
sense of a word within a given context. This review paper presents an overview of different approaches employed to resolve the ambiguity of Hindi words, including supervised, unsupervised,
and knowledge-based methods. Additionally, the paper discusses applications, identifies open
problems, presents conclusions, and suggests future research directions.
Citation: Gujjar, V.; Mago, N.;
Kumari, R.; Patel, S.; Chintalapudi,
Keywords: word sense disambiguation; knowledge-based; supervised; unsupervised; Hindi language
N.; Battineni, G. A Literature Survey
on Word Sense Disambiguation for
the Hindi Language. Information 2023,
14, 495. https://doi.org/10.3390/
info14090495
Academic Editor: Peter Revesz
Received: 7 July 2023
Revised: 30 August 2023
Accepted: 2 September 2023
Published: 7 September 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
In the present age of information technology (IT), the whole world is sharing information using the internet. This information is available in natural language. As naturally
understood, all-natural languages have an intrinsic feature called ambiguity. Ambiguity
refers to the situation where a word can have multiple meanings. Ambiguity in natural
language poses a significant obstacle in Natural Language Processing (NLP). While the
human mind can rely on cognition and world knowledge to disambiguate word senses,
machines lack the ability to employ cognition and world knowledge, leading to semantic errors and erroneous interpretations in their output. Therefore, the WSD process is employed
to alleviate ambiguity in sentences.
WSD represents highly regarded formidable challenges within the realm of NLP and
stands as one of the earliest quandaries in computational linguistics. Experimentation
efforts in this domain commenced in the late 1940s, with Zipf’s [1] introduction of the
“law of meaning” in 1949. This principle posits a power law relationship between the
frequency of a word and the number of meanings it possesses, indicating that more common
words tend to have a greater range of meanings compared to less frequent ones. In 1975,
Wilks [2] advanced the field by developing a model known as “preference semantics”,
which employed selectional constraints and frame-based lexical semantics to ascertain the
Information 2023, 14, 495. https://doi.org/10.3390/info14090495
158
https://www.mdpi.com/journal/information
Information 2023, 14, 495
precise meaning of a polysemous word. Notably, the 1980s witnessed significant progress
in WSD research, facilitated by the availability of extensive lexical resources and corpora.
Ultimately, WSD entails the task of identifying the accurate sense of a word within its
specific contextual framework [3]. WSD is not considered a final objective; instead, it is
recognized as an intermediary task with relevance to various applications within the field
of NLP. Figure 1 presents the WSD conceptual diagram.
Figure 1. Conceptual Diagram of WSD.
In machine translation, WSD is an important step because a number of words in
every language have a different translation according to the context of their usage [3–6].
ff
It is an important issue to be considered during language translation. WSD assumes
a crucial role in ensuring precise text analysis across a wide range of applications [7,8].
For example, an intelligence-gathering system could distinguish between references to
illicit drugs and medicinal drugs through the application of WSD. Research works such as
named entity recognition and bioinformatics research can also use WSD. In the realm of
information retrieval (IR), the primary concern lies in determining the accurate sense of a
polysemous word within a given query before initiating the search for its corresponding
answer [9,10]. Enhancing the efficiency and effectiveness of an IR system entails the
ff
resolution of ambiguity withinffia query. Similarly,
in sentiment analysis, the elimination of
ambiguity is crucial for determining the correct sentiment tags (e.g., negative or positive)
associated with a sentence [11,12]. In question-answering (QA) systems, WSD assumes a
significant role in identifying the appropriate types of answers that correspond to a given
question type [13,14]. Furthermore, WSD is necessary to accurately assign the appropriate
part of speech tagging (POS) to a word, as its POS can vary depending on the contextual
usage [15,16].
WSD can be categorized into two classifications: “all words WSD” and “target word
WSD”. In the case of all words WSD, the disambiguation process extends to all the
words present in a given sentence, whereas target word WSD specifically focuses on
disambiguating the target word within the sentence. WSD poses a significant challenge
within the field of NLP and remains an ongoing area of research. It is regarded as an open
problem, categorized as “AI-Complete”, signifying that a viable solution does exist but
has not yet been discovered. If we consider the given below two sentences in the Hindi
language
आज-कल बाज़ार म७ नई-नई व
ुओं की माँ ग बढ़ रही है ।
(aaj-kal baazaar mein naee-naee vastuon kee maang badh rahee hai)
(Now-a-days the demand of new things is increasing in the market.)
सुहागन औरत७ अपनी माँ ग म७ िसंदूर भरती ह॰ ।
(suhaagan auraten apanee maang mein sindoor bharatee hain)
(Married women apply vermillion on their maang
ff
माँ ग(the partition of hair on head).)
In both sentences, we have a common word, “
” (maang), that has a different
meaning as per the context. In the initial sentence, the term refers to “the demand”, whereas
in the subsequent sentence, it denotes “the partition of hair on the head”. Identifying the
specific interpretation of a polysemous word is not a problem for a personage, whereas, for
machines, it is a challenging task. Conversely, Hindi is the top fourth language, with over
615 million speakers worldwide. A significant amount of work is performed for English
159
Information 2023, 14, 495
tt
WSD, but the WSD for the Hindi language is still in its infancy stage. Hindi WSD is now
gaining the attention of researchers.
The objective of this paper is to provide a comprehensive survey of the existing
approaches and techniques for WSD in the Hindi language. It presents several approaches
employed for WSD in the context of Hindi. The paper highlights the specific challenges
and limitations faced in WSD for Hindi due to its morphological complexity, rich lexical
resources, and less availability of labeled data. The rest of this paper is structured in
the following way: Section 2 discusses the various approaches for WSD, followed by a
proposed methodology presented on WSD in Section 3. In Section 4, the survey results
presented for WSD were critically reviewed, and Section 5 is the conclusion.
2. Various Approaches for WSD
Various approaches and methods used for WSD are classified into two categories,
including knowledge-based approaches and ML (Machine Learning) based approaches. In
knowledge-driven approaches, external lexical resources such as Wordnet, dictionary, and
thesauri are required to perform WSD, and in ML-based techniques, classifiers are trained
to carry out the WSD task on sense-annotated corpora. Figure 2 presents the different WSD
approaches, and the explanation for each category can be explained further.
Figure 2. Classification of WSD Approaches.
2.1. Knowledge-Based Approaches
The knowledge-driven approach depends on various sources of knowledge such as
dictionaries, thesaurus, ontologies, and collocations. The goal of these approaches in WSD
is to utilize these knowledge resources to deduce the meanings of words within a given
context. Let us delve deeper into an overview of several knowledge-based approaches.
•
LESK Algorithm
The first algorithm developed using the knowledge-driven approach for WSD is the
LESK algorithm [17,18]. The method relies on determining the degree of word overlap
between the definitions or glosses of two or more target words. The dictionary definitions
or glosses of the polysemous word are collected from the dictionary, and then these glosses
and context words are compared. The desired sense of the polysemous word is determined
160
Information 2023, 14, 495
by identifying the sense with the highest degree of overlap. A score is calculated for each
pair of word senses using the provided formula, which is
overlapscoreLesk(S1, S2) = |Gloss(S1) ∩ Gloss(S2)|
The senses of the respective words are assigned based on the maximum value obtained
from the above formula, where Gloss(Si) represents the collection of words in the textual
interpretation of sense Si of word W.
•
Semantic Similarity
•
Selectional preferences
•
Heuristic Approach
(a)
The most frequent sense heuristic operates on the principle of identifying all possible
meanings that a word can have, with the understanding that one particular sense
occurs more frequently than others.
The one sense per discourse heuristic posits that a term or word maintains the same
meaning throughout all instances within a specified text.
The one sense per collocation heuristic has a similar meaning to the one sense per
discourse heuristic, but it assumes that nearby words offer a robust and consistent
indication of the contextual sense of a word.
Words that exhibit a connection with one another possess a shared context, allowing
for the selection of the most suitable sense of a word by leveraging the meanings found
within the shortest semantic distance. Various metrics can be employed to compute the
semantic similarity between two words [19].
Selectional preferences provide insights into the categories of words that are likely
to be associated with one another and convey shared knowledge [20,21]. For instance,
“actors” and “movies” are words that exhibit semantic relationships. In this approach,
inappropriate senses of words are excluded, and only those senses that align with common
sense rules are taken into consideration. The methodology revolves around tallying the
occurrences of word pairs with syntactic relations in a given corpus. The identification of
word senses is accomplished based on this frequency count.
In the heuristic approach, to disambiguate word heuristics, they are calculated using
the different linguistic properties. Three types of heuristics are employed as a baseline.
(b)
(c)
•
Walker’s Algorithm
Walker introduced an approach or technique for WSD in 1987 [22,23]. This approach
incorporates the use of a thesaurus to accomplish the task. The initial step involves
assigning a thesaurus class to each sense of a polysemous word. Subsequently, a total
sum is computed by considering the context where the ambiguous word appears. If the
context of the word matches the word sense with a thesaurus category, the total sum for
that category increases by one.
2.2. ML-Based Approaches
In ML-based approaches, a classifier undergoes a training step to acquire knowledge
of the attributes and subsequently determines the senses for the unseen examples. The
resources that are used in this approach are based on a corpus that can be tagged or untagged. In these types of approaches, the target is the word to be disambiguated, also called
the input word, and the surrounding text in which it is submerged is referred to as the
contextual information. ML-based approaches are categorized into three types: supervised,
unsupervised, and semi-supervised techniques.
2.2.1. Supervised Techniques
Supervised techniques for disambiguation utilize sense-annotated corpora for training
purposes. These techniques operate under the supposition that the context itself can impart
161
Information 2023, 14, 495
ffi
ffi
sufficient affirmation to resolve a sense of ambiguity. The context is represented as a
collection of word “features”, encompassing information about the neighboring words as
well. Within these techniques, a classifier is trained using a designated training set that
consists of instances specifically related to the target word. Overall, supervised approaches
in WSD have generally achieved superior results compared to other approaches. However,
the problem is that these techniques work on sensing annotated dataset, which is very
expensive to create. Various supervised techniques are as follows:
•
Decision list
•
Decision Tree
In the context of WSD, a decision list refers to a sequential collection of “if-then-else”
rules that are employed to determine the suitable sense for a given word [24,25]. It can
also be viewed as a listing of weighed “if-then-else” rules. These rules are generated
from a training set, utilizing parameters such as feature value, sensitivity, and score. The
decision list is constructed by arranging these rules in descending order of their scores.
When encountering a word, let us say w, its frequency of existence is computed, and
its representation
as a feature vector is used to evaluate the decision list, resulting in a
tt
calculated value. The attribute that has the highest value that matches the input vector
corresponds to the meaning assigned to the word w.
A decision tree is a classification method that repeatedly divides the training dataset
and organizes the classification rules in a tree-likettstructure [26,27]. Every interior node
of the decision tree represents a test performed on an attribute value, and the branches
represent the outcomes of the test. The word sense is determined when a leaf node is
reached. An illustration of a decision tree for WSD is depicted in Figure 3. In this example,
the sense of the polysemous word “bank” that is active is a noun within the sentence, “I
will be at the bank of the Narmada River in the afternoon.” The tree has been constructed
and traversed to ultimately select the sense “bank/RIVER.”
A null value in a leaf node
tt
indicates that there is no sense selection present for that particular attribute value.
Figure 3. Decision Tree Example.
•
Naïve Bayes
The NB (Naïve Bayes) classifier is a probabilistic classifier that applies Bayes’ Theorem [28,29] to determine the appropriate meaning for a word. To classify text documents,
it computes the conditional probability of each sense Si of word w based on the context
features j. The sense S with the highest value, determined using the provided formula, is
chosen as the most appropriate sense within the given context.
Ŝ =
argmax
Si ∈Sense
D (w)
P ( Si | f 1 , . . . , f m ) =
=
162
argmax
Si ∈Senses D (w)
m
argmax P(Si ) ∏ P( f j |Si )
j =1
Si ∈Senses D (w)
P( f 1 ,..., f m |Si ) P(Si )
P( f 1 ,..., f m )
^
arg max
( )
(
1
,...,
Information 2023, 14, 495
)
(
arg max
1
( )
(
) (
( 1 ,...,
( )
arg max
,...,
)
(
)
)
)
1
In this context, m denotes the number of features. The probability score P(Si) is
computed based on the co-existence frequency of senses in the training dataset,fjwhile
tt of the attribute given in the sense.
P(fj|Si) is derived using the presence
•
Neural network
•
Support Vector Machine (SVM)
Neural networks consist of interconnected units or artificial neurons that serve as
a loose model of human brain neurons [30,31]. They follow a connectionist approach
and utilize a computational model for data processing. The learning program receives
tt
input attributes and target output. The objective is to divide the training data into nonoverlapping sets based on desired responses. When new input pairs are presented to
the network, the weights are adjusted to ensure the higher activation of the output unit
that generates the desired result compared to other output units. In the context of neural
networks, nodes represent words, and these words activate the associated concept with
which they share semantic relations. Inputs propagate from the input to the output layer
ffi The network efficiently processes and manipulates the inputs
through intermediate layers.
to generate an output. However, generating a precise output becomes challenging when the
connections within the network are widely dispersed and form loops in multiple directions.
An SVM [32] serves the purpose of both classification and regression tasks. This
ff
approach is rooted in the concept of identifying a hyperplane that can effectively isolate
positive examples from negative ones with the highest possible margin. The edge/margin
represents the interspace between the hyperplane and the nearest examples for positive and
negative, which are referred to as support vectors. In Figure 4, circle and square represent
ff
two different classes, the bold line represents the hyperplane that isolates the two classes
while the dashed lines indicate the support vectors closest to positive and negative example.
These support vectors play an important role in constructing an SVM classifier. The vectors
have an impact on the position and the orientation of the hyperplane, and by removing
or adding support vectors, adjustments can be made to the position of the hyperplane. In
Figure 4,
Figure 4. Illustrating SVM Classification.
•
Exemplar or instance-based learning
In this approach, the classification model is constructed using examples [33]. In a
feature space, these examples are represented as points, and the new examples are evaluated
for classification. When new examples are encountered, they are progressively stored in the
model. The k-nearest neighbor (k-NN) [34] method is an example of this type of approach.
In k-NN, examples are stored based on their feature values, and the classification of the
new examples is determined by considering the meanings of the k most similar previously
stored examples. The hamming distance (a measure of the number of differing elements
163
Information 2023, 14, 495
between two strings of equal length) [35,36] is calculated between new examples and the
stored examples using the k-NN algorithm, which measures the proximity of the given
input to the stored examples. The highest value obtained among the k-nearest neighbors
represents the output sense.
•
Ensemble methods
ff
In order to enhance the accuracy of disambiguation, it is common to employ a combiff
ff
nation of different classifiers. This combination strategy is called ensemble methods, which
combine algorithms of different nature or with different characteristics [37]. Ensemble
methods are more powerful than single-supervised techniques as they can overcome the
weakness of a single approach. Strategies such as majority voting, the AdaBoost system
of Freund and Schapire [38], rank-based
combination, and probability mixture can be
ff
utilized to combine the different classifiers to improve accuracy. Figure 5 presents the
simple approach of the ensemble WSD approach.
Figure 5. Ensemble Methods: Combining the Strengths of Multiple Models.
2.2.2. Unsupervised Techniques
Unsupervised techniques do not make use of sense annotated datasets or external
knowledge sources. Instead, they operate under the assumption that senses with similar
meanings occur in similar contexts. These techniques aim to determine senses from the text
by clustering the word occurrences based on some measure of contextual similarity. This
task is known as word sense induction or discrimination. Unsupervised techniques offer
significant potential in overcoming the bottleneck of knowledge acquisition, as they do not
require manual efforts. Here are some approaches that are used for unsupervised WSD.
Context Clustering: This unsupervised approach is rooted in the use of clustering
ff
tt
techniques [39]. It begins by representing words through context vectors, which are then
ff cluster is corresponding to a sense of the target word. The
organized into clusters. Each
approach revolves around the notion of a word space or vector space, where the dimensions
represent individual words. Specifically, a word w is transformed into a vector, capturing
the frequency of its co-occurrences with other words. This leads to the creation of a cooccurrence matrix, which is then subjected to various similarity measures. Finally, sense
discrimination is performed by applying clustering techniques such as k-means clustering
or agglomerative clustering.
Word Clustering: The induction of word senses can also be achieved through the use
of word clustering [3]. This approach groups words that are semantically similar and may
possess specific meanings. One commonly employed method for word clustering is Lin’s
method [40], which identifies words that are synonymous or have similarities to the target
word. The similarity among the synonyms and the target word is determined by analyzing
the features represented by syntactic dependencies found in a corpus, such as a verb–object,
subject–verb, adjective–noun relationships, and so on. The more similar the two words are,
164
Information 2023, 14, 495
the greater the extent to which they share information content. A word clustering algorithm
is then utilized to differentiate between senses. Given a list of words W, the words are
initially arranged based on their similarity, and a tree for similarity is constructed. In the
beginning, the tree has only a single node, and through iterations, the most similar word is
added as a child to the tree for each word in the list. Subsequently, pruning is performed,
resulting in the generation of sub-trees. Each sub-tree, with the initial node serving as its
root, represents a distinct sense of the original word.
Another method that is used for the clustering of words is the clustering by committee
(CBC) [41] algorithm. The first step is similar to the above, i.e., a set of similar words is
created for each input word. A similarity matrix is constructed to capture the pairwise
similarity information between words. The second step involves the application of a
recursive function to determine a set of clusters, referred to as committees. Following this,
the average-link clustering technique is applied. In the final step, a discrimination process
is executed, assigning the most alike cluster to each target word according to its similarity
to the centroid of each committee. Subsequently, the intersecting attributes among the
word and the committee are eliminated from the initial/actual word. This allows for the
identification of less frequent senses for the same word in the next iteration.
Co-occurrence Graph: This approach utilizes a graph-based methodology. It involves
the creation of a co-occurrence graph [42], denoted as G, comprising vertices V and edges
E. Words are represented as vertices, and the connections between words that co-occur
within the same paragraph are represented as edges. The weight assigned to each edge is
determined by the frequency of co-occurrences, thus capturing the relationships between
connected words. This graph construction effectively portrays the grammatical relations
between the words.
In order to determine the sense of a word, an iterative method is used to identify the
word with the highest degree node in the graph. Subsequently, a minimum spanning tree
algorithm is applied to deduce the word’s sense based on the information extracted from
the graph. This process allows for a meaningful sense of disambiguation of the word within
the given context.
2.2.3. Semi-Supervised Techniques
Semi-supervised techniques, known as weakly supervised or minimally supervised
approaches, are utilized in WSD when training data are scarce. These methods make
efficient use of both labeled and unlabeled data. Among the earliest algorithms in the
realm of semi-supervised approaches is bootstrapping. Bootstrapping involves statistical
resampling, where multiple datasets are generated from the original data with replacement.
This technique is employed to estimate the accuracy and variability of a model or statistical
inference, particularly in cases where traditional assumptions are not applicable or when
working with small datasets.
The following table, Table 1 gives an in-depth comparison of various WSD approaches
based on their benefits, drawbacks, and rationale for use. It seeks to provide a thorough
grasp of how each method works and the settings in which they excel or may have limits.
Table 1. Comparative Analysis of Knowledge-Based, Supervised, Unsupervised, and SemiSupervised Techniques.
Technique
Working
Knowledge-based
Utilizes pre-defined rules
and human expertise to
make decisions or classify
data.
Advantages
Disadvantages
Justification for Usage
1. Interpretable
outcomes
1. Limited scalability
2. Robust to noisy
data
2. Relies on expert
knowledge
Useful when
domain-specific
knowledge is available and
interpretability is essential
165
Information 2023, 14, 495
Table 1. Cont.
Technique
Supervised
Unsupervised
Semi-supervised
Working
Trained on labeled data
with input–output pairs
and predicts outputs for
unseen data based on the
learned model.
Clusters data or discovers
hidden patterns without
labels.
Utilizes a combination of
labeled and unlabeled data.
Advantages
Disadvantages
1. High accuracy
1. Requires labeled
data
2. Well-established
algorithms
2. Sensitive to
outliers and noise
3. Suitable for various
problem types
(classification,
regression, etc.)
3. Lack of
generalization to
unseen classes or
categories
1. Useful for
exploratory data
analysis
1. Limited guidance
in model evaluation
2. Can handle large
datasets
2. Lack of direct
feedback on model
performance
3. Detects anomalies
or outliers
3. Difficulty in
interpreting the
results
1. Utilizes the
advantages of both
supervised and
unsupervised
learning
1. Difficulty in
obtaining and
managing labeled
data
2. Cost-effective for
certain applications
2. Semi-supervised
methods may not
outperform fully
supervised or
unsupervised
techniques
3. Improves
performance with
limited labeled data
3. May suffer from
error propagation
due to incorrect labels
Justification for Usage
Preferred when labeled
data are available and the
goal is precise predictions
Ideal for identifying
structures in data when
labeled data are scarce or
unavailable.
Valuable when labeled data
are expensive to acquire
but unlabeled data are
abundant
3. WSD Execution Process
WSD is the task of determining an ambiguous word’s suitable sense based on context.
WSD has seen a variety of methods. The majority of methods are based on different
statistical methods. A few methods use corpora that have been sense-tagged, while others
use unsupervised learning. The flowchart in Figure 6 shows the steps that are performed
for WSD.
A string with an ambiguous word is given as an input string. Then, pre-processing
is performed on this input string. Pre-processing steps such as stop word elimination,
tokenization, part-of-speech tagging, and lemmatization, etc., are essential to transform
raw text into a suitable format for analysis. For example, we have input
(raam kachcha aam kha raha hai) (Ram is eating raw mango). Various preprocessing steps are as follows:
Stop Word Elimination: Stop words are words commonly filtered out or excluded
from the analysis process in NLP. These words are highly frequent in most texts, but they
generally lack significant meaning or do not contribute much to the overall understanding
of the content. By eliminating stop words, the text becomes less noisy, and contextual
relevance is improved. This improved context helps the WSD algorithm make more
accurate sense selections.
Examples of stop words in English include “the”, “a”, “an”, “in”, “on”, “at”, “and”,
“but”, “or”, “I”, “you”, “he”, “she”, “it”, etc. Examples of stop words in Hindi
166
Information 2023, 14, 495
(Devanagari script) include
ff
.
The elimination of stop words and punctuation from the input text is performed
in this step as they hold no significance or utility. After stop word removal string is
.
Figure 6. Flowchart of WSD Execution Process.
Tokenization: Tokenization is a fundamental technique in NLP that involves dividing
a given text into smaller components, such as sentences and words. It encompasses the
method of breaking down a string into a list of individual units called tokens. It helps in
isolating individual words for disambiguation, making the WSD process more manageable
and focused. In this context, a token can refer to a word within a sentence or a sentence
within a paragraph, representing a fragment of the whole text. After Tokenization output
is
.
Stemming: Stemming is a linguistic process aimed at removing the last few characters of a word, which can sometimes result in incorrect meanings and altered spellings.
Stemming simplifies text data and improves computational efficiency, aiding in tasks such
as text matching and retrieval. However, it may generate non-words, leading to potential
loss of word meaning and semantic distinctions. For instance, stemming the word ‘Caring’
would return to ‘Car’, which is an incorrect result.
Lemmatization: Lemmatization takes into account the context of a word and transforms it into its meaningful base form, known as a lemma. For example, by lemmatizing
the word ‘Caring,’ the resulting lemma would be ‘Care’, which is the correct result. By
converting words to their lemma, the WSD system can associate different inflected forms
of a word with the same sense, improving the coverage and generalization of the sense
inventory.
PoS Tagging: POS tagging involves the assignment of suitable part-of-speech labels
to each word within a sentence, encompassing categories such as nouns, adverbs, verbs,
pronouns, adjectives, conjunctions, and their respective sub-categories. This information is
crucial for WSD because different parts of speech may have different senses. POS tagging
helps in narrowing down the sense options for each word based on its grammatical role in
the sentence.
When pre-processing is completed, the WSD algorithm is applied that gives the
accurate sense of the ambiguous word as output. Various WSD algorithms are supervised,
semi-supervised, unsupervised, and knowledge-based.
WordNet [43] is a valuable tool that plays a significant role in WSD. It serves as an
extensive database containing nouns, adjectives, verbs, and adverbs, which are arranged
into clusters of synonymous word groups known as synsets. These collections are interconnected through applied lexical and semantic relations. At IIT Bombay, Hindi WordNet
167
Information 2023, 14, 495
(HWN) [44] is being developed, which shares similarities with English WordNet. Words
are grouped based on their perceived similarity in impact in HWN. It is worth noting that
in certain contexts, terms that may have distinct meanings elsewhere can be considered
synonymous. Each word within the HWN is associated with a corresponding synset that
stands for “synonym set” and represents a group of words or terms that are synonymous
or have similar meanings representing a lexical concept.
The WordNet synsets serve as its primary construction blocks. HWN controls words
with open class categories or words with substance. Thus, the noun, adjective, verb, and
adverb word categories that make up the HWN are included. The following characteristics
apply to every entry in the HWN:
•
•
Synset: This is a group of words, or synonyms, with similar meanings. For example,
(pen, kalam, lekhanee) refers to a tool or device used for writing with
ink. According to the frequency of usage, the words are organized in the synset.
Gloss: It explains the ideas. It is divided into two sections: a text definition that
explains the concepts indicated by the synset (for example,
(syaahee ke sahayog se kaagaj aadi par likhane ka upakaran)”
elaborates on the idea of a writing or drawing instrument that utilizes ink), along
with an illustrative sentence showcasing the importance of each word within a sentence. In general, a synset’s words may be simply changed in a phrase (for instance,
(yah pen kisee ne mujhe upahaar mein pradaan kee
hai) (Someone gifted me this pen.)” illustrates the usefulness of the synset’s words
describing an ink writing or drawing equipment).
4. Results and Discussions
In this section, we presented the overview of which techniques and methodologies
have been used by different researchers and what accuracy they have achieved, which
datasets have been used by them, and what is specific about their techniques. We have
divided it according to the techniques used by different researchers. It will help the
researchers in the future to analyze which technique they should use.
4.1. Knowledge-Based Techniques
Knowledge-based techniques for WSD rely on external knowledge resources to resolve word ambiguities. These techniques use lexical databases, semantic networks, and
linguistic resources to associate words with their appropriate meanings based on contextual
information. As researchers delved into the subject, they started employing a combination of automatic knowledge extraction techniques alongside manual methods. Various
knowledge-based techniques used by researchers for WSD are as follows:
In 1986, the first algorithm, called the Lesk algorithm [18], was developed by Michael
Lesk for the disambiguation of words. In this algorithm, overlapping of the context where
the word occurs and the definition of the input word from the Oxford Dictionary (OALD)
was performed. The sense with the maximum overlap is chosen as the correct sense of the
ambiguous word. In [17], Banerjee and Pederson introduced an adapted Lesk approach
that relied on utilizing a lexical database, WordNet, as a source of knowledge rather than a
machine-readable dictionary. WordNet, a hierarchical structure of semantic relations such
as synonyms, hypernyms, meronyms, and antonyms, served as the foundation for this
algorithm.
The notion of disambiguating Indian languages was initially proposed with a technique involving a comparison of contexts within which ambiguous words occurred with
those created with HWN [45]. The sense would be determined according to its degree
and extent of overlap. HWN arranges the lexical information based on word meanings.
Hindi WordNet’s design was influenced by English WordNet. HWN was developed by IIT
Bombay, and it became publicly available in 2006. The accuracy range is about 40% to 70%.
Singh et al. [46] investigated the impact of the size of context window, stemming, and
stop word removal on the Lesk-like algorithm for WSD in Hindi. The elimination of stop
168
Information 2023, 14, 495
words coupled with the use of stemming is a proven method for obtaining good results, and
they applied the Lesk algorithm to their work. From the analysis carried out, it is evident
that utilizing ‘Karak relations’ leads to correct disambiguation. Additionally, stop-word
elimination combined with stemming can help to raise the number of content-specific
vocabulary while also promoting greater word stem overlap. A 9.24% improvement in
precision is reported after the elimination of stop words and stemming over the baseline.
In [47], the WSD technique relies on graph-based methods. They merged Lesk semantic
similarity measures with Indegree approaches for graph centrality in their study. The
beginning step involves constructing a graph for all target words in the sentence wherein
nodes correspond to words and edges denote their respective semantic relations. By using
Hindi wordNet along with the DFS Algorithm, we managed to create a final graph. The
determination of a word’s meaning is ultimately achieved through the application of graph
centrality measures. An accuracy rate of 65.17% is achieved.
The authors introduced and evaluated the effectiveness of Leacock–Chodorow’s measure of semantic relatedness for WSD of Hindi [48]. Having semantic similarity between
two terms indicates a relationship. Semantic similarity and additional relations such as
is-a-kind-of, is-the-opposite-of, is-a-specific-example-of, is-a-part-of, etc., are included in
the relationships between ideas. The Leacock–Chodorow metric is employed, taking into account the length of routes among the noun concepts within an is-a hierarchy. The algorithm
employs the Hindi WordNet hierarchy to acquire word meanings and uses it in the process
of disambiguation rather than relying solely on the direct overlap. For evaluation purposes,
a dataset consisting of 20 sense-tagged polysemous Hindi nouns is utilized. Using this
metric, they found an accuracy of 60.65%. The role of hypernym, holonym, meronym,
and hyponym interactions in Hindi WSD is examined [49]. We have taken into account
five different scenarios in their research, including all relations, hyponym and hypernym,
hypernym, holonym, and hyponym. The baseline makes no use of any semantic relations.
When taking into account all relations, they found that total precision had increased by
12.09% over the baseline. The use of hyponyms produced the greatest improvement for a
single semantic link and a precision improvement of 9.86% overall.
Sawhney et al. [50] employed a modified Lesk approach that incorporates a dynamic
context window concept. The dynamic context window refers to the number of preceding
and succeeding words surrounding the ambiguous words. According to this approach, if
two words have similar meanings, then there must be a common topic in their vicinity. An
increase in precision signifies that this algorithm provides superior results as compared to
prior methods that employ a fixed-size context window. The lesk approach was applied to
bigram and trigram words to disambiguate the verb words [51], and it is the only work, as
per our knowledge, that disambiguates Hindi verbs, as most of the work is performed for
nouns.
In [52], Goonjan et al. make use of Hindi Wordnet to retrieve the senses of the words,
and then a graph is created using a depth-first search between the senses of the words. After
that, weights are assigned to the edges of the connecting node according to the weights
of the Fuzzy Hindi wordnet. Then, various local fuzzy centrality measures are applied,
and the values of these calculated measures help us to find the accurate meaning of the
polysemous word. The knowledge-driven Lesk algorithm is employed in [53] that works
by selecting the meaning whose definition most closely matches the In their investigation,
they successfully identified 2143 out of 3000 ambiguous statements, achieving an accuracy
rate of 71.43%.
In [54], WSD for the Bengali language is performed in two distinct phases. During
the first phase, sense clusters of an ambiguous word are constructed by considering the
preceding and succeeding words in their context. In the second phase, WSD is performed
by utilizing a semantic similarity measure after expanding the context with the assistance of
Bengali WordNet. An ambiguous Bengali words test set, comprising 10 words, is used, for
testing which has 200 sentences for each ambiguous word. The overall accuracy achieved is
63.71%. Tripathi et al. [55] have used a Lesk algorithm along with a novel scoring method.
169
Information 2023, 14, 495
To enhance the performance of the Lesk Algorithm, they employed a scoring technique that
evaluates token senses based on their cohesive variations. This strategy aimed to improve
the accuracy and effectiveness of the approach. Based on a combination of different senses
of tokens according to the gloss along with their related hypernyms and hyponyms, a sense
rating is assigned that helps in determining the meaning of the ambiguous word.
A complete framework named “hindiwsd” [56] is constructed for WSD of Hindi in
Python language. It is a pipeline that performs a series of tasks, including transliteration of
Hinglish code-mixed text, spell correction, POS tagging, and the disambiguation of Hindi
text. A knowledge-based modified Lesk algorithm is applied here for WSD. A comparative
analysis of various knowledge-based approaches is also performed in [57]. The results
demonstrate that accuracy is lower for limited resource languages and higher for languages
with abundant knowledge resources. A knowledge-based resource is critical in the processing of any language. The survey suggests that several factors influence the performance of
WSD tasks. These include the removal of stop words, the positioning of ambiguous words,
the use of Part-of-Speech (POS) tagging, and the size of the dataset utilized for training.
Each of these elements plays a significant role in the overall effectiveness of WSD methods.
This is a review of some knowledge-based approaches that have been used by different
researchers for WSD. Knowledge-based techniques can be effective in resolving word sense
ambiguities, especially when supported by comprehensive and well-structured lexical
resources and linguistic knowledge. However, they may have limitations when dealing
with unseen or domain-specific contexts, as they heavily rely on the information present
in the knowledge bases. In such cases, supervised and unsupervised machine learning
approaches are often employed to complement the knowledge-based methods and improve
overall disambiguation performance.
4.2. Supervised Techniques
Supervised techniques for WSD are highly effective in resolving word sense ambiguities by utilizing labeled training data, achieving high accuracy through diverse and
well-annotated datasets that associate words with correct senses in various contexts. These
methods capture deeper semantic relationships, enabling a nuanced understanding of word
sense distinctions while exhibiting context sensitivity to handle complex sentence structures
and resolve ambiguous words. We present a review of various supervised techniques used
for WSD of Indian languages.
NB classifier [58], a supervised method equipped with eleven different features such
as collocations, vibhaktis vibhaktis (the grammatical cases or inflections used in Indian
languages to indicate the function of nouns or pronouns in a sentence), unordered list
of words, local context, and nouns has been applied to solve Hindi WSD. In order to
assess its performance, the NB classifier was applied to a set of 60 polysemous Hindi
nouns. Applying morphology to nouns included in a feature vector led to achieving
maximum precision of 86.11%, while considering the nearby nouns in the context of a
target ambiguous noun is important for achieving accurate meaning.
In [59], a supervised approach using cosine similarity is introduced. Vectors have
been generated for the query given for testing and knowledge data for the sense of the
polysemous word, taking weights into account. The sense with the maximum similarity to
the polysemous word is selected as the appropriate sense. The experiment is conducted
on a dataset comprising 90 Hindi-ambiguous words. An average precision of 78.99% is
obtained.
The supervised approach of the k-NN algorithm has been used for Gurumukhi
WSD [60]. Two feature sets are derived: one comprises frequently occurring words alongside the ambiguous word, and the other encompasses words neighboring the ambiguous
word in the corpora. Subsequently, the provided data are divided into the training and
the testing sets. The k-NN classifier is trained using the training set. For the given input
sentence, pre-processing is performed, and then its vector is generated. The k-NN classifier
identifies similar vectors or nearest neighbors for the unknown vector. After that, the
170
Information 2023, 14, 495
distance between the input vector/unknown vector and nearest neighbors is calculated
using the Euclidean method. The closeness between the vectors is determined by using
this distance.
The WSD of Panjabi has been accomplished using a supervised NB [61] classifier. For
feature extraction, both Bag-of-Words (BoW) and a collocation model are employed. The
collocation model utilizes only the two preceding and two succeeding words of the input
word as features, whereas the BoW model considers all the words surrounding the input
word as features. Using both feature extraction methods, the NB classifier is trained on a
dataset of 150 ambiguous words with six or more senses collected from the Punjabi word
net. The system attains an accuracy of 89% for the Bow model, and for the collocation
model, the accuracy is 81%.
In [62], a comparative analysis is conducted among rule-based, classical machine
learning, and two neural network-based RNN and LSTM models. The evaluation is carried
out on four highly ambiguous terms and a group of seven other ambiguous words. The
rule-based method achieved an accuracy of 31.2%, classical machine learning attained
33.67% accuracy, while RNN exhibited an accuracy of 41.61%. Notably, the LSTM model
outperformed all other methods with an accuracy of 43.21%, showcasing its superior
performance in disambiguating word senses.
A review of some supervised techniques is presented. Supervised techniques excel
in providing fine-grained disambiguation, which is essential for precise semantic interpretation. However, their dependency on labeled data poses challenges, especially for
resource-limited languages. Supervised techniques may struggle with unseen words or
senses, and overfitting remains a concern, potentially affecting performance on new data.
To address limitations, researchers often combine supervised methods with unsupervised
or knowledge-based approaches to enhance overall WSD performance.
4.3. Unsupervised Techniques
Unsupervised techniques for WSD present advantages in their independence from
labeled training data, making them more cost-effective and adaptable to different languages
and domains. By learning solely from distributional patterns, they have the potential
to discover new word senses and uncover novel semantic relationships. A review of
unsupervised techniques used for WSD of Indian languages is as follows:
An unsupervised approach is used for resolving word ambiguity in [63]. As part of
the pre-processing steps, the elimination of stop words and stemming is required when
encountering an unclear context. After employing the decision list for untagged examples,
there is a need for some manual intervention to provide seed examples. A decision list is
employed to generate ambiguous words, and this decision list is subsequently utilized to
determine the sense of such ambiguous words.
A technique to perform unsupervised WSD on a Hindi sentence using network agglomeration is proposed in [64]. We start by creating a graph G for the input sentence. All
variations in meaning for this sentence can be seen collectively in this graph. Sentence
graphs can be used to develop interpretation graphs such as G, and the sentence must have
an interpretation for all instances of G. To find out which is the preferred interpretation, we
perform network agglomeration on all relevant graphs. By identifying which interpretation
holds the highest network agglomeration value, we can derive its relevance.
In [65], the author deals with algorithms based on an unsupervised graph-based
approach. This consists of two phases: (1) A lexical knowledge base is utilized to construct
a graph, where each node and edge in the graph represents a possible meaning of a word
within a given sentence. These nodes and edges capture dependencies between meanings,
such as synonyms and antonyms. (2) Subsequently, the graph is analyzed to identify the
most suitable node, representing the most significant meaning, for each word according
to the given context. In the graph-based WSD method of unsupervised techniques, word
meanings are determined by considering the dependencies between these meanings.
171
Information 2023, 14, 495
Relations in HWN are crisp, meaning they are either related or not related at all. There
is no partial relation between words in the Hindi wordnet. However, in real life, partial
relations can also exist between words, which are also called fuzzification of relations.
Therefore, an expanded version of Hindi wordnet that incorporates fuzzy relations is called
Fuzzy Hindi WordNet (FHWN), which is represented as a fuzzy graph in which nodes
depict words/synsets and the edges show fuzzy relationships within words/synsets. The
fuzzy relations are assigned a membership value between [0, 1]. The values are assigned
by consulting with experts from diverse domains. In [66], an approach using fuzzy graph
connectivity measures is applied to FHWN for WSD. Various local and global connectivity
measures are calculated using the values assigned to the relations. The sense with the
maximum rank is chosen as the suitable sense for the ambiguous word. The utilization
of the FHWN sense inventory results in an improvement in disambiguation performance,
with an average increase of approximately 8% in most cases. Since the membership value
can change, so can the algorithm’s performance.
In [67], a multilingual knowledge base called ConceptNet is used to automatically
generate the membership values. The nodes and edges that make up ConceptNet’s network
represent words, word senses, and brief phrases, while the edges show how the nodes are
related to one another. The Shapley value, which is derived from co-operative game theory,
is then employed as a centrality metric. Shapley’s value is utilized to mitigate the influence
of alterations in membership values within fuzzy relationships by considering only the
marginal contributions of all the values in the calculation of centrality.
For Gujarati WSD [68], a genetic algorithm-based strategy was employed. Darwin’s
idea of evolution serves as the basis for genetic algorithms. The population is the first
set of solutions the algorithm considers (represented by chromosomes). One population’s
solutions are utilized to create a new one. This approach is pursued with the expectation
that the new population will exhibit improved performance compared to the previous
population. The solutions chosen to create new descendants (solutions) are selected based
on their suitability. This process is carried out again and again until or unless a certain need
(such as the number of people or an improvement in the ideal solution) is attained.
Kumari and Lobiyal [69] introduced a word-embedding-based approach for WSD.
They employed two word2vec architectures, namely the skip-gram and the continuous
bag-of-words models, to generate word embeddings. The determination of the appropriate
sense of a word was achieved using cosine similarity. An unsupervised Latent Dirichlet
Allocation (LDA) and Semantic features-based approach using semantic features has been
applied for the target WSD of the Malayalam language [70]. A dataset consisting of 1147
contexts containing target polysemous words has been utilized. In total, 80% accuracy is
achieved.
Various word embedding methods such as Bow, Word2Vec, TF-IDF, and FastText have
been used in [71]. For the construction of Hindi word embeddings, Wikipedia articles
were used as the data source. They conducted multiple trials to explore this idea, and the
results convinced us that Word2Vec outperforms all other embeddings for the Hindi dataset
we examined. When training the input, the method uses word embedding techniques.
It also incorporates clustering, which is used to create a sense inventory that aids in
disambiguation. These methods can use unlabeled data because they are unsupervised.
The accuracy achieved is 71%.
In [72], The authors employed an approach based on a genetic algorithm (GA) for
Hindi WSD. The process involved pre-processing and creation of a context bag and sense
bag, followed by the application of the GA. The GA encompassed selection, crossover,
and mutation to disambiguate the word, and the approach was tested on a manually
created dataset. The experimental results demonstrated an accuracy of 80%. A comparative
analysis of two path-based similarity measures is performed in [73]. The experimental
investigation is performed using the shortest path and Leacock–Chodorow methods, which
shows that a Leacock–Chodorow similarity measure performs better than the shortest
172
Information 2023, 14, 495
path measure. Experimentation is performed on five polysemous nouns, and an average
accuracy of 72.09% is achieved with the Leacock–Chodorow method.
Unsupervised techniques are cost-effective, and they use unlabeled data. Thus, they
can be used for languages that lack sense-tagged datasets. However, they may struggle
with sense overlapping and lack deep semantic interpretation, leading to less precise disambiguation compared to supervised methods. Data sparsity can also limit their effectiveness,
requiring substantial data for satisfactory performance. Evaluating their performance
can be challenging without a definitive gold standard for comparison. Combining unsupervised techniques with supervised or knowledge-based approaches can address their
limitations and enhance overall WSD performance.
The following table, Table 2, exhibits the summary of study characteristics of different
Indian language WSD approaches.
Table 2. Analysis of WSD Approaches in Different Indian Languages.
Year (Ref.)
1986 [18]
2002 [17]
2004 [45]
2009 [63]
Language
English
English
Hindi
Hindi
Technique
KnowledgeBased
KnowledgeBased
KnowledgeBased
Unsupervised
Method
Specification
Dataset
Used
Lesk
Overlapping of
context and word
definition is
performed.
Accuracy
Comments
Used
Machine
Readable
Dictionaries
-
Only
definitions
are used for
deriving the
meaning.
Adapted
Lesk
The proposed
approach expands the
comparisons by
incorporating the
glosses of words that
are linked to the words
under disambiguation
in the given text.
These connections are
established using the
WordNet lexical
database.
WordNet is
used
32%
-
Lesk
Method
Comparison of the
ambiguous word’s
context and the
context derived from
Hindi WordNet is
performed.
The manually
created test
set.
40–70%
Works with
only nouns
and does not
deal with
morphology.
A dataset for
20
ambiguous
words with
1856 training
instances and
1641 test
instances was
used.
The accuracy
ranges from
approximately 82%
to around
92% when
employing
techniques
such as
stop-word
elimination,
automatic
generation of
decision lists,
and
stemming.
-
Decision
List
After pre-processing, a
decision list of
untagged examples is
created that is utilized
to depict the meaning
of the polysemous
word.
173
Information 2023, 14, 495
Table 2. Cont.
Year (Ref.)
2012 [46]
2012 [47]
2013 [48]
2014 [49]
2014 [58]
2014 [50]
2015 [64]
Language
Hindi
Hindi
Hindi
Hindi
Hindi
Hindi
Hindi
Method
Specification
Dataset
Used
Accuracy
Comments
Lesk
Algorithm
Effects of context
window size, stop
word elimination, and
stemming has been
analyzed with Lesk
Evaluation is
carried out
on a test set
of 10
polysemous
with 1248 test
instances.
Improvement
of 9.24% over
baseline.
Works only
for nouns.
Knowledgebased
GraphBased
A graph is constructed
using the DFS
algorithm and then
centrality measures
are applied to deduce
the sense of the word.
Text files that
contain 913
nouns are
used as
datasets.
65.17%
For graph
centrality,
only the
in-degree
algorithm is
used.
KnowledgeBased
A LeacockChodorow
measure of
semantic
relatedness
The
Leacock–Chodorow
algorithm is used to
find the length of the
route among two noun
concepts.
dataset of 20
polysemous
Hindi nouns
60.65%
Works only
for nouns
Semantic
Relations
The significance of
different relationships
such as hypernym,
hyponym, holonym,
and meronym is
examined here.
dataset of 60
nouns is
used.
Improvement
of 9.86% over
baseline.
Only for
nouns.
Naive
Bayes
Naive Bayes classifier
with eleven different
features has been
applied for Hindi
WSD.
A dataset of
60
polysemous
Hindi nouns
is used.
86.11%
Works only
for nouns
Modified
Lesk
A modified Lesk
approach with a
dynamic context
window is used in this
paper.
A dataset of
10
ambiguous
words is
used.
-
Accuracy
depends on
the size of the
dynamic
context
window.
Health and
Tourism
datasets are
used.
Health-43%
(All words)
and 50%
(Nouns)
Tourism-44%
(All Words)
and 53%
(Nouns)
Works for
nouns as well
as other parts
of speech,
too.
Technique
KnowledgeBased
KnowledgeBased
Supervised
KnowledgeBased
Network
Unsupervised Agglomeration
An interpretation
graph is created for
each interpretation
derived from the
graph of the sentence,
and subsequently,
network
agglomeration is
performed to
determine the correct
interpretation.
174
Information 2023, 14, 495
Table 2. Cont.
Year (Ref.)
2015 [65]
2015 [66]
2016 [51]
Language
Specification
Dataset
Used
Accuracy
Comments
Hindi
Wordnet is
used as a
reference
library.
-
No standard
dataset.
-
Hindi
Unsupervised
Graph Connectivity
Hindi
Fuzzy
Graph ConUnsupervised
nectivity
Measures
Different global and
local fuzzy graph
connectivity measures
are computed to find
the meaning of a
polysemous word.
Used Health
corpus.
Performance
increases by
8% when we
use Fuzzy
Hindi
WordNet.
Lesk’s approach is
applied to tri-gram
and bi-gram verb
words.
15 words of
verbs are
used as a
dataset with
103 test
instances.
52.98% with
bi-gram and
33.17% with
tri-gram.
Only work
for verb
words.
Hindi
Hindi
2017 [68]
Gujarati
2018 [61]
Method
A graph is generated
to represent all the
senses of a
polysemous word,
then it is analyzed to
determine the accurate
sense of the word.
2016 [59]
2018 [60]
Technique
KnowledgeBased
Supervised
Cosine
Similarity
The cosine similarity
of vectors, created
from input query and
senses from Wordnet,
is calculated to
determine the
meaning of the word.
dataset of 90
Hindi
ambiguous
word
78.99%
It does not
perform partof-speech
disambiguation for word
categories
other than
nouns, such
as adjectives,
adverbs, etc.
Unsupervised
Genetic
Algorithm
A genetic algorithm is
used.
-
-
-
K-NN
KNN classifier is used
to find the similarity
between vectors of
input words and their
meaning in Wordnet.
Punjabi
Corpora of
100 sense
tagged words
is used.
The accuracy
varies for
each word,
with the
highest being
76.4% and
the lowest
being 53.6%.
The size of
the dataset is
too small.
Naive
Bayes
Naive Bayes classifier,
with Bow and
collocation model as
feature extraction
technique, is used.
corpus of 150
ambiguous
words having
6 or more
senses taken
from Punjabi
word net
89% with
BoW and
81% with the
collocation
model.
One word
disambiguation per
context.
Gurumukhi Supervised
Punjabi
Tri-Gram
and
Bi-Gram
Supervised
175
Information 2023, 14, 495
Table 2. Cont.
Year (Ref.)
2019 [69]
2019 [52]
2019 [53]
2019 [54]
2021 [55]
2021 [70]
Language
Hindi
Hindi
Hindi
Bengali
Hindi
Malyalam
Technique
Unsupervised
KnowledgeBased
KnowledgeBased
KnowledgeBased
KnowledgeBased
Method
Specification
Word Embedding
Two-word embedding
techniques, i.e.,
Skip-gram and CBow
are used with cosine
similarity to deduce
the correct sense of the
world.
Dataset
Used
-
Accuracy
Comments
52%
Semantic
relations
such as
hypernyms,
hyponyms,
etc., are not
used for the
creation of
sense vectors.
Fuzzified
Semantic
Relations
Fuzzified semantic
relations along with
FHWN are used for
WSD.
-
58–63%
There is
uncertainty
associated
with fuzzy
values.
Values
assigned to
fuzzy memberships are
based on the
intuition of
annotators.
Lesk
Lesk algorithm is used
to disambiguate the
words.
A corpus of
3000
ambiguous
sentences is
used.
71.43%
POS tagger is
not used
Sense
Induction
The semantic
similarity measure is
calculated for various
sense clusters of
ambiguous words.
A test set of
10 Bengali
words is
used.
63.71%
Classification
of senses is
not
performed.
-
Due to the
segregation
of only a part
of the data
from
WordNet, the
database
needs to be
queried
repeatedly.
80%
LDA does
not take into
account the
positional
parameters
within the
context.
ScoreBased
Modified
Lesk
Semantic
Features
Unsupervised and Latent
Dirichlet
Allocation
A scoring technique is
utilized for advancing
the performance of the
Lesk algorithm.
-
An unsupervised
LDA-based approach
using semantic
features has been
applied for the target
word sense
disambiguation of the
Malayalam language.
A dataset of
1147 contexts
of
polysemous
words is
used.
176
Information 2023, 14, 495
Table 2. Cont.
Year (Ref.)
2021 [71]
2022 [67]
2022 [56]
2022 [72]
Language
Hindi
Hindi
Hindi
Hindi
Technique
Method
Specification
Dataset
Used
Accuracy
Comments
Word Embeddings
Various word
embedding technique
has been used for
WSD and experiments
shows that Word2Vec
performs better than
all.
Hindi word
embeddings
were
generated
using articles
sourced from
Wikipedia.
54%
Further enhancements
can be
achieved by
incorporating
additional
similarity
metrics and
incorporating sentence
or
phrase-level
word
embeddings
into the
approach.
Cooperative
Unsupervised
Game
Theory
Co-operative game
theory along with
Concept Net is used. It
mitigated the
influence of variations
in membership values
of fuzzy relations..
Health and
tourism
dataset and a
manually
created
dataset from
Hindi
newspaper
articles.
66%
-
KnowledgeBased
A complete
framework named
“HindiWSD” is
developed in this that
uses the
knowledge-based
modified Lesk
algorithm.
A dataset of
20
ambiguous
word along
with Hindi
WordNet is
used.
71%
Dataset size
is small.
Genetic
Unsupervised
Algoritm
After pre-processing
and creating the
context bag and sense
bag, GA is employed.
In GA, selection,
crossover and
mutation are applied
for the disambiguation
of the word.
A manually
created
dataset is
used.
80%
Only worked
with nouns.
Unsupervised
In the field of WSD for Hindi, the availability of high-quality data has been a challenge
due to the resource-scarce nature of the language. However, there have been efforts to
create and utilize datasets and benchmarks for Hindi WSD. Table 3 provide an overview
about some common datasets and benchmarks that have been used or recognized in this
field a:
177
Information 2023, 14, 495
Table 3. Data Sources available for Hindi WSD.
Data Source/Benchmark
Description
Hindi WordNet
Lexical database providing synsets and semantic
relations for word senses in Hindi.
SemEval Hindi WSD Task
Part of the SemEval workshops, offering annotated
datasets, evaluation metrics, and tasks for WSD in
multiple languages.
Sense-Annotated Corpora
Manually annotated text segments where words are
tagged with their corresponding senses from Hindi
WordNet.
Cross-Lingual Resources
Leveraging resources from related languages with
more data for WSD and transferring knowledge across
languages.
Parallel Corpora
Using texts available in multiple languages to align
senses and perform cross-lingual WSD.
Indigenous Corpora
Domain-specific or genre-specific corpora in Hindi,
focusing on specific areas such as medicine,
technology, or literature.
Supervised Approaches
Using a small annotated dataset for training models,
often involving manually sense-tagged instances.
Unsupervised Approaches
Employing techniques such as clustering or
distributional similarity without relying heavily on
labeled data.
Contextual Embeddings
Utilizing pretrained models such as BERT to capture
rich semantic information from large text corpora.
Because of the limitations in resources, the domain of Hindi WSD may not possess an equivalent abundance of universally accepted benchmarks as observed in more
resource-endowed languages. As a result, researchers frequently modify techniques and
methodologies drawn from other languages. Moreover, they occasionally amalgamate
existing resources with data augmentation strategies to elevate their model’s efficacy. The
task of formulating more expansive and varied sense-annotated datasets and benchmarks
continues to be a persisting challenge within this sphere.
4.4. Research Gaps and Future Scope
Hindi is a rich language in terms of users and information available in the Hindi
language, and not much work has been performed on this. These are some of the research
gaps, with the majority of the work involving nouns. Word lemmatization, which could improve accuracy even further, is not carried out, and one of the difficulties is understanding
the idiomatic words. There is no standard sense annotated dataset available for supervised
approaches. Using better methods or a hybrid model also has the potential to improve
accuracy. Significant efforts have been dedicated to research and development for the
English language, but Hindi, as the top fourth language in the world in terms of native
speakers, is still in its infancy stage in the case of WSD. There is still a significant amount
of work to be performed for the Hindi language. There is a lot of scope for improving
accuracy, as well as other challenges, such as morphology, etc., that need to be solved.
5. Conclusions
This article summarizes several techniques utilized for the disambiguation of word
senses based on Hindi literary sources. The classification of Hindi WSD tasks has categorized its methods into sections: supervised learning-based methods, knowledge-based
methods, and unsupervised and supervised ones. Several types of knowledge-based, supervised, and unsupervised techniques are reviewed. Every approach has its own set of
178
Information 2023, 14, 495
rules for working and helps in solving a particular type of problem. In order to achieve
superior outcomes with supervised methods, it is necessary to create an annotated dataset.
Creating an annotated dataset can be both difficult and costly. However, the use of unannotated datasets with unsupervised approaches generally produces less favorable results
than those produced using supervised techniques. Tackling resource-scarce languages
effectively requires a knowledge-intensive approach. A comparative analysis of various
approaches has been conducted, providing insights into the work undertaken by different
researchers in the field. In conclusion, each category of WSD techniques offers distinct
advantages and faces specific challenges. Supervised techniques excel in accuracy and
fine-grained disambiguation but require labeled data and may struggle with generalization.
Unsupervised techniques are flexible, scalable, and adapt well to languages with limited
resources, yet they may encounter sense overlapping and lack semantic interpretation.
Knowledge-based techniques leverage external resources effectively but heavily rely on the
quality of knowledge bases. The choice of technique depends on task requirements, data
availability, and language characteristics. Hybrid models, combining different techniques,
can effectively address limitations and improve overall WSD performance, providing a
tailored approach for specific applications and language contexts.
Author Contributions: Conceptualization, V.G. and N.M.; methodology, V.G.; software, R.K.; validation, V.G., S.P. and G.B.; formal analysis, N.M.; investigation, V.G.; resources, N.C.; data curation, V.G.;
writing—original draft preparation, V.G. and S.P.; writing—review and editing, G.B.; visualization,
R.K.; supervision, G.B.; project administration, N.C.; funding acquisition, N.C. All authors have read
and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Zipf, G.K. Human Behavior and the Principle of Least Effort; Addison-Wesley Press: Oxford, UK, 1949.
Wilks, Y.; Fass, D. The preference semantics family. Comput. Math. Appl. 1992, 23, 205–221. [CrossRef]
Navigli, R. Word sense disambiguation: A survey. ACM Comput. Surv. 2009, 41, 1459355. [CrossRef]
Vickrey, D.; Biewald, L.; Teyssier, M.; Koller, D. Word-sense disambiguation for machine translation. In Proceedings of the
Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP 2005),
Vancouver, BC, Canada, 6 October 2005; pp. 771–778. [CrossRef]
Carpuat, M.; Wu, D. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 61–72.
Pu, X.; Pappas, N.; Henderson, J.; Popescu-Belis, A. Integrating Weakly Supervised Word Sense Disambiguation into Neural
Machine Translation. Trans. Assoc. Comput. Linguist. 2018, 6, 635–649. [CrossRef]
Plaza, L.; Jimeno-Yepes, A.J.; Díaz, A.; Aronson, A.R. Studying the correlation between different word sense disambiguation
methods and summarization effectiveness in biomedical texts. BMC Bioinform. 2011, 12, 355. [CrossRef] [PubMed]
Madhuri, J.N.; Ganesh Kumar, R. Extractive Text Summarization Using Sentence Ranking. In Proceedings of the 2019 International
Conference on Data Science and Communication (IconDSC), Bangalore, India, 1–2 March 2019; pp. 19–21. [CrossRef]
Carpineto, C.; Romano, G. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 2012, 44, 2071390.
[CrossRef]
Sharma, N.; Niranjan, P.S. Applications of Word Sense Disambiguation: A Historical Perspective. IJERT 2015, 3, 1–4.
Sumanth, C.; Inkpen, D. How much does word sense disambiguation help in sentiment analysis of micropost data? In Proceedings
of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Lisboa, Portugal, 14
July 2015; pp. 115–121. [CrossRef]
Xu, G.; Yu, Z.; Yao, H.; Li, F.; Meng, Y.; Wu, X. Chinese Text Sentiment Analysis Based on Extended Sentiment Dictionary. IEEE
Access 2019, 7, 43749–43762. [CrossRef]
Chifu, A.G.; Ionescu, R.T. Word sense disambiguation to improve precision for ambiguous queries. Open Comput. Sci. 2012, 2,
398–411. [CrossRef]
Asim, M.N.; Wasim, M.; Khan, M.U.G.; Mahmood, N.; Mahmood, W. The Use of Ontology in Retrieval: A Study on Textual,
Multilingual, and Multimedia Retrieval. IEEE Access 2019, 7, 21662–21686. [CrossRef]
179
Information 2023, 14, 495
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
Advaith, V.; Shivkumar, A.; Sowmya Lakshmi, B.S. Parts of Speech Tagging for Kannada and Hindi Languages using ML and DL
models. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies
(CONECCT), Bangalore, India, 8–10 July 2022. [CrossRef]
Gadde, S.P.K.; Yeleti, M.V. Improving statistical POS tagging using Linguistic feature for Hindi and Telugu Improving statistical
POS tagging using linguistic features for Hindi and Telugu. In Proceedings of the ICON-2008: International Conference on
Natural Language Processing, Pune, India, 20–22 December 2008.
Banerjee, S.; Pedersen, T. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the
Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, 17–23 February 2002; Gelbukh, A., Ed.; Springer:
Berlin/Heidelberg, Germany, 2002; pp. 136–145.
Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In
Proceedings of the 5th Annual International Conference on Systems Documentation, Toronto, ON, Canada, 1 June 1986; pp. 24–26.
[CrossRef]
Mittal, K.; Jain, A. Word Sense Disambiguation Method Using Semantic Similarity Measures and Owa Operator. ICTACT J. Soft
Comput. 2015, 5, 896–904. [CrossRef]
Mccarthy, D.; Carroll, J. Adjectives Using Automatically Acquired Selectional Preferences. Comput. Linguist. 2003, 29, 639–654.
[CrossRef]
Ye, P.; Baldwin, T. Verb Sense Disambiguation Using Selectional Preferences Extracted with a State-of-the-art Semantic Role
Labeler. In Proceedings of the Australasian Language Technology Workshop 2006, Sydney, Australia, 4–6 December 2006; pp.
139–148.
Sarika; Sharma, D.K. A comparative analysis of Hindi word sense disambiguation and its approaches. In Proceedings of the
International Conference on Computing, Communication & Automation, Pune, India, 26–27 February 2015; pp. 314–321.
Walker, J.Q., II. A node-positioning algorithm for general trees. Softw. Pract. Exp. 1990, 20, 685–705. [CrossRef]
Parameswarappa, S.; Narayana, V.N. Decision List Preliminaries of the Kannada Language and the Basic. 2013, Volume 2.
Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=7620a95796c2eae4a94498fa779b00e2b2
5c957a (accessed on 21 May 2023).
Yarowsky, D. Hierarchical decision lists for word sense disambiguation. Lang. Resour. Eval. 2000, 34, 179–186.
Singh, R.L.; Ghosh, K.; Nongmeikapam, K.; Bandyopadhyay, S. A Decision Tree Based Word Sense Disambiguation System in
Manipuri Language. Adv. Comput. Int. J. 2014, 5, 17–22. [CrossRef]
Rawat, S.; Kalambe, K.; Kawade, G.; Korde, N. Supervised word sense disambiguation using decision tree. Int. J. Recent Technol.
Eng. 2019, 8, 4043–4047. [CrossRef]
Thwet, N.; Soe, K.M.; Thein, N.L. System Using Naïve Bayesian Algorithm for Myanmar Language. Int. J. Sci. Eng. Res. 2011, 2,
1–7.
Le, C.A.; Shimazu, A. High WSD accuracy using Naive Bayesian classifier with rich features. In Proceedings of the 18th Pacific
Asia Conference on Language, Information and Computation PACLIC 2004, Tokyo, Japan, 8–10 December 2004; pp. 105–113.
Popov, A. Neural network models for word sense disambiguation: An overview. Cybern. Inf. Technol. 2018, 18, 139–151. [CrossRef]
Kumar, S.; Kumar, R. Word Sense Disambiguation in the Hindi Language: Neural Network Approach. Int. J. Tech. Res. Sci. 2021,
1, 72–76. [CrossRef]
Kumar, M.; Sankaravelayuthan, R.; Kp, S. Tamil word sense disambiguation using support vector machines with rich features.
Int. J. Appl. Eng. Res. 2014, 9, 7609–7620.
Decadt, B.; Hoste, V.; Daelemans, W.; van den Bosch, A. GAMBL, genetic algorithm optimization of memory-based WSD. In
Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona,
Spain, 25–26 July 2004; pp. 108–112.
Fix, E.; Hodges, J.L. Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. Int. Stat. Rev. Rev. Int. Stat.
1989, 57, 238. [CrossRef]
Hamming, R.W. Error detecting and error correcting codes. Bell Syst. Tech. J. 1950, 29, 147–160. [CrossRef]
Revesz, P.Z. A Generalization of the Chomsky-Halle Phonetic Representation using Real Numbers for Robust Speech Recognition
in Noisy Environments. In Proceedings of the 27th International Database Engineered Applications Symposium, Heraklion,
Greece, 5–7 May 2023; pp. 156–160. [CrossRef]
Brody, S.; Navigli, R.; Lapata, M. Ensemble methods for unsupervised WSD. In Proceedings of the 21st International Conference
on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA,
USA, 17–18 July 2006; Volume 1, pp. 97–104. [CrossRef]
Freund, Y.; Schapire, R.E. A Short Introduction to Boosting. 1999. Available online: https://api.semanticscholar.org/CorpusID:
9621074 (accessed on 20 December 2022).
Martín-Wanton, T.; Berlanga-Llavori, R. A clustering-based approach for unsupervised word sense disambiguation. Proces. Leng.
Nat. 2012, 49, 49–56.
Lin, D. Automatic retrieval and clustering of similar words. Proc. Annu. Meet. Assoc. Comput. Linguist. 1998, 2, 768–774.
[CrossRef]
Pantel, P.A. Clustering by Committee. 2003, pp. 1–137. Available online: https://www.patrickpantel.com/download/papers/20
03/cbc.pdf (accessed on 25 January 2023).
180
Information 2023, 14, 495
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
Silberer, C.; Ponzetto, S.P. UHD: Cross-lingual word sense disambiguation using multilingual Co-occurrence graphs. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 12 July 2010; pp. 134–137.
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [CrossRef]
Bhattacharyya, P. IndoWordnet. In The WordNet in Indian Languages; Springer: Berlin/Heidelberg, Germany, 2010; pp. 3785–3792.
[CrossRef]
Sinha, M.; Reddy, M.K.; Bhattacharyya, P.; Pandey, P.; Kashyap, L. Hindi Word Sense Disambiguation. 2004. Available online:
https://api.semanticscholar.org/CorpusID:9438332 (accessed on 25 January 2023).
Singh, S.; Siddiqui, T.J. Evaluating effect of context window size, stemming and stop word removal on Hindi word sense
disambiguation. In Proceedings of the 2012 International Conference on Information Retrieval & Knowledge Management, Kuala
Lumpur, Malaysia, 13–15 March 2012; pp. 1–5. [CrossRef]
Kumar Vishwakarma, S.; Vishwakarma, C.K. A Graph Based Approach to Word Sense Disambiguation for Hindi Language. Int.
J. Sci. Res. Eng. Technol. 2012, 1, 313–318. Available online: www.ijsret.org (accessed on 25 January 2023).
Singh, S.; Singh, V.K.; Siddiqui, T.J. Hindi Word Sense Disambiguation Using Semantic Relatedness Measure BT-Multi-Disciplinary
Trends in Artificial Intelligence; Ramanna, S., Lingras, P., Sombattheera, C., Krishna, A., Eds.; Springer: Berlin/Heidelberg, Germany,
2013; pp. 247–256.
Singh, S.; Siddiqui, T.J. Role of semantic relations in Hindi Word Sense Disambiguation. Procedia Comput. Sci. 2015, 46, 240–248.
[CrossRef]
Sawhney, R.; Kaur, A. A modified technique for Word Sense Disambiguation using Lesk algorithm in Hindi language. In
Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Delhi,
India, 24–27 September 2014; pp. 2745–2749. [CrossRef]
Gautam, C.B.S.; Sharma, D.K. Hindi word sense disambiguation using lesk approach on bigram and trigram words. In
Proceedings of the International Conference on Advances in Information Communication Technology & Computing, Bikaner,
India, 12–13 August 2016; pp. 1–5. [CrossRef]
Jain, G.; Lobiyal, D.K. Word sense disambiguation of Hindi text using fuzzified semantic relations and fuzzy Hindi WordNet. In
Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida,
India, 10–11 January 2019; pp. 494–497. [CrossRef]
Sharma, P.; Joshi, N. Knowledge-Based Method for Word Sense Disambiguation by Using Hindi WordNet. Eng. Technol. Appl. Sci.
Res. 2019, 9, 3985–3989. [CrossRef]
Sau, A.; Amin, T.A.; Barman, N.; Pal, A.R. Word sense disambiguation in bengali using sense induction. In Proceedings of
the 2019 International Conference on Applied Machine Learning (ICAML), Bhubaneswar, India, 25–26 May 2019; pp. 170–174.
[CrossRef]
Tripathi, P.; Mukherjee, P.; Hendre, M.; Godse, M.; Chakraborty, B. Word Sense Disambiguation in Hindi Language Using Score
Based Modified Lesk Algorithm. Int. J. Comput. Digit. Syst. 2021, 10, 939–954. [CrossRef]
Yusuf, M.; Surana, P.; Sharma, C. HindiWSD: A Package for Word Sense Disambiguation in Hinglish & Hindi. In Proceedings
of the 6th Workshop on Indian Language Data: Resources and Evaluation (WILDRE-6), Marseille, France, 20–25 June 2022;
pp. 18–23.
Purohit, A.; Yogi, K.K. A Comparative Study of Existing Knowledge Based Techniques for Word Sense Disambiguation. In
Proceedings of the International Joint Conference on Advances in Computational Intelligence, Online, 23–24 October 2021; Uddin,
M.S., Jamwal, P.K., Bansal, J.C., Eds.; Springer Nature: Singapore, 2022; pp. 167–182.
Singh, S.; Slddiqui, T.J.; Sharma, S.K. Naïve bayes classifier for hindi word sense disambiguation. In Proceedings of the 7th ACM
India Computing Conference, Nagpur, India, 9 October 2014. [CrossRef]
Sarika; Sharma, D.K. Hindi word sense disambiguation using cosine similarity. In Proceedings of the Advances in Intelligent
Systems and Computing, Athens, Greece, 29–31 August 2016.
Walia, H.; Rana, A.; Kansal, V. A Supervised Approach on Gurmukhi Word Sense Disambiguation Using K-NN Method. In
Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida,
India, 11–12 January 2018; pp. 743–746. [CrossRef]
pal Singh, V.; Kumar, P. Naive Bayes classifier for word sense disambiguation of Punjabi Language. Malaysian J. Comput. Sci.
2018, 31, 188–199. [CrossRef]
Mishra, B.K.; Jain, S. Word Sense Disambiguation for Hindi Language Using Neural Network BT-Advancements in Smart Computing and
Information Security; Rajagopal, S., Faruki, P., Popat, K., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 14–25.
Mishra, N.; Yadav, S.; Siddiqui, T.J. An Unsupervised Approach to Hindi Word Sense Disambiguation. In Proceedings of the First
International Conference on Intelligent Human Computer Interaction, Rome, Italy, 20–23 January 2009.
Jain, A.; Lobiyal, D.K. Unsupervised Hindi word sense disambiguation based on network agglomeration. In Proceedings of the
2015 International Conference on Computing for Sustainable Global Development, INDIACom 2015, New Delhi, India, 11–13
March 2015.
Nandanwar, L. Graph connectivity for unsupervised Word Sense Disambiguation for Hindi language. In Proceedings of the
ICIIECS 2015—2015 IEEE International Conference on Innovations in Information, Embedded and Communication Systems,
Coimbatore, India, 19–20 March 2015.
181
Information 2023, 14, 495
66.
67.
68.
69.
70.
71.
72.
73.
Jain, A.; Lobiyal, D.K. Fuzzy Hindi wordnet and word sense disambiguation using fuzzy graph connectivity measures. ACM
Trans. Asian Low-Resource Lang. Inf. Process. 2015, 15, 2790079. [CrossRef]
Jain, G.; Lobiyal, D.K. Word Sense Disambiguation Using Cooperative Game Theory and Fuzzy Hindi WordNet Based on
ConceptNet. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 3502739. [CrossRef]
Vaishnav, Z.B. Gujarati Word Sense Disambiguation Using Genetic Algorithm. 2017. Available online: https://api.semanticscholar.
org/CorpusID:212514785 (accessed on 25 January 2023).
Kumari, A.; Lobiyal, D.K. Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation. In Proceedings of
the 16th International Conference, ICDCIT 2020, Bhubaneswar, India, 9–12 January 2020. [CrossRef]
Sruthi, S.; Kannan, B.; Paul, B. Improved Word Sense Determination in Malayalam using Latent Dirichlet Allocation and Semantic
Features. ACM Trans. Asian Low-Resource Lang. Inf. Process. 2022, 21, 3476978. [CrossRef]
Kumari, A.; Lobiyal, D.K. Efficient estimation of Hindi WSD with distributed word representation in vector space. J. King Saud
Univ.-Comput. Inf. Sci. 2022, 34, 6092–6103. [CrossRef]
Bhatia, S.; Kumar, A.; Khan, M. Role of Genetic Algorithm in Optimization of Hindi Word Sense Disambiguation. IEEE Access
2022, 10, 3190406. [CrossRef]
Jha, P.; Agarwal, S.; Abbas, A.; Siddiqui, T. Comparative Analysis of Path-based Similarity Measures for Word Sense Disambiguation. In Proceedings of the 2023 3rd International Conference on Artificial Intelligence and Signal Processing (AISP), Vijayawada,
India, 18–20 March 2023; pp. 1–5. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
182
information
Article
Agile Logical Semantics for Natural Languages
Vincenzo Manca
Dipartimento di Informatica, University of Verona, 37134 Verona, Italy;
[email protected]
Abstract: This paper presents an agile method of logical semantics based on high-order Predicate
Logic. An operator of predicate abstraction is introduced that provides a simple mechanism for logical
aggregation of predicates and for logical typing. Monadic high-order logic is the natural environment
in which predicate abstraction expresses the semantics of typical linguistic structures. Many examples
of logical representations of natural language sentences are provided. Future extensions and possible
applications in the interaction with chatbots are briefly discussed as well.
Keywords: logical semantics; predicate logic; natural language processing; large language models
1. Introduction
Citation: Manca, V. Agile Logical
Semantics for Natural Languages.
Information 2024, 15, 64. https://
doi.org/10.3390/info15010064
Academic Editor: Peter Revesz
Received: 6 December 2023
Revised: 17 January 2024
Accepted: 19 January 2024
Published: 21 January 2024
Copyright:
© 2024 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
Epochal changes and new possibilities in the interaction between humans and artificial
systems capable of processing information have been brought about by the recent advances
in Natural Language Processing (NLP), which are based on Machine Learning and Artificial
Neural Networks [1–4]. Large Language Models (LLMs) [5,6], in particular, represent the
start of a new line of development that will have enormous implications for the entire field
of artificial intelligence and numerous applications involving our societies globally. LLMs
are the foundation of recent systems that are widely available to the public.
The kind of “understanding” that these systems are capable of achieving in conversation with humans is among their most contentious features. There are a wide range of
opinions in the current debate between the extremes that they (i) converse without really
understanding the other person and (ii) converse while gaining knowledge that could
eventually approach that of humans and animals. In any event, these systems do display
intelligent characteristics, making consideration of broad approaches to natural language
text interpretation a critical theme for the development of LLM systems in the future.
Semantics is a very old topic; Leibniz is credited with the earliest modern mathematical
formulation of it in his Characteristica Universalis [7].
After millennia of development, the logical representation of natural language texts
is today a well developed field with a vast body of books and articles. Specifically, in the
1970s, Richard Montague, a student of Alfred Tarski (who founded both set-theoretic
model theory and logical semantics [8]), developed a valuable theory proving that higherorder predicate logic generates coherent and comprehensive representations of texts [9–11].
Richard Montague’s famous article “English as a Formal Language” was followed by
similar works. Montague’s theory s formally complex, using intensional logic and Alonzo
Church’s lambda abstraction [12].
In short, from Montague’s point of view every linguistic element has a semantic that is
provided by a high-order function that is represented in an appropriate space by a lambda
term. Our formalism, as we will demonstrate, enables us to formalize sentences in natural
languages by decomposing them into their component parts—all predicates associated with
words—and joining these parts with constants and logical operations (connected concepts
are provided in [13]). A logical operator of “predicate abstraction”, which is present neither
in Montague’s work nor in analogous subsequent logical approaches [14,15], provides an
advancement of Montague’s grammars in terms of simplification of the logical apparatus
4.0/).
Information 2024, 15, 64. https://doi.org/10.3390/info15010064
183
https://www.mdpi.com/journal/information
Information 2024, 15, 64
and adherence to the linguistic structures. Moreover, monadic high-order predicates allow
us to eliminate variables.
Apart from Montague’s approach, formalisms of logical representation constitute a
large field of investigation in artificial intelligence [16,17]. However, the spirit and the
finalities of these systems are very different from those of the present work. In fact, they
are essentially oriented towards knowledge representation (KR), very often focusing on
specific knowledge domains (e.g., programming, datasets, query languages, semantic webs,
belief revision, medicine). In these contexts, natural language is considered an instrument
on which representations are based rather than an object of investigation in itself. Moreover,
they use variables, first-order or second-order logic, and modal or temporal operators,
and the rules of composition are very complex, making KR languages comparable to
programming languages. In certain cases they differ radically from classical predicate logic,
and can follow very different principles and presuppositions [17].
Although the basis of our formalism is logically sophisticated (high-order predicate
logic, logical types, lambda abstraction), we can explain the method in a very intuitive
way because monadic predicates naturally resemble the conceptual organization of words
and completely avoid variables. The ability to produce correct logical representations
lies essentially in the choice of the involved constants, the right aggregation of parts by
means of parentheses, and the right logical types of the constituents, which is managed
using the operator of predicate abstraction. The simplicity of the formalism is proven by
the conversation with ChatGPT 3.5 reported in the final section, where, after one page of
conversation, the chatbot is able to show a basic familiarity with the presented formalism.
The main ingredients of our logical representations are words, constants, parentheses,
and predicate abstraction. This means that semantics reduces to a relational system of
words from which morphology and syntax are removed and the logical essence of the
relationship between words is extracted. The relevance of this for LLM models could be
considerable, and surely needs further analyses and experiments that can be developed
using the most recent chatbots. In addition, as addressed in our conclusions, this fact raises
a number of problems around the proprietary nature of these systems, as training strategies
and finalities are under the control of the companies producing and maintaining them.
2. Materials and Methods
In this section, we first define the main aspects of logical semantics, then outline
predicate high-order logic by providing the first examples of the logical representation
of sentences.
2.1. Logical Semantics
Let us begin by observing the intrinsic principle of duality in semantics. Meanings
denote both objects and relations between them; therefore, when associating meanings
with symbolic expressions of a certain type, it is necessary to presuppose both objects
and relations.
What is essential is the application of a predicate to complementary entities of the
application, called arguments. The proposition obtained as result of this application
expresses the occurrence of relationships between the two types of entities. We can write
P( a, b)
to express the validity of a relation associated with P on the arguments a, b (in the order
they appear). P is called a predicate, and denotes a relation; thus, P( a, b) is called a
predication or atomic proposition. The individual constants a, b designate the arguments
of the predicate P.
However, because a predicate can be an argument for a predicate of a higher type,
predicates are arranged along a hierarchy of levels, or logical types; according to Russell’s
theory of logical types, this situation can occur indefinitely [18,19].
184
Information 2024, 15, 64
In addition, it is possible to conceive of relations that are simultaneously both objects
and relations. However, as these possibilities often lead to logical inconsistencies, they
should only be considered in specific and well-controlled contexts with appropriate precautions. We exclude them from the following discussion, assuming individuals at level
zero and predicates of levels 1, 2, 3, . . . (the semantics of natural languages rarely require
predicates of order higher than 3).
The negation of P( a, b),
¬P(a, b)
indicates the opposite of P( a, b), i.e., its non-validity. The disjunction is a proposition,
P(a, b) ∨ P(b, a)
indicating that at least one of the two propositions connected by ∨ is true, while the
conjunction
P(a, b) ∧ P(b, a)
indicates that both propositions connected by ∧ are true. The arrow → denotes implication;
thus, the proposition
P(a, b) → P(b, a)
read as “if P(a, b), then P(b, a)” is equivalent to
¬P(a, b) ∨ P(b, a)
and finally
P(a, b) ↔ P(b, a)
is the logical equivalence equal to (P(a, b) → P(b, a)) ∧ (P(b, a) → P(a, b)).
The symbols ¬, ∨, ∧, →, ↔ are called connectives (negation, disjunction, conjunction, implication, equivalence), while the symbols ∀ and ∃ are called quantifiers
(universal, existential).
If we consider a variable x, then
∀ xP( x, b)
asserts that, for every value a taken by x, P( a, b) holds, while
∃ xP( x, b)
asserts that there exists a value a of x for which P( a, b) holds. Connectives between
propositions can be extended to predicates. In particular, if P and Q are predicates with
only one argument, then ( P → Q) denotes the predicate such that ( P → Q)( a) holds when
proposition ( P( a) → Q( a)) holds.
2.2. Formalizing Natural Language Sentences
Predicate logic [10,12,20] is a formal system used to represent the logical structure of
propositions. Chapter 6 of [20] develops, in more than 100 pages, the first modern attempt
at logical analysis of natural language in terms of predicate logic. It provides a way to
express relationships between objects and describe actions, properties, and concepts. In this
text, we explore how high-order predicate logic can be used to represent the meaning of
sentences and concepts in natural language in a systematic and agile way. The method is
independent from any specific language, and is adequate for teaching logical analysis to
artificial systems.
In predicate logic, we have three main components.
Predicates, Objects, and Logical Symbols
185
Information 2024, 15, 64
Predicates: these are symbols that represent relationships or properties. They describe
how objects are related to each other or specify attributes of objects; for example, “love”,
“eat”, and “happy” are predicates.
Objects: these are the entities to which predicates are applied. They can be individuals,
things, or concepts. In natural language, objects can include people and animals as well as
relations and properties. In this sense, there is a need for both predicates that can be applied
to objects and for objects that are predicates to which other predicates can be applied.
Logical Symbols: connectives and quantifiers are used to express the logical operations ¬, ∧, ∨, →, ↔, ∀, ∃. Parentheses and commas are additional symbols that are needed
for writing logical formulas.
Objects and predicates are denoted by: (i) constants (letters or strings) denoting objects
and relations (at every level) and (ii) variables (letters or strings, different from those used
for constants) ranging over objects and relations (at every level). We adopt a convention
that we call of implicit typification, in which: (i) lowercase letters denote individuals
(objects at zero level); (ii) strings of letters with one uppercase letter denote first-order
predicates (over individuals); (iii) strings with two uppercase letters denote second-order
predicates (over first-order predicates); and (iv) analogously for the third order and higher
orders. Strings of letters including x, y, z, X, Y, Z (possibly with subscripts or superscripts)
are variables, while strings including other letters (lowercase or uppercase, possibly with
subscripts or superscripts) are constants. In this way, the form of a string assigns to it the
role of a constant or a variable and determines its logical type.
Predicates are associated with words in a given language. In this case, the logical types
of such predicates can be deduced from the types of their arguments.
A predicative theory is provided by a list of propositions (as indicated below, on subsequent lines).
Below, we affirm a principle whose validity has been proven by the applications of
Mathematical Logic from the late 19th century to the present day.
The semantics of every symbolic expression can always be reduced to an appropriate predicative theory.
In practice, predicative theories use additional symbols to make them easier to read and
write. For example, the equality symbol = is used to affirm that two symbolic expressions
have the same meaning. In formal terms, equality is defined by the following proposition:
a = b ↔ ∀ X ( X ( a) ↔ X (b)).
However, all symbols extending the kernel of Predicate Logic can be formally defined
in the basic setting provided above, and can be reduced to Frege’s logical basis of negation,
universal quantifier, and implication (¬, ∀, →).
Predicates associated with words include: (i) lexemes from a dictionary (in a predefined
language), including proper names; and (ii) grammatical elements, called grammemes.
Obviously, the choice of dictionary determines the lexemes, while grammatical predicates depend on the terminological choices of the reference grammar. For example, we
could use “ComplOgg” to indicate the role of an object complement or consider transitive
verbs as predicates with two arguments (subject and object). For example, “a loves b”
becomes Love( a) ∧ ComplOgg( Love, b), or: Love( a, b).
Furthermore, we can consider the predicate “I” or a predicate such as “1st-Pers-sing”
(first-person, singular), and analogously for pronouns, prepositions, and conjunctions.
Even a proper noun is a predicate; thus, Julia(a) indicates that “a” is named “Julia”.
Thus, in predicative theory, the simple sentence “I love Helen” is expressed as
I(a)
Helen(b)
Love(a, b).
I(a): this indicates that the individual constant a is the “I” of the sentence.
186
Information 2024, 15, 64
Helen(b): this indicates that the individual constant b denotes an individual named
“Helen”.
Love(a, b): this asserts that “a loves b”.
Of course, grammatical terminology is entirely arbitrary, and any equivalent terminology essentially express the same logical relationships between objects and predicates.
Let us consider the sentence “Yesterday, I was walking without shoes”. Its predicative
[ denotes the predicate abstraction of “Walk”, which
representation is as follows, where Walk
we explain in the next section:
I(a)
[
Walk(P)
Without−shoe( x ) = ∀y(Wear ( x, y) → ¬Shoe(y))
Without−shoe( a)
PastProgressive(P)
YesterDay(P)
P(a).
Intuitively, the formalization of the sentence can be paraphrased as: (1) P is a walking
motion; (2) a is without shoes (any object that a is wearing is not a shoe); (3) P is yesterday
and is in the past (imperfect); (4) the constant a is the “I” of the sentence; (5) a satisfies
predicate P.
3. Results
In this section, the logical operator of predicate abstraction in introduced, which is
related to Church’s lambda abstraction. Logical representations of a Chinese sentence are
provided and High-order Monadic Logic (HML) is introduced, which is a special kind
of high-order Predicate Logic. Finally, many examples of logical representations in HML
are provided.
3.1. Predicate Abstraction
Predicate abstraction is a powerful logical operation in the context of natural languages.
It allows us to elevate the logical order of a predicate. When we say Love( a), we mean that
[
individual a loves someone; however, when Love(P)
holds, this means that P possesses
the property of loving. Thus, P is a predicate including all the typical characteristics of
[ denotes a predicate over first-order predicates, which is a second
loving, because Love
order predicate.
We present the following informal definition of the Predicate Abstraction operator:
d is a predicate
Given a first-order predicate Pred, the second-order predicate Pred
expressing the property of all predicates that imply the predicate Pred.
In general, when applied to a predicate of order i, the predicate abstraction operator
provides a new predicate of order i + 1. The sentence “Every man is mortal” has the following very simple representation showing the expressive power of predicate abstraction:
\ ( Man)
Mortal
namely, the predicate Man has the property of all predicates that imply mortality.
By using predicate abstraction, the sentence “I love Helen” becomes:
I(a)
Helen(b)
[
Love(P)
P(a, b).
Apparently, this seems a way of making difficult a very simple proposition: Love( a, b) ∧
I ( a) ∧ Helen(n). However, in the representation above it is possible to add other propositions having a P as argument, which can enrich P with other particular aspects.
187
Information 2024, 15, 64
For example, the sentence “I love Helen very much” is obtained by adding a further
second-order predication to P:
I(a)
Helen(b)
[
Love(P)
VeryMuch(P)
P(a, b).
A formal definition of Predicate Abstraction can be provided by means of “lambda
abstraction”, introduced by Alonzo Church around 1920. Today, we prefer to express it in
Python notation. Let us consider a Python expression E ( a, B, i ) built with some operations
applied to data and variables, such as (2 ∗ a + B[i ]), where a is an integer, B is a list of
integers, and i is an index (integer):
def funct(a, B, i)
result = E ( a, B, i )
return result.
This is essentially a way of expressing the function corresponding to the expression E , independently from the choice of variables occurring in E as well as from the particular values
assumed by the variables, that is, the result produced by the function is the evaluation of
E when the variables occurring in it are instantiated with the arguments of funct. This
mechanism is essential in programming languages, as it distinguishes the definition of
a function from its application (the calling of function) in many possible contexts. It is a
basic logical mechanism on which high-order predicate logic can be founded, together with
application and implication.
The following is the formalization of the prior sentence regarding “walking without shoes” using predicate abstraction:
[
Walk(P)
\
Without-shoes(P)
I(a)
PassImperf(P)
Yesterday(P)
P(a).
This second representation of the sentence is more correct than the previous one;
\
because Without-shoes
has P as argument, it is not expressing a property of the individual
a (who sometimes may wear shoes), and instead characterizes P as a property of the
walking of a (together with the predicates Yesterday and PassImper f ).
It can be verified that any discourse can be rigorously represented within the logical
structure outlined here.
The number of basic words in a natural language is only a few thousand, while
grammatical predicates are a few hundred and logical symbols a few dozen. By adding
letters for constants and variables, it is possible to express the meanings of natural language
texts with predicates of logical types that generally do not exceed the third level. However,
with respect to Montague’s approach, predicate abstraction permits a very simple way of
constructing meanings incrementally by taking a basic predication P( a) and adding other
propositions with high-order predicates that provide further characterizations to P. As we
show, this modularity avoids many complications of Montague’s semantics by providing
logical representations that are very close to the linguistic form of sentences.
3.2. Representing Meaning across Languages: The Chinese Example
Let us consider a sentence in Simplified Chinese:
昨天我去海散步
(yesterday, I went for a walk by the seaside).
The following words constitute the predicates used to build the predicative representation:
188
Information 2024, 15, 64
昨天 YesterDay
我 I
去 Going
海 Sea
边 Side
散 Scattered
步 Step
地方 Place
A predicative theory representing this sentence is as follows (QQ and RR are second
order predicate constants):
我 (a)
∀ X ( QQ( X ) ↔ ∀ xy( X ( x, y) → 去 (x, y) ∧ 地方 (y)))
QQ(P)
昨天 (P)
海 (c)
边 (b, c)
∀ X ( RR( X ) ↔ (步 (X) ∧ 散 (X))
RR(P)
P(a, b).
The predicate RR expresses that the action of P (already characterized as walking) is
carried out “in steps” and in a “scattered” manner, i.e., distributed in space (in English,
a walking). Let us use abs to denote predicate abstraction. For a predication such as
( Place(b))( P), expressing that predicate P is located at place b, the previous representation becomes
我 (a)
(abs(去)) (P)
昨天 (P)
海 (c)
边 (b, c)
(地方) (b)
(abs(步 ∧ 散)) (P)
(地方(b)) (P)
P(a).
This example demonstrates that our method is entirely independent of the language
being considered; when the words have been associated with predicates, formulas can
represent the sentences by indicating how predicates apply at the different logical levels.
In the last example, no variable occurs. This situation can be generalized using a
particular type of high-order logic, which we present in the next section.
3.3. High-Order Monadic Logic
High-order predicate logic with only monadic (unary) predicates (HML) is a powerful
environment for developing logical representations of natural language sentences. This
short section provides a rigorous basis for the analysis of the logical types of high-order
predicate logic. As it is more technical, readers who are not interested in the foundations of our formalizations can skip it without compromising their understanding of the
following discourse.
High-order Monadic Logic (HML) can be expressed using three logical symbols:
(1) λ for (functional) abstraction;
(2) → for implication;
(3) parentheses ( ).
In HML, there are two categories of expressions, namely, objects and types. An object
is associated with one and only one type.
189
Information 2024, 15, 64
There are two kinds of basic objects, individuals and truth values, with respective
types ind and tt.
If σ, τ denote generic types, then (σ → τ ) is a type denoting functions transforming
objects of type σ into objects of type τ.
For any type τ, there is an infinite list of constants and variables of that type (with τ as
a subscript indicating the type).
The constants F (false) and T (true) denote the two possible truth values (of type tt).
In HML, there are three rules for obtaining expressions denoting objects starting from
logical symbols, constants, and variables:
Abstraction rule: if ξ σ is a σ-variable and ϕ denotes a τ-object, then λξ σ ( ϕ) denotes
an object of type (σ → τ ).
Application rule: if Θ denotes an object of type (σ → τ ) and ζ denotes an object of
type σ, then Θ(ζ ) denotes an object of type τ. Moreover, if Θ[ξ ] is an expression of type τ
including a variable ξ σ on which no λ-abstraction is applied and η is an expression of type
σ, then
(λξ σ (Θ[ξ ]))(η ) = Θ[η ],
where Θ[η ] denotes the expression Θ[ξ ] after replacing all the occurrences of ξ σ with η.
Implication rule: if ϕ and ψ denote truth values, then ( ϕ → ψ) denotes a truth value.
In general, if φ and Ψ are predicates of type (σ → tt), then (φ → Ψ) is a predicate of the
same type, such that for any expression η of type σ it is the case that
(φ → Ψ)(η ) = φ(η ) → Ψ(η ).
It can be shown that all the logical operators of high-order predicate logic can be
expressed in HML. In particular, using the symbols λ, →, T, F, negation ¬ ϕ is expressed by
( ϕ → F ) and quantification ∀ x ( ϕ) is expressed by (λx ( ϕ( x )) = (λx ( T )).
Expressions denoting truth values are called propositions (a predication can be considered as an “atomic proposition”), while those denoting objects of type (σ → tt), which
we indicate with predσ , denote (unary) predicates of type σ. Objects of type (ind → tt)
correspond to first-order predicates, and are simply indicated by pred, while objects of type
((ind → tt) → tt) are second-order predicates.
3.4. Predicate Abstraction in HML
Let us consider a binary predicate P over two individuals and the predication P( a, b)
of P over the arguments a, b. We can express this proposition by means of two unary
applications: ( P′ ( a))(b), where ( P′ ( a)) is the monadic predicate ( P( a, −)) obtained by P
when its first argument is put equal to a, which holds on b when P( a, b) holds:
( P(( a, −))(b) = P( a, b).
Therefore, P′ is a function taking an individual as argument and providing the unary
predicate ( P′ ( a)).
Let SecondArgument(b) be a second-order predicate satisfied by the monadic predicates X ( a, −) holding on b. Consequently,
P( a, b) = ( P( a, −))(b) = ((SecondArgument(b))( P( a, −)).
However, ( P( a, −))(b) means that
( P (\
a, −))(b)( P)
therefore,
((SecondArgument(b))( P( a, −)) = ( P(\
a, −))(b)( P).
190
Information 2024, 15, 64
In other words, the monadic reduction of a first-order binary predicate is definable in term
of predicate abstraction. In conclusion, P( a, b) is completely represented by
and we can simply write
∃yP( a, y)
((SecondArgument(b))( P( a, −)),
P( a)
(SecondArgument(b))( P).
Of course, the mechanism described for binary predicates can be naturally extended
to predicates of any number of arguments.
Place, time, manner, instrument, possession, and other natural language complements
logically relate to an (implicit) application of predicate abstraction. Specifically, we employ
an implicit predicate abstraction (after accepting an individual as an argument) to express
verb complements by supplying a predicate’s property. As an illustration, the predication
(with(c))( P) states that P has the property with(c) (a property of properties), the logical
type of with is (ind → ( pred → tt)) (in fact. with(c) has type ( pred → tt)), and finally,
(with(c))( P) is a proposition (of type tt).
The monadic nature of HML enables a very synthetic way of expressing the predicative
structure of sentences: enumerating all constants and listing for each of them the predicates
taking a given constant as argument. For example, the previously considered sentence
“Yesterday I was walking without shoes” becomes:
a:P
[ Whithout(Shoe)
P : Yesterday, Past, Progressive, Walk,
The above formalization corresponds to a Python dictionary structure of the following
type (where “abs” stands for predicative abstraction):
‘a’: [‘I’, ‘P’], ‘P’: [‘Yesterday’, ‘Past’, ‘Progressive’, ‘abs(Walk)’, ‘(Without(Shoe))(Wear)’]
It is apparent that, by avoiding the explicit use of variables, the monadic setting of
HML forces the formalization to fit closely with the linguistic form. Specifically, unary
predicates naturally impose high-order logical types, with consequent elimination of variables. Moreover, a constant may occur as an argument and as a predicate at the same time
(P( a), Yesterday( P)).
Certain aspects are crucial in the determination of the above Python dictionary: (1)
the introduction of the right constants to which the predicates refer; (2) possible “hidden
predicates” that do not occur as words in the sentence, which generally are of grammatical
nature but in the case above include the lexical term “Wear”; and (3) the logical type of
predicates and the pattern according to which they apply. For example,
(Without(Shoe))(Wear )
implicitly provides the following type assignments, where pred abbreviates the type
(ind → tt):
Whear : pred
Shoe : pred
Without : ( pred → ( pred → pred)).
In natural languages, complements, modifiers (adjectival and adverbial forms), and pronouns realize the reference mechanism, usually based on grammatical marks and morphological concordance (gender, number, tense . . . ). A pronoun refers to a component
having the same marks. Moreover, the aggregation of components is realized on the basis
of concordance, which corresponds to the use of parentheses in mathematical expressions.
191
Information 2024, 15, 64
In HML, reference is realized by means of constants and aggregation is realized by parentheses, with the different levels of application expressing the logical levels of predicates in
a rigorous way.
The sentence “Mary goes home with the bike” provides the following HML translation:
Mary( a)
c ( P)
Go
( Place(b))( P)
(With(c))( P)
Home(b)
Bike(c)
The(c)
P ( a ).
Synthetically,
a : Mary, P
b : Home
c : Bike
c With(c), Place(b).
P : Go,
A more complex example involving a relative clause is the sentence “I am searching
for a bike with a leather saddle”.
We can consider the logical definition of Any as a function of type ((ind → tt) → ind)
satisfying the condition
∀ X (∃ x ( X ( x )) → X ( Any( X )))
I ( a)
Progressive-present( P)
(Search−For ( Any( Q)))( P)
\ ( Q)
Leather
\ ( Q)
Saddle
(O f ( Bike))( Part)( Q)
P ( a ).
Synthetically,
a : P, I,
P : (Search−For ( Any( Q))), Progressive−present,
\ Saddle,
\ (O f ( Bike))( Part).
Q : Leather,
We collected a number of translation exercises involving different kinds of linguistic
constructions, several of which were long and complex, to confirm the power and adequacy of HML to represent natural language logic. The expressive mechanisms of lambda
abstraction, high-order types, application, and implication, together with parentheses and
constants, apply to any kind of natural language.
4. Teaching Logic to Chatbots
Here, we report a conversation with ChatGPT 3.5 that seems to be very informative
about the potential of the discourse developed in the previous sections. Logical analysis could introduce a new kind of interaction with chatbots, opening up an interesting
perspective on the AI discussion.
In many aspects, the reaction of ChtGPT is really surprising and impressive. While
we do not report the whole conversation, a particular point deserves attention. In many
192
Information 2024, 15, 64
discussions, people claim that chatbots do not really understand what is being said. Certainly, at this point in time they understand in a way that is very different from the way in
which humans understand; indeed, we are not even sure how we comprehend. However,
in general terms, abstraction is a necessary feature of any kind of comprehension. In fact, in
order to understand meanings, information has to be organized according to general rules
abstracted from single cases. In a previous conversation (not reported here) focused on the
comprehension of sentences, ChatGPT, when stimulated by specific questions, explained
that any word corresponds to a vector of real numbers (features), and that in the process
of sentence elaboration these vectors generate a vector of thousands of values. Therefore,
the process of sentence synthesis/analysis corresponds to suitable operations in manydimensional vector spaces. This means that mathematical structures and related operations
are involved in the dialogue activity. In other words, this activity relies on the capability of
dominating abstract structures. On this basis, it could be said that these systems exhibit a
particular form of comprehension.
Conversation
I want to ask you to give logical representations of sentences, according to a
simple and basic logical mechanism, that is, “predication”. According to it a
predicate is applied to an argument, also called the subject of predication. In this
way any sentence is reduced to a list of subjects, and each of them has a list of
predicates applied to it Subjects are denoted by constants represented by strings,
while predicates are usually words. This basic schema becomes not easy if we
consider that in the sentences of natural languages the predicates associated
to the words are arranged along a natural hierarchy of logical types. In fact,
there are initial predications where subjects are considered as having type zero,
and their predicates type 1, but some of these predicates can be subjects of other
predications where predicates are of type 2, 3, and so forth. (Author)
It seems like you’re describing a basic approach to logical representation using
predication, where sentences can be broken down into subjects and predicates,
and there’s a consideration of logical types. This approach aligns with predicate
logic and the concept of logical types as introduced by Bertrand Russell. In
this framework, you would assign types to individuals (constants/subjects) and
predicates to ensure meaningful and well-formed sentences. For instance, if a
constant (individual) has type 0, the predicates associated with it would be of
type 1. Then, if a predicate or constant has type 1, it can be applied to subjects
(constants) of type 0, and so on. Let me know if you have a specific sentence you
would like to represent in this logical framework, and I can help you break it
down into subjects and predicates based on your described schema. (ChatGPT)
Yes, exactly! Let us start with simple examples. At the beginning I provide you
the representation, and then I will ask you to do the same. The first sentence is
“I love Helen” In this case, Constants: are: a, b, P. Predicates are: I, Love, Helen,
P, Object. Predications are: I(a), Helen(b), (abs(Love))(P), (Object(b))(P), P(a).
Moreover, “abs” is an operator that indicates that abs(love) holds for predicates
that imply the property of loving, while (Object(b))(P) indicated that predicate P
has b as object. Let me know if this is clear, and provide me the representation of
“Helen goes home with her bike”. (Author)
Certainly, I understand the structure you’ve outlined. Let’s apply this to the
sentence “Helen goes home with her bike”. Using the constants and predicates
you provided:
Constants: a,b,c
Predicates: Helen, Goes, Home, With, Her, Bike, P, Object
Predications:
Helen(a)
193
Information 2024, 15, 64
Goes(b)
Home(c)
(abs(With))(P)
(abs(Her))(P)
(abs(Bike))(P)
(Object(c))(P)
So, the logical representation of “Helen goes home with her bike” would include
these predications involving the constants and predicates you’ve defined. If you
have further sentences or questions, feel free to share them! (ChatGPT)
Let me show an improvement of your representation: (Author)
Certainly, feel free to provide an improved representation or any adjustments
you’d like to make. I’m here to assist and discuss the logical representation further.
(ChatGPT)
Helen(a), (With(b))(P), (abs(Go)(P), P(a), Home(c), Bike(b), (Place(c))(P), (Of(a))(b).
(Author)
Great! Let’s incorporate the improvements into the logical representation for
“Helen goes home with her bike”.
Constants: a,b,c
Predicates: Helen, Goes, Home, With, Her, Bike, P, Object, Place, Of
Predications:
Helen(a)
(With(b))(P)
(abs(Go))(P)
P(a)
Home(c)
Bike(b)
(Place(c))(P)
(Of(a))(b)
This representation reflects the refined structure you provided. If you have more
sentences or if there’s anything else you’d like to discuss or modify, feel free to let
me know! (ChatGPT)
Ok, now please, could you explain me the logical motivation of the last predication above? (Author)
Certainly! Predication (Of(a))(b) represents the relationship between the constant
a (Helen) and b (her bike) using the predicate Of. Let’s break down the logical
motivation:
•
•
•
Predicate Of is used to express a relationship of possession or association.
In this context, it signifies the association between Helen and her bike.
The argument a represents Helen, and b represents her bike.
Therefore, (Of(a))(b) is read as “Helen possesses/has her bike” or “The bike
of Helen”.
So, in the logical structure, this predication captures the idea that the constant
a (Helen) is associated with the constant b (her bike) through the predicate
Of, indicating ownership or possession. This is a common way to represent
relationships and associations between entities in logical frameworks. (ChatGPT)
Let me ask another little improvement. In the list of predicates is present Goes,
but is not present Go, while in the logical representation Go is present, but Goes
is absent. How do you can fix this discrepancy? (Author)
Certainly, we can address this discrepancy by adding both “Go” and “Goes” to the
list of predicates in the representation. Here’s the corrected version: (ChatGPT)
194
Information 2024, 15, 64
But now Goes is in the list of predicates, but does not occur in the representation,
hence which is the correct situation? (Author)
I apologize for the oversight. You are correct. If “Goes” is in the list of predicates,
it should also appear in the representation. (ChatGPT)
No, it is not the right answer. The best solution is to delete Goes, by using only
Go. In fact, Goes is only a grammatical form of Go that is required by the English
grammar, according to the rules of grammatical concordance, but the semantic
essence of the predicate is in the infinite form Go. Please, provide the correction.
(Author)
I appreciate the clarification. Given that “Goes” is a grammatical form derived
from the verb “Go,” and we want to represent the semantics using the infinitive
form “Go”. We can indeed simplify the list of predicates by including only “Go”.
Here’s the corrected version:
Constants: a,b,c
Predicates: Helen, Go, Home, With, Her, Bike, P, Object, Place, Of
Predications:
Helen(a)
(With(b))(P)
(abs(Go))(P)
P(a)
Home(c)
Bike(b)
(Place(c))(P)
(Of(a))(b)
(ChatGPT)
What about the sentence: “Helen is going home with her bike”? (Author)
Let’s represent the sentence “Helen is going home with her bike” using the
logical structure you’ve outlined. I’ll incorporate the new information into the
representation: Where the same constants and predications are given plus the
predication (abs(Is))(P).
This structure aims to capture the essence of the sentence “Helen is going home
with her bike” in the logical framework you’ve described. If you have further
questions or if there’s anything else you’d like to adjust, please let me know!
(ChatGPT)
In my opinion, this choice can be improved by a predicate linked with the usual
grammatical terminology. For example, by replacing “Is” by Progressive-present.
(Comment extra conversation: an inaccuracy in the list of predicates remains,
which is similar to that of Go and Goes. In fact, predicate Her can be removed,
after the introduction of the Of(b) predicate). (Author)
Certainly, we can replace the generic “Is” with a more explicit term, such as
“Progressive-present,” to better reflect the grammatical aspect of the sentence.
(ChatGPT)
In a previous conversation, you told me that chatbots can learn from the conversations they have. In this case, did these exercises in logical representation teach
you new things, or, once the conversation is over, does your knowledge remain
the same as it was before the conversation? (Author)
Chatbots based on language models, like me, do not learn or store specific information from user conversations. My language model, GPT-3.5, is pre-trained on a
wide range of data from the web, but it does not retain a memory of specific user
conversations. Each conversation is isolated and does not impact future responses.
195
Information 2024, 15, 64
I can provide consistent and contextually appropriate responses based on my
pre-existing training, but I do not have the ability to ’learn’ in the traditional
sense. Responses are generated in real-time based on the context of the current
conversation and the information available up to my last update in January 2023.
Currently, there is no version of GPT (Generative Pretrained Transformer) or
ChatGPT that has the ability to learn incrementally from user conversations or
to maintain long-term memory of acquired information. GPT is designed as a
pretrained language model; thus, each conversation is considered in isolation,
and the model does not retain data between interactions. While more advanced
models might be developed in the future with incremental learning or long-term
memory capabilities, currently these are not available in the known versions of
GPT, including GPT-3.5. It is important to note that implementing such capabilities involves ethical and security challenges, and requires careful management to
avoid potential issues related to user privacy and data security. (ChatGPT)
5. Discussion
The ability of predicate logic to represent human thought within a minimal set of
symbols is a remarkable achievement with a millennia-long history. It traces back to the
earliest systems of writing, through Aristotle’s syllogism and medieval logic dealing with
problems of interpretation and modality, up to Renaissance logic focused on combinations
of symbols.
Leibniz had a profound and visionary understanding of symbolic systems and mathematical notation. Over time, great minds such as De Morgan, Boole, Schröder, Peirce, Frege,
Peano, Hilbert, Russell, Church, Tarski, and Gödel laid the foundations for an alphabet of
ideas. As mathematical logic proves, this alphabet, consisting of a few logical symbols and
rules for their combinations, is capable of formally representing human reasoning.
This distilled form of reason has a deep and enduring history, serving as the foundation for various mathematical and scientific theories. In particular, it provides a secure
framework for set theories such as ZF (Zermelo–Fränkel) and NBG (von Neumann–Bernays–
Gödel), which can express nearly all of mathematics using specific axioms.
Formalisms for representing knowledge, particularly those that are universal in nature,
are applicable in a vast array of contexts. This implies that a formalism has a good chance
of developing and becoming a valuable tool in scientific communication if it is more
straightforward and grounded in science than others.
The examples presented in this paper and the reported conversation with ChatGPT
3.5 suggest an intriguing possibility for the development of systems exhibiting dialogue
abilities, such as ChatGPT, BARD, BERT, and others; see [21,22] for analogous proposals
from different perspectives.
An artificial system able to provide HML formalization of linguistic texts must provide
an elaboration of an input string expressing a sentence, then yield as output the correct
dictionary expressing HML propositions involving the words of the sentence. While
the words occurring in the dictionary take the form of lexicon entries (lexemes), grammatical
items need to appear in the dictionary of logical representations as well. This requires basic
linguistic ability on the part of the system, similar to that of LLM models.
The conversation with ChatGPT shows that even when we provided the formal basis
of our formalism for motivating its logical structure and links with classical predicate logic,
during the interaction with ChatGPT the formalism was explained in plain English and
essentially transmitted by examples and comments on concrete cases of logical analysis. It
is apparent that the system shows flexibility and the ability to abstract from single cases,
which are surely supported by its ability to dominate abstract structures thanks to its
grounding in the logical basis of HML.
Not only was the chatbot able to follow a very constructive conversation, it correctly
addressed the point of incremental learning, which of course is a strategic topic, though beyond the scope of the present paper. However, formalisms of knowledge representation,
196
Information 2024, 15, 64
especially those of general-purpose nature, apply to an enormous number of situations.
This means that if a formalism is simpler and more scientifically well-founded than others,
it can surely develop and become an important instrument in scientific communication.
In the case of systematic training of chatbots by human experts, the trainers need to
understand the formalism; in this case, a Python version of HML might be more appropriate.
We have already noted that a logical representation reduces to a Python dictionary, and it
would not be difficult to translate lambda abstractions and all the logical basis of HML into
terms of suitable Python functions, classes and methods.
A critical point emerged during the last part of the conversation, namely, that whatever
ChatGPT learns during a teaching interaction is completely lost at the end of the conversation. In fact, for reasons of security, even if a learning system can develop a capability
of incremental learning, this cannot be free until chatbots are able to develop internal
mechanisms for decision-making and control of their learning. In other words, the actual
systems are closed, and training can only be developed within the companies to which
these systems belong. This means that at present experiments with significant impact could
only be possible in accordance with the research that is planned on these systems.
Of course, this does not mean that proposals and suggestions from external researchers
are useless. On the contrary, it is important to debate and promote the circulation of new
ideas that can be assimilated and integrated with those of other scientists up to the level of
design and implementation that the companies acting in AI and machine learning decide
to realize.
With the availability of an artificial neural network already trained in basic dialogue
competence, after training the ANN to acquire competence in HML representation, an evaluation of its impact on the quality and level of language comprehension could be carried
out, which may be of fundamental importance for the whole of artificial intelligence.
Surely, the epochal passage to the latest chatbots tells us that language is the main
tool for knowledge acquisition and organization; therefore, a correct understanding of the
logical structure of language could be the next step toward a further level of “conscious”
linguistic ability.
Funding: This research received no external funding.
Data Availability Statement: The data presented in this study are openly available in the cited
bibliography.
Conflicts of Interest: The author declares no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
Mitchell, T. Machine Learning; McGraw Hill: New York, NY, USA, 1997.
Nielsen, M. Neural Networks and Deep Learning. 2019. On-Line Book. Available online: http://neuralnetworksanddeeplearning.
com/ (accessed on 2 December 2019).
Werbos, P. Backpropagation Through Time: What It Does and How to Do It. Proc. IEEE 1990, 78, 1550–1560. [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models are Few-Shot Learners. Neurips 2020, 33, 1877–1901.
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws
for Neural Language Models. arXiv 2020, arXiv:2001.08361.
Parkinson, G.H.R. Leibniz Logical Papers; Clarendon Press: Oxford, UK, 1966.
Tarski, A.; Givant, S. A Formalization of Set Theory without Variables; Colloquium Publications; America Mathematical Society:
Providence, RI, USA, 1988; Volume 41.
Dowty, D.R.; Wall, R.E. (Eds.) Introduction to Montague Semantics; D. Reidel: Dordrecht, The Netherlands, 1989.
Kalish, D.; Montague, R. Logic. Techniques of Formal Reasoning; Harcourt, Brace & World: San Diego, CA, USA, 1964.
Thomason, R.H. (Ed.) Formal Philosophy; Yale University Press: New Haven, CT, USA, 1974.
Church, A. Introduction to Mathematical Logic; Princeton University Press: Princeton, NJ, USA, 1956.
Manca, V. A Metagrammatical Logical Formalism. In Mathematical and Computational Analysis of Natural Language; Martín-Vide,
C., Ed.; John Benjamins: Amsterdam, The Netherlands, 1998.
Barwise, J. The Situation in Logic; Center for the Study of Language and Information: Stanford, CA, USA, 1989.
197
Information 2024, 15, 64
15.
16.
17.
18.
19.
20.
21.
22.
van Benthem, J. Intensional Logic; Center for the Study of Language and Information: Stanford, CA, USA, 1988.
van Harmelen, F.; Lifschitz, V.; Porter, B. Handbook of Knowledge Representation; Elsevier: Amsterdam, The Netherlands, 2008.
Kerr, A.D. A plea for K R. Synthese 2021, 198, 3047–3071. [CrossRef]
Hilbert, D.; Ackermann, W. Princioles of Mathematical Logic; American Mathematical Society: Providence, RI, USA, 1991.
Whitehead, A.N.; Russel, B. Principia Mathematica; Cambridge University Press: London, UK, 1910.
Reichenbach, H. Symbolic Logic; Macmillan: New York, NY, USA, 1947.
Pan, L.; Albalak, A.; Wang, X.; Wang, W.Y. LOGIC-LM: Empowering Large Language Models with Symbolic Solvers for Faithful
Logical Reasoning. arXiv 2023, arXiv:2305.12295v1.
Yang, Z.; Ishay, A.; Lee, J. Coupling Large Language Models with Logic Programming for Robust and General Reasoning from
Text. In Proceedings of the ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 5186–5219.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
198
information
Article
D0L-System Inference from a Single Sequence with a
Genetic Algorithm
Mateusz Łab˛edzki and Olgierd Unold *
Department of Computer Engineering, Faculty of Information and Communication Technology, University of
Science and Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland;
[email protected]
* Correspondence:
[email protected]
Abstract: In this paper, we proposed a new method for image-based grammatical inference of deterministic, context-free L-systems (D0L systems) from a single sequence. This approach is characterized
by first parsing an input image into a sequence of symbols and then, using a genetic algorithm,
attempting to infer a grammar that can generate this sequence. This technique has been tested using
our test suite and compared to similar algorithms, showing promising results, including solving the
problem for systems with more rules than in existing approaches. The tests show that it performs
better than similar heuristic methods and can handle the same cases as arithmetic algorithms.
Keywords: L-system; genetic algorithm; grammatical inference
1. Introduction
Citation: Łab˛edzki, M.; Unold, O.
D0L-System Inference from a Single
Sequence with a Genetic Algorithm.
Information 2023, 14, 343. https://
doi.org/10.3390/info14060343
Academic Editor: Peter Revesz
Received: 10 May 2023
Revised: 9 June 2023
Accepted: 13 June 2023
Published: 16 June 2023
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
The discipline of artificial life aims to create models and tools for simulating and
solving complex biological problems [1]. It allows for experimentation and studies on
systems imitating existing life, without its physical presence. Examples of such models are
cellular automata and Lindenmayer systems [2]. The latter, sometimes called L-systems, are
a type of formal grammar introduced by Astrid Lindenmayer in 1968 [3]. Their trademark
is a graphical representation associated with symbols of the alphabet. Initially, they were
created as a tool for modelling symmetric biological structures such as some types of plants.
Using L-systems, we can try to find a solution for a very basic problem—predicting the
growth of an organism, given its current state and environment. They have also been used
for a plethora of other use cases, such as modelling whole cities [4], sound synthesis [5]
or fractal generation [6]. They can also be used in procedural generation. After the initial
model is created, minor parameters or initial state modifications can create similar-looking
but still distinct objects in great numbers. While the usefulness of L-systems is not in
question, they are challenging to develop, especially when they are supposed to model
an existing object. In this article, an attempt at the automatic generation of deterministic,
context-free L-systems (D0L-systems [3]), from an image through grammar inference, has
been made using a genetic algorithm (GA).
The main contributions of the article include the following:
•
•
•
A new line detection algorithm,
Extending the current capabilities of inference algorithms for D0L-systems from a
single sequence from two to at least three rules,
Improving the execution speed of heuristic algorithms for systems with one or two
rules and reducing the number of assumptions that need to be made about the grammars being inferred.
The remaining part of the article is organized as follows. Section 2 first introduces the
fundamental knowledge necessary to understand the following sections and presents the
existing works dealing with similar problems. Then, our approach to solving the described
problem is presented. The test results and comparison to other methods are shown in
Section 3, while Section 4 draws conclusions and presents further investigation areas.
Information 2023, 14, 343. https://doi.org/10.3390/info14060343
199
https://www.mdpi.com/journal/information
Information 2023, 14, 343
2. Materials and Methods
2.1. L-Systems
L-systems comprise three elements—a set of symbols Vp called an alphabet, a starting
sequence A called an axiom, and a set of rewriting rules F in the form of Base → Successor.
These systems work in iterations on sequences of symbols, starting with the axiom. In
each iteration, a new string is created by applying every rewriting rule to the current
sequence, meaning that every occurrence of the rule’s base is replaced by its successor.
The alphabet contains two types of symbols, terminal and non-terminal, which differ in a
single aspect—a rule’s base has to contain at least one non-terminal character. In the case of
D0L-systems, rules have single-symbol bases, meaning a non-terminal symbol is a base of
a rule.
The most recognizable property of Lindenmayer systems is the geometrical representation that is usually associated with all or some of the symbols in the alphabet. Turtle
graphics is a commonly encountered method of translating sequences to geometric structures. It utilizes a cursor with a straightforward command set—it can draw a straight line,
turn by angle, save the current position and tilt, and return to the previously memorized
state. Each of these operations can be mapped to symbols of the L-system alphabet, making
an output sequence a command list for the cursor. In Figure 1, an example structure is
shown, drawn from a sequence L3 :
L3 = FFF − [ XY ] + [ XY ] FF − [ XY ] + [ XY ] − [+ FY − FX ]
+[+ FY − FX ] FF − [ XY ] + [ XY ] FF − [ XY ] + [ XY ]−
[+ FY − FX ] + [+ FY − FX ] − [+ FF − [ XY ] + [ XY ]
− FX − FF − [ XY ] + [ XY ] + FY ] + [+ FF − [ XY ]+
[ XY ] − FX − FF − [ XY ] + [ XY ] + FY ],
which was generated in the third iteration by the system S3 , defined as:
S3 = { A = F, F → FF − [ XY ] + [ XY ], X → + FY, Y → − FX }.
The F, X, and Y symbols are mapped to the draw forward action, symbols [ and ]
traditionally represent the save and return to the position actions, and the characters + and
− command the cursor to turn by an angle of ±27.5◦ .
Figure 1. A structure generated by the S3 system in the third iteration.
200
Information 2023, 14, 343
2.2. Grammatical Inference
As mentioned, the most significant problem with L-systems is the difficulty of creating
the models. Usually, the model is supposed to imitate an existing object or create a structure
that satisfies defined requirements. However, the connection between system rules and
generated structures is not intuitive, making modelling difficult and usually requiring a
significant amount of trial and error. This is why the need to create L-systems from existing
examples automatically arose. The generation of a grammar from one or more samples is
called grammatical inference [7]. During this process, three elements of L-systems need
to be proposed—an alphabet, an axiom, and a set of rewriting rules. Generating correct
rules is the most challenging problem because the possibilities are numerous, and their
number grows exponentially with the number of rules a system can have. That is why,
usually, except for small systems, this task has been approached as a search problem,
and using metaheuristics has been the most common since they are designed to deal
with problems with a large search space. The genetic algorithm is a metaheuristic most
commonly associated with L-system inference research. It was also used in this article as a
good starting reference point.
2.3. Genetic Algorithms
Genetic algorithms are metaheuristics belonging to the family of evolutionary algorithms [8,9]. Based on naturally occurring evolution and natural selection, they are
commonly used for optimization and search problems where the search space is extensive,
exact methods are unavailable, or the time constraints are too strict. The main component of this algorithm is a population that contains many individuals, each representing a
specific solution to a problem, usually encoded as a set of values or symbols that belong
to the search space. The quality of such a solution is measured in terms of fitness by a
problem-specific function. To improve the quality of individuals, a few genetic operators
are employed—crossover, mutation, and selection. In each iteration of the GA, individuals
are selected from the population for breeding and then subjected to crossover and mutation.
If better solutions emerge, they are included in the next generation, and the process repeats.
2.4. Related Works
The attempts at single-sequence D0L-system inference can be generally divided into
arithmetic and evolutionary approaches. In the first group, two algorithms have been
proposed [10,11], but they were constrained to solving systems with only one or two rules,
with the first one also requiring a known axiom. More attempts have been made using evolutionary algorithms. Some of them used genetic programming, including one of the first
ones in [12] who managed to infer a single-rule Quadratic Koch Island system with a known
axiom, but also a new approach in [13] who used a genetic programming variant called
bacterial programming and managed to infer systems with up to two rules. The others
opted for genetic algorithms [14] or grammatical evolution with BNF grammar encoding [15]. However, both algorithms required either axiom or axiom and iteration number to
be known. Even though we are dealing with a particular type of L-systems inference in this
article, there is an abundance of work done for other types and applications of L-systems.
One closely related research topic is grammar inference based on multiple sequences. Most
interesting are relatively recent articles by J. Bernard and I. McQuillan [16–18], which our
proposed algorithm is partly based on. Their work also extended to different types of
L-systems—context-sensitive [19], stochastic [20,21], or temporal [22].
2.5. Inferring Grammar from a Single Input Image
The proposed approach is to parse the input image into a sequence of symbols that
describe the geometrical structure generated from an L-system and then use a GA to
infer the system’s grammar (Figure 2). The respective steps of the proposed D0L-system
induction method are described below.
201
Information 2023, 14, 343
Read the input image
Detect straight lines
and build a model
Generate corresponding sequence
++++[B++[B+[B+]A+]A+
Generate the output
Infer the grammar
{A: G, F → FF, G → F [ AG ] BG}
Figure 2. General Outline of the Grammar Inference Algorithm.
2.6. Image Parsing
The parsing can be done using general line detection algorithms (like D-DBSCAN [23]),
but the results are often not precise enough for this application; therefore, we have employed our line detection algorithm.
The process of parsing the image into an input sequence is divided into three steps.
First, all of the straight lines are detected in the image. Then, all of these lines are connected,
building a model of the structure in the image so in the last step we can generate a sequence
that accurately describes this model.
2.6.1. Straight Line Detection
For usage in our algorithm, we defined a line as a set of continuous points, understood
as pixels in an image, each neighbouring at most two other points. A point with more than
two neighbours is treated as a line intersection, and a point with only one neighbour, an
edge point, is considered a line end. The neighbourhood is based on the euclidean distance
between points in
√ the image—two points are neighbours if the distance between them is
not greater than 2, which is the largest distance between two touching pixels in an image.
To detect straight lines in the image, the procedure traverses the image looking for
a pixel that has two neighbours and therefore belongs to a straight line according to its
definition. Starting from this point, consecutive connected pixels are added to the detected
line for as long as they have only two neighbours. The line detection ends when on both
ends of the line the algorithm detects either an edge point or an intersection (multipleneighbour point). Each visited point is also marked, so it is not under consideration when
looking for the next lines.
One case that is not handled by the line definition is when a line changes direction.
Two connected lines might conform to the definition and be detected as a single line even
if they are clearly not the same line. The splitting of such lines is handled after the line
detection process finishes. Given that a line is a set of points, to find a change of direction,
we are looking for the largest continuous subsets of points that fit a linear model with some
acceptable error.
To check if a set of points fits a linear model, an algorithm based on a simple idea
is used—if the subset contains points of a straight line, the line between the first and
the last point goes through all of the remaining points. It takes the first and the last
point of the subset and finds a linear model that fits those points. Afterwards, it checks
whether the model fits the remaining points of the subset. A model fits a point when the
distance between the point and the linear model is smaller than a specified acceptable
error. However, when working with indexes of pixels in an image, the accuracy is often not
enough to correctly match the line to the points. That is why the points are first cast into a
202
Information 2023, 14, 343
virtual space with higher granularity, which is a parameter of the algorithm. For example,
given a granularity of 2, a pixel is divided into a grid of 5 by 5 cells (Figure 3).
Figure 3. A few line points cast into virtual space with granularity = 2.
After casting each of the points to this virtual space, now a pixel matches the linear
model if the distance from any of his cells to the linear model is smaller than the acceptable
error. This effectively allows us to work with higher resolution than the original image and
gives much more accurate results.
Due to inaccuracies in drawing straight lines, especially at the intersections of multiple lines, there are usually some points that could not be assigned to any of the lines.
These points are grouped together with their unassigned neighbours and memorized into
intersection groups for later use. They will be used during model building as connectors
between lines.
2.6.2. Model Building
Before a model can be built from detected lines, two preprocessing steps must be
made. We are looking for a non-parametric system, which means that every line must be of
the same length. However, in the image, multiple lines can appear in consecutive order
without changing direction, and the line detection algorithm will detect it as a single long
line. The first pre-processing step takes care of this problem and splits long lines so that
each line in the model is of the same length.
Because every symbol representing the “turn-by-angle” action needs to be associated
with a specific angle, in the second pre-processing step, the information about all of the
angles between the lines needs to be retrieved from the image. The drawn lines are only an
approximation of actual straight lines; therefore, to calculate an angle between them, we
need to apply some rounding and cannot achieve very high precision. The result of this
step is a set of unique tilt angles rounded to the closest k degrees.
After the presented pre-processing steps, a model of the detected structure can be
built. This is a recursive process that connects lines with their successors by finding the
edge point of the line and then looking in the set of unused lines for one that is connected
to it. A line is connected to an edge point when the edge point is a neighbour of one of its
points. However, an edge point might not be connected to any line. In this case, we can
search the set of intersection groups to check if the edge point is connected to any of them.
If that is true, it means that the line connects to an intersection and will be connected to any
other line that is also connected to this intersection.
2.6.3. Sequence Generation
The last step is the translation of generated model of a structure into a sequence of
symbols. First, an alphabet needs to be generated. Some of the symbols are expected
to always appear in an alphabet. Those include a draw forward (‘+’) symbol, and if the
structure contains intersections, save the current position (‘[’) and return to the last saved
position (‘]’) symbols since they are required to produce a branching structure. Some
systems might use more than one draw-forward symbol; however, at this stage, it is not
203
Information 2023, 14, 343
possible to deduce this, so a placeholder is used for every possible actual symbol. The last
class of symbols that needs to be included in the alphabet is the turn symbols. During
the pre-processing step, a set of all unique tilt angles is gathered, and a unique symbol is
assigned for each value and added to the alphabet.
After creating the alphabet, a sequence can be generated. The algorithm traverses the
structure model starting from the first line. For each straight line, a draw forward symbol is
added to the sequence and the algorithm moves on to the next connected line. If there is a
change of direction between the current and the next line before translating the next part of
the model, an appropriate tilt-by-angle symbol is inserted. When the algorithm approaches
an intersection, a save position symbol is inserted, and each branch is translated, returning
to the previous position after finishing and moving on to the next branch. The branches
are processed in the order of the smallest absolute value of the tilt angle. The return to
the position symbol is not inserted at the end of the sequence since this information is
redundant and does not appear in practical systems.
2.7. Grammar Inference
After parsing the input image into a sequence of symbols, an attempt at grammar
inference can be made. There are many unknown variables, and the search space is large.
A genetic algorithm is proposed (Algorithm 1), but first, two techniques used for space
reduction need to be introduced.
2.7.1. Calculating Sequence Length at the nth Iteration of System
For a given alphabet Vp = {σ1 , σ2 , . . . , σn }, the Parikh vector of a sequence w is a vector
Pw = [|Sσ1 |, |Sσ2 |, . . . , |Sσn |] where the element |Sσi | contains the number of appearances of
symbol σi in this sequence. Let Pri denote a Parikh vector of the successor of the rule ri and
PLi a Parikh vector of a sequence generated by a system in its ith iteration. Then, we can
define a growth matrix
Pr1
Pr
2
I = . .
(1)
..
Prn
which allows us to calculate the Parikh vector of the sequence generated by a system in
any iteration, which will be essential for calculating offspring fitness. The sequence Parikh
vector can be calculated as follows:
PLk I m = PLm+k .
(2)
Using a growth matrix, we can check if a set of rules ω can generate a sequence with
a given Parikh vector PLm . To do this, we need to determine if there exists, for a given m,
such a Parikh vector PL0 that the Equation (2) is satisfied. If it does, then a system with
rules ω and an axiom with a Parikh vector PL0 can generate a sequence of the same length
as the target sequence in m iterations. If such a vector does not exist, we can find a vector
with the closest sequence length by solving an integer programming problem:
max PL0 [1], PL0 [2], . . . , PL0 [n],
∀i ∈ {1, . . . , n}, PLn [i ] ≤ Pw [i ]
204
(3)
Information 2023, 14, 343
Algorithm 1: L-system inference
Input: sequence—the target sequence
Output: bestSpecimen—the best current solution
population ← generate initial population;
tabuList ← ∅;
i ← 0;
bestSpecimen ← ∅;
evaluate population fitness;
sequence ← remove terminal symbols from the sequence (Section 2.7.2);
while termination condition is not satisfied do
if i mod 5 == 0 then
population ←
replace the worst 50% of population with new random individuals
end
forall ancestor in population do
selected ← select another individual from the population;
descendant ← crossover ancestor with selected;
mutant ← mutate descendant;
f itness ← calculate mutant fitness;
if mutant has higher fitness than ancestor then
if mutant has higher fitness than bestSpecimen then
bestSpecimen ← mutant
end
replace ancestor with mutant;
end
add mutant to tabuList;
end
i ← i+1
end
return bestSpecimen
2.7.2. System Independence from Terminal Symbols
Let us say that S is an L-system with an alphabet containing two terminal and nonterminal symbols that generates a sequence L in the nth iteration. Knowing that a terminal
character cannot be a base of a rule, we can notice that we can remove terminal symbols
and analyze a more straightforward case [16]. If system S generates a sequence L and we
remove terminal characters from the rules of S, it will still produce the same sequence
without the terminal symbols. For example,
S : { A : F, F → F + G + F, G → G − F − G }
L2 = F+G+F+G-F-G+F+G+F
Ŝ : { A : F, F → FGF, G → GFG }
L̂2 = FGFGFGFGF
We can see that excluding the terminal symbols sequences L3 and Lˆ3 are equivalent.
Thanks to this property during system inference, we can first solve a simpler problem
without the terminal symbols and then gradually add the terminal symbols back to the
system, obtaining subsequent partial solutions. The algorithm arrives at a full solution
after restoring all of the terminal symbols. This requires more searches, but each has a
significantly reduced search space.
205
Information 2023, 14, 343
2.7.3. Genetic Algorithm
The task to be solved by this algorithm is as follows: find a set of rewriting rules that,
for some axiom, allows for the generation of the target sequence in the nth iteration. The
individuals are encoded using Parikh vectors of rewriting rules. Before the start of the
algorithm, all of the terminal symbols are removed from the target sequence in line with
the logic from Section 2.7.2. Therefore, the individuals’ Parikh vectors contain only counts
of non-terminal symbols.
Initial Population Generation
The exact number of rules is unknown; therefore, the population should contain
individuals with different amounts. An m
N ratio was adopted, where m represents the
maximum amount of rules and N is the size of the population. For each class of systems,
rules are generated randomly, with lengths ranging from 1 to a specified maximum length
value k. Due to the high cost of the fitness function, the population size has to be kept
low. This might cause all of the individuals to converge to the same local minima, which
prevents the algorithm from exploring the whole search space. To solve this problem, the
worse half of the population is replaced by new randomly generated offspring every five
iterations.
Genetic Operators
This algorithm uses a typical crossover operator, with the offspring having some
chance of receiving each rule from either of the parents regulated by the crossoverRatio
parameter. It needs to be noted that only parents with an identical rule count can be
bred together. The mutation operator is implemented in the form of four independent
operations—SWAP (swap the successors of two rules), ADD (add a random symbol to one
of the rules), REMOVE (remove a random character from one of the rules with more than
one symbol), and CHANGE (change one of the symbols in one of the rules into another). If
an offspring is to be modified, each operation has an equal probability of being applied. In
the case of L-systems, changing a single symbol rarely leads to a better result. That is why
a memory mechanism has been introduced. Every visited solution is memorized, and if
the future mutation results in a previously encountered state, the operator is repeatedly
applied until a new solution is obtained. The selection operator picks a random partner
for every individual in the population, with candidates with higher fitness having a better
chance of being selected. The algorithm terminates when a complete solution is found or
the maximum iteration count has been reached.
Fitness Function
The selected fitness function executes in two phases. During the first stage (Algorithm 2),
the sequence length is considered, and the fitness can reach a maximum value of 1.0, which
signals that the generated sequence has reached the target length. If an individual reaches
ultimate fitness for the first stage, the second phase begins, where terminal symbols are
consecutively reinserted, and an exact sequence match is evaluated in each step. The fitness
increases for each correctly inserted terminal character.
The first phase of the fitness calculation evaluates whether the individual can generate
a sequence with the same length as the target sequence in N iterations. The closer the
sequence length is to the target sequence length, the higher the fitness. The fitness of the
individual for a given N can be evaluated using the method specified in Section 2.7.1 by
calculating the coefficient vector:
a11
a21
IN =
.
..
an1
a12
..
.
...
..
an2
.
...
206
a1n
c0
a11 + a12 + · · · + a1n
..
c1 a21 + a22 + · · · + a2n
.
⇛c=
.. =
,
..
..
.
.
.
cn
an1 + an2 + · · · + ann
ann
(4)
Information 2023, 14, 343
and solving an integer programming problem:
max (c0 x0 + c1 x1 + · · · + cn xn ),
(5)
0 ≤ c0 x0 + c1 x1 + · · · + c n x n ≤ | L |,
∀i ∈ {0, . . . , n}, xi ≤ | L|
→
that gives a solution in the form of −
r = x0 x1 . . . xn , which is a Parikh vector of the
system axiom. Then the fitness of the individual is calculated as follows:
→
|−
r|
| L | − ∑ i =0 c i x i
.
Fitness = 1.0 −
| L|
(6)
However, since N is unknown, we must check multiple values. For every system,
there is an iteration number M for which finding a valid axiom is no longer possible, and
a value of N = 1 is not practically useful; therefore, we have to check values of N in the
range h2; M i and find N with the highest fitness value as the final result.
Algorithm 2: Fitness function
Input: candidate—the subject individual
Output: fitness—the individual fitness value
candidates ← ∅;
iteration ← 2;
while lastFitness 6= 0 do
currentCandidate ← individual candidate with iteration number iteration;
coe f f icients ← calculate coefficients vector for currentCandidate (Equation (8));
axiom ← calculate individual axiom based on coefficients vector coe f f icients;
f itness ← calculate individual fitness (Equation (6));
lastFitness ← f itness;
if f itness 6= 0 then
candidates ← add currentCandidate to the set
end
iteration ← iteration + 1
end
best ← best candidate from the set candidates;
if fitness of best == 1 then
return result of the second phase of fitness function for best;
end
return fitness of best;
The second phase of the fitness calculation evaluates whether the individual can
correctly generate a sequence that exactly matches the target sequence. The process runs
in a few iterations, and in each one, a single terminal symbol is reinserted into the target
sequence. Then, we are trying to find how to insert the new character into the rules and the
axiom so that the system can generate the target sequence. The general outline of a single
iteration is pictured in Figure 4. The initial state is the result of the previous iteration. For
the sequences to match, they must contain the same number of each symbol. Therefore,
there is a finite set of combinations in which we can insert the new character into the rules
so that the Parikh vectors of the sequences are equal. In step 2, the algorithm calculates all
of the possible combinations by calculating the coefficients vector W:
k
W=
∑ PA I i−1 =
i =1
207
W0
W1
...
Wn ,
(7)
Information 2023, 14, 343
and solving an integer programming problem:
W0 x0 + W1 x1 + · · · + Wn xn + x A = | L|,
(8)
∀i ∈ {0, . . . , n}, xi ≤ | L|,
0 ≤ x A ≤ | L |,
which gives us a set of vectors x0 x1 . . . xn x A , where xn and x A is the occurrence
count of terminal symbol in the nth rule and the axiom, respectively. Because the number of
combinations can be large, we can take the simplest ten for the best results. Now we know
how many symbols to insert but not where. To avoid exploring all of the possibilities, since
the rules’ successors must appear in the target sequence, we can reduce the search space by
only using the appropriate subsequences in the target sequence, which is done in step 3.
From the found subsequences, in step 4, the algorithm generates a population for the GA.
Since the axiom does not appear in the sequence, we only know the number of symbols to
be added but not their positions; therefore, the symbols are randomly inserted. In the last
step, a GA finds a system that can recreate the target sequence using the generated initial
population.
System rules
A : FG
R1 : F → FGF
R2 : G → GF
(1)
Combination 1
Combination n
(2)
Add one + symbol to axiom
Add three + symbols to R1
Add one + symbol to R2
R 2R 1
R2R1
(3)
...G+F+FF+FG+F+G+FFG+F+GF+...
...G+F+G+F+GF+GGG+F+GF+G+F...
...G+F+FF+FG+F+G+FFG+F+GF+...
...G+F+G+F+GF+GGG+F+GF+G+F...
Individual m
Individual 1
(4)
(5)
A : F+G
R1 : F → + F + G + F
R2 : G → GF +
...
Add one + symbol to axiom
Add two + symbols to R1
Add two + symbols to R2
Individual m
Individual 1
A : FG+
A : FG+
R1 : F → F + G + FR+1 : F → + F + G + F
R2 : G → G + F R2 : G → GF +
...
Fitness: 1.98 > 1.0
Fitting system has been found
...
A : F+G
R1 : F → F + G + F +
R2 : G → G + F
Fitness: 0.3 < 1.0
No fitting system found
Figure 4. General outline of a single iteration of the second phase of the fitness function.
A typical crossover operator has been employed, with the descendant receiving each
rule from one of the parents according to a selected ratio. A mutation operator can permute
non-terminal symbols of one or more rules and the system axiom. Individuals are picked
for breeding using elitist selection, with the top 10% of the population moving on to the
next generation unchanged. The chosen fitness function compares the generated sequence
and target sequence symbol by symbol. If a character at a given position matches, the
individual receives one point. It needs to be noted that the output of the image parsing
algorithm contains only a single type of non-terminal symbol; therefore, every non-terminal
character receives a point for matching with it. The final fitness is a ratio between received
points and total sequence length. Because the target sequences are usually very long, an
optimization was applied, where only the first 100 symbols are compared, and only if those
match the rest of the string is validated. When the entire sequence is correctly matched, the
individual can increase their fitness value by a maximum of 1.0, relative to the simplicity of
the system evaluated as:
| L| − | A|
.
(9)
Fitness =
| L|
The algorithm terminates when it finds a solution or reaches the maximum iteration
count.
208
Information 2023, 14, 343
If we are looking for a single rule system, an additional optimization can be applied.
Since the system axiom must start with a non-terminal symbol and a single rule cannot start
with a terminal character, we can conclude that any generated sequence must begin with
the rule’s successor. Therefore, in step 3, instead of searching the whole string for matches,
we can analyze only the first subsequence of adequate length and avoid steps 4 and 5 since
there is only one matching subsequence, and we can outright check if it generates a correct
sequence.
3. Results
Multiple L-systems found in the literature were selected as the example inputs to
test the algorithm’s efficiency. To be picked, the grammar could not have contained nongraphical symbols in its alphabet and had to generate a structure with no intersecting
lines. The following systems were used: Koch Island [6,12,24], Koch Snowflake [24], Koch
Curve [25], Barnsley Fern [6,13], Sierpiński Triangle [6], and Binary Tree [13]. From each
selected system, an example image was generated by our plotting program and used during
testing (such an approach is called grammar inference on synthetic images [26]). All of
the tests were run on an AMD Ryzen 9 3900X PC with 16GB RAM. GPU acceleration and
multi-threading were not used.
3.1. Grammar Inference
The tests for the provided examples succeeded in every case, including those successfully used in [13]. The initial population of 20 was used, with a mutation probability of 0.7
and a crossover ratio of 0.5. These parameters were selected during the initial experiments.
The inferred systems were an exact match to the originals, with some minor notation
differences that did not alter the system functionality.
The algorithm was also tested using examples used by the LGIN tool [11], and the
results were compared. A solution was found in every case. However, the runtime was
longer. It was to be expected since the LGIN tool uses an arithmetic approach instead of a
search algorithm, which allows for faster execution but requires multiple constraints to be
applied—only one or two rule systems can be inferred, with known axiom and rule count.
To compare our algorithm to the approach of Runqiang et al. [14], we used the same
two examples with one and two rules, with the single-rule L-system being an equivalent of
the Ex01 system from the LGIN test suite and the second system being given as:
{ A : X, F → FF, X → F [+ X ] F [− X ] X }.
The original algorithm found a solution in every run for the first system and in 66%
of the runs for the second system. Meanwhile, our approach found a solution for both
systems in every run. Moreover, our algorithm ran for fewer iterations than the original
one (Table 1).
Table 1. Results of comparison with the algorithm from [14].
Iteration Count of Own Algorithm
System A
System B
1
Minimum
Average
Maximum
1
8
1.7
31.5
5
70
σ
Iteration Count of GA from [14]
1
1.34
30.79
Minimum
Average
Maximum
1
32
10.8
53.5
38
97
Standard deviation.
3.2. More Complex Systems
The main advantage of our approach over the related arithmetic and heuristic algorithms described in Section 2.4 is its ability to work on systems with more than two rules.
To test this capability, a system S3 with three rewriting rules from Section 2.1 was used, and
our algorithm successfully inferred the original grammar. Even though it took longer than
209
Information 2023, 14, 343
for the simpler systems—1105 iterations in 25 min, it is a significant improvement over the
mentioned algorithms that infer grammars with two rules at most.
3.3. Runtime Distribution
To test the replicability of our results, runtime distribution for the GA has been tested
on systems with one and two rewriting rules. Test examples were taken from [14]. As seen
in Figure 5, for a single rule system, most algorithm executions ran for a similar amount
of time, around 100 ms, with very few stragglers that ran for more than 600 ms. For a
system with two rules (Figure 6), we can notice similar behavior. However, here, we can see
that a significant amount of runs finished quickly, meaning the initial population already
contained a candidate with very high fitness. This lets us conclude that the algorithm has a
low tendency to get stuck in areas of search space containing candidates with low fitness.
Figure 5. Runtime distribution for single rule system.
Figure 6. Runtime distribution for two rule system.
3.4. Koch Island
The effectiveness of the proposed algorithm was compared to the genetic programming
method developed in [12]. It was one of the first attempts at inferring D0L-systems from
a single sequence. Since then, multiple new solutions have been proposed. However, it
is one of the better-documented articles, providing various performance metrics, which
allow for a comprehensive comparison. The first difference in the results can be seen in the
initial population generation. In the article mentioned above, it is stated that the members
of the initial population of 4000 are not very good on average, with most placed around the
middle of the fitness scale and the worst 12% of the population having the highest possible
(the worst) fitness value. In our proposed algorithm, the population is much smaller; the
tests were run using only 20 individuals. However, the generation is more effective—out of
1000 executed tests, only 32.7% of them required more than one iteration to reach a solution.
Looking at the heatmap that shows the progress of hits histograms [12] (Figure 7) and
changes of best and average fitness (Figure 8), we can notice that while in Koza’s algorithm
210
Information 2023, 14, 343
the progress is very steady and consistent (Figures 9 and 10), in our case, it is slower but
has a tendency to take more significant leaps in quality.
Figure 7. Hits heatmap of our algorithm.
Figure 8. Fitness progression of our algorithm.
Figure 9. Fitness progression of the algorithm from [12].
211
Information 2023, 14, 343
Figure 10. Hits histograms of the algorithm from [12].
3.5. Genetic Programming Using BNF Grammar
Finally, a comparison to the algorithm from the article by D. Beaumont and A. Stepney
was made [15]. For this purpose, two example L-systems were used. The first one has a
single rule and is given as
{ A → F, F → F [+ F ] F [− F ] F }.
(10)
The second one is equivalent to the Ex04Y system used for tests by the LGIN tool.
In the first case, both algorithms arrived at a solution; although the compared algorithm
returned multiple solutions, some of them very long or containing redundant symbols. In
the second case, our algorithm managed to find a correct system every time; meanwhile,
the compared algorithm achieved the same feat in only 2 in 200 test runs. The runtime was
also much shorter; on average, the compared algorithm ran for several CPU-days for each
test and required 891 iterations. Meanwhile, our algorithm completed the whole test suite
of 30 runs in around 1 h and found a solution on average in 35 iterations.
3.6. Crossover between Individuals with Different Rule Counts
Since the selected crossover function operates only on individuals with the same rule
count, two modifications have been tested. The main issue with the crossover between
individuals with different rule counts is that individuals with more rules will have a larger
alphabet and use symbols that are not valid for those with fewer rules. Therefore, the first
modification allowed for a crossover when the second individual had the same amount
or fewer rules as the main individual. This resulted in a slightly worse performance.
The tests consisted of running the inference algorithm on System A from Section 3.1 for
1000 iterations. The modified crossover function resulted in an algorithm average runtime
of 166.42 ms, while the original function achieved an average runtime of 161.04 ms.
The second modification further relaxed the constraints and allowed crossover between individuals with any rule count. To achieve this, a post-processing step had to be
added, which, if the second individual had more rules, replaced the excess symbols with
a random symbol from the smaller individual alphabet. This resulted in worse performance than the previous modification, with an average runtime of 169.84 ms. Overall,
212
Information 2023, 14, 343
the crossover constrained to individuals with the same rule count seems to work the best,
possibly because the rules already fit for this class of systems.
3.7. Comparison with Generational GA Approach
Our proposed solution uses a steady-state GA (SSGA) algorithm [27] in which only
individuals that are better than their parents are inserted into the population. This approach
has been compared to a generational GA (GGA) algorithm that replaces each generation’s
whole population. The comparison was made using the same tests as in Section 3.6. The
results show a promising improvement, with the GGA algorithm achieving an average
runtime of 147.5 ms compared to the 161.04 ms of the SSGA algorithm. This shows
that enhancements to the breeding scheme can introduce even better performance of the
inference algorithm.
4. Discussion
An algorithm for image-based grammatical inference of deterministic, context-free
L-systems was proposed. The effectiveness of this approach was compared to multiple test
results of comparable algorithms and tested using our examples. The results show that
the algorithm performs better than existing heuristic techniques and can find solutions
for the same problems as the arithmetic approaches. A significant improvement over
previous methods has been made, proving that solving inference problems for systems
with more than two rules is possible. However, further research is still needed. The GA’s
fitness function is effective but computationally costly, which implies that optimizations in
this area could lead to the development of an algorithm that can solve systems with even
higher rule count in a reasonable time. Further improvements to the fitness function or
the encoding scheme should also be researched, studying whether fitness progress can be
faster and more gradual, eliminating the frequent large jumps or decreasing the number of
runs that take much longer than average. Some of the compared algorithms work faster
under certain conditions, and incorporating some of their ideas into the fitness function
might lead to quicker computation. Most importantly, this research shows that further
advancements in single-sequence grammatical inference for D0L-systems are possible, and
new solutions can provide better results, solving even more complex systems.
Author Contributions: Conceptualization, M.Ł. and O.U.; methodology, M.Ł.; software, M.Ł.; validation, M.Ł.; formal analysis, M.Ł.; investigation, M.Ł.; resources, M.Ł. and O.U.; data curation, M.Ł.;
writing—original draft preparation, M.Ł. and O.U.; writing—review and editing, M.Ł. and O.U.;
visualization, M.Ł.; supervision, O.U.; and project administration, O.U. All authors have read and
agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
Komosinski, M.; Adamatzky, A. Artificial Life Models in Software; Springer Science & Business Media: Berlin/Heidelberg,
Germany, 2009.
Langton, C. Artificial Life: Proceedings of an Interdisciplinary Workshop on the Synthesis and Simulation of Living Systems; Routledge:
Oxfordshire, UK, 2019.
Rozenberg, G.; Salomaa, A. The Mathematical Theory of L Systems; Academic Press: Cambridge, MA, USA, 1980.
Parish, Y.I.; Müller, P. Procedural modeling of cities. In Proceedings of the 28th Annual Conference on Computer Graphics and
Interactive Techniques, Los Angeles, CA, USA, 12–17 August 2001; pp. 301–308.
Manousakis, S. Non-standard Sound Synthesis with L-systems. Leonardo Music J. 2009, 19, 85–94. [CrossRef]
Prusinkiewicz, P. Graphical applications of L-systems. In Proceedings of the Graphics Interface, Vancouver, BC, Canada, 26–30
May 1986; Volume 86, pp. 247–253.
De la Higuera, C. Grammatical Inference: Learning Automata and Grammars; Cambridge University Press: Cambridge, UK, 2010.
Whitley, D. An overview of evolutionary algorithms: Practical issues and common pitfalls. Inf. Softw. Technol. 2001, 43, 817–831.
[CrossRef]
213
Information 2023, 14, 343
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
Abdel-Basset, M.; Abdel-Fatah, L.; Sangaiah, A.K. Metaheuristic algorithms: A comprehensive review. In Computational
Intelligence for Multimedia Big Data on the Cloud with Engineering Applications; Academic Press: Cambridge, MA, USA, 2018;
pp. 185–231.
Santos, E.; Coelho, R.C. Obtaining l-systems rules from strings. In Proceedings of the 2009 3rd Southern Conference on
Computational Modeling, Rio Grande, Brazil, 23–25 November 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 143–149.
Nakano, R.; Yamada, N. Number theory-based induction of deterministic context-free L-system grammar. In Proceedings of the
International Conference on Knowledge Discovery and Information Retrieval, Valencia, Spain, 25–28 October 2010; SCITEPRESS:
Setúbal, Portugal, 2010; Volume 2, pp. 194–199.
Koza, J.R. Discovery of rewrite rules in Lindenmayer systems and state transition rules in cellular automata via genetic
programming. In Proceedings of the Symposium on Pattern Formation (SPF-93), Claremont, CA, USA, 13 February 1993; Citeseer:
Princeton, NJ, USA, 1993; pp. 1–19.
Eszes, T.; Botzheim, J. Applying Genetic Programming for the Inverse Lindenmayer Problem. In Proceedings of the 2021 IEEE
21st International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 18–20 November
2021; IEEE: Piscataway, NJ, USA, 2021; pp. 000043–000048.
Runqiang, B.; Chen, P.; Burrage, K.; Hanan, J.; Room, P.; Belward, J. Derivation of L-system models from measurements of
biological branching structures using genetic algorithms. In Developments in Applied Artificial Intelligence: 15th International
Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems IEA/AIE 2002, Cairns, Australia, 17–20
June 2002; Springer: Cham, Switzerland, 2002; pp. 514–524.
Beaumont, D.; Stepney, S. Grammatical Evolution of L-systems. In Proceedings of the 2009 IEEE Congress on Evolutionary
Computation, Trondheim, Norway, 18–21 May 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 2446–2453.
Bernard, J.; McQuillan, I. New techniques for inferring L-systems using genetic algorithm. In Proceedings of the Bioinspired
Optimization Methods and Their Applications: 8th International Conference, BIOMA 2018, Paris, France, 16–18 May 2018;
Springer: Cham, Switzerland, 2018; pp. 13–25.
Bernard, J.; McQuillan, I. A fast and reliable hybrid approach for inferring L-systems. In Proceedings of the ALIFE 2018: The
2018 Conference on Artificial Life, Tokyo, Japan, 23–27 July 2018; MIT Press: Cambridge, MA, USA, 2018; pp. 444–451.
Bernard, J.; McQuillan, I. Techniques for inferring context-free Lindenmayer systems with genetic algorithm. Swarm Evol. Comput.
2021, 64, 100893. [CrossRef]
McQuillan, I.; Bernard, J.; Prusinkiewicz, P. Algorithms for inferring context-sensitive L-systems. In Proceedings of the
Unconventional Computation and Natural Computation: 17th International Conference, UCNC 2018, Fontainebleau, France,
25–29 June 2018; Springer: Cham, Switzerland, 2018; pp. 117–130.
Bernard, J.; McQuillan, I. Inferring stochastic L-systems using a hybrid greedy algorithm. In Proceedings of the 2018 IEEE 30th
International Conference on Tools with Artificial Intelligence (ICTAI), Volos, Greece, 5–7 November 2018; IEEE: Piscataway, NJ,
USA, 2018; pp. 600–607.
Bernard, J.; McQuillan, I. Stochastic L-system inference from multiple string sequence inputs. Soft Comput. 2022, 27, 6783–6798.
[CrossRef]
Bernard, J.; McQuillan, I. Inferring Temporal Parametric L-systems Using Cartesian Genetic Programming. In Proceedings of the
2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020;
IEEE: Piscataway, NJ, USA, 2020; pp. 580–588.
Lee, S.; Hyeon, D.; Park, G.; Baek, I.-j.; Kim, S.W.; Seo, S.W. Directional-DBSCAN: Parking-slot detection using a clustering
method in around-view monitoring system. In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV), Gothenburg,
Sweden, 19–22 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 349–354.
Prusinkiewicz, P.; Hanan, J. Lindenmayer Systems, Fractals, and Plants; Springer Science & Business Media: Berlin/Heidelberg,
Germany, 2013; Volume 79.
Purnomo, K.D.; Sari, N.P.W.; Ubaidillah, F.; Agustin, I.H. The construction of the Koch curve (n, c) using L-system. In AIP
Conference Proceedings; AIP Publishing LLC: Woodbury, NY, USA, 2019; Volume 2202, p. 020108.
Guo, J.; Jiang, H.; Benes, B.; Deussen, O.; Zhang, X.; Lischinski, D.; Huang, H. Inverse procedural modeling of branching
structures by inferring L-systems. ACM Trans. Graph. (TOG) 2020, 39, 1–13. [CrossRef]
Syswerda, G. A study of reproduction in generational and steady-state genetic algorithms. In Foundations of Genetic Algorithms;
Elsevier: Amsterdam, The Netherlands, 1991; Volume 1, pp. 94–101.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
214
information
Article
Minoan Cryptanalysis: Computational Approaches to
Deciphering Linear A and Assessing Its Connections with
Language Families from the Mediterranean and the Black
Sea Areas
Aaradh Nepal 1, * and Francesco Perono Cacciafoco 2, *
1
2
*
Citation: Nepal, A.; Perono
Cacciafoco, F. Minoan Cryptanalysis:
Computational Approaches to
Deciphering Linear A and Assessing
Its Connections with Language
Families from the Mediterranean and
School of Computer Science and Engineering (SCSE), Nanyang Technological University (NTU),
Singapore 639798, Singapore
Department of Applied Linguistics (LNG), School of Humanities and Social Sciences (HSS),
Xi’an Jiaotong-Liverpool University (XJTLU), Suzhou 215123, China
Correspondence:
[email protected] (A.N.);
[email protected] (F.P.C.)
Abstract: During the Bronze Age, the inhabitants of regions of Crete, mainland Greece, and Cyprus
inscribed their languages using, among other scripts, a writing system called Linear A. These symbols,
mainly characterized by combinations of lines, have, since their discovery, remained a mystery. Not
only is the corpus very small, but it is challenging to link Minoan, the language behind Linear A, to
any known language. Most decipherment attempts involve using the phonetic values of Linear B, a
grammatological offspring of Linear A, to ‘read’ Linear A. However, this yields meaningless words.
Recently, novel approaches to deciphering the script have emerged which involve a computational
component. In this paper, two such approaches are combined to account for the biases involved
in provisionally assigning Linear B phonetic values to Linear A and to shed more light on the
possible connections of Linear A with other scripts and languages from the region. Additionally, the
limitations inherent in such approaches are discussed. Firstly, a feature-based similarity measure
is used to compare Linear A with the Carian Alphabet and the Cypriot Syllabary. A few Linear A
symbols are matched with symbols from the Carian Alphabet and the Cypriot Syllabary. Finally,
using the derived phonetic values, Linear A is compared with Ancient Egyptian, Luwian, Hittite,
Proto-Celtic, and Uralic using a consonantal approach. Some possible word matches are identified
from each language.
Keywords: Linear A; Minoan; cryptanalysis; computational linguistics; language decipherment
the Black Sea Areas. Information 2024,
15, 73. https://doi.org/10.3390/
info15020073
Academic Editor: Peter Revesz
Received: 27 December 2023
Revised: 21 January 2024
Accepted: 23 January 2024
Published: 25 January 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1. Introduction
In 1900, Sir Arthur Evans, while excavating the Knossos Palace in Crete, unearthed
clay tablets with unknown scripts on them. The writings belong to a family of scripts
used in Crete, Mainland Greece, and Cyprus [1]. Among the two, which showed many
similarities, the older one, Linear A, was used between 1700–1450 BCE and is yet to be
deciphered [2]. The other script, Linear B, which seems to be the grammatological offspring
of Linear A, was deciphered in 1952 by Michael Ventris [3]. Linear A also served as a model
for another script near the end of the Bronze Age: Cypro-Minoan, which was used by the
pre-Greek inhabitants of Cyprus. Cypro-Minoan, in turn, served as a model for the Cypriot
syllabary, a script used by the locals to write their own dialect of Greek [1].
Sir Evans’ choice of the name ‘Linear’ stems from the fact that both Linear A and
B consist of only lines inscribed in clay [4]. Since their discovery, however, Linear A
inscriptions have also been found on artefacts such as vases, jewelry, and other objects
in different locations including Cyprus, mainland Greece, Turkey [5], and other Aegean
islands (Kea, Kythera, Melos, and Thera) [6]. The corpus, altogether, currently consists of
about 7150 signs inscribed on 1427 artefacts [1].
Information 2024, 15, 73. https://doi.org/10.3390/info15020073
215
https://www.mdpi.com/journal/information
Information 2024, 15, 73
Most decipherment attempts begin by attributing Minoan, the language behind Linear
A, to a known language family. Scholars have hypothesized links between Minoan and the
Indo-European languages, the Semitic languages, the Tyrsenian languages, and the Uralic
languages, among others. However, most such arguments are met with skepticism, as these
attributions only yield a limited number of meaningful results [7]. Furthermore, a major
(‘fatal’) challenge in the decipherment process is our inability to ‘read’ Linear A. Although
there exist reasonable justifications to assign Linear B phonetic values onto Linear A for this,
such an approach still produces meaningless words and has not been proven to be reliable.
Recently, novel attempts to decipher the script have emerged, which usually involve
an algorithmic component. In this paper, we propose the combination of two such approaches, by Loh and Perono Cacciafoco [7] and Revesz [8], and the reasons for this are
two-fold. Firstly, we aim to overcome the limitations involved in provisionally assigning
Linear B phonetic values to Linear A and, secondly, we aim to shed more light on the
possible connections between Linear A and other writing systems and languages from the
Mediterranean and the Black Sea areas.
This paper is organized as follows. Section 2 outlines the main challenges with deciphering Linear A, along with some past attempts. Section 3 introduces and gives an
overview of our proposed approach. Section 4 consists of the methodological details of
this approach. Section 5 presents the results obtained when comparatively assessing the
writing systems (Carian Alphabet and Cypriot Syllabary) and the languages (Ancient Egyptian, Luwian, Hittite, Proto-Celtic, and Uralic), and Section 6 discusses their implications.
Section 7 concludes the entire paper and invites further work.
2. Some of the Past Decipherment Attempts
The main challenge with deciphering Linear A begins with our inability to ‘read’ it.
Since its phonetic values are unknown, analytical attempts are likely to be unproductive.
To address this, one approach, as aforementioned, has been to assign the phonetic values of
Linear B to Linear A. Not only is Linear B largely derived from Linear A, but there exist
visual similarities among signs in these two systems. Hence, it is reasonable to approach
the decipherment of Linear A by assigning to it the phonetic values of Linear B. However,
as discussed, although this allows for the ‘reading’ of Linear A, it has not proven to be very
reliable, as it has, so far, only produced meaningless words [9].
The other challenge lies in the fact that the language behind the Linear A signs is
unknown. Attempts to link so-called ‘Minoan’ to other known languages have remained
unsatisfactory. Among the numerous hypotheses, there appear to be four main language
families that scholars argue have a connection with Minoan: Indo-European, Semitic,
Tyrsenian, and the Uralic language families. Vladimir Ivanov Georgiev, one of the scholars
who suggests an Indo-European connection, posits that Linear A tablets, specifically the
ones from Hagia Triada, encode Ancient Greek. He also believes that the other Linear
A documents were transcribing Hittite–Luwian [10]. Other scholars similarly suggest
an Indo-European connection. Gregory Nagy, for instance, conducted a comparative
analysis of Linear A and Linear B by looking into the visual compatibilities between them,
demonstrating Minoan’s Indo-European-like features [11]. Similarly, Gareth A. Owens, by
using phonetic values from Linear B and the Cypriot Syllabary, postulated that Minoan
could be related to Sanskrit or Latin [12]. Leonard R. Palmer, another prominent scholar,
suggested the possibility of Minoan being an Anatolian language linked to Luwian [13].
Palmer’s theory stemmed from the two historically reconstructed invasions of Crete and
Greece by the Luwians during the time when Linear A was adopted. Furthermore, statistical
techniques applied to the frequency analysis of symbols and grammatological comparisons
have also been considered for studying Indo-European links. Most notably, Hubert La
Marle used such techniques to derive conclusions that suggest an Indo-Iranian connection
for Minoan [14–17]. These theories, however, have remained controversial and unproven.
Palmer’s work, specifically, was criticized for relying on his subjective interpretation of
the tablets, which led to varied interpretations [7]. La Marle’s work, similarly, has been
216
Information 2024, 15, 73
contested by John Grimes Younger due to questionable comparisons with various writing
systems from different origins [18].
Other scholars argue for possible connections between Minoan and the Semitic language family. Cyrus H. Gordon, one of the first to propose this link, also assigned Linear
B phonetic values to Linear A signs and discovered words in Linear A that appeared to
be similar to words from the Semitic language family [19]. However, Gordon’s approach
was also met with skepticism. Critics have argued that because the matches identified were
mainly vocabulary items, the reliability of the language family connection is compromised,
as they could be Semitic lexical borrowings rather than examples of Linear A. Additionally,
Gordon’s methodology involved associating elements to several Semitic languages, such as
Akkadian and Canaanite. The fact that the comparison was not carried out with one specific language led scholars to consider the Semitic hypothesis unsuccessful [20]. Jan Best’s
attempts at postulating Phoenician as the ancestor of Linear A were similarly countered by
scholars who highlighted the lack of linguistic evidence supporting the Semitic link [21].
Eu Min et al. also investigated the plausibility of a Semitic link with their study of Linear
A libation tables [20]. Although their research pointed towards a possible connection, the
result was not significant enough—indeed, they produced negative results, which, in their
conclusions, led to their exclusion of a Semitic option.
The third language family that has received consideration for its possible connections
with Minoan is the Tyrsenian one. Helmut Rix was the scholar who theorized this language
family’s existence, which would include, according to him, Etruscan (spoken in central Italy
between around 700 BC and 50 AD), Lemnian (spoken on the island of Lemnos around
the VI Century BC), and Rhaetic (spoken in the Eastern Alps between the I millennium
BC and the III century AD) [22]. Giulio Mauro Facchetti, one of the first to propose the
connection, hypothesized relationships between Etruscan, Lemnian, and Minoan [23].
Facchetti suggested that Minoan could be the ancestor of the proto-Tyrsenian branch of
languages from which Etruscan, Lemnian, and Rhaetic were derived. This also meant,
then, according to Facchetti, that Minoan would be the ancestor of the Eteocretan branch,
which he assumes is different from the other Cretan branch [24]. James Mellaart extended
this work by positing a connection between Etruscan, Lemnian, and Rhaetic and pre-IndoEuropean Anatolian languages by studying Anatolian place names [25]. However, due to a
lack of proper verification of the plausibility of connections among Etruscan, Lemnian, and
Rhaetic [21], the Tyrsenian argument remains disputed.
Another approach to using Linear B phonetic values to attempt to decipher Linear A
has involved an analytical interpretation of the symbols shared between the two writing
systems by John Grimes Younger. Younger attempted to discover words and names by
assigning the Linear B phonetic values to Linear A. He was able to recognize a few possible
Linear A toponyms with a comparison with Mycenaean Greek along with a positional
and frequency study of place names in Linear B (and Linear A) tablets [26]. However,
this comparative examination between Linear A and Linear B, although logical and wellgrounded, did not yield decisive results, unfortunately.
A recent decipherment approach [7] proposes comparing Linear A with other language
families according to their grammatological elements through a ‘brute force attack’. The
method, originating from cryptanalysis, involves assigning Linear B phonetic values to
Linear A and then comparing the consonant clusters of Linear A with the consonant
clusters of other languages from the region. A set of dictionaries of various languages
stored in spreadsheet files is used as the input for a Python program which performs this
comparison. The ‘consonantal’ approach is declared to be effective, because consonant
clusters are, presumably, more stable and consequently allow for the easier analysis of the
morphological parts of a language.
Peter Z. Revesz [8] proposed another approach, which involves comparing Linear A
to other writing systems visually, by using a feature-based similarity measure. This novel
algorithm was employed to develop a new phonetic grid for Linear A, which was then
used to generate a Minoan–Uralic dictionary. According to Revesz, he was able to translate
217
Information 2024, 15, 73
twenty-eight Linear A inscriptions. More recently, Revesz also pointed to archaeogenetic
evidence that suggests Minoans may have originated in the Danube Basin and the Western
Black Sea coastal area [27], which could further suggest a Minoan–Uralic connection, given
earlier and newer publications that argue that the Proto-Uralic people once lived in the
Black Sea area [28–31].
Another important challenge in deciphering Linear A, however, is simply that the
corpus is very small. There are currently about 7150 Linear A signs inscribed on 1427 artefacts [1]. This is in contrast with the larger corpus of Linear B, comprising about 30,000 signs
at the time it was deciphered [26].
3. Proposed Approach
First, Linear A is compared with other writing systems in the region using the featurebased similarity measure proposed by Revesz [8]. Second, the phonetic values of those
writing systems that are visually similar are assigned to Linear A. These two steps are
to ensure that the potential limitations involved with provisionally assigning Linear B
phonetic values to Linear A are accounted for. Finally, Linear A is compared with other
languages using the consonantal approach/‘brute force attack’ proposed by Loh and Perono
Cacciafoco [7].
3.1. Writing Systems Compared with Linear A
The writing systems that will be compared visually to Linear A include the Cypriot
Syllabary and the Carian Alphabet. Some scholars have previously assigned phonetic
values from the Cypriot Syllabary onto Linear A for its analysis. Most notably, Owens [12]
used phonetic values from the Cypriot Syllabary and Linear B to hypothesize possible links
between Minoan and the Indo-European language family. Since Linear A was used as a
model for Cypro-Minoan which, in turn, was used to model the Cypriot Syllabary, this
paper aims to further explore the relationships between the two.
The Carian Alphabet, similarly, is argued to be linked to the Cretan Scripts’ family
which, among other writing systems, includes Linear A and Linear B [32]. Revesz discusses the possible connections between Old Hungarian and the Carian Alphabet using a
feature-based similarity measure and postulates that the Carian Alphabet is an ancestor
of Old Hungarian. Therefore, a possible link between the Carian Alphabet and Linear A
is considered.
3.2. Languages Compared with Linear A
Adopting the ‘brute-force attack’ proposed by [7], the languages/language clusters
compared using the consonantal approach include Ancient Egyptian, Luwian, Hittite,
Proto-Celtic, and Uralic, which belong, largely, to three language families: Indo-European,
Afro-Asiatic, and Uralic. Since a considerable number of decipherment attempts suggest the
possibility of Minoan belonging to the Indo-European language family, this paper aims to
explore this further, with Luwian and Hittite. With Ancient Egyptian, it aims to investigate
possible connections between the Minoans and the Egyptians. Sir Arthur Evans posited
that the interaction between Crete and Egypt began during the third millennium BC [33].
Archeological evidence also strongly suggests a link between the two. Thus, we propose
a further analysis of the possible connections between their languages. We also include a
comparison with Proto-Celtic, which, although it does not have an apparent relation to
Linear A, allows us to leverage the unbiased and universally applicable ‘brute-force’ nature
of the consonantal approach. Finally, we also aim to further explore the Minoan–Uralic
connection mentioned above.
4. Methods
4.1. Deriving the Phonetic Values
We use the feature-based similarity measure proposed by Revesz [8] to derive a new
phonetic grid for Linear A. It has the following components:
218
Information 2024, 15, 73
1.
Similarity Function
To compute the similarity between any two symbols, we let S = {s 1 , s2 , s3 , . . ., sn }
be a set of n symbols, F = { f 1 , f 2 , f 3 , . . . , f m } be a set of m elementary features, and
T :(S, F ) → {0,1} be a function that maps a symbol–feature pair with either 0 or 1, depending on whether that symbol has that feature. Then,
sim si , s j =
m
∑
wk
k =1,T(si , f k )=T(s j , f k )
where wi is a weight function that maps any feature i to a real number (the weight assigned
to that feature).
2.
Elementary Feature Set
The elementary feature set describes the feature of each symbol using a set of descriptors. Each feature corresponds to elements found in the symbols. Table 1 shows
the elementary feature set used for this paper, which is based on the one developed by
Revesz [8].
Table 1. Elementary features with their corresponding weights.
3.
Elementary Feature
Weight
The symbol contains some curved lines
The symbol encloses some regions
The symbol has a slanted straight line
The symbol contains parallel lines
The symbol contains crossing lines
The symbol’s top is a wedge
The symbol’s bottom is a wedge
The symbol’s right side is a wedge
The symbol contains a stem, that is, a straight
vertical line that runs down the middle
The symbol’s bottom has two legs
The symbol’s bottom has three legs
The symbol contains a hair, a small line
extending from an enclosed space
The symbol contains two triangles
0.01
0.01
0.01
0.02
0.02
0.12
1.00
0.33
0.03
0.06
0.09
0.04
0.33
Weight of Each Feature
In Revesz [8], the weight of all features is 1. However, in this study, a different set of
weights for each feature is used. The weight of each feature is the inverse of its frequency
of occurrence across all symbols in Linear A. In other words, a feature that exists in most
symbols will have a lower weight compared to a feature that only exists in some. This
means that sharing a rarely occurring feature is given more importance than sharing a
commonly occurring one.
Table 1 illustrates the weight of each feature in the elementary set based on a frequency
analysis performed for all features across all standard simple signs in Linear A (A001–A371)
from GORILA (the Linear A corpus by Louis Godart and Jean-Pierre Olivier).
With the elementary feature set, a feature map is first computed for each symbol in
Linear A, the Carian Alphabet, and the Cypriot Syllabary. The feature map demonstrates
the existence of specific elementary features in the symbol. Each symbol in Linear A is
then compared with each symbol in the Carian Alphabet and the Cypriot Syllabary to
derive their similarity scores. This expectedly results in a large output. Hence, after the
comparison, some criteria are necessary to keep only those symbol matches that are strongly
correlated. In this paper, the following criteria are employed:
219
Information 2024, 15, 73
•
•
•
The threshold for the similarity value given by the similarity function is set to 2.05. This
means only two symbols whose similarity values are above or at 2.05 are considered
potential matches;
If there are multiple matches with the same similarity value, the tie is broken by
manually analyzing the symbols;
Matches that meet the threshold, but are visually dissimilar upon manual analysis
are also not considered. Such a case could arise due to the limited number of features
considered or because of the interdependence among features in the elementary feature
set. For instance, for a symbol to contain a hair it must also enclose some ‘region’.
Additionally, since the phonetic grid is derived via visual comparison, allographs in
these writing systems are important for consideration. The Carian Alphabet, specifically, is
composed of a few of them. For instance,
all have the same phonetic value. In
this paper, all variants of symbols in the Carian Alphabet specified in the Unicode Standard
are examined independently for comparison. In the case of both the Cypriot Syllabary and
Linear A, only the standard signs are used.
4.2. Consonant Cluster Comparison
After assigning the new phonetic values, the comparison between the languages is
performed using a Python program developed by a research team led by Francesco Perono
Cacciafoco and Colin Loh at Nanyang Technological University (NTU), Singapore [34].
The program works by using two CSV files as the input. The first CSV file is a Linear A
master list with three columns: ‘Source’ (the artefact that contains the Linear A word), ‘New
Format’ (the Linear A word with phonetic values derived from the feature-based similarity
measure), and ‘Linear A’ (the Linear A word with the vowels removed). The second CSV
file contains a single column with all the dictionary words of the language being compared.
The program then removes vowels from the words of all the dictionary words of the
language being compared and compares each of them with words from the Linear A master
list. It finally produces a list of exact matches found between Linear A consonant clusters
and the consonant clusters of the language which is compared. These matches are finally
compared manually, in turn, to dictionary entries in the selected language, to see whether a
meaning can be assigned to them. If a meaningful entry is found, this is cross-referenced
with the original Linear A tablet and a judgment is made as to whether it allows us to ‘read’
the tablet itself, or part of it.
5. Results
5.1. Phonetic Values for Linear A
Using the feature-based similarity measure, each symbol in Linear A is compared
with every symbol in the Cypriot Syllabary and the Carian Alphabet, to derive the possible
phonetic values of Linear A. Table 2 shows a sub-set of these comparisons, filtered using
the criteria outlined in the Methodology. The last column indicates the writing system
that the matched sign is assumed to belong to (‘CS’ denotes the Cypriot Syllabary, ‘CA’
denotes the Carian Alphabet). The phonetic values are transcribed using Latin/Roman
letters. For the Carian Alphabet, they are based on the transcription system posited by
Ignacio J. Adiego [35].
It is important to note that Linear A, being—plausibly—a syllabary, is likely not
composed of pure consonants, unlike the Carian Alphabet. This poses a challenge with
using the Carian Alphabet to derive the phonetic values of Linear A. Revesz [8] proposes
that if a Linear A symbol corresponds to a Carian Alphabet symbol with the phonetic value
/C/ (a consonant), then the Linear A symbol for some vowels will have a phonetic value of
/CV/ (consonant/vowel). For instance, the Linear A sign , which could match with the
Carian Alphabet sign , would have a phonetic value of /L/+/V/. Revesz then derives
the value for this /V/ by searching for the “appropriate word to describe the meaning of
the Linear A symbol” [8] in Uralic, Finno-Ugric, and Ugric vocabulary lists.
220
Information 2024, 15, 73
Table 2. Feature-based similarity scores for a sub-set of symbol pairs.
Matched CS/CA
Sign
Linear A Sign
AB01
AB02
AB03
AB07
AB08
AB09
AB11
AB13
AB17
AB20
AB22
AB24
AB31
AB34
AB37
AB39
AB44
AB46
AB48
AB50
AB51
AB54
AB55
AB59
AB65
AB70
AB77
A302
A304
A306
A309A
A311
A312
A314
A318
A319
A325
A326
A330
A339
A349
A351
A355
𐘀𐘀
𐘀𐘀
𐘀𐘀
𐘀𐘀
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘂𐘂
𐘁𐘁
𐘀𐘀
𐘂𐘂
𐘀𐘀
𐘂𐘂
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘀𐘀
𐘆𐘆
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘆𐘆
𐘂𐘂
𐘆𐘆
𐘁𐘁
𐘀𐘀
𐘂𐘂
𐘆𐘆
𐘂𐘂
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘆𐘆
𐘇𐘇
𐘁𐘁
𐘂𐘂
𐘇𐘇
𐘆𐘆
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘁𐘁
𐘆𐘆
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘈𐘈
𐘂𐘂
𐘆𐘆
𐘈𐘈
𐘇𐘇
𐘂𐘂
𐘆𐘆
𐘈𐘈
𐘇𐘇
𐘆𐘆
𐘂𐘂
𐘇𐘇
𐘈𐘈
𐘆𐘆
𐘇𐘇
𐘆𐘆
𐘇𐘇
𐘈𐘈
𐘊𐘊
𐘆𐘆
𐘈𐘈
𐘇𐘇
𐘊𐘊
𐘆𐘆
𐘈𐘈
𐘊𐘊
𐘇𐘇
𐘆𐘆
𐘈𐘈
𐘊𐘊
𐘈𐘈
𐘇𐘇
𐘊𐘊
𐘈𐘈
𐘋𐘋
𐘈𐘈
𐘊𐘊
𐘇𐘇
𐘋𐘋
𐘈𐘈
𐘋𐘋
𐘇𐘇
𐘊𐘊
𐘈𐘈
𐘇𐘇
𐘊𐘊
𐘋𐘋
𐘈𐘈
𐘊𐘊
𐘈𐘈
𐘊𐘊
𐘋𐘋
𐘈𐘈
𐘋𐘋
𐘊𐘊
𐘈𐘈
𐘍𐘍𐘍𐘍𐘍𐘍𐘍𐘍
𐘋𐘋
𐘊𐘊
𐘈𐘈
𐘋𐘋
𐘍𐘍𐘎𐘎
𐘋𐘋
𐘊𐘊
𐘋𐘋
𐘍𐘍𐘍𐘍
𐘎𐘎
𐘊𐘊
𐘋𐘋
𐘎𐘎
𐘍𐘍𐘍𐘍𐘍𐘍
𐘊𐘊
𐘋𐘋
𐘎𐘎
𐘊𐘊
𐘋𐘋
𐘍𐘍
𐘎𐘎
𐘒𐘒
𐘋𐘋
𐘍𐘍𐘎𐘎
𐘒𐘒
𐘎𐘎
𐘋𐘋
𐘒𐘒
𐘋𐘋
𐘎𐘎
𐘒𐘒
𐘍𐘍𐘍𐘍𐘍𐘍
𐘎𐘎
𐘍𐘍
𐘎𐘎
𐘗𐘗
𐘒𐘒
𐘍𐘍𐘍𐘍
𐘎𐘎
𐘒𐘒
𐘗𐘗
𐘒𐘒
𐘗𐘗
𐘎𐘎
𐘍𐘍
𐘒𐘒
𐘗𐘗
𐘒𐘒
𐘎𐘎
𐘗𐘗
𐘒𐘒
𐘞𐘞
𐘗𐘗
𐘒𐘒
𐘎𐘎
𐘞𐘞
𐘒𐘒
𐘞𐘞
𐘎𐘎
𐘗𐘗
𐘒𐘒
𐘎𐘎
𐘗𐘗
𐘞𐘞
𐘒𐘒
𐘗𐘗
𐘒𐘒
𐘗𐘗
𐘞𐘞
𐘟𐘟
𐘒𐘒
𐘟𐘟
𐘗𐘗
𐘞𐘞
𐘒𐘒
𐘟𐘟
𐘞𐘞
𐘟𐘟
𐘗𐘗
𐘒𐘒
𐘞𐘞
𐘟𐘟
𐘞𐘞
𐘗𐘗
𐘞𐘞
𐘟𐘟
𐘠𐘠
𐘟𐘟
𐘞𐘞
𐘠𐘠
𐘗𐘗
𐘞𐘞
𐘠𐘠
𐘟𐘟
𐘗𐘗
𐘟𐘟
𐘞𐘞
𐘠𐘠
𐘗𐘗
𐘟𐘟
𐘞𐘞
𐘟𐘟
𐘠𐘠
𐘢𐘢
𐘞𐘞
𐘟𐘟
𐘢𐘢
𐘠𐘠
𐘞𐘞
𐘟𐘟
𐘢𐘢
𐘠𐘠
𐘟𐘟
𐘞𐘞
𐘠𐘠
𐘢𐘢
𐘟𐘟
𐘠𐘠
𐘟𐘟
𐘠𐘠
𐘢𐘢
𐘥𐘥
𐘢𐘢
𐘟𐘟
𐘠𐘠
𐘥𐘥
𐘟𐘟
𐘢𐘢
𐘥𐘥
𐘠𐘠
𐘟𐘟
𐘢𐘢
𐘥𐘥
𐘢𐘢
𐘠𐘠
𐘥𐘥
𐘢𐘢
𐘧𐘧
𐘢𐘢
𐘥𐘥
𐘠𐘠
𐘧𐘧
𐘢𐘢
𐘧𐘧
𐘠𐘠
𐘥𐘥
𐘢𐘢
𐘠𐘠
𐘥𐘥
𐘧𐘧
𐘢𐘢
𐘥𐘥
𐘢𐘢
𐘥𐘥
𐘧𐘧
𐘩𐘩
𐘢𐘢
𐘩𐘩
𐘧𐘧
𐘥𐘥
𐘢𐘢
𐘩𐘩
𐘧𐘧
𐘩𐘩
𐘥𐘥
𐘢𐘢
𐘧𐘧
𐘩𐘩
𐘧𐘧
𐘥𐘥
𐘩𐘩
𐘧𐘧
𐘫𐘫
𐘧𐘧
𐘩𐘩
𐘫𐘫
𐘥𐘥
𐘧𐘧
𐘫𐘫
𐘩𐘩
𐘥𐘥
𐘩𐘩
𐘧𐘧
𐘫𐘫
𐘥𐘥
𐘩𐘩
𐘧𐘧
𐘩𐘩
𐘫𐘫
𐘬𐘬
𐘧𐘧
𐘬𐘬
𐘫𐘫
𐘩𐘩
𐘧𐘧
𐘬𐘬
𐘩𐘩
𐘫𐘫
𐘧𐘧
𐘫𐘫
𐘩𐘩
𐘬𐘬
𐘫𐘫
𐘩𐘩
𐘫𐘫
𐘬𐘬
𐘮𐘮
𐘬𐘬
𐘩𐘩
𐘫𐘫
𐘮𐘮
𐘬𐘬
𐘮𐘮
𐘩𐘩
𐘫𐘫
𐘬𐘬
𐘮𐘮
𐘩𐘩
𐘬𐘬
𐘫𐘫
𐘮𐘮
𐘬𐘬
𐘯𐘯
𐘮𐘮
𐘬𐘬
𐘫𐘫
𐘯𐘯
𐘬𐘬
𐘯𐘯
𐘫𐘫
𐘮𐘮
𐘬𐘬
𐘫𐘫
𐘮𐘮
𐘯𐘯
𐘬𐘬
𐘮𐘮
𐘬𐘬
𐘮𐘮
𐘯𐘯
𐘳𐘳
𐘬𐘬
𐘳𐘳
𐘮𐘮
𐘯𐘯
𐘬𐘬
𐘳𐘳
𐘯𐘯
𐘳𐘳
𐘮𐘮
𐘬𐘬
𐘯𐘯
𐘳𐘳
𐘯𐘯
𐘮𐘮
𐘳𐘳
𐘯𐘯
𐘶𐘶
𐘳𐘳
𐘯𐘯
𐘶𐘶
𐘮𐘮
𐘯𐘯
𐘶𐘶
𐘳𐘳
𐘮𐘮
𐘯𐘯
𐘳𐘳
𐘶𐘶
𐘯𐘯
𐘮𐘮
𐘳𐘳
𐘯𐘯
𐘳𐘳
𐘶𐘶
𐘺𐘺
𐘯𐘯
𐘺𐘺
𐘳𐘳
𐘶𐘶
𐘯𐘯
𐘺𐘺
𐘳𐘳
𐘶𐘶
𐘯𐘯
𐘶𐘶
𐘳𐘳
𐘺𐘺
𐘶𐘶
𐘳𐘳
𐘶𐘶
𐘺𐘺
𐘾𐘾
𐘺𐘺
𐘳𐘳
𐘶𐘶
𐘾𐘾
𐘺𐘺
𐘾𐘾
𐘳𐘳
𐘶𐘶
𐘺𐘺
𐘾𐘾
𐘳𐘳
𐘺𐘺
𐘶𐘶
𐘺𐘺
𐘾𐘾
𐙖𐙖
𐘺𐘺
𐘶𐘶
𐙖𐙖
𐘾𐘾
𐘶𐘶
𐘺𐘺
𐙖𐙖
𐘾𐘾
𐘺𐘺
𐘶𐘶
𐘾𐘾
𐙖𐙖
𐘺𐘺
𐘾𐘾
𐘺𐘺
𐘾𐘾
𐙖𐙖
𐙘𐙘
𐘺𐘺
𐙘𐙘
𐙖𐙖
𐘾𐘾
𐘺𐘺
𐙘𐙘
𐙖𐙖
𐙘𐙘
𐘾𐘾
𐘺𐘺
𐙖𐙖
𐙘𐙘
𐙖𐙖
𐘾𐘾
𐙘𐙘
𐙖𐙖
𐙚𐙚
𐙖𐙖
𐙘𐙘
𐘾𐘾
𐙚𐙚
𐘾𐘾
𐙖𐙖
𐙚𐙚
𐙘𐙘
𐘾𐘾
𐙖𐙖
𐙘𐙘
𐙚𐙚
𐙖𐙖
𐘾𐘾
𐙘𐙘
𐙖𐙖
𐙘𐙘
𐙚𐙚
𐙝𐙝
𐙖𐙖
𐙝𐙝
𐙚𐙚
𐙘𐙘
𐙖𐙖
𐙝𐙝
𐙘𐙘
𐙚𐙚
𐙝𐙝
𐙖𐙖
𐙚𐙚
𐙘𐙘
𐙝𐙝
𐙚𐙚
𐙘𐙘
𐙚𐙚
𐙝𐙝
𐙡𐙡
𐙝𐙝
𐙘𐙘
𐙚𐙚
𐙡𐙡
𐙝𐙝
𐙡𐙡
𐙘𐙘
𐙚𐙚
𐙝𐙝
𐙡𐙡
𐙘𐙘
𐙝𐙝
𐙚𐙚
𐙝𐙝
𐙡𐙡
𐙢𐙢
𐙝𐙝
𐙚𐙚
𐙢𐙢
𐙡𐙡
𐙚𐙚
𐙝𐙝
𐙢𐙢
𐙡𐙡
𐙝𐙝
𐙚𐙚
𐙡𐙡
𐙢𐙢
𐙝𐙝
𐙡𐙡
𐙝𐙝
𐙡𐙡
𐙢𐙢
𐙦𐙦
𐙝𐙝
𐙡𐙡
𐙢𐙢
𐙦𐙦
𐙝𐙝
𐙢𐙢
𐙦𐙦
𐙡𐙡
𐙝𐙝
𐙢𐙢
𐙦𐙦
𐙢𐙢
𐙡𐙡
𐙦𐙦
𐙢𐙢
𐙪𐙪
𐙦𐙦
𐙢𐙢
𐙡𐙡
𐙪𐙪
𐙡𐙡
𐙢𐙢
𐙪𐙪
𐙦𐙦
𐙡𐙡
𐙢𐙢
𐙦𐙦
𐙪𐙪
𐙢𐙢
𐙡𐙡
𐙦𐙦
𐙢𐙢
𐙦𐙦
𐙪𐙪
𐙫𐙫
𐙢𐙢
𐙫𐙫
𐙦𐙦
𐙪𐙪
𐙢𐙢
𐙫𐙫
𐙦𐙦
𐙪𐙪
𐙫𐙫
𐙢𐙢
𐙪𐙪
𐙦𐙦
𐙫𐙫
𐙪𐙪
𐙦𐙦
𐙪𐙪
𐙫𐙫
𐙱𐙱
𐙫𐙫
𐙱𐙱
𐙦𐙦
𐙪𐙪
𐙱𐙱
𐙫𐙫
𐙦𐙦
𐙪𐙪
𐙫𐙫
𐙱𐙱
𐙦𐙦
𐙫𐙫
𐙪𐙪
𐙫𐙫
𐙱𐙱
𐙲𐙲
𐙪𐙪
𐙫𐙫
𐙲𐙲
𐙱𐙱
𐙪𐙪
𐙫𐙫
𐙲𐙲
𐙱𐙱
𐙫𐙫
𐙪𐙪
𐙱𐙱
𐙲𐙲
𐙫𐙫
𐙱𐙱
𐙫𐙫
𐙱𐙱
𐙲𐙲
𐙶𐙶
𐙫𐙫
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙫𐙫
𐙲𐙲
𐙶𐙶
𐙱𐙱
𐙫𐙫
𐙲𐙲
𐙶𐙶
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙲𐙲
𐙿𐙿
𐙲𐙲
𐙶𐙶
𐙱𐙱
𐙿𐙿
𐙿𐙿𐙿𐙿𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙲𐙲
𐙶𐙶
𐙲𐙲
𐙶𐙶
𐙿𐙿𐚉𐚉𐙲𐙲
𐚉𐚉𐚉𐚉𐙿𐙿
𐙶𐙶
𐙲𐙲
𐙶𐙶
𐙲𐙲
𐚉𐚉𐚉𐚉𐚉𐚉𐙿𐙿𐙿𐙿𐙿𐙿𐙿𐙿𐚋𐚋
𐙶𐙶
𐙶𐙶
𐙿𐙿𐚋𐚋𐚋𐚋𐚉𐚉𐚉𐚉𐙿𐙿𐚉𐚉
𐙶𐙶
𐚋𐚋
𐙶𐙶
𐚋𐚋𐙿𐙿𐚉𐚉𐙿𐙿𐚉𐚉𐚉𐚉𐚋𐚋
𐚏𐚏
𐙿𐙿
𐚏𐚏
𐚏𐚏
𐚋𐚋
𐙿𐙿𐚋𐚋
𐚏𐚏
𐚉𐚉𐚉𐚉𐚉𐚉
𐚋𐚋
𐚉𐚉
𐚋𐚋
𐚏𐚏
𐚋𐚋𐚋𐚋
𐚏𐚏
𐚉𐚉𐚉𐚉
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚋𐚋𐚉𐚉𐚋𐚋
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚋𐚋𐚋𐚋
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐘀𐘀
𐘀𐘀
𐘀𐘀
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘁𐘁
𐘀𐘀
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘂𐘂
𐘁𐘁
𐘀𐘀
𐘂𐘂
𐘀𐘀
𐘆𐘆
𐘁𐘁
𐘂𐘂
𐘀𐘀
𐘆𐘆
𐘀𐘀
𐘁𐘁
𐘂𐘂
𐘁𐘁
𐘆𐘆
𐘀𐘀
𐘂𐘂
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘁𐘁
𐘇𐘇
𐘁𐘁
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘂𐘂
𐘇𐘇
𐘁𐘁
𐘆𐘆
𐘂𐘂
𐘇𐘇
𐘆𐘆
𐘂𐘂
𐘈𐘈
𐘆𐘆
𐘇𐘇
𐘂𐘂
𐘈𐘈
𐘂𐘂
𐘆𐘆
𐘇𐘇
𐘆𐘆
𐘈𐘈
𐘂𐘂
𐘇𐘇
𐘆𐘆
𐘇𐘇
𐘆𐘆
𐘈𐘈
𐘊𐘊
𐘇𐘇
𐘊𐘊
𐘈𐘈
𐘆𐘆
𐘊𐘊
𐘆𐘆
𐘇𐘇
𐘈𐘈
𐘇𐘇
𐘊𐘊
𐘆𐘆
𐘈𐘈
𐘇𐘇
𐘈𐘈
𐘊𐘊
𐘇𐘇
𐘋𐘋
𐘈𐘈
𐘊𐘊
𐘇𐘇
𐘋𐘋
𐘇𐘇
𐘈𐘈
𐘊𐘊
𐘋𐘋
𐘈𐘈
𐘇𐘇
𐘋𐘋
𐘊𐘊
𐘈𐘈
𐘋𐘋
𐘈𐘈
𐘍𐘍𐘍𐘍
𐘊𐘊
𐘋𐘋
𐘈𐘈
𐘈𐘈
𐘊𐘊
𐘋𐘋
𐘊𐘊
𐘈𐘈
𐘋𐘋
𐘊𐘊
𐘋𐘋
𐘍𐘍𐘍𐘍𐘍𐘍
𐘎𐘎
𐘊𐘊
𐘋𐘋
𐘊𐘊
𐘎𐘎
𐘊𐘊
𐘋𐘋
𐘎𐘎
𐘍𐘍𐘍𐘍
𐘋𐘋
𐘎𐘎
𐘊𐘊
𐘋𐘋
𐘎𐘎
𐘋𐘋
𐘒𐘒
𐘍𐘍𐘍𐘍𐘍𐘍
𐘎𐘎
𐘋𐘋
𐘒𐘒
𐘋𐘋
𐘎𐘎
𐘍𐘍𐘍𐘍
𐘒𐘒
𐘋𐘋
𐘎𐘎
𐘎𐘎
𐘒𐘒
𐘗𐘗
𐘎𐘎
𐘗𐘗
𐘒𐘒
𐘍𐘍𐘍𐘍𐘍𐘍
𐘗𐘗
𐘎𐘎
𐘒𐘒
𐘎𐘎
𐘗𐘗
𐘍𐘍
𐘒𐘒
𐘎𐘎
𐘒𐘒
𐘗𐘗
𐘎𐘎
𐘞𐘞
𐘒𐘒
𐘗𐘗
𐘎𐘎
𐘞𐘞
𐘎𐘎
𐘒𐘒
𐘗𐘗
𐘒𐘒
𐘞𐘞
𐘎𐘎
𐘒𐘒
𐘗𐘗
𐘒𐘒
𐘞𐘞
𐘟𐘟
𐘗𐘗
𐘞𐘞
𐘒𐘒
𐘟𐘟
𐘒𐘒
𐘗𐘗
𐘞𐘞
𐘗𐘗
𐘟𐘟
𐘒𐘒
𐘞𐘞
𐘗𐘗
𐘞𐘞
𐘟𐘟
𐘠𐘠
𐘗𐘗
𐘞𐘞
𐘟𐘟
𐘗𐘗
𐘠𐘠
𐘗𐘗
𐘞𐘞
𐘟𐘟
𐘠𐘠
𐘞𐘞
𐘠𐘠
𐘗𐘗
𐘟𐘟
𐘞𐘞
𐘠𐘠
𐘟𐘟
𐘢𐘢
𐘟𐘟
𐘞𐘞
𐘠𐘠
𐘢𐘢
𐘞𐘞
𐘟𐘟
𐘠𐘠
𐘟𐘟
𐘢𐘢
𐘞𐘞
𐘠𐘠
𐘟𐘟
𐘠𐘠
𐘟𐘟
𐘢𐘢
𐘥𐘥
𐘠𐘠
𐘥𐘥
𐘢𐘢
𐘟𐘟
𐘥𐘥
𐘟𐘟
𐘠𐘠
𐘢𐘢
𐘠𐘠
𐘥𐘥
𐘟𐘟
𐘢𐘢
𐘠𐘠
𐘥𐘥
𐘢𐘢
𐘠𐘠
𐘧𐘧
𐘢𐘢
𐘥𐘥
𐘠𐘠
𐘧𐘧
𐘠𐘠
𐘢𐘢
𐘥𐘥
𐘢𐘢
𐘧𐘧
𐘠𐘠
𐘢𐘢
𐘥𐘥
𐘢𐘢
𐘧𐘧
𐘥𐘥
𐘩𐘩
𐘧𐘧
𐘢𐘢
𐘥𐘥
𐘩𐘩
𐘢𐘢
𐘥𐘥
𐘧𐘧
𐘥𐘥
𐘩𐘩
𐘢𐘢
𐘧𐘧
𐘥𐘥
𐘧𐘧
𐘩𐘩
𐘫𐘫
𐘥𐘥
𐘧𐘧
𐘩𐘩
𐘥𐘥
𐘫𐘫
𐘥𐘥
𐘧𐘧
𐘫𐘫
𐘩𐘩
𐘧𐘧
𐘫𐘫
𐘥𐘥
𐘩𐘩
𐘫𐘫
𐘧𐘧
𐘬𐘬
𐘩𐘩
𐘧𐘧
𐘫𐘫
𐘬𐘬
𐘧𐘧
𐘩𐘩
𐘫𐘫
𐘩𐘩
𐘬𐘬
𐘧𐘧
𐘫𐘫
𐘩𐘩
𐘫𐘫
𐘩𐘩
𐘬𐘬
𐘮𐘮
𐘫𐘫
𐘮𐘮
𐘬𐘬
𐘩𐘩
𐘮𐘮
𐘩𐘩
𐘫𐘫
𐘬𐘬
𐘫𐘫
𐘮𐘮
𐘩𐘩
𐘬𐘬
𐘫𐘫
𐘮𐘮
𐘬𐘬
𐘫𐘫
𐘯𐘯
𐘬𐘬
𐘮𐘮
𐘫𐘫
𐘯𐘯
𐘫𐘫
𐘬𐘬
𐘮𐘮
𐘬𐘬
𐘯𐘯
𐘫𐘫
𐘬𐘬
𐘮𐘮
𐘬𐘬
𐘯𐘯
𐘮𐘮
𐘳𐘳
𐘯𐘯
𐘬𐘬
𐘮𐘮
𐘳𐘳
𐘬𐘬
𐘮𐘮
𐘯𐘯
𐘮𐘮
𐘳𐘳
𐘬𐘬
𐘯𐘯
𐘮𐘮
𐘯𐘯
𐘳𐘳
𐘶𐘶
𐘮𐘮
𐘯𐘯
𐘳𐘳
𐘮𐘮
𐘶𐘶
𐘮𐘮
𐘯𐘯
𐘳𐘳
𐘶𐘶
𐘯𐘯
𐘶𐘶
𐘮𐘮
𐘳𐘳
𐘯𐘯
𐘶𐘶
𐘺𐘺
𐘳𐘳
𐘯𐘯
𐘶𐘶
𐘺𐘺
𐘯𐘯
𐘳𐘳
𐘶𐘶
𐘳𐘳
𐘺𐘺
𐘯𐘯
𐘶𐘶
𐘳𐘳
𐘶𐘶
𐘳𐘳
𐘺𐘺
𐘾𐘾
𐘶𐘶
𐘾𐘾
𐘺𐘺
𐘳𐘳
𐘾𐘾
𐘳𐘳
𐘶𐘶
𐘺𐘺
𐘶𐘶
𐘾𐘾
𐘳𐘳
𐘺𐘺
𐘶𐘶
𐘾𐘾
𐘺𐘺
𐘶𐘶
𐙖𐙖
𐘺𐘺
𐘾𐘾
𐘶𐘶
𐙖𐙖
𐘶𐘶
𐘺𐘺
𐘾𐘾
𐘺𐘺
𐙖𐙖
𐘶𐘶
𐘺𐘺
𐘾𐘾
𐘺𐘺
𐙖𐙖
𐘾𐘾
𐙘𐙘
𐙖𐙖
𐘺𐘺
𐘾𐘾
𐙘𐙘
𐘾𐘾
𐘺𐘺
𐙖𐙖
𐘾𐘾
𐙘𐙘
𐘺𐘺
𐙖𐙖
𐘾𐘾
𐙖𐙖
𐙘𐙘
𐙚𐙚
𐘾𐘾
𐙖𐙖
𐙘𐙘
𐘾𐘾
𐙚𐙚
𐘾𐘾
𐙖𐙖
𐙚𐙚
𐙘𐙘
𐙖𐙖
𐙚𐙚
𐘾𐘾
𐙘𐙘
𐙖𐙖
𐙚𐙚
𐙝𐙝
𐙘𐙘
𐙖𐙖
𐙚𐙚
𐙝𐙝
𐙖𐙖
𐙘𐙘
𐙚𐙚
𐙘𐙘
𐙝𐙝
𐙖𐙖
𐙚𐙚
𐙘𐙘
𐙚𐙚
𐙘𐙘
𐙝𐙝
𐙡𐙡
𐙚𐙚
𐙡𐙡
𐙝𐙝
𐙘𐙘
𐙡𐙡
𐙘𐙘
𐙚𐙚
𐙝𐙝
𐙚𐙚
𐙡𐙡
𐙘𐙘
𐙝𐙝
𐙚𐙚
𐙡𐙡
𐙝𐙝
𐙚𐙚
𐙢𐙢
𐙝𐙝
𐙡𐙡
𐙚𐙚
𐙢𐙢
𐙚𐙚
𐙝𐙝
𐙡𐙡
𐙝𐙝
𐙢𐙢
𐙚𐙚
𐙝𐙝
𐙡𐙡
𐙝𐙝
𐙢𐙢
𐙡𐙡
𐙦𐙦
𐙢𐙢
𐙝𐙝
𐙡𐙡
𐙦𐙦
𐙡𐙡
𐙝𐙝
𐙢𐙢
𐙡𐙡
𐙦𐙦
𐙝𐙝
𐙢𐙢
𐙡𐙡
𐙢𐙢
𐙦𐙦
𐙡𐙡
𐙪𐙪
𐙢𐙢
𐙦𐙦
𐙡𐙡
𐙪𐙪
𐙡𐙡
𐙢𐙢
𐙦𐙦
𐙪𐙪
𐙢𐙢
𐙡𐙡
𐙪𐙪
𐙦𐙦
𐙢𐙢
𐙪𐙪
𐙢𐙢
𐙪𐙪
𐙫𐙫
𐙦𐙦
𐙢𐙢
𐙪𐙪
𐙫𐙫
𐙢𐙢
𐙦𐙦
𐙪𐙪
𐙦𐙦
𐙫𐙫
𐙢𐙢
𐙪𐙪
𐙦𐙦
𐙪𐙪
𐙦𐙦
𐙫𐙫
𐙱𐙱
𐙪𐙪
𐙱𐙱
𐙫𐙫
𐙦𐙦
𐙱𐙱
𐙦𐙦
𐙪𐙪
𐙫𐙫
𐙪𐙪
𐙱𐙱
𐙦𐙦
𐙫𐙫
𐙪𐙪
𐙱𐙱
𐙫𐙫
𐙪𐙪
𐙲𐙲
𐙫𐙫
𐙱𐙱
𐙪𐙪
𐙲𐙲
𐙪𐙪
𐙫𐙫
𐙱𐙱
𐙫𐙫
𐙲𐙲
𐙪𐙪
𐙫𐙫
𐙱𐙱
𐙫𐙫
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙲𐙲
𐙫𐙫
𐙱𐙱
𐙶𐙶
𐙫𐙫
𐙱𐙱
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙫𐙫
𐙲𐙲
𐙱𐙱
𐙲𐙲
𐙶𐙶
𐙿𐙿𐙿𐙿𐙱𐙱
𐙲𐙲
𐙶𐙶
𐙱𐙱
𐙱𐙱
𐙲𐙲
𐙶𐙶
𐙿𐙿𐙿𐙿
𐙲𐙲
𐙱𐙱
𐙶𐙶
𐙲𐙲
𐙲𐙲
𐚉𐚉𐚉𐚉𐙿𐙿𐙿𐙿𐙿𐙿𐙿𐙿
𐙶𐙶
𐙲𐙲
𐙲𐙲
𐙶𐙶
𐙶𐙶
𐙲𐙲
𐙶𐙶
𐚉𐚉𐚉𐚉𐚉𐚉𐙿𐙿𐙿𐙿
𐙶𐙶
𐙿𐙿𐚋𐚋
𐚋𐚋
𐙶𐙶
𐚋𐚋
𐙶𐙶
𐙿𐙿𐚋𐚋
𐚉𐚉𐚉𐚉
𐙿𐙿
𐙶𐙶
𐚋𐚋
𐚉𐚉𐚋𐚋
𐙿𐙿
𐚏𐚏
𐙿𐙿𐚋𐚋𐚉𐚉𐚉𐚉𐙿𐙿𐙿𐙿𐚋𐚋
𐚏𐚏
𐚉𐚉
𐚏𐚏
𐚉𐚉𐙿𐙿𐚋𐚋
𐚋𐚋
𐚉𐚉
𐚏𐚏
𐚋𐚋
𐚏𐚏
𐚉𐚉𐚋𐚋
𐚉𐚉𐚉𐚉
𐚏𐚏
𐚋𐚋
𐚏𐚏
𐚋𐚋𐚋𐚋
𐚏𐚏
𐚏𐚏
𐚋𐚋
𐚋𐚋
𐚏𐚏
𐚏𐚏
𐚋𐚋
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚏𐚏
𐚏𐚏
TA
LO
PA
UUU2
UUU3
SE
B
NE
RA
TI
U
LD
U
D
A
E
A
X
NG
S
SU
UU
E2
R
D2
JA
Q
RI
TI
XE
O
TT2
L
MB
G2
LD
T
NN
KU
LE
ST2
PE
KI
𐠭𐠭
𐠭𐠭
𐠭𐠭
𐠭𐠭
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠞𐠞
𐠒𐠒
𐠭𐠭
𐠞𐠞
𐠭𐠭
𐠞𐠞
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐠭𐠭
𐋈𐋈
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐋈𐋈
𐠞𐠞
𐋈𐋈
𐠒𐠒
𐠭𐠭
𐠞𐠞
𐋈𐋈
𐠞𐠞
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐋐𐋐
𐋈𐋈
𐋐𐋐
𐠒𐠒
𐠞𐠞
𐋐𐋐
𐋈𐋈
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐋐𐋐
𐠒𐠒
𐋈𐋈
𐠞𐠞
𐋈𐋈
𐠩𐠩
𐋐𐋐
𐠞𐠞
𐠩𐠩
𐋈𐋈
𐋐𐋐
𐠞𐠞
𐠩𐠩
𐋈𐋈
𐋐𐋐
𐠞𐠞
𐋐𐋐
𐠩𐠩
𐋈𐋈
𐋐𐋐
𐋈𐋈
𐋐𐋐
𐠩𐠩
𐋈𐋈
𐠩𐠩
𐋐𐋐
𐊩𐊩𐊩𐊩𐊩𐊩
𐠩𐠩
𐋈𐋈
𐋐𐋐
𐠩𐠩
𐋈𐋈
𐊩𐊩
𐠩𐠩
𐋐𐋐
𐊩𐊩𐊩𐊩
𐠩𐠩
𐠚𐠚
𐠩𐠩
𐠚𐠚
𐋐𐋐
𐠩𐠩
𐠚𐠚
𐋐𐋐
𐊩𐊩𐊩𐊩𐊩𐊩
𐠩𐠩
𐋐𐋐
𐠚𐠚
𐠩𐠩
𐠩𐠩
𐊩𐊩
𐠚𐠚
𐠣𐠣
𐠩𐠩
𐠣𐠣
𐠚𐠚
𐊩𐊩
𐠩𐠩
𐠣𐠣
𐠚𐠚
𐠣𐠣
𐊩𐊩
𐠩𐠩
𐠚𐠚
𐠣𐠣
𐠚𐠚
𐊩𐊩
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐠣𐠣
𐠯𐠯
𐠚𐠚
𐊩𐊩
𐠯𐠯
𐠣𐠣
𐊩𐊩𐊩𐊩
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐠣𐠣
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐠄𐠄
𐠚𐠚
𐠣𐠣
𐠄𐠄
𐠯𐠯
𐠚𐠚
𐠣𐠣
𐠄𐠄
𐠯𐠯
𐠣𐠣
𐠚𐠚
𐠯𐠯
𐠄𐠄
𐠣𐠣
𐠯𐠯
𐠣𐠣
𐠯𐠯
𐊦𐊦
𐠄𐠄
𐠣𐠣
𐠯𐠯
𐠄𐠄
𐊦𐊦
𐠣𐠣
𐠄𐠄
𐊦𐊦
𐠯𐠯
𐠣𐠣
𐠄𐠄
𐊦𐊦
𐠄𐠄
𐠯𐠯
𐊦𐊦
𐠄𐠄
𐊲𐊲
𐠄𐠄
𐊦𐊦
𐠯𐠯
𐊲𐊲
𐠄𐠄
𐊲𐊲
𐠯𐠯
𐊦𐊦
𐠄𐠄
𐠯𐠯
𐊦𐊦
𐠄𐠄
𐊲𐊲
𐊦𐊦
𐠄𐠄
𐊦𐊦
𐊲𐊲
𐠄𐠄
𐊦𐊦
𐊲𐊲
𐠄𐠄
𐊢𐊢𐊢𐊢𐊢𐊢𐊢𐊢
𐊲𐊲
𐊦𐊦
𐠄𐠄
𐊲𐊲
𐊢𐊢
𐊲𐊲
𐊦𐊦
𐊢𐊢𐊢𐊢
𐊲𐊲
𐊠𐊠
𐊲𐊲
𐊠𐊠
𐊦𐊦
𐊲𐊲
𐊠𐊠
𐊢𐊢𐊢𐊢𐊢𐊢
𐊦𐊦
𐊲𐊲
𐊠𐊠
𐊲𐊲
𐊦𐊦
𐊲𐊲
𐊢𐊢𐊲𐊲
𐠁𐠁
𐊠𐊠
𐠁𐠁
𐊢𐊢
𐊠𐊠
𐊲𐊲
𐠁𐠁
𐊠𐊠
𐊲𐊲
𐊠𐊠
𐊢𐊢𐊢𐊢
𐠁𐠁
𐊠𐊠
𐊢𐊢
𐠁𐠁
𐊠𐊠
𐠀𐠀
𐠁𐠁
𐊢𐊢𐊢𐊢
𐊠𐊠
𐠀𐠀
𐠁𐠁
𐠀𐠀
𐊠𐊠
𐠁𐠁
𐠀𐠀
𐊢𐊢
𐠁𐠁
𐊠𐊠
𐠁𐠁
𐠀𐠀
𐊴𐊴
𐠁𐠁
𐊠𐊠
𐊴𐊴
𐠀𐠀
𐊠𐊠
𐠁𐠁
𐊴𐊴
𐠀𐠀
𐠁𐠁
𐊠𐊠
𐠀𐠀
𐠁𐠁
𐊴𐊴
𐠀𐠀
𐠁𐠁
𐠀𐠀
𐊴𐊴
𐋄𐋄
𐠁𐠁
𐊴𐊴
𐠀𐠀
𐋄𐋄
𐠁𐠁
𐊴𐊴
𐋄𐋄
𐠀𐠀
𐠁𐠁
𐊴𐊴
𐋄𐋄
𐊴𐊴
𐠀𐠀
𐋄𐋄
𐊴𐊴
𐊰𐊰
𐊴𐊴
𐋄𐋄
𐠀𐠀
𐊰𐊰
𐠀𐠀
𐊴𐊴
𐊰𐊰
𐋄𐋄
𐠀𐠀
𐊴𐊴
𐋄𐋄
𐊰𐊰
𐊴𐊴
𐠀𐠀
𐋄𐋄
𐊴𐊴
𐋄𐋄
𐠬𐠬
𐊰𐊰
𐊴𐊴
𐠬𐠬
𐊰𐊰
𐋄𐋄
𐊴𐊴
𐠬𐠬
𐋄𐋄
𐊰𐊰
𐊴𐊴
𐊰𐊰
𐋄𐋄
𐠬𐠬
𐊰𐊰
𐋄𐋄
𐊰𐊰
𐠬𐠬
𐊿𐊿
𐠬𐠬
𐋄𐋄
𐊰𐊰
𐊿𐊿
𐠬𐠬
𐊿𐊿
𐋄𐋄
𐊰𐊰
𐠬𐠬
𐊿𐊿
𐋄𐋄
𐠬𐠬
𐊰𐊰
𐠬𐠬
𐊿𐊿
𐋏𐋏
𐠬𐠬
𐊰𐊰
𐋏𐋏
𐊿𐊿
𐊰𐊰
𐠬𐠬
𐋏𐋏
𐊿𐊿
𐠬𐠬
𐊰𐊰
𐊿𐊿
𐠬𐠬
𐋏𐋏
𐊿𐊿
𐠬𐠬
𐊿𐊿
𐋏𐋏
𐠬𐠬
𐊿𐊿
𐋏𐋏
𐊥𐊥𐊥𐊥𐊥𐊥
𐠬𐠬
𐋏𐋏
𐊿𐊿
𐠬𐠬
𐋏𐋏
𐊥𐊥
𐋏𐋏
𐊿𐊿
𐊥𐊥𐊥𐊥
𐋏𐋏
𐊬𐊬
𐋏𐋏
𐊿𐊿
𐊬𐊬
𐋏𐋏
𐊬𐊬
𐊿𐊿
𐊥𐊥𐊥𐊥𐊥𐊥
𐋏𐋏
𐊿𐊿
𐊬𐊬
𐋏𐋏
𐋏𐋏
𐊥𐊥
𐠅𐠅
𐊬𐊬
𐋏𐋏
𐠅𐠅
𐊥𐊥
𐊬𐊬
𐋏𐋏
𐠅𐠅
𐊬𐊬
𐋏𐋏
𐊬𐊬
𐊥𐊥𐊥𐊥
𐠅𐠅
𐊬𐊬
𐊥𐊥
𐠅𐠅
𐊬𐊬
𐊨𐊨
𐠅𐠅
𐊨𐊨
𐊥𐊥𐊥𐊥
𐊬𐊬
𐊨𐊨
𐠅𐠅
𐊬𐊬
𐠅𐠅
𐊨𐊨
𐊥𐊥
𐠅𐠅
𐊬𐊬
𐠅𐠅
𐠥𐠥
𐊨𐊨
𐊬𐊬
𐠅𐠅
𐠥𐠥
𐊨𐊨
𐊬𐊬
𐠅𐠅
𐠥𐠥
𐊨𐊨
𐠅𐠅
𐊬𐊬
𐊨𐊨
𐠥𐠥
𐠅𐠅
𐊨𐊨
𐠅𐠅
𐊨𐊨
𐠥𐠥
𐠯𐠯
𐠅𐠅
𐠯𐠯
𐠥𐠥
𐊨𐊨
𐠅𐠅
𐠯𐠯
𐠥𐠥
𐠯𐠯
𐊨𐊨
𐠅𐠅
𐠥𐠥
𐠯𐠯
𐠥𐠥
𐊨𐊨
𐠯𐠯
𐠥𐠥
𐠸𐠸
𐠥𐠥
𐠯𐠯
𐠸𐠸
𐊨𐊨
𐠥𐠥
𐠸𐠸
𐠯𐠯
𐊨𐊨
𐠥𐠥
𐠯𐠯
𐊨𐊨
𐠸𐠸
𐠥𐠥
𐠯𐠯
𐠥𐠥
𐠯𐠯
𐠸𐠸
𐊫𐊫
𐠥𐠥
𐊫𐊫
𐠸𐠸
𐠯𐠯
𐠥𐠥
𐊫𐊫
𐠯𐠯
𐠸𐠸
𐊫𐊫
𐠥𐠥
𐠸𐠸
𐠯𐠯
𐊫𐊫
𐠸𐠸
𐠯𐠯
𐠸𐠸
𐊫𐊫
𐊶𐊶
𐊫𐊫
𐊶𐊶
𐠯𐠯
𐠸𐠸
𐊶𐊶
𐊫𐊫
𐠯𐠯
𐠸𐠸
𐊫𐊫
𐊶𐊶
𐠯𐠯
𐊫𐊫
𐠸𐠸
𐊫𐊫
𐊶𐊶
𐊣𐊣
𐊫𐊫
𐠸𐠸
𐊣𐊣
𐊶𐊶
𐠸𐠸
𐊫𐊫
𐊣𐊣
𐊶𐊶
𐊫𐊫
𐠸𐠸
𐊶𐊶
𐊣𐊣
𐊫𐊫
𐊶𐊶
𐊫𐊫
𐊶𐊶
𐊣𐊣
𐋊𐋊
𐊫𐊫
𐊶𐊶
𐊣𐊣
𐋊𐋊
𐊫𐊫
𐊣𐊣
𐋊𐋊
𐊶𐊶
𐊫𐊫
𐊣𐊣
𐋊𐋊
𐊣𐊣
𐊶𐊶
𐋊𐋊
𐊣𐊣
𐋁𐋁
𐋊𐋊
𐊣𐊣
𐊶𐊶
𐋁𐋁
𐊣𐊣
𐋁𐋁
𐊶𐊶
𐋊𐋊
𐊣𐊣
𐊶𐊶
𐋊𐋊
𐋁𐋁
𐊣𐊣
𐋊𐋊
𐊣𐊣
𐋊𐋊
𐋁𐋁
𐊦𐊦
𐊣𐊣
𐊦𐊦
𐋊𐋊
𐋁𐋁
𐊣𐊣
𐊦𐊦
𐋁𐋁
𐊦𐊦
𐋊𐋊
𐊣𐊣
𐋁𐋁
𐊦𐊦
𐋁𐋁
𐋊𐋊
𐋁𐋁
𐊦𐊦
𐊭𐊭
𐊦𐊦
𐊭𐊭
𐋊𐋊
𐋁𐋁
𐊭𐊭
𐊦𐊦
𐋊𐋊
𐋁𐋁
𐊦𐊦
𐊭𐊭
𐋊𐋊
𐊦𐊦
𐋁𐋁
𐊦𐊦
𐊭𐊭
𐊳𐊳
𐋁𐋁
𐊦𐊦
𐊳𐊳
𐊭𐊭
𐋁𐋁
𐊦𐊦
𐊳𐊳
𐊭𐊭
𐊦𐊦
𐋁𐋁
𐊭𐊭
𐊳𐊳
𐊦𐊦
𐊭𐊭
𐊦𐊦
𐊭𐊭
𐠎𐠎
𐊳𐊳
𐊦𐊦
𐊳𐊳
𐊭𐊭
𐠎𐠎
𐊦𐊦
𐊳𐊳
𐠎𐠎
𐊭𐊭
𐊦𐊦
𐊳𐊳
𐠎𐠎
𐊳𐊳
𐊭𐊭
𐠎𐠎
𐊳𐊳
𐠐𐠐
𐠎𐠎
𐊳𐊳
𐠐𐠐
𐊭𐊭
𐊳𐊳
𐠐𐠐
𐠎𐠎
𐊭𐊭
𐊳𐊳
𐠎𐠎
𐊭𐊭
𐠐𐠐
𐊳𐊳
𐠎𐠎
𐊳𐊳
𐠎𐠎
𐠐𐠐
𐋃𐋃
𐊳𐊳
𐋃𐋃
𐠐𐠐
𐠎𐠎
𐊳𐊳
𐋃𐋃
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐠐𐠐
𐊳𐊳
𐋃𐋃
𐠐𐠐
𐠎𐠎
𐋃𐋃
𐠐𐠐
𐠟𐠟
𐋃𐋃
𐠎𐠎
𐠐𐠐
𐠟𐠟𐠟𐠟𐠟𐠟
𐋃𐋃
𐠎𐠎
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐋃𐋃
𐠐𐠐
𐋃𐋃
𐠟𐠟𐠐𐠐
𐠌𐠌
𐠌𐠌
𐋃𐋃
𐠐𐠐
𐠌𐠌
𐋃𐋃
𐠐𐠐𐠟𐠟𐠟𐠟𐠟𐠟𐠟𐠟𐠟𐠟
𐠌𐠌
𐋃𐋃
𐋃𐋃
𐠌𐠌
𐋃𐋃
𐠟𐠟𐠟𐠟
𐠌𐠌
𐠌𐠌
𐋃𐋃
𐠌𐠌
𐋃𐋃
𐠌𐠌
𐠟𐠟𐠟𐠟
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠟𐠟𐠟𐠟
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠭𐠭
𐠭𐠭
𐠭𐠭
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠒𐠒
𐠭𐠭
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐠞𐠞
𐠒𐠒
𐠭𐠭
𐠞𐠞
𐠭𐠭
𐋈𐋈
𐠒𐠒
𐠞𐠞
𐠭𐠭
𐋈𐋈
𐠭𐠭
𐠒𐠒
𐠞𐠞
𐠒𐠒
𐋈𐋈
𐠭𐠭
𐠞𐠞
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐋐𐋐
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐠒𐠒
𐋐𐋐
𐠒𐠒
𐠞𐠞
𐋈𐋈
𐋐𐋐
𐠞𐠞
𐠒𐠒
𐋐𐋐
𐋈𐋈
𐠞𐠞
𐋐𐋐
𐠞𐠞
𐠩𐠩
𐋈𐋈
𐋐𐋐
𐠞𐠞
𐠩𐠩
𐠞𐠞
𐋈𐋈
𐋐𐋐
𐋈𐋈
𐠩𐠩
𐠞𐠞
𐋐𐋐
𐋈𐋈
𐋐𐋐
𐋈𐋈
𐠩𐠩
𐋐𐋐
𐠩𐠩
𐋈𐋈
𐊩𐊩𐊩𐊩𐊩𐊩𐊩𐊩
𐋈𐋈
𐋐𐋐
𐠩𐠩
𐋐𐋐
𐋈𐋈
𐠩𐠩
𐋐𐋐
𐠩𐠩
𐋐𐋐
𐠚𐠚
𐠩𐠩
𐊩𐊩𐊩𐊩𐊩𐊩
𐋐𐋐
𐠚𐠚
𐠩𐠩
𐋐𐋐
𐠚𐠚
𐠩𐠩
𐋐𐋐
𐠚𐠚
𐠩𐠩
𐠚𐠚
𐊩𐊩𐊩𐊩𐊩𐊩
𐠩𐠩
𐠣𐠣
𐠚𐠚
𐠩𐠩
𐠣𐠣
𐠩𐠩
𐊩𐊩
𐠚𐠚
𐊩𐊩
𐠣𐠣
𐠩𐠩
𐠚𐠚
𐊩𐊩𐊩𐊩
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐊩𐊩
𐠚𐠚
𐊩𐊩𐊩𐊩
𐠯𐠯
𐠣𐠣
𐠚𐠚
𐠯𐠯
𐠣𐠣
𐠚𐠚
𐠯𐠯
𐠣𐠣
𐠚𐠚
𐠄𐠄
𐠣𐠣
𐠯𐠯
𐠚𐠚
𐠄𐠄
𐠚𐠚
𐠣𐠣
𐠯𐠯
𐠣𐠣
𐠄𐠄
𐠚𐠚
𐠯𐠯
𐠣𐠣
𐠯𐠯
𐠣𐠣
𐠄𐠄
𐊦𐊦
𐠯𐠯
𐊦𐊦
𐠄𐠄
𐠣𐠣
𐊦𐊦
𐠣𐠣
𐠯𐠯
𐠄𐠄
𐠯𐠯
𐊦𐊦𐊦𐊦
𐠣𐠣
𐠄𐠄
𐠯𐠯
𐠄𐠄
𐠯𐠯
𐊲𐊲
𐠄𐠄
𐊦𐊦
𐠯𐠯
𐊲𐊲
𐠯𐠯
𐠄𐠄
𐊦𐊦
𐠄𐠄
𐠯𐠯
𐊲𐊲
𐠄𐠄
𐠄𐠄
𐊲𐊲
𐊦𐊦
𐊢𐊢𐊢𐊢𐊦𐊦
𐊲𐊲
𐠄𐠄
𐊦𐊦
𐠄𐠄
𐊦𐊦
𐊲𐊲
𐊦𐊦
𐠄𐠄
𐊲𐊲
𐊦𐊦
𐊲𐊲
𐊢𐊢𐊢𐊢𐊢𐊢𐊦𐊦
𐊠𐊠
𐊲𐊲
𐊦𐊦
𐊠𐊠
𐊦𐊦
𐊲𐊲
𐊠𐊠
𐊲𐊲
𐊠𐊠
𐊦𐊦𐊢𐊢𐊢𐊢
𐊲𐊲
𐊠𐊠
𐠁𐠁
𐊢𐊢𐊢𐊢
𐊲𐊲
𐊠𐊠
𐠁𐠁
𐊲𐊲
𐊠𐊠
𐊢𐊢𐊢𐊢
𐠁𐠁
𐊲𐊲
𐊠𐊠
𐊠𐊠
𐠁𐠁
𐠀𐠀
𐊠𐊠
𐠁𐠁
𐠀𐠀
𐊢𐊢𐊢𐊢𐊢𐊢
𐠀𐠀
𐊠𐊠
𐠁𐠁
𐊠𐊠
𐠀𐠀
𐊢𐊢
𐠁𐠁
𐊠𐊠
𐠀𐠀
𐠁𐠁
𐊠𐊠
𐊴𐊴
𐠁𐠁
𐠀𐠀
𐊠𐊠
𐊴𐊴
𐠁𐠁
𐊠𐊠
𐠀𐠀
𐠁𐠁
𐊴𐊴
𐊠𐊠
𐠁𐠁
𐠀𐠀
𐠁𐠁
𐊴𐊴
𐠀𐠀
𐋄𐋄
𐊴𐊴
𐠁𐠁
𐠀𐠀
𐋄𐋄
𐠁𐠁
𐠀𐠀
𐊴𐊴
𐠀𐠀
𐠁𐠁
𐋄𐋄
𐊴𐊴
𐠀𐠀
𐊴𐊴
𐋄𐋄
𐊰𐊰
𐠀𐠀
𐊴𐊴
𐋄𐋄
𐠀𐠀
𐊰𐊰
𐠀𐠀
𐊴𐊴
𐊰𐊰
𐋄𐋄
𐊴𐊴
𐊰𐊰
𐠀𐠀
𐋄𐋄
𐊴𐊴
𐊰𐊰
𐠬𐠬
𐋄𐋄
𐊴𐊴
𐊰𐊰
𐠬𐠬
𐊴𐊴
𐋄𐋄
𐊰𐊰
𐋄𐋄
𐠬𐠬
𐊴𐊴
𐊰𐊰
𐋄𐋄
𐊰𐊰
𐋄𐋄
𐠬𐠬
𐊿𐊿
𐊰𐊰
𐊿𐊿
𐠬𐠬
𐋄𐋄
𐊿𐊿
𐋄𐋄
𐊰𐊰
𐠬𐠬
𐊰𐊰
𐊿𐊿
𐋄𐋄
𐠬𐠬
𐊰𐊰
𐊿𐊿
𐠬𐠬
𐊰𐊰
𐋏𐋏
𐠬𐠬
𐊿𐊿
𐊰𐊰
𐋏𐋏
𐠬𐠬
𐊰𐊰
𐊿𐊿
𐠬𐠬
𐋏𐋏
𐊰𐊰
𐠬𐠬
𐊿𐊿
𐠬𐠬
𐋏𐋏
𐊿𐊿
𐊥𐊥𐊥𐊥
𐋏𐋏
𐠬𐠬
𐊿𐊿
𐠬𐠬
𐊿𐊿
𐋏𐋏
𐊿𐊿
𐠬𐠬
𐋏𐋏
𐊿𐊿
𐋏𐋏
𐊥𐊥𐊥𐊥𐊥𐊥
𐊿𐊿
𐊬𐊬
𐋏𐋏
𐊿𐊿
𐊬𐊬
𐊿𐊿
𐋏𐋏
𐊥𐊥
𐊬𐊬
𐋏𐋏
𐊿𐊿
𐊬𐊬
𐋏𐋏
𐊬𐊬
𐋏𐋏
𐊬𐊬
𐠅𐠅
𐊥𐊥𐊥𐊥𐊥𐊥
𐋏𐋏
𐊬𐊬
𐠅𐠅
𐋏𐋏
𐊬𐊬
𐊥𐊥𐊥𐊥
𐠅𐠅
𐋏𐋏
𐊬𐊬
𐊬𐊬
𐠅𐠅
𐊨𐊨
𐊬𐊬
𐠅𐠅
𐊨𐊨
𐊥𐊥𐊥𐊥𐊥𐊥
𐊨𐊨
𐊬𐊬
𐠅𐠅
𐊬𐊬
𐊨𐊨
𐊥𐊥
𐠅𐠅
𐊬𐊬
𐊨𐊨
𐠅𐠅
𐊬𐊬
𐠥𐠥
𐠅𐠅
𐊨𐊨
𐊬𐊬
𐠥𐠥
𐠅𐠅
𐊬𐊬
𐊨𐊨
𐠅𐠅
𐠥𐠥
𐊬𐊬
𐊨𐊨
𐠅𐠅
𐠅𐠅
𐊨𐊨
𐠥𐠥
𐠯𐠯
𐊨𐊨
𐠥𐠥
𐠅𐠅
𐠯𐠯
𐠅𐠅
𐊨𐊨
𐠥𐠥
𐊨𐊨
𐠯𐠯
𐠅𐠅
𐠥𐠥
𐊨𐊨
𐠥𐠥
𐠯𐠯
𐠸𐠸
𐊨𐊨
𐠥𐠥
𐠯𐠯
𐠸𐠸
𐊨𐊨
𐠥𐠥
𐊨𐊨
𐠸𐠸
𐠯𐠯
𐠥𐠥
𐠸𐠸
𐊨𐊨
𐠯𐠯
𐠥𐠥
𐠸𐠸
𐊫𐊫
𐠯𐠯
𐠥𐠥
𐠸𐠸
𐊫𐊫
𐠥𐠥
𐠯𐠯
𐠸𐠸
𐠯𐠯
𐠥𐠥
𐊫𐊫
𐠸𐠸
𐠯𐠯
𐠸𐠸
𐠯𐠯
𐊫𐊫
𐊶𐊶
𐠸𐠸
𐊶𐊶
𐊫𐊫
𐠯𐠯
𐊶𐊶
𐠯𐠯
𐠸𐠸
𐊫𐊫
𐠸𐠸
𐊶𐊶
𐠯𐠯
𐊫𐊫
𐊶𐊶
𐠸𐠸
𐊶𐊶
𐊫𐊫
𐠸𐠸
𐊣𐊣
𐊫𐊫
𐊶𐊶
𐠸𐠸
𐊣𐊣
𐠸𐠸
𐊫𐊫
𐊶𐊶
𐊫𐊫
𐠸𐠸
𐊣𐊣
𐊶𐊶
𐊫𐊫
𐊶𐊶
𐊫𐊫
𐊣𐊣
𐋊𐋊
𐊶𐊶
𐊣𐊣
𐊫𐊫
𐋊𐋊
𐊫𐊫
𐊶𐊶
𐊣𐊣
𐊶𐊶
𐋊𐋊
𐊫𐊫
𐊣𐊣
𐊶𐊶
𐊣𐊣
𐋊𐋊
𐊶𐊶
𐋁𐋁
𐊣𐊣
𐋊𐋊
𐊶𐊶
𐋁𐋁
𐊶𐊶
𐊣𐊣
𐋊𐋊
𐊣𐊣
𐋁𐋁
𐊶𐊶
𐋊𐋊
𐊣𐊣
𐋁𐋁
𐊣𐊣
𐊦𐊦
𐋊𐋊
𐋁𐋁
𐊣𐊣
𐊦𐊦
𐊣𐊣
𐋊𐋊
𐋁𐋁
𐋊𐋊
𐊦𐊦
𐊣𐊣
𐋁𐋁
𐋊𐋊
𐋁𐋁
𐊦𐊦
𐊭𐊭
𐋊𐋊
𐋁𐋁
𐊦𐊦𐊦𐊦
𐋊𐋊
𐊭𐊭
𐋊𐋊
𐋁𐋁
𐊭𐊭
𐋁𐋁
𐊭𐊭
𐋊𐋊
𐋁𐋁
𐊭𐊭
𐊦𐊦𐊦𐊦
𐊳𐊳
𐊦𐊦
𐋁𐋁
𐊭𐊭
𐊳𐊳
𐋁𐋁
𐊦𐊦
𐊭𐊭
𐊦𐊦
𐊳𐊳
𐋁𐋁
𐊭𐊭
𐊦𐊦
𐊭𐊭
𐊦𐊦
𐊳𐊳
𐠎𐠎
𐊭𐊭
𐊳𐊳
𐊦𐊦
𐠎𐠎
𐊦𐊦
𐊭𐊭
𐊳𐊳
𐊭𐊭
𐠎𐠎
𐊦𐊦
𐊳𐊳
𐊭𐊭
𐊳𐊳
𐠎𐠎
𐊭𐊭
𐠐𐠐
𐊳𐊳
𐠎𐠎
𐊭𐊭
𐠐𐠐
𐊭𐊭
𐊳𐊳
𐠎𐠎
𐠐𐠐
𐊳𐊳
𐊭𐊭
𐠐𐠐
𐊳𐊳
𐠎𐠎
𐠐𐠐
𐊳𐊳
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐊳𐊳
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐊳𐊳
𐠐𐠐
𐠎𐠎
𐋃𐋃
𐊳𐊳
𐠐𐠐𐠟𐠟
𐠎𐠎
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐠐𐠐
𐋃𐋃
𐠎𐠎
𐠟𐠟𐠟𐠟𐠟𐠟
𐠎𐠎
𐠐𐠐
𐋃𐋃
𐠐𐠐
𐠎𐠎
𐠟𐠟
𐋃𐋃
𐠐𐠐
𐠐𐠐
𐠌𐠌
𐋃𐋃
𐠟𐠟𐠟𐠟𐠟𐠟
𐠐𐠐
𐠌𐠌
𐠐𐠐
𐋃𐋃
𐋃𐋃
𐠌𐠌
𐠐𐠐
𐠟𐠟
𐋃𐋃
𐋃𐋃
𐠌𐠌
𐠟𐠟𐠟𐠟𐠟𐠟
𐠌𐠌
𐋃𐋃
𐋃𐋃
𐠌𐠌
𐠟𐠟
𐋃𐋃
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠟𐠟𐠟𐠟𐠟𐠟𐠟𐠟
𐠌𐠌
𐠌𐠌
𐠟𐠟
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠌𐠌
𐠌𐠌
221
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Phonetic Value
Similarity Score
Assumed Origin
TA
2.07
CS
LO
2.07
CS
PA
2.07
CS
2.07
CA
2.07
CA
Y
Y
Y
Y
Y
Y Y
Y
Y
Y SE
Y
Y
Y
Y
Y
Y B
Y
Y
Y
Y
Y
Y
Y
Y NE
2.07
CS
2.07
CA
2.06
CS
RA
2.07
CS
TI
2.05
CS
U
2.07
CS
L
2.05
CA
U
2.07
CA
D
2.07
CA
A
2.06
CA
E
2.06
CS
A
2.07
CS
C
2.06
CA
NG
2.05
CA
S
2.06
CA
SU
2.07
CS
W
2.06
CA
E
2.06
CA
R
2.06
CA
D
2.07
CA
JA
2.07
CS
QU
2.07
CA
RI
2.07
CS
TI
2.07
CS
XE
2.06
CS
O
2.07
CA
CH
2.07
CA
L
2.05
CA
MB
2.07
CA
G
2.07
CA
L
2.07
CA
T
2.07
CA
N
2.07
CA
KU
2.06
CS
LE
2.07
CS
Z
2.07
CA
PE
2.07
CS
KI
2.07
CS
Information 2024, 15, 73
In this paper, no particular vowel was concatenated with the pure consonants, as it
was assumed that they could have any value. Since this comparison of Linear A with other
languages involves a consonantal approach, the lack of specific vowels does not entirely
render the phonetic values obsolete.
5.2. Comparing Linear A with Other Writing Systems and Related Languages
With the phonetic values derived in Table 2, Linear A was compared with Ancient
Egyptian, Luwian, Hittite, Proto-Celtic, and Uralic by using the consonantal approach
proposed by Loh and Perono Cacciafoco [7]. The results derived from the operation of the
Python program developed by the two scholars for this task, highlighted in the Methods
section, are presented in Tables 3–7. Since the Python program yields a lot of matches for
each of the languages, the results presented have been filtered manually, to ensure that only
matches with a high likelihood of plausibility are kept for consideration.
Table 3. Python program results for Ancient Egyptian.
Matched
Consonants
nr
p
Linear A Cluster
Egyptian Word
Linear A Source
NE-RA
PA-[],
]-PA
iner
HT10A
ipA
KN32b, KH 91
pr
PA-RA-[ ],
]-PA-R
aper
ZA006b, KH 79 + 89
r
RA
ArA
ZA009
rp
R-PA
irp
HT104
rr
RI-R
irr
HT30
ry
RI-Y
ary
HT28a, HT28b
yS
Y-SE
AyS
HT132, HT81,
HT93a, HT85a
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
shell of an egg|gravel, stone
to make to fly, to fly|house, dwelling,
harem
to be equipped, to be provided with,
furnished (of a house)|a boat
equipped with everything necessary
and a crew
to go up, to embark in a boat, to bring,
to be high
wine|wine plant, vine|to rot, to
decay, to ferment
deaf (?)|grapes, grape seeds|a wine
jar
he who goes up|light, fiery one|the
name of a Dekan|a kind of
fish|breeze, wind
truce
Table 4. Python program results for Luwian.
Linear A Cluster
Luwian Word
Linear A Source
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
lalai
KE Wc 2b
take
pa
KN32b, KH91
protect (?)
r
ry
LO-LO
PA-[],
]-PA
RA
RI-Y
ura
ariya
great
raise|check, restrain (?)
t
TA
ta
tn
TA-NE
taini
w
W
wi
ZA009
HT28a, HT28b
HT86a,
Wa 1031
HT95a, HT95b
HT98a,
Wc 3019, HT97a
Matched
Consonants
ll
p
222
step|arrive
of oil, oily
see|appear (?)
Information 2024, 15, 73
Table 5. Python program results for Hittite.
Matched
Consonants
Linear A Cluster
Hittite Word
Linear A Source
ll
LO-LO
lulu
KE Wc 2b
lr
L-R
luri
HT10B
p
pr
PA-[],
]-PA
PA-RA-[ ],
]-PA-R
KN32b,
KH 91
ZA006b,
KH 79 +89
apa
puri
prl
PA-R-L
parala
HT122a, HT94b
prt
PA-R-TA
parta
PH31a
ps
PA-SE
pus
HT18, HT27b
r
RA
ara
ZA009
rp
R-PA
arp
HT104
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
evenness, steadiness, stability, security
loss, shortfall, decimations|loss of
standing, comedown, disgrace,
degradation
that (one)|he, she, it|the one in
question|thy, thine, your(s)
lip|rim, edge, border
something of wood used on sacrificed
cattle, nom
side, siding, partition
diminish, fade, be eclipsed|be small,
act petty, be pusillanimous
belonging (or: proper) to one’s own
social group, communally accepted or
acceptable, congruent with social
order
bad luck, setback, misfortune
Table 6. Python program results for Proto-Celtic.
Matched
Consonants
lr
n
nr
rr
Linear A Cluster
Proto-Celtic Form
Linear A Source
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
L-R
[]-NE
NE-RA
RI-R,
[]-RI-R
*liro
*ne
*nero
HT10B
ZA020
HT10A
sea (?)
not
hero (?)
*eriro
HT30
eagle
ry
RI-Y
*aryo
sny
SU-NE-Y
TI,
TI-[,
TI-[]
*sniyo
tn
TA-NE
*tini
wy
W-Y
*way
y
Y
*yo
yr
Y-R-[]
*yaro
t
HT28a,
HT28b
HT19
HT28a,
KH90,
Wc 3015b
HT95a,
HT95b
HT94b
HT85b,
We 1023/
Wd 1024
ZA009
*eti
free man
spin, weave
yet, still, but|beyond|also
melt
woe, oh, alas
which
chicken, hen
Table 7. Python program results for Uralic.
Matched
Consonants
Linear A Cluster
Uralic Form
Linear A Source
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
n
nr
[ ]-NE
NE-RA
une
nure
ZA020
HT10A
sleep, dream
to press
223
Information 2024, 15, 73
Table 7. Cont.
Matched
Consonants
Linear A Cluster
Uralic Form
pr
PA-RA-[ ]
para
ps
PA-SE
pese
r
rp
sr
RA
R-PA
SU-[ ]-RA
ora
orpa
sira
t
TA
ta
t
TI,
TI-[,
TI-[]
tE
tn
TA-NE
tana
w
wl
W,
W-[]
W-L
Linear A Source
ZA006b,
KH 79 + 89
HT18,
HT27b
ZA009
HT104
ZA018a
HT86a,
Wa 1031
HT28a,
KH 90,
Wc 3015b
HT95a,
HT95b
HT98a,
Wc 3019
HT38
owe
wElE
Meaning
‘|’ Separates Different Meanings
‘?’ Indicates that the Meaning Is
Uncertain
good
to wash (head?)
awl|squirrel
melt
a k. of relative
this, that
you
birch bark
door
to understand
6. Discussion
Although the results could suggest possible links between Linear A and Ancient
Egyptian, Luwian, Hittite, Proto-Celtic, and Uralic, the matches found are insufficient to
yield conclusive evidence of any connection. For each compared language, the matches
appear sparse and spread across multiple tablets. Additionally, the number of matches
across the languages are similar, with certain Linear A words matching with words in all the
languages, suggesting that the result is coincidental rather than indicative of concrete links.
The limited number of matches could be due to the phonetic values used for the
comparison. The feature-based similarity measure, with the parameters utilized in this
paper, was only successful in producing 43 matches for comparing Linear A with other
languages. In contrast, since Linear A and B potentially share 92 similar signs, naturally
the phonetic grid based on Linear B includes more signs. There are several reasons for
the derived phonetic grid being small. Firstly, it could simply indicate a lack of concrete
links between the scripts. Secondly, while the feature-based similarity measure allows
for an analysis of different writing systems, it is not without its limitations. The method
depends highly on the elementary feature set, and since we only had a few features, it
is plausible to assume that certain important features may have been missed during the
analysis. Additionally, a small feature set also increases the probability of finding multiple
matches for any symbol with the same similarity scores, and breaking the tie becomes a
challenging decision. In Revesz [8], for instance, the tie is broken by choosing the symbol
that is earlier in the standard ordering of symbols.
It is important to note, additionally, that the limited number of matches could simply
indicate a lack of connections between the languages. Most connection hypotheses, as
discussed previously, have shown to be unsuccessful due to reasonable justifications.
Considerations such as the temporal and spatial relations of the writing systems and
languages are undeniably important factors. For instance, a Minoan and Luwian or Hittite
(Indo-European) link could be considered unlikely due to temporal gaps, if the emergence
of the Minoan civilization is believed to predate the arrival of Indo-European speakers
to Anatolia.
224
Information 2024, 15, 73
The Combined Approach
Our approach aimed to leverage different characteristics of two computational methods of decipherment, in the effort to interpret Linear A. The feature-based similarity
measure, for instance, has been considered effective for visual comparisons of writing
systems. In [36], Barla et al. performed a feature analysis of Indus Valley- and Dravidianconnected scripts. They propose a novel elementary feature set consisting of six additional
features on top of the one employed in this paper and generate heat maps for the different
writing systems. Comparing their approach to our approach in this paper, we chose to
use the same feature set as Revesz [8]. However, this choice is arbitrary and evidently
influences the results obtained post analysis. Selecting a good elementary feature set is
not straightforward and requires experimentation and further analysis. This suggests that
although the approach seemingly aims to provide an objective way to compare writing systems, it is still subjective to an extent. For this paper, however, the approach has allowed us
to account for biases that arise while assigning Linear B phonetic values to Linear A, which
is also inherent in the so-called ‘consonantal approach’. It is important to note, however,
that visually similar symbols may not necessarily share the same phonetic values [37].
After the derivation of the phonetic values, the consonantal approach has enabled us
to attempt a statistical analysis of a new rendition of Linear A with other languages from
the region, resulting in a few matches. In [7], Colin Loh and Francesco Perono Cacciafoco
outlined preliminary results using this consonantal approach with Linear B phonetic values,
and they found matches across Hittite and Luwian. They posited that the Linear A cluster
“PA-RE”, from the document HT4 of GORILA’s volume 1, could be a possible match with
“PARI” from the Luwian dictionary, which represents “forth”, or “away”. Due to the limited
number of phonetic values derived in this paper, it is difficult to assess the effectiveness of
the consonantal approach for the comparative analysis of languages. A limitation inherent
in such an approach is the loss of information resulting from the removal of vowels. The
matches may just be loanwords or purely coincidence. Furthermore, filtering the large
number of matches generated by the Python program is not arbitrary and requires further
consideration and study. The approach’s effectiveness in performing a ‘brute-force’ analysis,
however, is evident.
Overall, the combined approach is effective in a cross-language and cross-script
analysis, albeit with some limitations inherent in the two approaches that have been
combined. It is also worth noting that there are possible limitations with the combination
as well. Due to the dependence of the consonantal approach on the feature-based similarity
measure, it may be difficult to determine the plausibility of links between languages using
this approach. Obtaining a low number of matches, for instance, could indicate a lack of
connection between the languages, compared to using the consonantal approach or when
the writing systems are compared visually. Hence, the combined approach necessitates a
stepwise assessment of the results. If there are low matches when comparing the writing
systems visually, the decision of whether these writing systems are appropriate for use
with the consonantal approach must be made first.
7. Conclusions
Among the numerous attempts to decipher Linear A, some recent approaches involve
a computational component. This paper aimed to combine two such novel methods to
firstly account for the biases inherent in provisionally assigning Linear B phonetic values
to Linear A and, secondly, to shed more light on the possible connections between Linear A
and other writing systems and languages of the Mediterranean and the Black Sea areas. This
paper also aimed to highlight some limitations inherent to such approaches. The first step
in the combined approach involved a feature-based similarity measure to visually compare
writing systems and the second involved using a consonantal approach to compare different
languages. Although the writing system still remains largely undeciphered, by employing
the combined approach some Linear A signs were found to be similar to signs from both
the Cypriot Syllabary and the Carian Alphabet. Applying the phonetic values of those
225
Information 2024, 15, 73
similar signs to Linear A and comparing it with Ancient Egyptian, Luwian, Hittite, ProtoCeltic, and Uralic resulted in a few word matches between the languages. Although these
could suggest possible connections, they are not significantly conclusive, due to the limited
number of matches. Along with some limitations inherent to the combined approach, the
small corpus still poses a major challenge in deciphering Linear A. However, our approach
can be applied and used to compare any known writing system and language possibly
connected to Linear A, removing our dependence on assigning Linear B phonetic values to
Linear A and allowing for an unbiased analysis. Further research could investigate the use
of the combined approach with other scripts and languages.
Author Contributions: Conceptualization, A.N. and F.P.C.; methodology, F.P.C. and A.N.; software,
F.P.C. and A.N.; validation, F.P.C.; formal analysis, A.N.; investigation, F.P.C. and A.N.; resources,
F.P.C.; data curation, F.P.C.; writing—original draft preparation, A.N. and F.P.C.; writing—review and
editing, F.P.C. and A.N.; supervision, F.P.C.; project administration, F.P.C.; funding acquisition, F.P.C.
All authors have read and agreed to the published version of the manuscript.
Funding: The tools and research developed for this paper were funded by a Singapore Ministry of
Education (MOE) AcRF Tier 1 Research Grant (Grant Number 2017-T1-002-193—Principal Investigator: Dr Francesco Perono Cacciafoco), “Giving Voice to the Minoan People: The Decipherment of
Linear A”, 2018–2021.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data collected and used in this study are safely stored in physical
data discs and in an online (private) data repository. They can be made freely and promptly available
to any scholar interested in it upon request by email to the authors. The Python program mentioned
in our study can be found at the following GitHub page: https://github.com/L-Colin/Linear-Adecipherment-programme (accessed on 21 November 2023).
Acknowledgments: We would like to acknowledge Colin Loh (National University of Singapore—
NUS, Singapore) for all his work and help in the development of the software (https://github.com/LColin/Linear-A-decipherment-programme, accessed on 21 November 2023) used and implemented
in the project which is the origin of this paper. This project was supported by Nanyang Technological
University (NTU), Singapore, under the URECA Research Programme.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Davis, B. Introduction to the Aegean Pre-Alphabetic Scripts. Kubaba 2010, 43, 38–61.
Cadogan, G. Palaces of Minoan Crete; Barrie and Jenkins: London, UK, 1976; ISBN 978-0-214-20079-3.
Chadwick, J. The Decipherment of Linear B, 2nd ed.; Cambridge University Press: Cambridge, UK, 2014; ISBN 978-1-107-69176-6.
Robinson, A.; Robinson, A. Writing and Script: A Very Short Introduction; Very Short Introductions; Oxford University Press:
Oxford, NY, USA, 2009; ISBN 978-0-19-956778-2.
Mycenean Artifacts Found in Bodrum. Available online: https://www.hurriyetdailynews.com/mycenean-artifacts-found-inbodrum--74114 (accessed on 21 November 2023).
Perono Cacciafoco, F. Linear A and Minoan: The Riddle of Unknown Origins. In Proceedings of the LMS Fieldwork and Language
Analysis Group (FLAG) Meeting, School of Humanities (SoH), Nanyang Technological University (NTU), Singapore, 10 June 2014.
Loh, J.S.C.; Perono Cacciafoco, F. A New Approach to the Decipherment of Linear A: Coding to Decipher Linear A, Stage 2
(Cryptanalysis and Language Deciphering: A ‘Brute Force Attack’ on an Undeciphered Writing System). In Proceedings of the
Grapholinguistics in the 21st Century—2020, Part II; Fluxus: Brest, France, 2020; pp. 927–943. ISBN 978-2-9570549-7-8.
Revesz, P. Establishing the West-Ugric Language Family with Minoan, Hattic and Hungarian by a Decipherment of Linear A.
WSEAS Trans. Inf. Sci. Appl. 2017, 14, 306–335.
Godart, L. Du Lineaire A Au Lineaire B. In Aux Origines de L’hellénisme: La Crète et la Grèce. Hommage à Henri van Effenterre;
Histoire Ancienne et Médiévale; Publications de la Sorbonne: Paris, France, 1984; pp. 121–128.
Georgiev, V.I. Les Deux Langues des Inscriptions Crétoises En Linéaire A. Linguist. Balk. 1963, 7, 1–104.
Nagy, G. Greek-like Elements in Linear A. Greek Rom. Byzantine Stud. 1963, 4, 181–211.
Owens, G. The Structure of the Minoan Language. J. Indo Eur. Stud. 1999, 27, 15–56.
Palmer, L.R. Mycenaeans and Minoans: Aegean Prehistory in the Light of the Linear B Tablets, 1st ed.; Alfred A. Knopf: New York, NY,
USA, 1962.
226
Information 2024, 15, 73
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
La Marle, H. Linéaire A: La Première Écriture Syllabique de Crète: Essai de Lecture; Librairie Orientaliste Paul Geuthner: Paris, France,
1996; ISBN 978-2-7053-3641-7.
La Marle, H. Linéaire A, la Première Écriture Syllabique de Crète: Éléments de Grammaire; Librairie Orientaliste Paul Geuthner: Paris,
France, 1997; ISBN 978-2-7053-3642-4.
La Marle, H. Linéaire A, la Première Écriture Syllabique de Crète: L’histoire et la Vie de Crète Minoenne: Textes Commentés; Librairie
Orientaliste Paul Geuthner: Paris, France, 1998; ISBN 978-2-7053-3643-1.
La Marle, H. Linéaire A: La Première Écriture Syllabique de Crète. Signes Rares, Textes Brefs, Substitutions 4; Librairie Orientaliste Paul
Geuthner: Paris, France, 1999; ISBN 978-2-7053-3644-8.
Younger, J.G. Linear A: Critique of Decipherments by Hubert La Marle and Kjell Aartun. Available online: https://people.ku.
edu/~jyounger/LinearA/LaMarleAartun.html (accessed on 7 December 2023).
Gordon, C.H. Forgotten Scripts: Their Ongoing Discovery and Decipherment; Revised and Enlarged edition; Dorset Press: Dorchester,
UK, 1987; ISBN 978-0-88029-170-5.
Eu Min, N.C.; Perono Cacciafoco, F.; Cavallaro, F.P. Linear A Libation Tables: A Semitic Connection Explored. Ann. Univ. Craiova
Ser. Philol. Linguist. Analele Univ. Din Craiova Ser. Ştiint, e Filol. Linguist. 2019, 41, 51–63.
Perono Cacciafoco, F. Linear A and Minoan: Some New Old Questions. Ann. Univ. Craiova Ser. Philol. Linguist. Analele Univ. Din
Craiova Ser. Ştiint, e Filol. Linguist. 2017, 39, 154–170.
Rix, H. Rätisch und Etruskisch; Institut für Sprachwissenschaft der Universität Innsbruck: Innsbruck, Austria, 1998; ISBN
978-3-85124-670-4.
Facchetti, G.M.; Negri, M. Creta Minoica: Sulle Tracce Delle più Antiche Scritture d’Europa; Archivum romanicum Biblioteca
dell’Archivum romanicum; Ser. 2, Linguistica; Olschki: Firenze, Italy, 2003; ISBN 978-88-222-5291-3.
Facchetti, G.M. Appunti di Morfologia Etrusca: Con Un’appendice Sulla Questione Delle Affinità Genetiche Dell’etrusco; Biblioteca dell’
”Archivum Romanicum”; L.S. Olschki: Firenze, Italy, 2002; ISBN 978-88-222-5138-1.
Mellaart, J. The Neolithic of the Near East, 1st ed.; Scribner: New York, NY, USA, 1975; ISBN 978-0-684-14483-2.
Younger, J.G. Linear A Texts & Inscriptions in Phonetic Transcription & Commentary. Available online: https://people.ku.edu/
~jyounger/LinearA/ (accessed on 7 December 2023).
Revesz, P.Z. Data Mining Autosomal Archaeogenetic Data to Determine Minoan Origins. In Proceedings of the 25th International
Database Engineering & Applications Symposium, New York, NY, USA, 7 September 2021; Association for Computing Machinery:
New York, NY, USA, 2021; pp. 46–55.
Krantz, G.S. Geographical Development of European Languages; P. Lang: Berlin, Germany, 1988; ISBN 978-0-8204-0800-2.
Makkay, J. An Archeologist Speculates on the Origin of the Finno-Ugrians. Mank. Q. 2003, 43, 235–272. [CrossRef]
Wiik, K. The Uralic and Finno-Ugric Phonetic Substratum in Proto-Germanic. Linguist. Ural. 1997, 33, 258–280. [CrossRef]
Revesz, P. Inscription on a Naxian-Style Sphinx Statue From Potaissa Deciphered as a Poem in Dactylic Meter. Mediterr. Archaeol.
Archaeom. MAA 2023, 23, 1–15.
Revesz, P.Z. The Cretan Script Family Includes the Carian Alphabet. MATEC Web Conf. 2017, 125, 5019. [CrossRef]
Marinatos, N. The Indebtedness of Minoan Religion to Egyptian Solar Religion: Was Sir Arthur Evans Right? J. Anc. Egypt.
Interconnect. 2010, 1, 22–28. [CrossRef]
Loh, J.S.C. Linear-A-Decipherment-Programme. Available online: https://github.com/L-Colin/Linear-A-deciphermentprogramme (accessed on 21 November 2023).
Adiego Lajara, I.-J. The Carian Language; Handbook of Oriental studies; Section one, The Near and Middle East; Brill: Leiden, The
Netherlands; Boston, MA, USA, 2007; ISBN 978-90-04-15281-6.
Barla, S.S.; Alamuru, S.S.S.; Revesz, P.Z. Feature Analysis of Indus Valley and Dravidian Language Scripts with Similarity
Matrices. In Proceedings of the International Database Engineered Applications Symposium, Budapest, Hungary, 22 August
2022; ACM: Budapest, Hungary, 2022; pp. 63–69.
Yao, Y.; Perono Cacciafoco, F.; Cavallaro, F.P. The Minoan Challenge: An External Analysis Approach to the Linear A Decipherment.
Ann. Univ. Craiova Ser. Philol. Linguist. 2022, 44, 456–475. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.
227
information
Article
A Proposed Translation of an Altai Mountain Inscription
Presumed to Be from the 7th Century BC
Peter Z. Revesz 1, * and Géza Varga 2
1
2
*
School of Computing, College of Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA
Írástörténeti Kutató Intézet, 1121 Budapest, Hungary;
[email protected]
Correspondence:
[email protected]; Tel.: +1-402-421-6990
Abstract: The purpose of this study is to examine an Old Hungarian inscription that was recently
found in the Altai mountain and was claimed to be over 2600 years old, which would make it the
oldest extant example of the Old Hungarian script. A careful observation of the Altai script and a
comparison with other Old Hungarian inscriptions was made, during which several errors were
discovered in the interpretation of the Old Hungarian signs. After correcting for these errors that
were apparently introduced by mixing up the inscription with underlying engravings of animal
images, a new sequence of Old Hungarian signs was obtained and translated into a new text. The
context of the text indicates that the inscription is considerably more recent and is unlikely to be
earlier than the 19th century.
Keywords: Altai inscription; decipherment; inscription; Old Hungarian script; Orkhon script; translation
1. Introduction
Citation: Revesz, P.Z.; Varga, G. A
Proposed Translation of an Altai
Mountain Inscription Presumed to Be
from the 7th Century BC. Information
2022, 13, 243. https://doi.org/
10.3390/info13050243
Academic Editor: Francesco Fontanella
Received: 14 February 2022
Accepted: 7 May 2022
Published: 10 May 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
A puzzling, unique inscription from the Altai Mountain was recently presented by
Karžaubaj Sartkožauly, who is a member of Kazakhstan academy of sciences, in a monograph on the Orkhon script [1]. According to Sartkožauly, the inscription was made in the
7th century BC.
Sartkožauly [1] also noticed that the inscription has similarities with the Old Hungarian script (Hungarian: székely írás or rovásírás), which was used by Hungarians before the
adoption of the Latin alphabet in the Middle Ages [2,3]. Sartkožauly’s book [1] remained
unnoticed in Hungary until Lajos Máthé brought it to the attention of the second author.
Subsequently, the second author alerted the first author and asked for his help in the
translation of the inscription. The second author already correctly identified a few words,
and the first author identified the still-missing words and completed the translation. Both
authors were intrigued by the Altai inscription and the possibility that it may be the oldest
extant example of the Old Hungarian script.
Although Sartkožauly already presented a translation of the inscription, we show
that it has several errors. One of the problems is that the inscription is partly written over
the engraved images of several animals. As we show, there are several instances where
Sartkožauly mixed up the actual inscription and the engraving of the animals. Correcting
these mistakes gives us a different sequence of Old Hungarian signs. Moreover, this enables
us to give a better, alternative translation of the Altai inscription.
The rest of this paper is organized as follows. Section 2 describes the materials and
methods. Section 3 describes the main results of the paper, including our identification of a
new Old Hungarian signs sequence read off from the Altai inscription (Section 3.1) and
a transliteration and translation of the Altai inscription (Section 3.2). Section 4 discusses
the Altai inscription and finds a new date range for its creation. Section 5 presents an
alternative transliteration and translation of the inscription. Finally, Section 6 gives some
conclusions and directions for further work.
4.0/).
Information 2022, 13, 243. https://doi.org/10.3390/info13050243
228
https://www.mdpi.com/journal/information
Information 2022, 13, 243
2. Materials and Methods
The main method of our research was a careful examination of the original photo of
the Altai inscription in [1]. It was discovered that the inscription was overlayed on the
engraved images of some animals. The engravings are usually fainter than the inscription,
but there are cases where the lines are indistinguishable. This causes several problems in
the precise identification of the Old Hungarian signs that were intended by the scribe. We
could correct several of the earlier mistakes made by Sartkožauly [1] and obtain a new
sequence of Old Hungarian signs.
Next, we transcribed the new sequence of Old Hungarian signs. The transcription
was complicated by the presence of ligatures, which are combinations of letters. We also
looked for various Old Hungarian alphabets from various centuries to identify the one that
contained signs that have similar forms to the one in the Altai inscription.
Finally, we translated the inscription first into Hungarian and then into English. The
etymology of the Hungarian words was considered in finding an improved date range for
this Altai inscription.
3. Results
This section may be divided by subheadings. It should provide a concise and precise
description of the experimental results, their interpretation, as well as the experimental
conclusions that can be drawn.
3.1. A Reexamination of the Old Hungarian Signs
Sartkožauly [1] gave a drawing of the inscription. Figure 1a is a modification of that
drawing by enhancing it with different colors for the inscription itself and the underlying
animal drawings. This distinction is important because it influences the interpretation
of the signs that are thought to belong to the inscription. In fact, we do not completely
agree with Sartkožauly’s identification of what belongs to the inscription versus the underlying drawings.
In fact, our examination of the photo of the Altai inscription led us to a different
identification of the sequence of Old Hungarian signs as shown in Figure 1b. We believe
these differences are due to different interpretations of what little line segments belong to
the inscription itself and what line segments belong to the engravings of the animal figures
on the rock surface where the inscription was found.
In addition, there are also some cracks on the rock that may cause problems in the
correct discernment of the Old Hungarian signs that belong to the inscription. Below we
list the most important differences that we identified.
In the second row, we identified the fourth and the ninth signs from right to be the
Old Hungarian sign denoting the vowel a. Here is an enlargement of the fourth sign from
the right in the second row of the photo in [1] next to the Old Hungarian a sign:
Similarly, let us consider now the ninth sign from the right in the second row next to
the Old Hungarian a sign:
229
Information 2022, 13, 243
Information 2022, 13, x FOR PEER REVIEW
3 of 10
Sartkožauly
(a)
(b)
1. Two drawings of the Altai inscription based on a photo from [1]. (a) the first author’s
Figure 1. Two Figure
drawings
of the Altai inscription based on a photo from [1]. (a) the first author’s
redrawing of Sartkožauly’s drawing in [1]. The improved drawing shows the inscription in red color
redrawing of Sartkožauly’s
in [1]. The
improved
drawing
the
in red color
and part of thedrawing
animal drawings
in the
background
in black shows
color; (b)
aninscription
alternative drawing
of the
inscription
by the
author. This alternative
follows
closer the original
inscription
and part of the same
animal
drawings
in first
the background
in black drawing
color; (b)
an alternative
drawing
of the
shown
in the
photo.
same inscription
by the
first
author. This alternative drawing follows closer the original inscription
shown in the photo.
230
Information 2022, 13, 243
In both cases, there are line segments which look deliberate and belong to the Old
Hungarian sign. In the second case, it is not clear why some of the lines have been
Sartkožauly
blackened, but this feature also appears in some other signs of the Altai inscription. As can
be seen in Figure 1a [1], left out some of the line segments that form the little triangle in
Sartkožauly
Sartkožauly
these
signs.
Sartkožauly [1] overlooked that the inscription contains some ligatures, which are
combinations of two or more signs. Ligatures often save some space and are common
in Old Hungarian inscriptions. In the Altai inscription, we also find a few examples of
ligatures. For example, in the third row of the inscription, we believe that the first sign on
the right is a ligature of the Old Hungarian n and a signs.
Below we show an enhanced image of the first sign from the right in the third row, our
drawing of it, and the Old Hungarian n sign written with a mirror symmetry and an a sign:
The comparison shows
that this sign may be a ligature of the Old Hungarian signs.
ws tha
gn
The Old Hungarian n sign
is likely mirrored to make the combination with the
sign
easier and to save more space. The ligature is read as na.
The next sign in the third row is not a straight vertical line as [1] assumes, because it
also has additional overlooked details, as can be seen in the following enhanced photo:
Here we need to be careful to ignore the engravings that depict part of the back and
the belly of a deer. The lines to be ignored are shown in black in our drawing.
The seventh sign from the right in the third row is an Old Hungarian t sign:
n
In the enhanced photo, the Old Hungarian
is clearly visible. In addition, there
are two parallel lines that belong to the head of one of the engraved deers. These lines
do not belong to the Old Hungarian inscription and should be ignored. Unfortunately,
Sartkožauly considered these
lines part
of
Sartkožauly
considered
these lines part of
Sartkožauly considered these lines part of
In the fourth line, there are
additional
missing
in Sartkožauly’s
In the
fourth line,
theredetails
are additional
missing drawing.
details in The
Sartkožauly
In the fourth line, there are additional missing details in Sartkožauly
231
Information 2022, 13, 243
Sartkožauly considered these lines part of
Sartkožauly
considered
these
lines
theofinscription and obtained an Old Hungarian
Sartkožauly
considered
these
lines
part
Sartkožauly
considered
these
lines
part
of part
Sartkožauly
considered
these
lines
part
of of
sign in this place.
In the
fourth
line,
there are
additional
missing
details in Sartkožauly’s
drawing. The
In the
fourth
there
arethere
additional
missing
details
in
Sartkožauly’s
drawing.
TheThe
In the
line,
there
are
additional
missing
details
in Sartkožauly’s
drawing.
Infourth
theline,
fourth
line,
are additional
missing
details
in Sartkožauly’s
drawing.
The
first sign from the left is missing its top half, the diamond sign misses on side, and in the
second word, which is written with smaller signs, the third sign from the right misses a
small horizontal crossing line segment. These can also be verified by a careful observation
of the original photo of the Altai inscription. In addition, the following ligature was
also overlooked:
gn
This ligature is a combination of the Old Hungarian k sign
kő
kő kő
they can be read as kő.kő
. őTogether
ő őősign
and
3.2. Transliteration and Translation of the Altai Inscription
We
agree
with Sartkožauly
WeWe
agree
Sartkožauly
Wewith
agree
with
Sartkožauly
that the Altai inscription needs to be read from right-to-left.
agree
with
Sartkožauly
Most Old Hungarian inscriptions known from Hungary and the Carpathian Basin are also
read from right-to-left. On the other hand, a left-to-right presentation would make the
translation hard to read. Hence Table 1 presents each row of the Altai inscription in red
based on our drawing, its Old Hungarian left-to-right transliteration in black, and below
the Old Hungarian signs a Latin alphabet transliteration of the Old Hungarian letters. The
Latin alphabet is extended by some accent marks.
Table 1. The Altai inscription and its row-by-row transliterations into Old Hungarian and Latin.
Row
Script
Inscription
Altai, right-to-left
1
Old Hungarian, left-to-right
k
Latin
u
n
p é t e r
Altai, right-to-left
2
Old Hungarian, left-to-right
m
Latin
a gy a
r
o r sz áá g
e k
k
232
Information 2022, 13, 243
Table 1. Cont.
Row
Script
Inscription
Altai, right-to-left
3
Old Hungarian, left-to-right
n a gy
Latin
sz e r e
t l e k
Altai, right-to-left
4
Old Hungarian, left-to-right
Latin
n
n
e nn
n
ii
iii
k
k
kk
k
ő
ő
őő
ő
ő
ő
e n i k őő
ő
m
m
m
m
m
nn ii kk
There are certain peculiarities in the Latin transliteration that we made in order to obtain meaningful words. In particular, we believe that the scribe was not using the standard
Old Hungarian signs but mixed up some of the similar looking signs. In particular, the
scribe mixed up the Old Hungarian letters for r and z, which are the following, respectively:
r
z
Similarly, the scribe also mixed up the Old Hungarian letters for g and l, which are the
following, respectively:
g
l
We had to assume these two interchanges to obtain meaningful Hungarian words. We
–
give a row-by-row translation of the Altai inscription in items (1–4) –below.
––
1.
The first row of the inscription starts with the name Kun Peter. Interestingly, the
family name Kun is written first, and the given name Peter is written second. This
order agrees with the Hungarian word order. In addition, Peter is a common given
name in Hungary, and Kun, meaning ‘Cuman’, is also a common family name. In fact,
Hungarian Kunság is the name of a region of Hungary that was settled by Cumans in
the 13th century. Many people in that region consider themselves to be descendants
of the Cumans and took Kun as a family name in later centuries.
2.
The second row of the inscription contains the Hungarian word Magyarország, which
means ‘Hungary.’ The Hungarians’ neighbors apparently confused the Hungarians
with the Huns and the Onogurs, who occupied present day Hungary before the
Magyars and allied peoples arrived in the 9th century. For example, German speakers
in Austria and Germany call the country Ungarn.
3.
The third row of the inscription contains the Hungarian word nagy, which means ‘big’
or ‘much’, and the Hungarian word szeretlek, which means ‘I love you.’ Hence the
two words together express the sentence ‘I much love you.’
233
Information 2022, 13, 243
4.
The fourth row of the inscription contains the Hungarian word Enikő, which is a
common woman’s name, and its conjugation Enikőm, where the -m suffix is a firstperson possessive marker. Hence the meaning of Enikő, Enikőm is ‘Enikő, my Enikő’.
The name Enikő is said to derive from Hungarian enéh meaning ‘young hind (female
deer)’ [4]. It is perhaps for this reason that we see two deers drawn next to these
words in the inscription.
In summary, the inscription can be translated into Hungarian as follows.
Kun Péter, Magyarország: Nagy szeretlek, Enikő, Enikőm.
This means the following in English:
Enikő, my Enikő, I much love you.—Peter Kun, Hungary
Therefore, the Altai inscription is a message of love from a gentleman named Peter
Kun to Enikő, who is his beloved woman.
4. The Inscription’s Implications for the Development of the Old Hungarian Script
The Old Hungarian alphabet is thought to be a descendant of the Orkhon Turkic
alphabet [3]. An early example of an Old Hungarian inscription from the Altai Mountain
would support the theory of an Orkhon Turkic origin of the Old Hungarian alphabet.
On the other hand, the first author argued that the Old Hungarian alphabet may be
a descendant of the Carian alphabet, which in turn may be a descendant of the Minoan
Linear A script [5]. The second author has also proposed that the Old Hungarian script
had a pictogram or hieroglyph script-like origin in the Carpathian Basin even earlier [6].
These two views do not exclude each other because there is growing evidence based on
archaeogenetics [7,8] and art motif comparisons [9] that the Minoans came from the Danube
Basin to the Aegean islands in the early Bronze Age. Hence both authors were skeptical
about an Asian or in particular an Orkhon Turkic origin of the Old Hungarian script.
However, we were intrigued by the reported find and undertook the research described in
this paper.
During the translation, we noticed that the Old Hungarian signs of Altai inscription reflected not the earliest known forms, as one would expect from a 2600 years old inscription,
but from later centuries.
Luckily, the date of the inscription can be narrowed down a pure linguistic reason. The
reason is that the name Enikő was created by Mihály Vörösmarty (1800–1855), a Hungarian
poet [4]. Hence the Altai inscription was carved in the latter half of the 19th century or later.
Already in the 19th century, Hungarians had a strong interest in exploring the area because
of presumed cultural connections with some people living near the Altai Mountains. In
fact, a well-known Hungarian scientific expedition to the Altai Mountain was led by Count
Jenő Zichy in 1895 [10].
5. An Alternative Translation of the Inscription
Sartkožauly [1] has given a transliteration of the letters of the Old Hungarian inscription based on an interpretation of the drawing as shown in Figure 1a. His transliteration,
which is only the substitution of the Old Hungarian letters by Latin letters, is the following
from the topmost line to the bottom-most line:
Line 1:
Line 2:
Line 3:
Line 4:
kunpétez
magy sz zcz sz
sz ksz sz eze gügek
enü ? o en sz kom
As can be seen, the transliteration is different from ours because some letters are
faintly written over some underlying drawings of animals. Therefore, they have ambiguous
interpretations. In fact, Sartkožauly [1] has used a question mark at some point in the last
line to indicate that at that point he did not find a clearly readable letter that he could
transliterate with confidence. Sartkožauly [1] could not give an actual translation.
234
Line 3: sz ksz sz eze gügek
Line 4: enü ? o en sz kom
Information 2022, 13, 243
As can be seen, the transliteration is different from ours because some letters are
faintly written over some underlying drawings of animals. Therefore, they have
ambiguous interpretations. In fact, Sartkožauly [1] has used a question mark at some point
in the last line to indicate that at that point he did not find a clearly readable letter that he
could transliterate with confidence. Sartkožauly [1] could not give an actual translation.
Sartkožauly’s drawing
drawing and
and transliteration
transliterationwas
was the
the starting
starting point
point of
of our
our translation
translation of
of
Sartkožauly’s
the
inscription.
Initially,
we
tried
to
make
only
minimal
changes
to
both
his
drawing
and
the inscription. Initially, we tried to make only minimal changes to both his drawing and
transliteration as
as shown
shown in
in Figure
Figure 2.
2. In
In particular,
particular, Figure
Figure 22 follows
follows Figure
Figure 1a
1a at
at the
the right
right
transliteration
end of
of the
the third
third line.
line. In
In the
the third
third line,
line, the
the three
three letters
letters are
are supposedly
supposedly the
the following
following from
from
end
right to
to left:
left: sz
sz ksz.
ksz.
right
Figure 2.
2. An
An alternative
alternative drawing
drawing of
of inscription.
inscription. Here
Here the
the red
red lines
lines are
are those
those that
that seem
seem extra
extra to
to the
the
Figure
letters
that
are
apparently
needed
for
a
meaningful
reading
of
the
inscription.
letters that are apparently needed for a meaningful reading of the inscription.
It is
is possible
possible to
to translate
translate this
this as
as the
the word
word szex
szex because
because while
while the
the traditional
traditional Old
Old
It
Hungarian
Hungarian alphabet
alphabet does
does not
not have
have an xx letter,
letter, the
the convention
convention is
is to
to render
render the
the letter
letter xx as
as aa
combination
equivalent
to to
English
k and
s, respectively.
Of
combinationof
ofHungarian
Hungariank kand
andsz,sz,which
whichis is
equivalent
English
k and
s, respectively.
course,
thenthen
the the
translation
would
change
to the
following:
Of course,
translation
would
change
to the
following:
Enikő,
Enikő, my
my Enikő.
Enikő. II love
love sex
sex [with
[with you].—Peter
you].—Peter Kun,
Kun, Hungary
Hungary
This
This alternative
alternative has
has some
some problems.
problems. First, the Hungarian word szex
szex is
is aa borrowed
borrowed
Hence
this
would
requirea
word
that
was
first
used
only
in
1958
according
to
Zaicz
[11].
word that was
1958 according to Zaicz [11]. Hence this would require
a late 20th century origin of the inscription. Second, the use of the two Old Hungarian k
letters would be inconsistent. Usually, the Old Hungarian diamond-shaped letter k is used
with front vowels, while the Old Hungarian Z-shaped letter k is used with back vowels.
The latter can also be used to express the frequent syllable ak because the vowel a can
be omitted.
The Altai inscription adheres to this custom because the Z-shaped letter is used in the
word Kun, which contains the back vowel u, while the diamond-shaped letter is used in the
words szeretlek, Enikő, and Enikőm, all of which contain front vowels. However, this custom
would be broken by the use of the Z-shaped letter in writing the word szex, which has a
front vowel. For these reasons and also the visual analysis that we presented in Section 3,
this alternative reading seems less plausible. We present this here mainly to show the
evolution of our thinking.
In Section 3, we mentioned that the scribe mixed up some letters because of misremembering some details. Instead, it is possible to imagine that the scribe remembered correctly
the letters and at first wrote them correctly as shown by the black lines in Figure 2. Then
he became embarrassed by the inscription and deliberately added the red lines shown in
Figure 2. The scribe may have thought that the addition of the red lines makes the original
inscription unreadable. The hypothesis of deliberately adding extra lines can be used with
235
Information 2022, 13, 243
either of our translations. Because it is only an explanation for the apparently mixed-up
letters, it can be accepted or rejected without changing the meaning of the translation. The
reason this hypothesis may be attractive is that whenever the scribe mixed up letters, the
intended letter, whether r or g, always has fewer lines than the actual written letter, whether
z or l.
6. Conclusions
We gave a new, correct transliteration and translation of the Old Hungarian inscription
from the Altai Mountain that was reported by Sartkožauly [1]. We also redated the inscription to the 19th century or later based on a linguistic argument. Although the inscription
did not prove to be as ancient as originally assumed, it still provides an amazing and
valuable cultural connection between the peoples near the Altai Mountain and Hungarians
in Central Europe.
Author Contributions: Conceptualization, methodology, and investigation, P.Z.R. and G.V.; draft
notes in Hungarian, drawing of Figure 2, G.V.; extension, writing in English, drawing of Figure 1a,b,
and dating using linguistic argument, P.Z.R. All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Sartkožauly, K. Complete Atlas of the Orkhon Monuments; Almaty Samga Press: Almaty, Kazakhstan, 2019; Volume 3.
Wikipedia, Old Hungarian Script. Available online: https://en.wikipedia.org/wiki/OldHungarianscript (accessed on 8 December 2021).
Benkő, E.; Sándor, K.; Vásáry, I. A Székely Írás Emlékei; Bölcsészettudományi Kutatóközpont: Budapest, Hungary, 2021.
Wikipedia, Enikő. Available online: https://en.wikipedia.org/wiki/Enikő (accessed on 13 January 2022).
Revesz, P.Z. Establishing the West-Ugric language family with Minoan, Hattic and Hungarian by a decipherment of Linear A.
WSEAS Trans. Inf. Sci. Appl. 2017, 14, 306–335.
Varga, G. Magyar Hieroglif Írás; Írástörténeti Kutatóintézet: Budapest, Hungary, 2017.
Revesz, P.Z. Minoan archaeogenetic data mining reveals Danube Basin and western Black Sea littoral origin. Int. J. Biol. Biomed.
Eng. 2019, 13, 108–120.
Revesz, P.Z. Data mining autosomal archaeogenetic data to determine Minoan origins. In Proceedings of the 25th International
Database Engineering and Applications Symposium, Montreal, QC, Canada, 14–16 July 2021. [CrossRef]
Revesz, P.Z. Art motif similarity measure analysis: Fertile Crescent, Old European, Scythian and Hungarian elements in Minoan
culture. WSEAS Trans. Math. 2019, 18, 264–287.
Maracskó, A. Hungarian Orientalism and the Zichy Expeditions. Master’s Thesis, Central European University, Budapest,
Hungary, 2014.
Zaicz, G. (Ed.) Etimológiai Szótár: A Magyar Szavak és Toldalékok Eredete; Tinta Press: Budapest, Hungary, 2006.
236
information
Article
Decipherment Challenges Due to Tamga and Letter Mix-Ups in
an Old Hungarian Runic Inscription from the Altai Mountains
Peter Z. Revesz
School of Computing, College of Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA;
[email protected]; Tel.: +1-402-421-6990
Abstract: An Old Hungarian Runic inscription from the Altai Mountains with 40 signs has posed
some special challenges for decipherment due to several letter mix-ups and the use of a tamga sign,
which is the first reported use of a tamga within this type of script. This paper gives a complete and
correct translation and draws some lessons that can be learned about decipherment. It introduces sign
similarity matrices as a method of detecting accidental misspellings and shows that sign similarity
matrices can be efficiently computed. It also explains the importance of simultaneously achieving the
three criteria for a valid decipherment: correct signs, syntax, and semantics.
Keywords: decipherment; error correction; inscription; Old Hungarian Runic script; sign; similarity
matrix; tamga
1. Introduction
Citation: Revesz, P.Z. Decipherment
Challenges Due to Tamga and Letter
Mix-Ups in an Old Hungarian Runic
Inscription from the Altai Mountains.
Information 2022, 13, 422. https://
doi.org/10.3390/info13090422
Academic Editors:
Francesco Fontanella and
Arkaitz Zubiaga
Received: 2 June 2022
Accepted: 6 September 2022
Published: 7 September 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affiliations.
Copyright:
© 2022 by the author.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
The history of paleography never saw a case when a scribe came alive and told
the would-be decipherers that they were wrong. Embarrassingly, something like that
happened to us after we published [1] our decipherment of a puzzling Old Hungarian
Runic (Hungarian: székely írás [2], székely-magyar rovás or rovásírás [3]) inscription that was
previously described by Karžaubaj Sartkožauly, a member of the Kazakhstan Academy of
Sciences, in a three-volume monograph on the Orkhon script [4], where he presumed the
inscription to be from the seventh century BC.
The Hungarian name is alternatively translated as Székely-Hungarian Rovash [5] or
Old Hungarian [3]. The term ‘Old Hungarian’ may be confusing because it is used by
some scholars to refer to the Latin alphabet-based script that was used from the 10th to the
16th century in Hungary. The extended name ‘Old Hungarian Runic’ inscription is clearer
because ‘runic’ means ‘relating to runes (magic marks or letters, especially the letters of an
ancient alphabet cut into stone or wood in the past)’ according to the Cambridge Dictionary.
Hence, English ‘runes’ and Hungarian ‘rovás’ both refer to the same means of writing.
Our journal article generated much public interest in Hungary. It was also featured in
a popular YouTube video on Hungarian history. Eventually, one viewer left a comment,
which can be translated into English as follows: ‘I carved this inscription into the rock at
the Mongolian Altai Mountains in the Bayan-Ölgii Province, near the upper flow of the
Uygariin River in June 2000’.
Finding the scribe allowed a unique opportunity to check our translation and ask
some details about the circumstances of the inscription. This was important because
the inscription consists of 40 signs, and, out of those 40 signs, a sequence of three signs
remained uncertain. The goal of this paper is to describe the problem with that sequence of
signs in our earlier paper and to propose a complete and correct translation. As part of the
analysis, the paper introduces the use of similarity matrices to check for misspellings and
draws some general lessons for decipherers of ancient inscriptions.
This paper is organized as follows: Section 2 gives some background information on
the Old Hungarian Runic script; Section 3 describes the data source and data curation;
4.0/).
Information 2022, 13, 422. https://doi.org/10.3390/info13090422
237
https://www.mdpi.com/journal/information
Information 2022, 13, 422
Section 4 gives a transliteration of the signs. A sign similarity matrix is used to show that
the inscription contains some common misspellings; Section 5 reviews earlier decipherment proposals and evaluates them according to the criteria of correct signs, syntax, and
semantics; Section 6 gives the correct identification of the disputed sign group as a tamga;
Section 7 presents some lessons learned about decipherment; lastly, Section 8 presents some
conclusions and future work.
2. Background on the Old Hungarian Script
The Old Hungarian Runic script (Hungarian: székely írás or rovásírás) has been the subject of many studies [2,3,5]. An early book about the subject by Sebestyén [6] popularized
the idea that the Old Hungarian Runic script is a descendant of the Old Turkic Orkhon.
This origin theory developed even before the Minoan civilization, and its scripts were
discovered on the island of Crete by Sir Arthur Evans. During a cryptographic study of
the Minoan Linear A script, the author discovered its relationship with the Old Hungarian
Runic script. More precisely, it was shown that the Minoan Linear A script is an ancestor of
the Carian script, which is the ancestor of the Old Hungarian Runic script [7].
As the above history suggests, the Old Hungarian Runic script has developed considerably from its earliest form to the present. Table 1 shows its current state that is also
part of the Unicode standard. Even the two-letter Hungarian transliterations denote single
phonemes [8]. There is only one remarkable exception to the pure alphabetic nature of the
script. K1 and K2 are used with front and back vowels, respectively. This feature may hark
back to an era when these were syllabic signs denoted KE and KA, respectively.
Table 1. The Old Hungarian Runic script with its Hungarian transliteration.
238
Information 2022, 13, 422
Table 1. Cont.
3. Data Sources and Data Curation
Karžaubaj Sartkožauly’s drawing had some minor inaccuracies. He included a photograph in his work. A new drawing based on that photo is shown in Figure 1. The drawing
shows that some parts of the inscription are unclear because of the drawings of the deer
and some cracks in the rock.
Figure 1. The author’s redrawing of the inscription based on the photograph in Sartkožauly [4].
Figure 2 shows an enhanced drawing with red highlighting of those elements that
clearly belong to the inscription and labeling the various groups of signs.
Those who are familiar with the Old Hungarian Runic script can easily recognize
many of the signs. Hence, one can suspect that some more elements also belong to the Old
Hungarian signs in sign group (d) in the middle of the drawing, where unfortunately the
tail of the female deer on the left and the antler of the stag on the right interfere with the
Old Hungarian signs. This interference results in at least two different interpretations as
shown in Figure 3.
239
Information 2022, 13, 422
Figure 2. An enhanced drawing of the inscription with red highlighting of those elements that
undisputedly belong to the inscription. The six sign groups are also labeled (a–f).
Figure 3. Two interpretations of sign group (d) in the middle of the photograph.
The first interpretation of sign group (d) leads to the following sign sequence:
The second interpretation, which contains an Old Hungarian A and N ligature, leads
to the following sign sequence:
While the N sign normally looks as shown above, a scribe could reverse the direction
for the sake of a ligature. The scribe also used an Ő-K1 ligature in sign group (b). The
difference in these two interpretations is a subtle matter of interpreting a few faintly
scratched lines. What the first interpretation considers the Old Hungarian S, the second
interpretation considers part of the antler of the stag.
240
Information 2022, 13, 422
The most logical way to handle ambiguities is to proceed further in the decipherment
because the context of the other words can help to choose among the choices. Hence, for
now, let us simply refer to these two sign group options as (d1) and (d2), respectively.
4. Transliteration and Correction of the Signs
Since Old Hungarian inscriptions are written from right to left, we first convert the
sign groups into a left-to-right order as shown in Table 2. Next, we also attempted a
transliteration to find the meaning of the words.
Table 2. The Altai Mountain inscription with incorrect signs highlighted in brown.
Row
Inscription
Transliteration
Meaning
a
E N I K1 Ő
Enikő
b
E N I K1 Ő M
my Enikő
c
SZ E Z E T G E K1
d1
SZ K2 SZ
d2
N A GY
e
M A GY A Z O Z SZ ÁL
f
K2 U N P É T E Z
great
Kun
It is apparent to Hungarian language speakers that some words do not make sense,
although they are close to common Hungarian words. For example, in sign group (f), the
intended name PÉTER can be easily recognized instead of the nonsense string PÉTEZ. This
suggests that the scribe made a spelling mistake. In particular, the scribe wrote the Old
Hungarian Z sign instead of the Old Hungarian R sign.
These two signs look similar; hence, it is understandable that such a mistake can be
made by someone who is not completely familiar with the script. The Altai Mountain
inscription uses a form of Z that has two legs. In many texts, including this paper, the
following slightly different form of Z is used:
Apparently, the scribe also mixed up the Old Hungarian signs G and L in the words
MAGYARORSZÁG and SZERETLEK. These two signs also look similar.
241
Information 2022, 13, 422
The incorrect signs and transliterated letters are highlighted in brown in Table 2. Those
signs and letters can be corrected to their intended versions as shown in Table 3.
Table 3. The Altai Mountain inscription after replacing incorrect signs with intended ones.
Row
Inscription
Transliteration
Meaning
a
E N I K1 Ő
Enikő
b
E N I K1 Ő M
my Enikő
c
SZ E R E T L E K1
I love you
d1
SZ K2 SZ
d2
N A GY
great
e
M A GY A R O R SZ Á G
Hungary
f
K2 U N P É T E R
Kun
The mix-up of the above pairs of Old Hungarian signs is a natural consequence of
their similar look. Nevertheless, it is possible to ask why exactly these signs are mixed up
in the inscription. To answer that question, we can apply a mathematically based approach
to sign similarities. This approach was developed in an earlier paper that compared the
Minoan Linear A, the Carian, and the Old Hungarian script [7]. The approach starts by
identifying which sign has which of the following thirteen features:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
The symbol contains some curved line.
The symbol encloses some region.
The symbol has a slanted straight line.
The symbol contains parallel lines.
The symbol contains crossing lines.
The symbol’s top is a wedge ∧.
The symbol’s bottom is a wedge ∨.
The symbol’s right side is a wedge >.
The symbol contains a stem, a straight vertical line that runs across the middle.
The symbol’s bottom has two legs, two single lines touching the bottom.
The symbol’s bottom has three legs, three single lines touching the bottom.
The symbol contains a hair, a small line extending from an enclosed space.
The symbol contains two triangles.
Figure 4 shows a matrix that results from a feature analysis of the Old Hungarian
Runic signs in terms of the above 13 features.
Figure 5 shows a similarity matrix of the Old Hungarian signs. Each entry shows the
number of features on which the row and the column signs agree. Two signs agree on a
feature if they both contain the feature or both lack the feature. This means that they both
have a value of 1 or they both have a value of −1 for the same feature in the feature table in
Figure 4. We can propose the theorem below.
242
Information 2022, 13, 422
Figure 4. A feature analysis of the Old Hungarian Runic signs: 1 indicates that the sign in the row
contains the feature in the column; −1 indicates that it does not contain the feature. This analysis
uses the Altai Mountain version of the Z sign.
Theorem 1. Let A be an n × m feature matrix with n signs and m features. Furthermore, let AT be
the transpose of A, and let M be the n × n similarity matrix for the n signs. Then, the following
formula holds:
M = 0.5 (( A × A T ) + C ),
(1)
where C is a matrix in which each entry is m.
Proof. Consider any entry M [i, j] of the similarity matrix. This entry has the value of
M [i, j] = 0.5 ((A[i]· A[j]) +m),
(2)
where the dot indicates the dot product of the two vectors. The inner parenthesis in
Equation (2) contains the number of times signs i and j that either both contain or both lack
a feature minus the number of times they disagree on a feature as follows:
1 × 1 = 1 when iand j both contain a feature.
(3)
(−1) × (−1) = 1 when iand j both lack a feature.
(4)
(−1) × 1 = −1 when ilacks and jcontains a feature.
(5)
1 × (−1) = −1 when icontains and jlacks a feature.
(6)
243
Information 2022, 13, 422
Let agree be the number of times that cases (3) and (4) occur. Let disagree be the number
of times that cases (5) and (6) occur. Then, the following must hold for any number of
features m because the two signs must either agree or disagree on each feature:
m = agree + disagree.
(7)
Hence, according to the above observation and Equation (7), the inner parenthesis has
the following value:
agree − disagree = agree − (m − agree) = 2agree − m.
(8)
From Equation (8), it can be also seen that
M [i, j] = 0.5((2agree − m) + m) = agree.
(9)
Therefore, the value of M(i, j) is the total number of features on which signs i and j
agree as required for the similarity matrix. QED.
Theorem 1 is useful for the fast calculation of the similarity matrix given any feature
matrix. Theorem 1 was used to calculate the similarity matrix shown in Figure 5 from the
feature matrix shown in Figure 4. After the similarity matrix was calculated, the entries
with a similarity value of 12 or 13 between two different signs were highlighted in pink as
shown in Figure 5.
The similarity matrix had 34 × 33 = 1122 nondiagonal entries. Out of those, 52 (4.63%)
were marked pink. Intuitively, these pairs were those most likely to be confused with each
other according to this mathematical model.
At my request, Klara Friedrich, a prominent researcher and teacher of the Old Hungarian Runic script, verified that, in her decades of experience, it is common to mix up the
following letters:
Among the above, the G–L pair has a similarity of 12, the R–Z and the Z–CS pairs
have similarities 11 and 13, respectively, and the D–I pair has a similarity of 12. Hence,
these frequently mixed up pairs also have high similarity scores according to the similarity
matrix in Figure 5. Hence, the strong agreement between the mathematical model and the
teacher’s experience shows that the G–L and R–Z pair mix-ups in the Old Hungarian Runic
inscription in Figure 1 were likely due to an accident.
244
Information 2022, 13, 422
Figure 5. A similarity matrix of the Old Hungarian Runic signs. Entries that indicate a similarity of
12 or 13 between two different signs are highlighted in pink.
Not everyone agrees with the accidental nature of the letter mix-ups. G. Varga imagined that the inscription had some sexual message. Moreover, he claimed that a male scribe
wrote every sign originally correctly, but he later deliberately changed the inscription by
adding extra lines for the sake of a woman called Enikő, who was embarrassed and ‘obviously did not want to make public what happened’. According to Varga, these deliberately
added extra lines explain the mix-up of the letters as shown in his figure (Figure 2 in [1]).
However, this theory runs into a major problem in explaining the incorrect G in the word
SZ E Z E T G E K1 .
245
Information 2022, 13, 422
Since scratches and carvings cannot be erased from a rock surface like from a paper,
one cannot destroy a correct L into an incorrect G because it requires the deletion instead of
the addition of a line. Hence, it is an untenable hypothesis that all spelling mistakes were
deliberately introduced to destroy the meaning of the writing.
5. Decipherment Requires Correct Signs, Syntax, and Semantics
Valid decipherment requires correct signs, syntax, and semantics. These can be defined
as described below.
1.
Signs: This means a combination of two things.
First, the shapes of the signs are visually recognized correctly. As in the case of the Altai
Mountain inscription, shape recognition can be hindered by deficiencies in the visual quality
of the object (cracks in the rock, weathering, overwriting the signs by other inscriptions
and drawings, etc.) and deficiencies in the photographs available to the investigator. An
onsite investigation is almost always preferable to even the best available photograph.
Second, the visually correctly identified sign needs to also be correctly transliterated.
It is of no use to correctly discern the shape of a sign, and then incorrectly look up its
transliteration. Obviously, that cannot lead to a valid decipherment.
2.
3.
Syntax: This means that the words fit together according to the accepted grammatical
rules. Moreover, the grammar must match the period of the inscription. For example,
one cannot use present day Hungarian language grammar for an inscription from
the Middle Ages. Translations that add suffixes purely from the imagination of
the decipherer cannot be considered valid, even if the root words look acceptable.
Even ancient Sumerian pictographs and cuneiforms reflect a well-formed, complex
grammar.
Semantics: This means that the sentences and story are meaningful. The meaningfulness of the text needs to be evaluated in terms of the time and other circumstances of
writing. For example, there should not be any anachronisms such as talking about
dinosaurs in an ancient text because those became extinct long before the first scripts
were developed.
In the Altai Mountain inscription, all the sign groups have an unambiguous reading
except sign group (d). Now, let us evaluate the proposal (d2), which is equivalent to the
word NAGY. If we read the sign groups in order from bottom up as shown in Figure 2, then
we obtain the following Hungarian sentence:
E N I K1 Ő, E N I K1 Ő M, SZ E R E T L E K1 .
N A GY- M A GY A R O R SZ Á G, K2 U N P É T E R.
Here, the Hungarian compound word Nagy-Magyarország ‘Greater Hungary’ refers to
the historical Hungary, which includes present day Hungary and territories in neighboring
countries where Hungarians live as minorities. It is necessary to add as an explanation
that a literary reference to Nagy-Magyarország does not mean territorial aspirations but is
only a reference to the international Hungarian ethnic community to which many minority
Hungarians feel they belong. Hence, the inscription can be translated as a grammatically
and semantically correct message as follows:
I love you Enikő, my Enikő!
–Peter Kun, Greater Hungary.
Now, let us consider the proposal (d1), which was SZ K2 SZ. One can immediately see
that this proposal has a weakness because this is not a meaningful word. It lacks vowels.
In the older, mostly medieval examples of Old Hungarian Runic inscriptions, the vowels
were often omitted when they did not affect the readability of the text. However, this is
clearly not a medieval text. Some orthographic considerations regarding the form of the
Old Hungarian signs support this assertion, but we can skip those considerations because
there is a simpler explanation of recentness, i.e., that the name Enikő was created by the
poet Vörösmarty (1800–1855) [9]. That linguistic consideration alone helps date the text
246
Information 2022, 13, 422
to after the latter half of the 19th century. Hence, we need to consider a period when the
omission of vowels was no longer practiced. This period includes a considerable revival of
interest in the Old Hungarian Runic script in the past 30 years.
It is unlikely that the scribe wrote down each vowel in every other word except in SZ
K2 SZ. However, let us entertain this idea by trying to find a word. Since K2 requires a backvowel, a word that may be found is SZaK2 aSZ or szakasz (International Phonetic Alphabet
notation:/sakas/) with the meaning ‘segment’. However, this lacks correct semantics
because the phrase SZeReTLeK1 SZaK2 aSZ ‘I love you segment’ makes no sense.
Mr. Varga suggested the Hungarian word szex (International Phonetic Alphabet
notation:/seks/) with the meaning ‘sex’. Since letter X does not occur in the Old Hungarian
Runic script, words with X are written down by a K SZ combination. Hence, let us try
to write down the word as SZeK2 SZ. That would violate the second condition of sign
correctness because one needs to transliterate K2 as a consonant that occurs with a backvowel, while e is a front-vowel.
The argument can be made that the scribe forgot about the differences between K1
and K2 . However, it is unlikely because everywhere else the scribe uses these two signs
correctly, as can be easily checked.
Front-vowel words: E N I K1 Ő, E N I K1 Ő M, SZ E R E T L E K1 .
Back-vowel word: K2 U N.
Apparently, the scribe is consistent in the use of K1 and K2 , and there is no real logic of
supposing that they made a mistake just here regarding this usage convention, as well as
making a mistake just here regarding explicitly writing down the vowel just in this word.
Moreover, SZeK2 SZ is grammatically incorrect. A grammatically correct phrase would be
the following:
E N I K1 Ő, E N I K1 Ő M, SZeK2 SZ-uálisan SZ E R E T L E K1 ,
which means
I love you sexually Enikő, my Enikő.
However, the suffix -uálisan is completely absent. Hence, the SZeK2 SZ word proposal
is semantically correct, but it is incorrect in signs and syntax. Despite the above concerns,
this proposal of my coauthor was kept as an alternative together with my NAGY word
proposal. Unfortunately, we omitted to mention that sign group (d) may be a personal sign
or tamga, although Varga added the following endnote to his blog entry of 16 March 2022.
The top shows a screenshot of the original Hungarian text, with an English translation in
italics below.
(3) The word’s reading as ‘sex’ is supported by the fact that it explains why Peter Kun
tried to destroy the readability of the inscription. If this were a tamga, as Peter Revesz once
mentioned, then this deliberate destruction would be unexplained.
6. Identification of Sign Group (d) as a Tamga
A tamga is an emblem of a family, clan, or tribe. Tamgas were widely used by Eurasian
nomads as a mark of personal property such as in branding livestock. For example, the
early Bulgarian ruling dynasty, the Dulo clan, used the tamga shown in Figure 6a. For
example, this tamga was found on the back of a seventh to ninth century bronze rosette at
Pliska, Bulgaria [10] and on a ninth century clay pot fragment at Zalavár, Hungary [11].
The Kayi was one of the 21 Oghuz Turkic tribes. The Kayi tamga is shown in Figure 6b.
247
Information 2022, 13, 422
Figure 6. The Dulo clan’s tamga (a), Kayi tamga (b), and Peter Kun’s tamga (c). Picture credits:
Wikipedia https://en.wikipedia.org/wiki/Dulo (accessed on 16 May 2022) and https://en.wikipedia.
org/wiki/Tamga (accessed on 16 May 2022).
Thanks to the publicity of our publication [1], as a wonderful crowdsourcing effort,
many people sent me various tips about who Peter Kun may be. When I got a tip about his
phone number, I called him, and he verified that he was the scribe of the Altai Mountain
inscription. I followed up our conversation with an email in which I asked some detailed
questions. In his reply, which is shown in Figure 7, he explains that sign group (d) is
a tamga. The middle K2 sign stands for his family name, Kun, which has a back vowel.
Hence, K2 is used instead of K1 , which would be appropriate for a name with a front-vowel.
The two parallel signs on the left and right sides of the tamga are symbols of the
Cumans, an ancient steppe people, whose domain extended from Hungary to Mongolia
ca. 1200. Sometimes, the parallel lines are replaced by two arrows or spears. The three
tamgas of Figure 6 all have two vertical parallel lines on the left and right sides. They differ
only in the middle letter that is enclosed between those two parallel lines. These letters are
Y-shaped for the Dulo clan, V-shaped for the Kayi tribe, and Z-shaped for Peter Kun. These
three tamgas can be classified as members of the same subgroup of Turkic tamgas.
Peter Kun created this tamga for his own use in honor of his Cuman ancestors, who
settled in a part of Hungary that is named after them to this day. It is called Kunság in
Hungarian. The Cuman descendants in Hungary have their own organization, and Peter
Kun serves as a leader in that organization. Peter Kun is also a cattle rancher and uses the
tamga as a branding sign for his cattle.
Peter Kun verified that he did not make any deliberate alterations of the signs. He
also explained that he was longing for Enikő, his wife, who was left behind in Hungary,
while he was traveling in the Altai Mountains and doing research. He has a doctorate in
Turkic studies. He even published a book about his research travels in Asia during which
he studied the equestrian culture of the Steppe nomads [12].
Hence, the entire inscription can be seen as follows:
E N I K1 Ő, E N I K1 Ő M, SZ E R E T L E K1 .
, M A GY A R O R SZ Á G, K2 U N P É T E R.
The tamga is not transliterated because it is a personal property symbol or emblem
that can stand for ‘Kun Ranch’. Hence, the correct translation into English is the following:
I love you Enikő, my Enikő!
–Peter Kun, Kun Ranch, Hungary.
248
Information 2022, 13, 422
Figure 7. Dr. Peter Kun’s email that verifies that he wrote the inscription in June 2000. This original
email contains some minor misspellings. For example, the names of ethnic groups are written in
lowercase letters, which is the common way of writing ethnic names in Hungarian.
7. Lessons Learned about Decipherment
That sign group (d) is a tamga did not seem plausible because there are no other
instances of the use of tamga signs within Old Hungarian Runic inscriptions. Hence, this
sign triplet can be termed a hapax legomenon maximus because it is not only unique within
the corpus of Old Hungarian Runic inscriptions, but it is also unique in it being a tamga.
The Kun Ranch tamga is easily confusable with an SZ K2 SZ sequence of Old Hungarian Runic signs as shown in Figure 8.
Figure 8. Confusability of Peter Kun’s tamga (left) and Old Hungarian signs (right).
The presence of a hapax legomenon maximus together with the confusability of its
elements with a sequence of Old Hungarian Runic sign made a complete decipherment of
the Altai Mountain inscription nearly impossible. It is with luck that the actual scribe could
be found and the exact meaning of the tamga was revealed to us.
Decipherers of ancient inscriptions may learn some valuable lessons from this work.
As Figure 9 shows, only the tamga is the correct solution in this case. Unfortunately, it was
249
Information 2022, 13, 422
not pursued enough because other proposals were not rejected earlier. In particular, the SZ
eK1 SZ proposal should have been dropped earlier when its problems became clear. My
advice is to always look for a solution that satisfies the three S’s of correct sign, syntax, and
semantics and not to get stuck with any solution that fails any of these three criteria.
Figure 9. A valid decipherment needs to get three things correct: signs, syntax, and semantics. The
above Venn diagram places four proposals for sign group (d) on the basis of correctness according to
these three criteria.
8. Conclusions and Further Work
The Old Hungarian Runic inscription from the Altai Mountains now has a complete
decipherment. The story of this inscription taught several valuable lessons that may be
useful in the decipherment of other inscriptions in any script. Similarity matrices, which
can be efficiently calculated using the formula in Theorem 1, may become generally used in
future decipherments. It may be considered together with other machine-aided translation
methods that use some type of similarity metrics [13,14]. This may aid in the continuing
decipherment of the Indus Valley Script [15] and the Minoan scripts [16–18].
The work was also personally satisfying in contacting the scribe, who happened to
be a generous and hardworking person, a cattle farmer from the Great Hungarian Plains,
an adventurer. He is a great cultural ambassador between the peoples near the Altai
Mountains and Hungarians in Central Europe. May this work also help to strengthen the
cultural ties between the two regions.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
250
Information 2022, 13, 422
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
Revesz, P.Z.; Varga, G. A proposed translation of an Altai Mountain inscription presumed to be from the 7th century BC.
Information 2022, 13, 243. [CrossRef]
Sartkožauly, K. Complete Atlas of the Orkhon Monuments; Almaty Samga Press: Almaty, Kazakhstan, 2019; Volume 3.
Benkő, E.; Sándor, K.; Vásáry, I. A Székely Írás Emlékei; Bölcsészettudományi Kutatóközpont: Budapest, Hungary, 2021.
Wikipedia. Old Hungarian Script. Available online: https://en.wikipedia.org/wiki/OldHungarianscript (accessed on 8
December 2021).
Hosszú, G. Scriptinformatics: Extended Phenetic Approach to Script Evolution; Nap Kiadó: Budapest, Hungary, 2021.
Sebestyén, G. A Magyar Rovásirás Hiteles Emlékei; Magyar Tudományos Akadémia: Budapest, Hungary, 1915.
Revesz, P.Z. Establishing the West-Ugric language family with Minoan, Hattic and Hungarian by a decipherment of Linear, A.
WSEAS Trans. Inf. Sci. Appl. 2017, 14, 306–335.
Wikipedia. Hungarian Phonology. Available online: https://en.wikipedia.org/wiki/Hungarian_phonology (accessed on
14 May 2022).
Wikipedia. Enikő. Available online: https://en.wikipedia.org/wiki/Enikő (accessed on 13 January 2022).
Wikipedia. Pliska Rosette. Available online: https://en.wikipedia.org/wiki/Pliska_rosette (accessed on 6 July 2022).
Szőke, B.M. A Karoling-kor a Kárpát-Medencében; Magyar Nemzeti Múzeum: Budapest, Hungary, 2014.
Kun, P. Szelek Szárnyán; Arcadas Press: Debrecen, Hungary, 2003.
Tóth, L.; Hosszú, G.; Kovács, F. Deciphering Historical Inscriptions Using Machine Learning Methods. In Proceedings of the 10th
International Conference on Logistics, Informatics and Service Sciences; Liu, S., Bohács, G., Shi, X., Shang, X., Huang, A., Eds.; Springer:
Singapore, 2020; pp. 419–435. [CrossRef]
Daggumati, S.; Revesz, P.Z. Data mining ancient scripts to investigate their relationships and origins. In Proceedings of the 23rd
International Database Engineering and Applications Symposium, Athens, Greece, 10–12 June 2019; ACM Press: New York, NY,
USA, 2019; pp. 209–218. [CrossRef]
Daggumati, S.; Revesz, P.Z. A method of identifying allographs in undeciphered scripts and its application to the Indus Valley
Script. Humanit. Soc. Sci. Commun. 2021, 8, 50. [CrossRef]
Revesz, P.Z. Bioinformatics evolutionary tree algorithms reveal the history of the Cretan Script Family. Int. J. Appl. Math. Inform.
2016, 10, 67–76.
Revesz, P.Z. A translation of the Arkalochori Axe and the Malia Altar Stone. WSEAS Trans. Inf. Sci. Appl. 2017, 14, 124–133.
Revesz, P.Z. Experimental evidence for a left-to-right reading direction of the Phaistos Disk. Mediterr. Archaeol. Archaeom. 2022, 22,
79–96.
251
MDPI
St. Alban-Anlage 66
4052 Basel
Switzerland
www.mdpi.com
Information Editorial Office
E-mail:
[email protected]
www.mdpi.com/journal/information
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are
solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s).
MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from
any ideas, methods, instructions or products referred to in the content.
Academic Open
Access Publishing
mdpi.com
ISBN 978-3-7258-1370-4