Literature Review final
Literature Review final
Literature Review final
Systems
Introduction
An Automated Essay Scoring (AES) system is used to automatically evaluate and score
essays written in response to a question prompt, based on computational techniques. The
main aim is to mimic or improve the way human raters score with respect to reliability,
objectivity and efficiency of scoring. These systems are broadly applied in educational and
standardized testing applications, which involve rapidly increasing supply content evaluation,
including essays. Since then, AES systems have developed from simple rule-based systems to
complex machine learning and deep learning models. In this literature review, we are going to
analyse what kind of methods used by the researchers in applying AES and some key
research contributions on different areas including its methodologies, problems and
advancement in the field. [8] [9] [11]
Traditional Approaches
Automated essay scoring has a long history. In the early 1960s, the Project Essay Grade
system (PEG ) [1] one of the first automated essay scoring systems, was developed. In the
first attempts, four raters defined criteria (proxy variables) while assessing 276 essays in
English by 8th to 12th graders. The system uses a standard multiple regression analysis
involving defined proxy variables (text features such as document length, grammar, or
punctuation) representing the independent variables with the human rated essay score being
the dependent variable. With a multiple correlation coefficient of 0.71 as an overall accuracy
for both the training and test set and given the good prediction of the human score for the test
set based on derived weights from the training set, the computer reached comparable results
to humans. However, with a correlation of 0.51 for the average document length the PEG
could be easily manipulated by writing longer texts. Thus, this kind of score prediction
considers only surface aspects and ignores semantics and the content of the essays.
In subsequent studies, the n-gram model based on the bag of words approach (BOW)
was commonly used for a number of decades. BOW models extract features from student
essays, in the case of the n-gram model, by counting the occurrences of terms consisting of n
words. They then consider the number of shared terms between essays of the same class and
model their relationship. An often-cited AES system using BOW is the Electronic Essay
Rater (e-rater for short) was developed. The e-rater predicts scores for essay responses and
was originally applied to responses written by non-native English speakers taking the
Graduate Management Admission Test (GMAT; 13 considered questions) and the Test of
English as a Foreign Language (TOEFL; two considered questions). Two features used in the
e-rater are extracted by content vector analysis programs named EssayContent and
ArgContent. While the former uses a BOW approach on the full response, the latter
(ArgContent) uses a weighted BOW approach for each required argument in a response.
Using these features alone results in average accuracies of 0.69 and 0.82, respectively.
Including all 57 predictive features of the e-rater, the accuracy ranges from 0.87 to 0.94 (the
number of responses varies between 260 and 915 for each question, with a mean of 638) For
a general overview of the types of relevant features that are used in such feature-based
approaches, we additionally recommend the paper by Ke and Ng. [2]
The above models are based on laborious, theory-based feature extraction methods.
Therefore, methods such as latent semantic analysis (LSA) were established. LSA are corpora
specific and trained on texts that are related to the given essay topic. They provide a semantic
representation of an essay, which is then compared to the semantic representation of other
similarly scored responses. In this way, a feature vector, or embedding vector, for an essay is
constructed, which is then used to predict the score. Foltz, Laham, and Landauer used LSA in
the intelligent essay assessor system and compared the accuracy of the system against two
human raters. The analysis is based on 1205 essay answers on 12 diverse topics given in the
GMAT. The system achieves a correlation of 0.70-0.71 accuracy.
AES systems advanced significantly with the introduction of more sophisticated natural
language processing (NLP) techniques. Burstein et al. developed the e-rater system [2],
which used linguistic features and statistical models to evaluate essays. Similarly, the
Intelligent Essay Assessor (IEA) [3] employed Latent Semantic Analysis (LSA) to
compare student essays with reference texts, capturing more semantic and stylistic nuances
than earlier models.
With the rise of machine learning and deep learning, AES systems became more
effective. Taghipour & Ng (2016) [4] applied Recurrent Neural Networks (RNNs) to AES,
significantly improving performance over feature-based models. Dong et al. (2017) [5]
further extended this by using a multi-aspect RNN to jointly assess grammar, content, and
style. The most recent advancements have come from Transformer-based architectures [6],
with researchers like Zhao, Hou, and Du (2020) using self-attention mechanisms to capture
complex relationships between words in longer, more complex essays.
[7] For the application of a transformer model, it is fundamental to understand the three
basic architecture types of a transformer: encoder models, decoder models, and encoder–
decoder models. The tasks they are predominantly used for, and exemplary models that have
been implemented and trained based on the respective type.
In general, transformer models are neural networks based on the so-called attention
mechanism and were originally introduced in the context of language translation. The
attention mechanism was presented in 2014 by Bahdanau et al. They showed that, instead of
encoding a text from the source language into a vector representation and then decoding the
vector representation into the text of the target language, the attention mechanism allows this
bottleneck of a vector representation to be avoided between the encoder and decoder by
allowing the model to directly search for relevant tokens in the source text when predicting
the next token for the target text.
While the encoder and decoder models for translation tasks before were mainly based on
RNNs, they showed that they can be implemented based on an attention mechanism alone.
Before, the attention mechanism was only used to allow models for translation tasks to
consider the relevance of single words in the source language for the prediction of a single
word in the target language, which yields a more integral connection of the encoder and
decoder model. To implement the encoder and decoder without the RNNs, Vaswani et al.
implemented a self-attention mechanism for the texts of the source and target language,
respectively, in which different attention layers tell the model to focus on informative words
and neglect irrelevant words in the texts. They showed that, this way, the model achieves new
performance records on several translation benchmarks while having a fraction of the training
cost compared to the best models previously used.
A major advantage of the transformer architecture is that it allows the parallel processing
of the input data and is not affected by the vanishing gradient problem. This makes it possible
to train with larger and larger datasets, resulting in better and better language models. Today,
transformer-based architectures are applied to a large variety of tasks, not only in NLP but
also in fields such as computer vision, predictions based on time series or tabular data, or
multimodal tasks. In addition, in the majority of fields, transformer-based models are
currently leading the benchmarks. The additional capabilities of the model come with a price,
however. While models pre-trained with enormous datasets increase performance, the
computational processing of billions of parameters is costly and time consuming. Moreover,
it restricts the training of such models to large corporations or governmental institutions and
could, therefore, prevent independent researchers or smaller organizations from gaining
equivalent access to such models.
The Automated Essay Scoring (AES) systems have evolved significantly from rule-based
models to advanced machine learning and deep learning frameworks. Initially, systems like
Project Essay Grade (PEG) relied on basic features such as word count, grammar, and
punctuation, but these models had limited ability to evaluate deeper aspects like content and
semantics. Early attempts, such as PEG, used regression analysis but struggled with
problems like overemphasis on document length.
The development of more sophisticated models, such as e-rater and Intelligent Essay
Assessor (IEA), introduced natural language processing (NLP) and Latent Semantic
Analysis (LSA), improving AES performance by analysing the semantic similarity of essays.
Despite these improvements, these models were still limited in handling complex essays and
unconventional writing styles.
The next significant leap in AES came with Recurrent Neural Networks (RNNs) and Long
Short-Term Memory (LSTM) networks. These models could handle sequential text and
understand context better. Researchers like Taghipour & Ng (2016) demonstrated that RNN-
based models outperformed traditional feature-based models in essay scoring. Other
advances, such as multi-aspect RNNs, considered several dimensions like content, grammar,
and style jointly, further improving AES accuracy.
Recent breakthroughs have been driven by Transformer-based architectures, such as
BERT and GPT. These models, particularly the self-attention mechanism, enabled AES
systems to process essays more contextually and capture complex word relationships, greatly
enhancing the quality of scoring for long and intricate essays. For example, Zhao, Hou, and
Du (2020) showed how Transformer-based models outperform RNNs and LSTMs in tasks
involving complex sentence structures.
Proposed Model
This proposed Automated Essay Scoring (AES) system is designed specifically for the Indian
Administrative Services (IAS) examination, addressing the specialized needs of civil service
essay evaluations. By incorporating domain-specific knowledge, the system is able to assess
critical skills like argument structure, logical coherence, and critical thinking, which are
crucial for candidates aiming for roles in public administration and governance. The system
also supports Optical Character Recognition (OCR) technology to handle both typed and
handwritten essays, ensuring a fair and inclusive assessment process for candidates
submitting essays in different formats.
In addition to its advanced scoring capabilities, the system uses Natural Language Processing
(NLP) techniques such as BERT embeddings and machine learning models to capture deep
contextual understanding and assess content relevance. The model is aligned with IAS
grading rubrics to ensure consistency with human evaluators' standards, and it includes a bias
detection layer to prevent unfair scoring based on demographic factors like language
proficiency or socio-economic background. Overall, the system offers a sophisticated and
equitable solution for essay evaluation in civil services exams.
FRAMEWORK FOR MODEL
FORMULATION
01 02 03 04
Essay collection
Choosing
and criterion Transforming
appropriate
selection Cleaning the text into
machine
happens in this data numerical
learning model
phase features
Data Feature
Data ColleCtion Model SeleCtion
PreproCessing ExtraCtion
05 06 07 08
Using LLM
Assessing model Implementing Ensuring domain
model to
performance trained model expertise in the
provide
and scoring into a web or scoring process
personalised
mobile interface
feedback
FeedbaCk
Evaluation Deployment Fine Tuning
Generation
2. [2] J. Burstein, "The E-rater® Scoring Engine," *Research Gate*, 1999. [Online]. Available:
https://www.researchgate.net/publication/274309476_Attali_Y_Burstein_J_2006_Automated_essay_scoring_with_e-
raterR_V2_Journal_of_Technology_Learning_and_Assessment_43 [Accessed: Sept. 22, 2024].
3. [3] T. K. Landauer, "The Intelligent Essay Assessor (IEA)," *Research Gate*, 2003. [Online]. Available:
https://www.researchgate.net/publication/256980328_The_Intelligent_Essay_Assessor [Accessed: Sept. 22, 2024].
4. [4] K. Taghipour and H. T. Ng, "Deep learning for automated essay scoring," *Research Gate*, 2016. [Online]. Available:
https://www.researchgate.net/publication/305748202_A_Neural_Approach_to_Automated_Essay_Scoring [Accessed: Sept. 22, 2024].
5. [5] Y. Dong, Y. Zhang, and Z. Yang, "A multi-aspect recurrent neural network for automated essay scoring," *Research Gate*, 2017. [Online].
Available: https://www.researchgate.net/publication/318739534_Attention-
based_Recurrent_Convolutional_Neural_Network_for_Automatic_Essay_Scoring [Accessed: Sept. 22, 2024].
6. [6] S. Ludwig, C. Mayer, C. Hansen, K. Eilers, and S. Brandt, "A transformer-based framework for automated essay scoring," *arXiv preprint*,
2021. [Online]. Available: https://arxiv.org/pdf/2110.06874 [Accessed: Sept. 22, 2024].
7. [7] X. Zhang, "Automated essay scoring using transformer models," *MDPI*, vol. 3, no. 4, 2023. [Online]. Available:
https://www.mdpi.com/2624-8611/3/4/897 [Accessed: Sept. 22, 2024].
8. [8] Kaixun Yang, Mladen Rakovic, Yuyang Li, Quanlong Guan, Dragan Gašević, Guanliang Chen, "Unveiling the tapestry of automated essay
scoring: A comprehensive investigation of accuracy, fairness, and generalizability," *arXiv preprint*, 2023. [Online]. Available:
https://ar5iv.labs.arxiv.org/html/2401.05655 [Accessed: Sept. 22, 2024].
9. [9] Yuning Ding, Marie Bexte, and Andrea Horbach, "A multi-task learning study on automatic scoring of argumentative essays," *ACL
Anthology*, 2023. [Online]. Available: https://aclanthology.org/2023.findings-acl.825.pdf [Accessed: Sept. 22, 2024].
10. [10] Christopher M. Ormerod, Akanksha Malhotra, Amir Jafari, "Automated essay scoring using efficient transformer-based language models,"
*arXiv preprint*, 2023. [Online]. Available: https://ar5iv.labs.arxiv.org/html/2102.13136 [Accessed: Sept. 22, 2024].
11. [11] hangrong Xiao, Wenxing Ma, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, Qi Fu, "From automation to augmentation: Large language
models elevating essay scoring," *arXiv preprint*, 2023. [Online]. Available: https://ar5iv.labs.arxiv.org/html/2401.06431v2 [Accessed: Sept. 22,
2024].
12. [12] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint
arXiv:1409.0473, 2014. [Online]. Available: https://arxiv.org/abs/1409.0473. [Accessed: Sept. 22, 2024].
13. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need,"
arXiv preprint arXiv:1706.03762, 2017. [Online]. Available: https://arxiv.org/abs/1706.03762. [Accessed: Sept. 22, 2024].