From_LLMs to_LLM_based_Agents
From_LLMs to_LLM_based_Agents
From_LLMs to_LLM_based_Agents
9, SEPTEMBER 2020 1
Abstract—With the rise of large language models (LLMs), as GPT [3] and Codex [4], have demonstrated remarkable
researchers are increasingly exploring their applications in var- capabilities in handling downstream tasks in SE, including
arXiv:2408.02479v1 [cs.SE] 5 Aug 2024
ious vertical domains, such as software engineering. LLMs have code generation, debugging, and documentation. These models
achieved remarkable success in areas including code generation
and vulnerability detection. However, they also exhibit numerous leverage vast amounts of training data to generate human-like
limitations and shortcomings. LLM-based agents, a novel tech- text, offering unprecedented levels of uency and coherence.
nology with the potential for Articial General Intelligence (AGI), Studies have shown that LLMs can enhance productivity in
combine LLMs as the core for decision-making and action-taking, software projects by providing intelligent code suggestions,
addressing some of the inherent limitations of LLMs such as lack automating repetitive tasks, even generating entire code snip-
of autonomy and self-improvement. Despite numerous studies
and surveys exploring the possibility of using LLMs in software pets from natural language descriptions [5].
engineering, it lacks a clear distinction between LLMs and LLM- Despite their potential, there are signicant challenges in
based agents. It is still in its early stage for a unied standard applying LLMs to SE. One major issue is their limited context
and benchmarking to qualify an LLM solution as an LLM-based length [6], which restricts the model’s ability to comprehend
agent in its domain. In this survey, we broadly investigate the and manage extensive codebases, making it challenging to
current practice and solutions for LLMs and LLM-based agents
for software engineering. In particular we summarise six key maintain coherence over prolonged interactions. Hallucina-
topics: requirement engineering, code generation, autonomous tions is another main concern, where the model generates code
decision-making, software design, test generation, and software that appears plausible but is actually incorrect or nonsensi-
maintenance. We review and differentiate the work of LLMs cal [7], potentially introducing bugs or vulnerabilities if not
and LLM-based agents from these six topics, examining their carefully reviewed by experienced developers. Additionally,
differences and similarities in tasks, benchmarks, and evaluation
metrics. Finally, we discuss the models and benchmarks used, the inability of LLMs to use external tools restricts their access
providing a comprehensive analysis of their applications and to real-time data and prevent them from performing tasks
effectiveness in software engineering. We anticipate this work outside their training scope. It diminishes their effectiveness in
will shed some lights on pushing the boundaries of LLM-based dynamic environments. These limitations signicantly impact
agents in software engineering for future research. the application of LLMs in SE, and also highlight the need
Index Terms—Large Language Models, LLM-based Agents, for expert developeers to critically rene and validate LLM-
Software Engineering, Benchmark, Software Security, AI System generated code for accuracy and security [8]. In complex
Development projects, the static nature of LLMs can hinder their ability
to adapt to changing requirements or efciently incorporate
I. I NTRODUCTION new information. Moreover, LLMs typically cannot interact
S OFTWARE engineering (SE) has seen its booming re- with external tools or databases, further limits their utility in
search and development with the aid of articial intel- dynamic and evolving SE contexts.
ligence techniques. Traditional approaches leveraging neural To address these challenges, LLM-based agents have
networks and machine learning have facilitated various SE emerged [9] [10], combining the strengths of LLMs with
topics such as bug detection, code synthesis, and requirements external tools and resources to enable more dynamic and
analysis [1] [2]. However, they often present limitations, in- autonomous operations. These agents leverage recent advance-
cluding the need for exclusive feature engineering, scalability ments in AI, such as Retrieval-Augmented Generation (RAG)
issues, and the adaptability across diverse codebases. The and tool utilization, to perform more complex and contextually
rise of Large Language Models (LLMs) has embarked on aware tasks [11]. For instance, OpenAI’s Codex has been
new solutions and ndings in this landscape. LLMs, such integrated into GitHub Copilot [12], enabling real-time code
suggestions and completion within development environments.
Haolin Jin, Linghan Huang and Huaming Chen are with the School of Unlike static LLMs, LLM-based agents can perform a wide
Electrical and Computer Engineering, The University of Sydney, Sydney, range of tasks, such as autonomously debugging code by
2006, Australia. (email: [email protected])
Haipeng Cai is with the School of Electrical Engineering and Computer identifying and xing errors, proactively refactoring code to
Science at Washington State University, US enhance efciency or readability, and generating adaptive test
Jun Yan is with the School of Computing and Information Technology at cases that evolve alongside the codebase. These features make
University of Wollongong, Australia
Bo Li is with the Computer Science Department at the University of LLM-based agents a powerful tool for SE, capable of handling
Chicago, US more complex and dynamic workows than traditional LLMs.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2
F IG . 2: PAPER D ISTRIBUTION
Historically, AI agents focused on autonomous actions
based on predened rules or learning from interactions [13]
tion. IV- IX)
[14]. The integration of LLMs has presented new opportuni-
• RQ1: What are the key differences in task performance
ties in this area, providing the language understanding and
between LLMs and LLM-based agents in SE applica-
generative capabilities needed for more sophisticated agent
tions? (Section. IV- IX)
behaviors. [10] shows that LLM-based agents are capable
• RQ2: Which benchmark datasets and evaluation metrics
of autonomous reasoning and decision-making, achieving the
are most commonly used for assessing the performance of
third and fourth levels of WS (World Scope) [15], which
LLMs and LLM-based agents in SE tasks? (Section. IV-
outlines the progression from natural language processing
IX and Section. X)
(NLP) to general AI. In software engineering, LLM-based
• RQ3: What are the predominant experimental models and
agents show promise in areas such as autonomous debugging,
methodologies employed when utilizing LLMs in SE?
code refactoring, and adaptive test generation, demonstrating
(Section. X)
capabilities that approach articial general intelligence(AGI).
In this work, we present, to the best of our knowledge, a rst
survey outlining the integration and transformation of LLMs II. EXISTING WORKS AND THE SURVEY
to LLM-based agents in the domain of SE. Our survey covers STRUCTURE
six key themes in SE: A. Existing works
1) Requirement Engineering and Documentation: Cap- In recent years, large language models have been primarily
turing, analyzing, and documenting software require- applied to help programmers generate code and x bugs.
ments, as well as generating user manuals and technical These models understand and complete code or text based
documentation. on the user’s input, leveraging their training data and reason-
2) Code Generation and Software Development: Au- ing capabilities. In previous survey papers, such as Angela
tomating code generation, assisting in the development Fan’s research [8], there has not been much elaboration on
lifecycle, refactoring code, and providing intelligent code requirement engineering. As mentioned in the paper, software
recommendations. engineers are generally reluctant to rely on LLMs for higher-
3) Autonomous Learning and Decision Making: High- level design goals. However, with LLMs achieving remarkable
lighting the capabilities of LLM-based agents in au- improvements in contextual analysis and reasoning abilities
tonomous learning, decision-making, and adaptive plan- through various methods like prompt engineering and Chain-
ning within SE contexts. of-Thought (COT) [16], their applications in requirement
4) Software Design and Evaluation: Contributing to design engineering are gradually increasing. Table I summarizes and
processes, architecture validation, performance evalua- categorizes the tasks in requirement engineering. Many studies
tion, and code quality assessment. utilize models for requirement classication and generation.
5) Software Test Generation: Generating, optimizing, and Since the collection primarily focuses on the latter half of
maintaining software tests, including unit tests, integra- 2023 and before April 2024, and some papers address multiple
tion tests, and system tests. tasks, the table does not reect the exact number of papers we
6) Software Security & Maintenance: Enhancing security have collected.
protocols, facilitating maintenance tasks, and aiding in While other works have surveyed LLMs applications in
vulnerability detection and patching. some SE tasks [17] [8] [18], they lack a wider coverage of the
In detail, we aim to address following research questions: general SE area to incorporate recent research developments.
• RQ1: What are the state-of-the-art techniques and prac- More importantly, a focus of LLMs is the main contributions
tices in LLMs and LLM-based agents for SE? (Sec- of these works, but there is no distinguish the capabilities
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 3
between LLMs and LLM-based agents. We summarize the agents, our survey provides a thorough and in-depth overview
difference between our work and others in Table II, this sur- of how these technologies are applied and new opportunities
vey addresses these limitations by distinctly analyzing LLMs they bring to SE.
and LLM-based agents applications across six SE domains,
providing a thorough and up-to-date review. From previous
research, it is evident that the performance of LLMs in various
applications and tasks heavily depends on the model’s inherent In summary, we have collected a total of 117 papers directly
capabilities [10]. More importantly, earlier surveys often relevant to this topic, covering the six SE domains men-
present ndings from papers spanning in a wide range of tioned earlier as shown in Figure.1. Our analysis distinguishes
publication dates, leading to signicant content disparities between LLM and LLM-based agent contributions, offering
for LLMs in different SE tasks. For instance, research in a comparative overview and addressing the limitations of
requirement engineering was relatively nascent, resulting in previous surveys. Considering the novelty nature of the LLM-
sparse content in this area in previous surveys. The recent rise based agents eld and the lack of standardized benchmarks,
of LLM-based agents, with their enhanced capabilities and this work seeks to offer a detailed review that can guide future
autonomy, lls these gaps. By focusing on the latest researches research and provide a clearer view of the potential of these
and clearly differentiating between LLMs and LLM-based technologies in SE.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 4
Topic Keywords
Software Security & Maintenance Software security, Vulnerability detection, Automated Program repair, Self-debugging,
Vulnerability reproduction
Code Generation and Software Development Code generation, Automatic code synthesis, Code refactoring, Programming language
translation, Software development automation, Code completion, AI-assisted coding,
Development lifecycle automation
Requirement Engineering and Documentation Requirement engineering, Software requirements analysis, Automated requirement
documentation, Technical documentation generation, User manual generation, Doc-
umentation maintenance, Requirements modeling, Requirements elicitation
Software Design and Evaluation Software design automation, Architectural validation, Design optimization, Perfor-
mance evaluation, Code quality assessment, Software metrics, Design pattern recogni-
tion, Architectural analysis, Code structure analysis
Software Test Generation Test case generation, Automated testing, Unit test generation, Integration test genera-
tion, System test generation, Test suite optimization, Fault localization, Test mainte-
nance, Regression testing, Adaptive testing
Autonomous Learning and Decision Making Autonomous learning systems, Decision making, Adaptive planning, Project manage-
ment automation, Self-improving software, Autonomous software agents
construct the correct translated text. A recent example of this model easier to train and ne-tune.
architecture is the CodeT5+ model, launched by Salesforce
AI Research in 2023 [30]. This model is an enhancement C. Large Language Model Based Agent
of the original T5 architecture, which designed to improve
performance in code understanding and generation tasks. It The concept of agents even trace back to the 19th century
incorporates a exible architecture and diversied pre-training and is often referred to as intelligent agents, envisioned to
objectives to optimize its effectiveness in these specialized ar- possess intelligence comparable to humans. Over the past
eas. This development highlights the competency of Encoder- few decades, as AI technology has evolved, the capabilities
Decoder architectures in tackling increasingly complex NLP of AI agents have signicantly advanced, particularly with
challenges. the reinforcement learning. This development has enabled AI
agents to autonomously handle tasks and learn and improve
The Encoder-only architecture, as the name suggests it based on specied reward/punishment rules. Notable mile-
eliminates the decoder from the entire structure making the stones include AlphaGo [34], which leveraged reinforcement
data more compact. Unlike RNNs, this architecture is stateless learning to defeat the world champion in Go competition.
and uses a masking mechanism that allows input process- The success of GPT has further propelled the eld, with
ing without relying on hidden states, and also accelerating researchers exploring the use of large language models as the
parallel processing speeds and providing excellent contextual ”brain” of AI agents, thanks to GPT’s powerful text under-
awareness. BERT (Bidirectional Encoder Representations from standing and reasoning capabilities. In 2023, a research team
Transformers) is a representative model of this architecture, from Fudan University [10] conducted a comprehensive survey
this model is a large language model built solely on the on LLM-based agents, examining their perception, behavior,
encoder architecture. BERT leverages the encoder’s powerful and cognition. Traditional LLMs typically generate responses
feature extraction capabilities and pre-training techniques to based solely on given natural language descriptions, lacking
learn bidirectional representations of text, achieving outstand- the ability for independent thinking and judgment. LLM-based
ing results in sentiment analysis and contextual analysis [31]. agents able to employ multiple rounds of interaction and
The Decoder-only archiecture, in the transformer framework customized prompts to gather more information, which enable
primarily involves the decoder receiving processed word vec- the model to think and make decisions autonomously. In 2023,
tors and generating output. Utilizing the decoder to directly Andrew Zhao proposed the ExpeL framework [35], which
generate text accelerates tasks such as text generation and utilizes ReAct as the planning framework combined with an
sequence prediction. This characteristic with high scalability experience pool [36]. This allows the LLM to extract insights
is known as auto-regressiveness, which is why popular mod- from past records to aid in subsequent related queries, by
els like GPT use this architecture. In 2020, the exceptional letting the LLM analyze why previous answers were incorrect,
performance of GPT-3 and its remarkable few-shot learning it learns from experience to identify the problems.
capabilities demonstrated the vast potential of the decoder- At the same time, the application of LLM-based embodied
only architecture [32]. Given the enormous computational agents has also become a hot research area in recent years.
cost and time required to train a model from scratch, and LLM-based Embodied Agents are intelligent systems that inte-
the exponential increase in the number of parameters, many grate LLMs with embodied agents [37]. These systems can not
researchers now prefer to leverage pre-trained models for only process natural language but also complete tasks through
further research. The most popular open-source pre-trained perception and actions in physical or virtual environments. By
language model LLaMA, developed by Meta AI also employs combining language understanding with actual actions, these
the decoder-only architecture [33], as mentioned earlier, the agents can perform tasks in more complex environments. This
autoregressiveness and simplicity of this structure make the integration often involves using visual domain technologies
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 6
to process and understand visual data and reinforcement ensuring the generation of high-quality outputs and achieving
learning algorithms to train agents to take optimal actions in complex automated tasks. In 2023, Jules White introduced a
the environment. These algorithms guide the agent through set of methods and patterns to enhance prompt engineering,
reward mechanisms to learn how to make optimal decisions optimizing interactions with LLMs such as ChatGPT [40].
in different situations, while the LLM acts as the brain to This study categorizes prompt engineering into ve main
understand user instructions and generate appropriate feed- areas: Input Semantics, Output Customization, Error Identi-
back. In 2023, Guanzhi Wang introduced VOYAGER, an open- cation, Prompt Improvement, and Interaction, to address a
ended embodied agent with large language models [38]. It uses wide range of problems and adapt to different elds.
GPT-4 combined with input prompts, an iterative prompting One notable technique is Retrieval-Augmented Generation
mechanism, and a skill library enabling the LLM-based agents (RAG), the input question undergoes similarity matching with
to autonomously learn and play the game Minecraft, becoming documents in an index library, attempting to nd relevant re-
the rst lifelong learning agent in the game. sults. If similar documents are retrieved, they are organized in
conjunction with the input question to generate a new prompt,
which is then fed into the large language model. Currently,
large language models possess long-text memory capabilities,
numerous studies have tested Gemini v1.5 against the Needle
In A Haystack (NIAH) evaluation to explore whether RAG
has become obsolete [41]. However, from various perspec-
tives such as cost and expense, RAG still holds signicant
advantages (RAG could be 99 percent cheaper than utilizing
all tokens). Additionally, long texts can negatively impact
response performance, causing LLM to respond more slowly
when the input text is too long. Therefore, the advancements
in context length for LLM will not fully replace the role of
RAG but treated as complements between each other.
systems fully utilize the advantages of multiple models, with in complex environments. Moreover, signicant research has
each model specializing in specic aspects of the task to re- been conducted in the eld of automated program repair
duce overhead caused by multi-processes in single agents, the (APR). In 2024, Islem Bouzenia introduced RepairAgent [46],
collaboration among agents allows for more sophisticated and another LLM-based tool designed for automatic software
robust problem-solving capabilities. Due to their exceptional repair, this tool reduced the time developers spent on xing
capabilities, more researchers are beginning to explore the eld issues. Additionally, in 2023, Emanuele Musumeci demon-
of Multi-LLM based agents and start applying into software strated a multi-agent LLM system [47], which involved a
engineering domains. In 2024, a lot of researchers adopt the multi-agent architecture where each agent had a specic role
multi-agent system into the practical experiments [43] [44]. in the generation of documents. This system signicantly
Multi-agent systems address the limitations of single-agent improved handling complex document structures without ex-
systems in the following ways: tensive human supervision. Besides these, LLMs have made
• Enhanced Context Management: Multiple agents can outstanding contributions in software testing, software design,
maintain and share context, generating more coherent and and emerging elds such as software security and mainte-
relevant responses over long interactions. nance.
• Specialization and Division of Labor: Different agents Currently, there is no comprehensive and accurate denition
can focus on specic tasks or aspects of a problem, of the capabilities an LLM must exhibit to be considered an
improving efciency and effectiveness. llm-based agent. Since the application of LLMs in software
• Robustness and Error Correction: Collaborative agents engineering is relatively broad and some frameworks already
can cross-check and validate each other’s outputs, re- behave certain levels of autonomy and intelligent, this study
ducing the likelihood of errors and improving overall denes the distinction between LLM and LLM-based agents
reliability. based on mainstream denitions and literature from the rst
• Contextual Consistency: Multi-agent systems can better half of 2024. In this survey, an LLM architecture can be called
manage context over long dialogues. The collaboration an agent when it satisfy the Table IV.
of multiple agents improves the efciency of incident
mitigation. IV. R EQUIREMENT E NGINEERING AND AND
• Scalability and Flexibility: These systems can integrate D OCUMENTATION
specialized agents to scale and handle more complex Requirement Engineering is a critical eld within software
tasks. Through the division of labor among multiple engineering and plays an essential role in the software devel-
agents, the quality of code generation is improved. opment process, its primary task is to ensure that the software
• Dynamic Problem Solving: By integrating agents with system meets the needs of all relevant stakeholders. Typically,
different expertise, multi-agent systems can adapt to a requirement engineering in project development involves many
wider range of problems and provide more accurate steps, where developers need to fully understand the users’
solutions. needs and expectations to ensure that the development direc-
tion of the software system aligns with actual requirements.
E. LLM in Software Engineering The collected requirements are then organized and evaluated
by the development group. Requirements Specication is the
Recently, there has been a shift towards applying general
process of formally documenting the analyzed requirements,
AI models to specic vertical domains such as medical and
the specication must be accurate and concise, and the
nance. In software engineering, new AI agents are emerging
requirement verication must be conducted to ensure that
that are more exible and intelligent compared to previous
developers are building what users need and that it aligns
applications of LLMs, although they utilize different data
with the specications. Requirement engineering also includes
and experiments. This continuous innovation underscores the
requirement management, a task that spans the entire software
transformative potential of AI agents across various elds,
development life-cycle, developers need to continuously track,
these models excel in text understanding and generation,
control, and respond to any changes occurring during devel-
promoting innovative applications in software development
opment, ensuring that these changes do not negatively impact
and maintenance.
the project’s progress and overall quality.
LLMs profoundly impact software engineering by facili-
tating tasks such as code generation, defect prediction, and
automated documentation. Integrating these models into devel- A. LLMs Tasks
opment workows not only simplies the coding process but In the eld of requirement engineering, LLMs have demon-
also reduces human errors. LLM-based agents enhance basic strated signicant potential in automating and enhancing tasks
LLM capabilities by integrating decision-making and interac- such as requirement elicitation, classication, generation, spec-
tive problem-solving functions. These agents can understand ication generation, and quality assessment. Requirement
and generate result by interacting with other software tools classication and extraction is a crucial task in requirement
which optimize workows, and make autonomous decisions engineering during the development process. It is common
to improve software development practices. In 2023, Yann to encounter situations where clients present multiple re-
Dubois introduced the AlpacaFarm framework [45], where quirements at once, necessitating manual classication by
LLMs are used to simulate the behavior of software agents developers. By categorizing requirements into functional and
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 8
Criteria
1) The LLM serves as the brain (the center of information processing and generation of thought).
2) The framework not only relies on the language understanding and generation capabilities of LLMs but also possesses decision-
making and planning abilities.
3) If tools are available, the model can autonomously decide when and which tools to use and integrate the results into its predictions
to enhance task completion efciency and accuracy.
4) The model can select the optimal solution from multiple homogeneous results (the ability to evaluate and choose among various
possible solutions).
5) The model can handle multiple interactions and maintain contextual understanding.
non-functional requirements, developer can better understand of LLMs to generate SRS. The experiment conducted on GPT-
and manage them, thanks to the strong performance of LLMs 4 and CodeLlama-34b one close-source LLM and one open-
in classication tasks, many relevant frameworks have been source LLM for comprehensive evaluation, the generated SRS
developed. The PRCBERT framework, utilizing the BERT will compare with human-crafted SRS and nally scored by
pre-trained language model, transforms classication problems the likert scale. The result indicate that, the human-generated
into a series of binary classication tasks through exi- SRS was overall superior, but CodeLlama often came close,
ble prompt templates, signicantly improving classication sometimes outperforming in specic categories. The CodeL-
performance [48]. Studies have shown that the PRCBERT lama scored higher in completeness and internal consistency
achieved an F1 score of 96.13% on the PROMISE dataset than GPT-4 but less concise, so this stuy demonstrated the
which outperform the previous state-of-arts NoRBERT [49] potential of using ne-tuned LLMs to generate SRS and
and BERT-MLM models [31]. Additionally, the application increase the overall project productivity. Another paper also
of ChatGPT in requirement information retrieval has shown explores using LLMs for generating specications. In [53], the
promising results, by classifying and extracting information authors introduce a framework called SpecGen for generating
from requirement documents, ChatGPT achieved comparable program specications. The framework primarily uses GPT-
or even better F β scores under zero-shot settings, particularly 3.5-turbo as the base model and employs prompt engineering
in feature extraction tasks, where its performance surpassed combined with multi-turn dialogues to generate the specica-
baseline models [50]. As seen in Table I, there is also tions. SpecGen applies four mutation operators to modify these
substantial literature and research on using LLMs to automat- specications and nally uses a heuristic selection strategy to
ically generate requirements and descriptions in requirement choose the optimal variant. The results show that SpecGen
engineering. can generate 70% of the program specications, outperforming
traditional tools like Houdini [54] and Daikon1 .
By automating the generation and description of require- Furthermore, designing prompt patterns can signicantly
ments, the efciency and accuracy of requirement elicitation enhance LLMs’ capabilities in tasks such as requirement
can be improved. Research indicates that LLMs hold signif- elicitation and system design. The paper provides a catalog
icant potential in requirements generation task. For example, of 13 prompt patterns, each aimed at addressing specic
using ChatGPT to generate and gather user requirements, challenges in software development [55]. The experiments
studies found that participants with professional knowledge test the efcacy of these patterns in real world scenarios to
could use ChatGPT more effectively, indicating the inuence validate their usefulness. By applying different prompt pat-
of domain expertise on the effectiveness of LLM-assisted terns, the study found that these patterns could help generate
requirement elicitation [51]. The study employed qualitative more structured and modular results and reduce common
assessments of the LLMs’ output against predened criteria for errors. Automated requirement completeness enhancement is
requirements matches, including full matches, partial matches, another important benet brought by the LLMs in requirement
and the relevancy of the elicited requirements, although their generation. The study [56] use BERT’s Masked Language
success varied depending on the complexity of the task and the Model (MLM) can detect and ll in missing parts in natural
experience of the users, the result showing that LLMs could language requirements, signicantly improving the complete-
effectively assist in eliciting requirements, and its particularly ness of requirements. BERT’s MLM achieved a precision of
useful in identifying, and suggesting requirements based on the 82%, indicating that 82% of the predicted missing terms were
large corpus of training data they provided. The SRS (Software correct.
Requirement Specication) generation is an important task There is also the application of LLMs in ambiguity detection
which the developer normally spent a lot of time to rene tasks, aimed at detecting ambiguities in natural language
and veried. In [52], researchers use both iterative prompting
and a single comprehensive prompt to assess the performance 1 https://github.com/codespecs/daikon
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 9
requirement documents to improve clarity and reduce misun- the system according to the case, and then the generated
derstandings. This study primarily aims to address the issue results will let humans judge whether the requirements are
of detecting term ambiguities within the same application met or not. The study results indicate that AISD signicantly
domain (where the same term has different meanings in increased use case pass rates to 75.2%, compared to only
different domains). Although current models generally possess 24.1% without human involvement. AISD demonstrates the
excellent contextual understanding capabilities, this was a agents’ autonomous learning ability by allowing LLMs to
common problem in machine learning at that time. This study generate all code les in a single session, continually rening
provides an excellent paradigm for the subsequent application and modifying based on user feedback. This also ensures code
of LLMs in requirements engineering, study demonstrated the dependency and consistency, further proving the importance
transformer-based machine learning models can effectively of human involvement in the requirement analysis and system
detect and identify ambiguities in requirement documents, testing stages.
thereby enhancing document clarity and consistency. The Furthermore, in generating safety requirements for au-
framework utilizes BERT and K-means clustering to identify tonomous driving, LLM-based agents have shown unique
terms used in different contexts within the same applica- advantages by introducing multimodal capabilities. The sys-
tion domain or interdisciplinary project requirements docu- tem employs LLMs as automated agents to generate and
ments [57]. In recent two years, more and more researchers rene safety requirements with minimal human intervention
use LLMs to help them to evaluate the requirement docu- until the verication stage, which is unattainable with only
mentations, quality assessment tasks ensure that the generated LLMs. [62] describes an LLM prototype integrated into the ex-
requirements and code meet expected quality standards. The isting Hazard Analysis and Risk Assessment (HARA) process,
application of ChatGPT in user story quality evaluation has signicantly enhancing efciency by automatically generating
shown potential in identifying quality issues, but it requires specic safety-related requirements. The study through three
further optimization and improvement [58]. A similar study design iterations progressively improved the LLM prototype’s
use LLM to automatically process the requirement satisfac- efciency by completing within a day compared to months
tion assessment, and evaluate whether design elements fully manually. In agile software development, the quality of user
covered by the given requirements, but the the researcher stories directly impacts the development cycle and the realiza-
indicated the necessity of further verication and optimization tion of customer expectations. [63] demonstrates the successful
in practical applications [59]. application of the ALAS system in six agile teams at the Aus-
trian Post Group IT. The ALAS system signicantly improved
the clarity, comprehensibility, and alignment with business
B. LLM-based Agents Tasks objectives of user stories through automated analysis and
Currently the application of LLM-based agents in the re- enhancement. The entire agent framework allows the model to
quirement engineering is till quite nascent, but there are some perform specic roles in the Agile development process, the
useful researches to help us to see the potential possibility. study results indicated that the ALAS-improved user stories
LLM-based agents bring both efciency and accuracy for received high satisfaction ratings from team members.
tasks like requirement elicitation, classication, generation,
and verication. Compared to traditional LLMs, these systems C. Analysis
exhibit higher levels of automation and precision through The application of LLM-based agents in requirement engi-
task division and collaboration. The application of multi-agent neering has demonstrated signicant efciency improvements
systems in semi-structured document generation has shown and quality assurance. Through multi-agent collaboration and
signicant effectiveness. In [60], a multi-agent framework automated processing, these systems not only reduce manual
is introduced that combines semantic recognition, informa- intervention but also enhance the accuracy and consistency of
tion retrieval, and content generation tasks to streamline the requirement generation and verication. We can see that the
creation and management of semi-structured documents in tasks of LLM-based agents are no longer limited to simply
the public administration domain. The proposed framework generating requirements or lling in the gaps in descriptions.
involves three main types of agents: Semantics Identication Instead, they involve the implementation of an automated
Agent, Information Retrieval Agent, and Content Generation process, with the generation of requirement documents being
Agent. By avoiding the overhead of a single model, each agent just one part of it, integrating LLM into agents enhances the
is assigned a specic task with minimal user intervention, overall system’s natural language processing and reasoning
following the designed framework and workow. capabilities. In the real-world application, many tasks can no
Additionally, the AI-assisted software development frame- longer be accomplished by simple LLMs alone, especially
work (AISD) also showcases the autonomy brought by the for high-level software design. The emergence of LLM-based
LLM-based agents in requirement engineering. [61] proposes agents addresses this issue through a multi-agent collaborative
the AISD framework, which continuously improves and op- system centered around LLMs, these agents continuously ana-
timizes generated use cases and code through ongoing user lyze and rene the deciencies in the requirement documents,
feedback and interaction. In the process of the experiment, hu- this is might be the main application trend of LLM-based
mans need to rst give a fuzzy requirement denition, and then agents in requirements engineering in the future.
LLM-based agent will improve the requirement case according The application of LLM-based agents in requirements engi-
to this information, and then design the model and generate neering is still relatively limited, with most efforts focusing on
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 10
reect the requirements in actual projects, providing a valuable recall, and F1 score but also more specic indicators tailored to
testing benchmark. the unique nature of requirement engineering. Through these
From these papers, it can be seen that the selection and evaluations, we can see how these models are assessed and
construction of datasets in LLM-based agents’ research in how they are changing the practice of requirement engineering.
requirement engineering often rely on practical projects and The specic evaluation metrics are detailed in Table V. In [51],
case studies, lacking standardization and large-scale datasets. while precision and recall are fundamental for evaluating the
Compared to LLM literature, the datasets used are broader effectiveness of information retrieval, additional evaluations
and in a higher level like an actual system’s les, not lim- of clarity, consistency, and compliance are included, which
ited to the classication of non-functional requirements and are crucial quality indicators in requirement engineering. This
pure software requirement specications. Researchers focus multidimensional evaluation method not only measures the
more on validating the model’s effectiveness through practical operational performance of LLMs but also examines their
application and iterative improvement to enhance model per- ability to maintain the quality of requirement specications.
formance. While this approach is exible and targeted, it also Through this approach, LLMs have demonstrated their value in
highlights the eld’s shortcomings in dataset standardization automating and optimizing the requirement elicitation process,
and scaling. In the future, with more public datasets being enhancing both efciency and the reliability of results. The
constructed and shared, the application of LLM-based agents paper [52] use the Likert Scale to measure the quality of
in requirement engineering is expected to achieve broader and generated specications, the specication will be scored by its
deeper development. Unambiguous, Understandable, Conciseness, etc. The Likert
Scale will be scored from 1 to 5 of agreements.
For agent-based LLMs, as demonstrated in [63], the evalu-
E. Evaluation Metrics ation extends to assessing the independence and negotiability
In the eld of requirement engineering, LLMs and LLM- of the agents, elevating their functionality to a new level.
based agents are evaluated using various metrics. These met- These agents provide technical solutions and also interact
rics not only include traditional indicators such as precision, with users, autonomously adjusting to meet specic project
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12
needs, thus resembling collaborative partners. This capability tion and reasoning, covering areas such as code generation,
makes LLM-based agents valuable in requirement engineering debugging, code comprehension, code completion, code trans-
in the requirement management and decision optimization, formation, and multi-turn interactive code generation. The
also highlight that LLMs typically focus on improving the primary method is generating executable code from natural
accuracy and efciency of specic tasks, while LLM-based language descriptions, where models utilize previously learnt
agents exhibit higher capabilities in autonomy and adaptability. code snippets or apply few-shot learning to better understand
In Table V, we can see that the application of LLMs in user requirements. Nowadays the AI tools integrates deeply
requirements engineering typically requires common metrics with IDEs like Visual Studio Code2 and JetBrains3 to enhance
such as F1 Score to evaluate model’s performance. However, code writing and translation tasks such as OpenAI’s Codex
for LLM-based agents, the evaluation focus shifts from the per- model [67]. Codex ne-tuned on public code from GitHub,
formance of requirement document generation to the quality of demonstrate the capability to generate Python functions from
the nal product. Therefore, evaluation metrics emphasize user doc-strings also outperformed other similar models on the
satisfaction, such as pass rate, feedback, etc. Essentially, LLM- HumanEval benchmark.
based agents still leverage LLMs themselves to achieve higher- In [68], researchers comprehensively evaluated the perfor-
level development, and depends pretty much on the nature of mance of multiple LLMs on L2C(language to code) tasks.
the task. In summary we can conclude that, the characteristics The results showed that GPT-4 demonstrates strong capability
of agent models both reect their complex decision-making in tasks such as semantic parsing, mathematical reasoning,
and learning abilities also reveal their potential advantages and Python programming. With instruction tuning and support
in collaboration with human or other tools to provide higher from large-scale training data, the model can understand and
scalability and exibility design. This phenomenon implies generate code that aligns with user intent, achieving high-
the potential opportunities that the methods of requirement precision code generation. Applying LLMs to text-to-database
elicitation and processing in future software development will management and query optimization is also a novel research
become more efcient, precise, and continuous rene to better direction in natural language to code generation task. By
align with stakeholder’s needs by using LLM-based agents. converting natural language queries into SQL statements,
LLMs help developers quickly generate efcient database
V. C ODE G ENERATION AND S OFTWARE D EVELOPMENT query code. In [69], proposed the SQL-PaLM framework
Code generation and software development are core areas which signicantly enhances the execution accuracy and exact
within software engineering which plays a crucial role in match rate for text-to-SQL tasks through a few-shot prompt
the software development process. The primary objective of and instruction ne-tuning, providing an effective solution for
using LLMs in code generation is to enhance development complex cross-domain SQL generation tasks. The improve-
efciency and code quality through automation processes, ments in accuracy and exact match achieved in the SQL-PaLM
thereby meeting the needs of both developers and users. model are considered state-of-the-art (SOTA) in tested bench-
In recent years, the application of LLMs in code generation marks, the SQL-PaLM performed promise results comparing
and software development has made signicant progress, this with existing methods such as T5-3B + PICARD, RASAT +
has changed the way developers work and revealed a shift in PICARD, and even GPT-4, achieving the highest test accuracy
automated development processes. Compared to requirement of 77.3% and an execution accuracy of 82.7%. Multilingual
engineering, research on the application of LLMs and LLM- code generation is another important application of LLMs,
based agents in code generation and software development particularly suited to the transformer architecture. In [70],
is more extensive and in-depth. Using natural language pro- researchers introduced the CodeGeeX model, which was pre-
cessing and generation technologies, LLMs can understand trained on multiple programming languages and performed
and generate complex code snippets, assisting developers in well in multilingual code generation and translation tasks.
automating various stages from code writing and debugging Experimental results showed that CodeGeeX outperformed
to software optimization. The decoder-based large language other multilingual models on the HumanEval-X benchmark.
models such as GPT-4 have shown signicant potential in code Although current LLMs possess excellent code generation
generation by providing accurate code suggestions and auto- capabilities, with accuracy and compile rates reaching usable
mated debugging, greatly improving development efciency. levels, the quality of generated code often depends on the
Recently, the application of LLM-based agents in software user’s prompts. If the prompts are too vague or general,
development is also gaining attention, these intelligent agents the LLM typically struggles to understand the user’s true
can not only perform complex code generation tasks but also requirements, making it difcult to generate the desired code
engage in autonomous learning and continuous renement, in a single attempt. In [71], researchers introduced ”print
thereby offering exible assist in dynamic development en- debugging” technique, using GPT-4 to track variable values
vironments. Tools like GitHub Copilot [12], which integrate and execution ows, which enhancing the efciency and ac-
LLMs, have already demonstrated their advantages in enhanc- curacy by using in-context learning techniques. This method is
ing programming efciency and code quality. particularly suitable for medium-difculty problems on Leet-
code, compared to the rubber duck debugging method, print
A. LLMs Tasks
Large language models have optimized various tasks in 2 https://code.visualstudio.com/
debugging improved performance by 1.5% on simple Leetcode efciency and improves the quality and accuracy of generated
problems and by 17.9% on medium-difculty problems. code to mitigate the hallucination from the single LLM.
Additionally, the application of LLMs in improving pro- In [76], researchers proposed a self-collaboration framework
gramming efciency has garnered widespread attention, the where multiple ChatGPT (GPT-3.5-turbo) agents act as differ-
tools like GitHub Copilot which integrating OpenAI’s Codex ent roles to collaboratively handle complex code generation
model, provide real-time code completion and suggestions dur- tasks. Specically, the introduction of Software Development
ing coding. According to [72], researchers present a controlled Methodology (SDM) divides the development process into
experiment with the Github Copilot, the result demonstrated three stages: analysis, coding, and testing. Each stage is
that with developers completing HTTP server tasks 55.8% managed by specic roles, and after completing their tasks,
faster when using Copilot. Another similar study also using each role provides feedback and collaborates with others to
LLM to be the programmer tools, in [73], researchers intro- improve the quality of the generated code. Experiment shows
duced the INCODER model which capable of both program that this self-collaboration framework signicantly improves
synthesis and editing. By leveraging bidirectional context, performance on both the HumanEval and MBPP benchmarks,
the model performs well in both single-line and multi-line with the highest improvement reaching 29.9% in HummanEval
code lling tasks, providing developers with smarter code compared to the SOTA model GPT-4. This result demon-
editing tools. This real-time code generation and completion strating the potential of collaborative teams in complex code
functionality not only improves programming efciency but generation tasks. Although it lacks external tool integration
also reduce the burden on developers, allowing them to focus and dynamic adjustment capabilities, this framework exhibits
on higher-level design which is a common problem in software common characteristics of LLM-based agents, such as role dis-
development where substantial workforce and time are wasted tribution, self-improvement ability, and excellent autonomous
on tedious coding tasks. decision-making, these combined capabilities qualify it to be
The multi-turn program synthesis tasks represent a signif- considered an LLM-based agent. Similarly, In [77], the LCG
icant breakthrough for LLMs in handling complex program- framework improved code generation quality also through
ming tasks, in [74], researchers introduced the CODEGEN multi-agent collaboration and chain-of-thought techniques,
model, which iteratively generates programs through multiple once again demonstrating the effectiveness of multi-agent
interactions, signicantly improving program synthesis quality collaboration in the software development process.
and making the development process more efcient and ac- The limitations of context windows was not discussed in
curate. By gradually generating and continuously optimizing previous studies, this has been thoroughly explored in a 2024
code at each interaction, LLMs can better understand user by University of Cambridge team. In [78], researchers intro-
intent and generate more precise and optimized code. In the duced the L2MAC framework, which dynamically manages
experiments, comparisons were made with the Codex model, memory and execution context through a multi-agent system
which was considered state-of-the-art in code generation at to generate large codebases, and achieved SOTA performance
the time. CODEGEN-MONO 2.7B outperformed the Codex in generating large codebases for system design tasks. The
model of equivalent outcome in pass@k metrics for both k=1 framework is primarily divided into the following components:
and k=10. Furthermore, CODEGEN-MONO 16.1B exhibited the processor, which is responsible for the actual generation
performance that was comparable to or better than the best of task outputs; the Instruction Registry, which stores program
Codex model on certain metrics, further demonstrating its prompts to solve user tasks; and the File Storage, which
SOTA performance in the code generation. By iteratively contains both nal and intermediate outputs. The Control Unit
generating and optimizing code, LLMs continuously improve periodically checks the outputs to ensure that the generated
their output quality. In [75], researchers proposed the Cycle content is both syntactically and functionally correct. The
framework, which enhances the self-improvement capability of researchers conducted multiple experiments and compared
code language models by learning from execution feedback, with many novel methods like GPT-4, Reexion, and Auto-
improving code generation performance by 63.5% on multiple GPT, achieving a Pass@1 score of 90.2% on the HumanEval
benchmark datasets. Although Cycle has a certain degree of benchmark, showcasing its superior performance in generating
autonomy, its decision-making and planning capabilities are large-scale codebases.
mainly limited to code generation and improvement tasks Recently, many studies have begun to use LLM-based
without overall planning, and the execution sequence is com- agents to simulate real software development processes, the
pletely followed a xed pattern, so it’s better to classied as paper [79] introduced the MetaGPT framework, which en-
an advanced LLM application. hanced problem-solving capabilities through standard operat-
ing procedures (SOPs) encoded in multi-agent collaboration.
The entire process of the multi-collaboration framework sim-
B. LLM-based Agents Tasks ulates the waterfall life-cycle of software development, with
LLM-based agents have shown signicant potential and each agent playing different roles and collaborating to achieve
advantages by substantially improving task efciency and the goal of automating software development. LLM-based
effectiveness through multi-agent collaboration. Unlike tradi- agents have also shown strong ability in automated software
tional LLMs, LLM-based agents adopt a division of labor development, [80] proposed a multi-GPT agent framework
approach, breaking down complex tasks into multiple subtasks that automates tasks such as project planning, requirement
handled by specialized agents, this method can enhance task engineering, software design, and debugging, illustrating the
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 14
potential for automated software development. Similarly [81] overcomes the limitations of traditional LLMs also provides
introduced the model called CodePori, which is a novel new directions and ideas for future software development
model designed to automate code generation for extensive research and applications, to frees programmers from the
and complex software projects based on natural language boring test suite generation.
prompts. In [82] the AgentCoder framework collaborates with
programmer agents, test design agents, and test execution
agents to generate and optimize code, outperforming existing
methods, achieved SOTA performance on the HumanEval-ET
benchmark with pass@1 of 77.4% compared to the previous
state-of-the-art result of 69.5%, this result showcasing the
advantages of multi-agent systems in code generation and
testing.
The purpose of integrating LLMs into agents from many
framework is to enhance the self-feedback and reection
capabilities of the entire agent system. Because the current
open-source LLMsgenerally have much lower capabilities in
this aspect compared to proprietary models, the emergence
of LLM-based agents can help bridge the gap between open-
source models and the advanced capabilities of proprietary
systems like GPT-4. [83] introduced the OpenCodeInterpreter
framework, which improved the accuracy of code generation
models by integrating code generation, execution, and human
feedback. Based on CodeLlama and DeepSeekCoder, this
framework performed close to the GPT-4 Code Interpreter on
the HumanEval and MBPP benchmarks. The abbility of using
external tools or APIs is another signicant advantage of LLM-
based agents, [84] proposed the Toolformer model, which
signicantly enhanced task performance by learning to call
APIs through self-supervision. The framework Based on GPT- F IG . 5: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN
LLM- BASED AGENT AND LLM IN C ODE G ENERATION AND S OFT-
J (6.7B parameters) achieved signicant performance improve- WARE D EVELOPMENT
ments across multiple benchmark tasks, demonstrating the
possibility of LLM-based agent brought by the external tool,
the diverse choice of tools and architectures, allowing LLMs In software engineering task handling, there are subtle
to continuously learn new things and improve themselves. differences between LLMs and LLM-based agents in terms
Similarly, [85] enhanced LLMs’ interaction with external of task focus, approach, complexity and scalability, automa-
APIs through the ToolLLM framework, outperforming Text- tion level, and task management. LLMs primarily focus on
Davinci-003 and Claude-2 on the ToolBench and APIBench enhancing the code generation capabilities of a single LLM,
benchmarks and excelling in multi-tool instruction processing. including debugging, precision, evaluation. These methods
typically improve specic aspects of code generation or eval-
uation through a single model, concentrating on performance
C. Analysis enhancement within existing constraints, such as context
The main differences between LLM-based agents and tra- windows and single-task execution. In contrast, LLM-based
ditional LLMs in software development applications mainly agents emphasize handling more complex and broader tasks
focus on the efciency and autonomy, particularly in task through the collaboration of multiple specialized LLMs or
division and collaboration. Traditional LLMs typically use frameworks, integrating tool usage, iterative testing, and multi-
a single model to handle specic tasks, such as generating agent coordination to optimize the whole development process
code from text and code completion. However, this approach and easily surpass the state-of-art model in common bench-
has limitations when dealing with complex tasks, especially marks. The emeergence of multi-agent systems also brings
regarding context window restrictions and the need for con- more possibilities, this system can imitate the real software
tinuous feedback. LLM-based agents handle different subtasks developer to perform the scrum development. Figure. 5 utilize
through collaboration with clear division of labor, thereby studies [77] and [75] showcase the differences between LLM-
enhancing task efciency and quality. For example, in a code based agents and LLMs on the same code generation task.
generation task, one agent generates the initial code, another The LLM-based agents system are able to perform multi-agent
designs test cases, and a third executes tests and provides collaboration and simulate the real scrum development team
feedback, thus achieving iterative optimization. Through task in the industry. In contrast the LLMs on the right are normally
division, multi-agent systems, and tool integration, LLM-based use multi-LLMs to analysis mistakes from the test cases, and
agents can tackle more complex and broader tasks, improving rene the initial generated code, but they lack autonomy and
the quality and efciency of code generation. This approach efciency, as the test cases are manually generated by humans.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 15
VI. AUTONOMOUS L EARNING AND D ECISION M AKING timizing performance, with the frequently used method called
majority vote [89], this improves the accuracy of inference
Autonomous Learning and Decision Making is a critical
systems and ensures the selection of the optimal possibility.
and evolving eld in modern software engineering, especially
Additionally, the performance of LLMs in tasks such as
under the inuence of articial intelligence and big data.
automated debugging and self-correction has enhanced the
The core task of autonomous learning and decision making
system’s autonomous learning capabilities, achieving efcient
is to achieve automated data analysis, model building, and
error identication and correction. At the same time, the
decision optimization through machine learning algorithms
application of LLM-based agents in autonomous learning and
and intelligent systems, thereby enhancing the autonomy and
decision-making is also a novel but popular topic, these agents
intelligence of systems.
can perform complex reasoning and decision-making tasks
In this process, LLMs and LLM-based agents bring numer-
with the help from the LLM, and also improve their adapt-
ous possibilities, following the development of NLP technol-
ability in dynamic environments through continuous learning
ogy, a lot of achievements have been made in the application
and optimization. In this context, we have collected nineteen
of LLMs in this eld. These models can handle complex
research papers on LLM-based agents in this eld. This survey
language tasks and also demonstrate powerful reasoning and
will provides a general review of these studies, analyzing
decision-making abilities, the research on voting inference
the specic applications and technical implementations in
using multiple LLMs calls has revealed new methods for op-
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17
autonomous learning and decision making. transformative creativity. Although LLMs can generate high-
quality creative content, further research and improvement are
needed to achieve true creative breakthroughs. Additionally,
A. LLMs Tasks
innovative responses generated by LLMs may come with the
The API call for the LLMs is one common applications, possibility of hallucination, a long-standing issue for large
often requiring continuous calls to enable the model to make language models. Despite many techniques to mitigate its
judgments and inferences, but does continuously increasing downsides, it still cannot be entirely prevented. There are
the number of calls always improve performance? In [90], many interesting experiments in decision making, such as
researchers explored the impact of increasing LLM calls on the having LLMs act as judges to determine whether a person
performance of composite reasoning systems. Paper analyze has committed a crime [94]. A familiar attempt is to have
the voting inference design systems, the result showed that a primary LLM interact with other LLMs. [95] explored the
there is a non-linear relationship between the number of LLM effectiveness of using LLMs as judges to evaluate other LLM-
calls and system performance; performance improves initially driven chat assistants. The study validated the consistency of
with more calls but declines after reaching a certain thresh- LLM judgements with human preferences through the MT-
old. This research provides a theoretical basis for optimizing Bench and Chatbot Arena benchmarks, with results showing
LLM calls, helping to allocate resources reasonably in prac- that GPT-4’s judgments were highly consistent with human
tical applications to achieve optimal performance. However, judgments across various tasks. This research demonstrates the
the performance of Voting Inference Systems shows a non- potential of LLMs in simulating human evaluation, providing
monotonic trend due to the diversity of query difculties, and new ideas for automated evaluation and optimization.
the continuously increasing cost also needs to be considered.
Autonomous learning is also applied in bug xing, where
researchers hope LLMs can continuously learn to x bugs B. LLM-based Agents Tasks
and eventually identify human oversights or common errors. Multi-agent collaboration and dialogue frameworks also
In [91], the SELF-DEBUGGING method was proposed, en- demonstrated strong capabilities in both decision making and
abling LLMs to debug code by analyzing execution results and autonomous learning. [96] explores whether multi-agent dis-
natural language explanations. This method signicantly im- cussions can enhance the reasoning abilities of LLMs. The
proved the accuracy and sample efciency of code generation proposed CMD framework simulates human group discussion
tasks especially for complex problems. Experimental results on processes, showing that multi-agent discussions can improve
the Spider and TransCoder benchmarks showed that the SELF- performance in commonsense knowledge and mathematical
DEBUGGING method increase the model’s accuracy by 2- reasoning tasks without task-specic examples. Additionally,
12% which demonstrates the potential of LLMs in autonomous the study found that multi-agent discussions also correct com-
learning to debug and correct any erros. Another similar mon errors in single agents, such as judgment errors and the
study introduced the AutoSD (Automated Scientic Debug- propagation of incorrect answers, thereby enhancing overall
ging) technique [92], which simulates the scientic debugging reasoning accuracy. [97] researchers explored the potential
process through LLMs, generating explainable patched code. of multi-modal large language models (MLLMs) like GPT4-
Researchers evaluated AutoSD’s capabilities from six aspects: Vision in enhancing agents’ autonomous decision-making pro-
feasibility, debugger ablation, language model change, devel- cesses. The paper introduce the PCA-EVAL benchmark, and
oper benet, developer acceptance, and qualitative analysis. evaluated multi-modal decision-making capabilities in areas
Result have shown that AutoSD can generate effective patches such as autonomous driving, home assistants, and gaming.
and also improve developers’ accuracy in evaluating patched The results showed that GPT4-Vision exhibiting outstanding
code by providing explanations, its explainability function performance across the dimensions of perception, cognition,
makes it easier for developers to understand and accept au- and action.
tomatically generated patches. Although the above two stud- [98] proposes the Reexion framework, a novel approach
ies primarily focus on automated debugging techniques, the that strengthens learning through language feedback rather
frameworks designed in these studies automatically determine than traditional weight updates to avoid expensive re-train
the optimal repair solution based on the debugging results costs. The framework uses self-reection and language feed-
after collecting sufcient information, and provide specic back to help language agents learn from mistakes, signicantly
code implementations, which demonstrated the capability of improving performance in decision-making, reasoning, and
autonomous decision-making and learning. programming tasks. The Reexion’s rst-pass success rate
Since the rise of LLMs applied to various elds, one on the HumanEval Python programming task increased from
research direction has been the rational analysis of their 80.1% to 91.0%, success rates in the ALFWorld decision-
creativity and the exploration of their potential for continuous making task improved by 22%, and performance in the
learning, this creativity also highly determined by the decision HotPotQA reasoning task increased by 14%. These results
making capability of the models. [93] analyzed the outputs indicate that the Reexion framework demonstrate the state-
of LLMs from the perspective of creativity theory, exploring of-art performance in various tasks through self-reection and
their ability to generate creative content, the study used metrics language feedback.
such as value, novelty, and surprise, nding that current LLMs Another agent framework [35] introduces the ExpeL agent
have limitations in generating combinatorial, exploratory, and framework which enhances decision-making capabilities by
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 18
autonomously collecting experiences and extracting knowl- enhancing LLM capabilities. [101] investigates the compre-
edge from a series of training tasks using natural language, this hensive comparison of LLM-augmented autonomous agents
experience collection process is similar to how humans gain in- and proposes a new multi-agent coordination strategy for
sights through practice and apply them in exams. By accessing solving complex tasks through efcient communication and
internal databases, ExpeL also reduces hallucinations, employ- coordination called BOLAA. The experiment showed that the
ing the RAG technique discussed in III. The ExpeL framework BOLAA outperforms other agent architectures in the WebShop
doesn’t require parameter updates it enhances decision-making environment especially in high-performance LLMs The above
capabilities by recalling past successes and failures which three studies focus on achieving a multi-agent collaboration
fully leveraging the advantages of ReAct framework [36]. architecture by increasing the number of agents. This trend
Experimental showed that ExpeL can continuous improvement indicates that more frameworks are beginning to explore the
across tasks in multiple domains and exhibited cross-task potential of multi-agent systems. [44] explores methods to
transfer learning capabilities. The combination of ExpeL and enhance LLMs performance by increasing the number of
Reexion even further enhance the performance in iterative agents. Using sampling and voting methods, the study showed
task attempts, highlighting the importance of autonomous that as the number of agents increased, LLM performance
learning and experiential accumulation in developing intelli- in arithmetic reasoning, general reasoning, and code gener-
gent agents. The ExpeL framework demonstrates its potential ation tasks improved signicantly. This method proves the
as a state-of-the-art (SOTA) LLM-based agents in several effectiveness of multi-agent collaboration in enhancing model
aspects, particularly in cross-task learning, self-improvement, performance. These studies collectively indicate the impor-
and memory mechanisms. By comparing ExpeL with existing tance of multi-agent collaboration and dialogue frameworks
SOTA agents like Reexion [98], ExpeL outperforms baseline in autonomous learning and decision-making tasks. Compared
methods in various task environments. These studies collec- to traditional LLMs, these multi-agent frameworks enhance
tively indicate the importance of autonomous learning and im- reasoning accuracy under zero-shot learning and demonstrate
provement in LLM-based agents, agent systems continuously higher autonomy and exibility which reduce the burden on
optimize and improve decision-making processes through self- developers.
feedback, self-reection, and experiential accumulation which LLM-based agents not only perform complex data analysis
shows higher autonomy and exibility in handling dynamic tasks but also demonstrate potential in simulating and under-
and complex tasks compared to traditional LLMs. Unlike standing human trust behaviors. [102] introduces a framework
traditional LLMs, which mainly rely on pre-training data and named SELF, designed to achieve self-evolution of LLMs
parameter updates, LLM-based agents adapt and improve their through language feedback, use RLHF to train agent behavior
performance in real-time through continuous self-learning and to meet the human alignment. The framework enhances model
feedback mechanisms, thus demonstrating outstanding perfor- capabilities through iterative processes of self-feedback and
mance in various tasks. self-improvement without human intervention. In experiments,
[99] proposes the AGENTVERSE multi-agent framework, the test accuracy on GSM8K and SVAMP datasets increasing
designed to improve task completion efciency and effective- by 6.82% and 4.9%, respectively and the overall task win rates
ness through collaboration. The framework draws on human on the Vicuna test set and Evol-Instruct test set also increased
group dynamics by designing a collaborative system of expert by 10% and 6.9%. Another similar study exploring the po-
agents that exhibit outstanding performance in tasks such as tential of LLM-based agents to simulate the human trust be-
text understanding, reasoning, coding, and tool usage. Experi- haviors. [103] also examines whether LLM-based agents can
ments showed that the AGENTVERSE framework performed simulate human trust behaviors. The study aims to determine if
well not only in independent task completion but also sig- LLM-based agents exhibit trust behaviors similar to humans
nicantly improved performance through group collaboration and explore whether these behaviors can align with human
especially in coding tasks where the framework use GPT-4 trust. Through a series of trust game variants such as initial
to be the brain of the agent groups. The framework also ob- fund allocation and return trust games, the research analyzes
served emergent behaviors in agents during collaboration, such LLM-based agents’ trust decisions and behaviors in different
as voluntary actions, conformity, and destructive behaviors, contexts. Results show that particularly for GPT-4, LLM-based
providing valuable insights for understanding and optimizing agents exhibit trust behaviors consistent with human expecta-
multi-agent systems. tions in these trust games, validating the potential of LLM-
Another multi-agents study [100] Introducing the CAMEL based agents in simulating human trust behaviors. The efcient
framework, this is a well known agent framework, which and accurate handling of diverse datasets highlights the broad
explores building scalable techniques to facilitate autonomous application prospects in elds such as software engineering.
collaborative agent frameworks. The study proposes a role- In terms of simulating trust behaviors, LLM-based agents
playing collaborative agent framework that guides dialogue demonstrate human-like behavior patterns through complex
agents to complete tasks through embedded prompts while trust decisions and behavior analysis providing an important
maintaining alignment with human intentions. The CAMEL theoretical foundation for future human-machine collaboration
framework generates dialogue data to study behaviors and and human behavior simulation.
capabilities within the agent society, the study further en- Integrating LLMs into agents allows for more complex
hanced agent performance by ne-tuning the LLaMA-7B task processing. [104] proposes a lightweight user-friendly
model, validating the effectiveness of generated datasets in library named AgentLite whic designed to simplify the de-
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 19
velopment, prototyping and evaluation of task-oriented LLM- agents exhibit higher autonomy which are typically designed
based agents systems. The main goal of the study is to to interact with or adapt to the environment in real-time, they
enhance the capabilities and exibility of LLM-based agents in are often part of multi-agent systems where collaboration and
various applications by introducing a exible framework. This communication are key components, for example use extra
framework enhances task-solving capabilities through task de- model or tools to further help with the planning phases. In
composition and multi-agent coordination, using a hierarchical terms of integration with other systems and modalities, LLMs
multi-agent coordination approach where a managing agent typically operate in text input-output scenarios and even in
supervises the task execution of each agent. [105] introduces multi-modal settings, their role is usually limited to processing
a framework, GPTSwarm, that represents LLM-based agents and generating text-based content. Also, LLM-based agents are
as computational graphs to unify existing prompt engineering more likely to integrate with other systems and modalities such
techniques and introduces methods for optimizing these graphs as visual input or real-world perception data, enabling them
to enhance agent performance. The study veries the effective- to perform more complex and context-based decision-making
ness of the framework through various benchmarks such as tasks.
MMLU, Mini Crosswords, and HumanEval. The framework
demonstrated signicant performance improvements on the
GAIA benchmark with an improvement margin of up to
90.2% compared to the best existing methods. Additionally,
agents have shown strong capabilities in autonomous learning
and decision-making in software engineering and security,
which will be introduced in the subsequent Software Security
section [106] [107] [108] [109].
C. Analysis
Overall, LLMs and LLM-based agents exhibit strong capa- F IG . 6: E XPEL [35] F RAMEWORK WITH R EFLEXION [98] IN E XPE -
RIENCE G ATHERING
bility on the autonomous learning and decision making but
slightly different view. These differences are reected in the
Regarding learning and adaptation mechanisms, LLMs’
focus of task execution and also in autonomy, interactivity,
adaptation and learning are usually conned to the model’s
learning and adaptation mechanisms, and the integration with
training data and parameter range, although they can adapt
other systems and modalities. From the perspective of task
through new data updates, they lack the ability to continuously
execution focus, LLMs primarily concentrate on enhancing
learn from real-time feedback, they are more focused on using
specic functions in software engineering, such as debugging,
existing knowledge to solve problems and generate responses.
problem-solving and automated reasoning. The tasks they
In contrast, LLM-based agents are often equipped with ex-
perform are usually static and well-dened, such as automatic
periential learning and real-time feedback adaptation mech-
debugging, enhancing debugging capabilities to autonomously
anisms, allowing them to optimize strategies and responses
identify and correct errors, evaluating creativity and judging
based on continuous interactions. One good example of LLm-
responses from other chatbots. In contrast, LLM-based agents
based agents framework is Expel [35], which utilize the
not only focus on specic tasks but also manage multi-
previous researches ReAct [36] and Reexion [98] as shown
ple tasks simultaneously, often involving dynamic decision-
in Figure. 6. This framework utilizes a memory pool and
making and interaction with other agents or systems. Examples
insights pool to enable the LLM to learn from past knowledge,
of these agents’ tasks include enhancing reasoning through
thereby aiding subsequent decision-making. This autonomous
multi-agent discussions, continuous learning from experiences,
decision-making capability is something that traditional LLM
requiring real-time dynamic decision-making, and also LLM-
frameworks cannot achieve.
based agents can get in touch with the multimodal task in the
visual environment.
We can conclude that, the application of LLM-based agents D. Benchmarks
in the topic of autonomous learning and decision-making In the eld of autonomous learning and decision-making,
primarily involves exploring their performance in specic tasks the benchmark datasets used by LLMs and LLM-based agents
through various framework designs. These studies evaluate the are quite similarly in the task handling and application require-
agents’ autonomy and decision-making capabilities to deter- ments. We can gain a deeper understanding of the strengths
mine whether they align with human behavior and decision- and weaknesses of both approaches in different tasks and their
making processes. If we dive into the specic task deigns, application contexts. The specic dataset references, please see
in terms of autonomy and interactivity LLMs are usually Table VII.
designed to perform highly specic tasks without needing In the research on LLMs, the main datasets include De-
to adapt to external input or environmental changes, they fects4J, MMLU, TransCoder, and MBPP. These datasets are
mainly operate as single models focusing on processing and primarily used to evaluate model performance in specic
responding within predened boundaries, this also applied domains and tasks. Defects4J is a widely used in the soft-
to all LLM applications. On the other hand, LLM-based ware engineering, this software defect dataset containing 525
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 20
real defects from 17 Java projects. It’s designed to test the complex interactions and multitasking scenarios. Datasets
effectiveness of automated program repair and defect detection like HotpotQA, ALFWorld, FEVER, and WebShop challenge
tools by providing a standardized benchmark that allows the models’ performance in information synthesis, dynamic
researchers to compare the performance of different methods. decision-making/interaction and multimodal tasks. This dif-
MMLU (Massive Multitask Language Understanding) is a ference arises from the distinct design goals of the two:
large-scale benchmark dataset covering 57 subjects, testing LLMs aim to optimize performance on single tasks, while
models on a broad spectrum of knowledge and reasoning LLM-based agents are designed to handle complex or multi-
abilities in multitask language understanding. It includes ques- modal task, this require higher autonomy and adaptability.
tions ranging from elementary education to professional level It’s also reects modern applications’ demand for highly
such as College Mathematics, Business Ethics, and College interactive, adaptive, and multifunctional AI systems, driving
Chemistry, challenging the models’ diverse knowledge base the development from single LLM models to multi-agent
and reasoning capabilities. The TransCoder dataset focuses systems. Through these analyses, we can identify the different
on code translation across programming languages which application of LLMs and LLM-based agents in autonomous
evaluate the model’s ability to automatically translate code learning and decision-making, it’s important to choose the
from one programming language to another. This is crucial for appropriate framework to meet different task requirements in
multilingual software development and maintenance, as it can the real world applications.
greatly enhance development efciency. MBPP (Mostly Basic
Python Programming) has been introduced in previous section,
it’s a dataset containing 427 Python programming problems, E. Evaluation Metrics
covering basic concepts and standard library functions, it’s various evaluation metrics are used in the research on
widely used to test the model’s performance in different pro- LLMs and LLM-based agents, these metrics used to evaluate
gramming scenarios, evaluating its ability to generate correct the models’ performance in specic tasks and analyze their
and efcient code. application effectiveness in this domain. Below, we discuss
In contrast, LLM-based agents use datasets that empha- several representative studies analyzing the evaluation metrics
size multitasking and decision-making capabilities in complex they employed and exploring the differences between LLMs
scenarios. The main datasets include HotpotQA, ALFWorld, and LLM-based agents in this eld.
FEVER, WebShop, and MGSM. HotpotQA is a multi-hop In research on LLMs, evaluation metrics primarily focus on
question-answering dataset that requires models to reference model accuracy and task completion. In [90], researchers used
content from multiple documents when answering questions, the accuracy of a voting inference system which measured by
evaluating their information synthesis and reasoning abilities, the expected 0/1 loss (the proportion of correct responses) to
this dataset challenges the model’s performance in complex assess model performance. This metric evaluates the accuracy
reasoning tasks. ALFWorld is a text-based environment simu- of models through multiple calls, reecting the ability of
lation dataset requiring multi-step decision-making where the LLMs to improve result accuracy via iterative reasoning.
model completes tasks in a virtual home environment. The Common evaluation metrics in the literature include accuracy
dataset combines natural language processing and decision- and sample efciency, accuracy refers to the proportion of
making, evaluating the model’s performance in dynamic and correct predictions made by the model, while sample efciency
interactive tasks. The FEVER (Fact Extraction and VERica- measures the number of samples required to achieve a certain
tion) dataset is used for fact verication tasks, where the model accuracy level. These metrics assess both the predictive and
needs to verify the truthfulness of given statements and provide decision making ability of the model and its data utilization
evidence, it assesses the model’s capabilities in information efciency during training. In [92], evaluation metrics include
retrieval and logical reasoning. WebShop is an online shopping possible patches, correct patches, precision, and developer
environment simulation dataset containing 1.18 million real- accuracy. Possible patches refer to patches that pass all
world products and human instructions, it used to test the tests, while correct patches are semantically equivalent to
model’s performance in complex decision-making tasks such the original developer patches. Precision measures the pro-
as completing shopping tasks and attribute matching. MGSM portion of correct patches among the possible patches, and
(Multimodal Generalized Sequence Modeling) is a multimodal developer accuracy assesses the correctness of patches with
dataset containing tasks related to dialogue, creative writing, and without explanations through human evaluation. These
mathematical reasoning, and logical reasoning, evaluating the metrics emphasize the model’s explanatory capability and
model’s comprehensive abilities in multimodal tasks. practical effectiveness in automated code repair, increasing
Comparatively, LLM datasets typically focus on single, reliance on human evaluation. To assess model creativity,
static tasks such as code generation, mathematical reason- value, novelty and surprise are used as creativity dimensions.
ing and creative writing, which suitable for models work- Quality, social acceptability, and similarity of generated works,
ing within predened task scopes. Datasets like Defects4J, as well as the ability to generate creative product, are also
MMLU, and MBPP help evaluate model capabilities in spe- included in the evaluation. [110] used the success rate in the
cic domains. LLM-based agents are more suited for com- Game of 24 and the coherence of generated paragraphs in
plex, multitasking, and dynamic environments where datasets creative writing as evaluation metrics. These metrics assess the
require models to handle multimodal inputs and real-time model’s performance in problem-solving and text generation,
decision-making, it can showcase their advantages in handling showcasing LLMs’ potential in solving complex problems and
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 21
generating coherent text. In [95], consistency and success rate integration of design and development to ensure that decisions
were used as evaluation metrics, the consistency calculates made during the design phase seamlessly translate into high-
the probability of agreement between two judges on randomly quality code. Consequently, the research on software design
selected questions which measures the alignment of LLM often explores aspects related to code generation and devel-
judges with human preferences. Success rate is used for opment by utilizing LLMs for software development with a
specic tasks (such as the Game of 24) to measure the correct certain framework and special architecture design. Software
response rate. design frameworks often involve multiple stages of continuous
In contrast, LLM-based agents use more diverse evaluation renement to achieve optimal results, which can be considered
metrics to reect their multi-agent collaboration character- part of LLM applications in software development [83].
istics. In [97], evaluation metrics include Perception Score Similarly, [85] and [84] highlight the frequent use of tools
(P-Score), Cognition Score (C-Score), and Action Score (A- or API interfaces when using LLMs to assist in development
Score). These metrics comprehensively assess the model’s per- and design, demonstrating an overlap with the topic of code
ception, cognition and action capabilities, demonstrating the generation and software development.
comprehensive performance of LLM-based agents in handling LLMs in software design and evaluation also intersect
multimodal tasks. In multimodal applications, success rate extensively with autonomous learning and decision making,
(SR) is often used as a primary metric, evaluated through tasks these two topics are interrelated elds. Software design needs
such as HotpotQA and FEVER to assess precise matching to consider system adaptability and learning capabilities to
success. These metrics focus on task completion success and handle dynamic environments, therefore design evaluations
accuracy, showcasing the practical execution capabilities of involving autonomous learning and decision making naturally
LLM-based agents in different task environments. In [111], become a focal point of intersection for these two topics.
evaluation metrics include practitioner feedback, efciency, Many LLM techniques and methods nd similar applications
and accuracy. Practitioner feedback uses the Likert scale in both elds, for example LLMs based on reinforcement
to collect satisfaction and performance feedback, the Likert learning can be used for automated design decisions and
scale is a commonly used psychometric tool designed to evaluations, as well as for self-learning and optimization.
measure an individual’s attitude or opinion toward a particular Common applications of LLMs in software engineering in-
statement. The scale typically consists of the following ve volve ne-tuning models with prompt engineering techniques
options: Strongly Disagree, Disagree, Neutral, Agree, Strongly to continuously enhance performance particularly in soft-
Agree. While efciency and accuracy are measured through ware design and evaluation, more sample learning is often
the effectiveness of model-executed qualitative data analysis required to ensure that the model outputs align with user
validated by practitioners. These metrics assess the agents’ expectations [93] [102] [44] [111] [105] [96]. Additionally,
performance in qualitative data analysis, demonstrating their requirement elicitation and specication in requirement engi-
utility and accuracy in practical applications. neering can also be considered part of software design and
By comparing these metrics, we nd that LLMs using evaluation [51] [112]. This section reviews the main research
traditional metrics such as accuracy and sample efciency achievements of LLMs in software design and evaluation
to assess their capabilities. In contrast, LLM-based agents in recent years, discussing their application scenarios and
handle more complex algorithm through multi-agents, which practical effects.
requires more comprehensive and diverse metrics to evaluate
their performance from multiple directions. LLM-based agents
A. LLMs Tasks
in multimodal tasks and self-evolution tasks emphasize the
integrated performance of perception, cognition, and action In recent years, there has been extensive research on the use
capabilities. This difference reects LLMs’ strengths in single- of LLMs in tasks such as automation, optimization, and code
task optimization and LLM-based agents’ potential in col- understanding. ChatGPT has been widely utilized for various
laborative handling of complex tasks with higher capability software engineering tasks and demonstrated excellent perfor-
of autonomous learning. Additionally, practical application mance in tasks like log summarization, pronoun resolution,
evaluation metrics for LLM-based agents, such as practitioner and code summarization, achieving a 100% success rate in
feedback, efciency, and accuracy, demonstrate their utility both log summarization and pronoun resolution tasks [113].
and user satisfaction in real-world scenarios. This evaluation However, its performance on tasks such as code review
approach assesses task completion but also consider a compre- and vulnerability detection is relatively poor, which shows
hensive evaluation of user experience, which can also evaluate that it needs further improvement for more complex tasks.
the human alignment of their decision making capabilities. Another framework EvaluLLM addresses the limitations of
traditional reference-based evaluation metrics (such as BLEU
and ROUGE) by using LLMs to assess the quality of natural
VII. S OFTWARE D ESIGN AND E VALUATION
language generation (NLG) outputs [114]. The EvaluLLM
The application of LLMs to software design and evaluation introduces a new evaluation method that compares genera-
has very similar overlaps with previous topics, software design tive outputs in pairs and uses win rate metrics to measure
is an early phase of software development, and the quality of model performance, this approach can simplies the evaluation
the design directly impacts the quality of furture development. process also ensures consistency with human assessments,
Modern software engineering methodologies emphasize the showcasing the broad application prospects of LLMs in gener-
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 22
ative tasks. Similarly, in the LLMs evaluation domain, LLM- used to generate high-level synthesis (HLS) designs containing
based NLG Evaluation provides a review and classication of predened errors to create a dataset called Chrysalis, this
current LLMs used for NLG evaluation, the paper summarizes dataset provides a valuable resource for evaluating and opti-
four main evaluation methods: LLM-derived metrics, prompt- mizing LLM-based HLS debugging assistants. The optimized
based LLMs, ne-tuned LLMs, and human-LLM collaborative LLM signicantly improves inference performance, providing
evaluations [115]. These methods demonstrate the potential of new possibilities for error detection and correction in the elec-
LLMs in evaluating generative outputs which also mention tronic design automation (EDA) eld. In [117], the researchers
challenges such as the need for improved evaluation metrics introduces RaWi, a data-driven GUI prototyping approach. The
and further exploration of human-LLM collaboration. framework allows users to retrieve GUIs from this repository,
edit them, and create new high-delity prototypes quickly. The
There are also many novel application design with the
experiment conducted by comparing RaWi with a traditional
LLMs which applied in the engineering design, one study ex-
GUI prototyping tool (Mockplus) to measure how quickly
plores strategies for software/hardware co-design to optimize
and effectively users can create prototypes. The result demon-
LLMs and applies these strategies to design verication [116].
strated that RaWi outperformed on multiple benchmarks, with
Through quantization, pruning, and operation-level optimiza-
40% improvement on precision@k metric. This study proves
tion, this research demonstrates applications in high-level
the possibility of LLMs to improve the efciency during
synthesis (HLS) design functionality verication, GPT-4 was
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 23
prototyping phase of software design, which allows designers appropriate models based on user requests. The innovation
to quickly iterate on GUI designs, facilitating early detection of lies in using LLMs not just as tools for direct task execution
design aws. With the new possibility brought by the LLMs, but as central orchestrators that leverage existing AI models
there has been much discussion in the education eld, with to fulll complex tasks, This approach expands the practical
researchers exploring the implications of the prevalence of applicability of LLMs beyond typical language tasks. [123]
large language models for education [118]. Study indicates proposes the LLMARENA benchmark framework to evaluate
that ChatGPT shows signicant potential but some limitations LLMs’ capabilities in dynamic multi-agent environments, the
in answering questions from software testing courses [119]. idea is similar to the ChatDev but innovates by shifting the
ChatGPT was able to answer about 77.5% of the questions focus from single-agent static tasks to dynamic and interac-
and provided correct or partially correct answers 55.6% of tive multi-agent environments, providing a more realistic and
the time. However, the correctness of its explanations was challenging setting to assess the practical utility of LLMs, this
only 53.0%, indicating the need for further improvement in approach mirrors real-world conditions where multiple agents
educational applications. (either AI or human) interact and collaborate. Experiments
show that this framework can test LLMs’ spatial designing,
strategic planning, and teamwork abilities in gaming environ-
B. LLM-based Agents Tasks ments, offering new possibilities and tools for designing and
The application of LLM-based agents in software design evaluating LLMs in multi-agent systems.
and evaluation enhance the development efciency and code [124] introduces the ”Flows” conceptual framework for
quality, as well as showcase the broad applicability and structuring interactions between AI models and humans to
immense potential of LLM-based agents in practical software improve reasoning and collaboration capabilities. The study
engineering tasks. [120] explores the current capabilities, present the idea of conceptualizing processes as indepen-
challenges, and opportunities of autonomous agents in soft- dent, goal-driven entities that interact through standardized
ware engineering. Study evaluate Auto-GPT’s performance message-based interfaces, enabling a modular and extensible
across different stages of the software development lifecycle design. This approach is inherently concurrency-friendly and
(SDLC), including software design, testing, and integration supports the development of complex nested AI interactions
with GitHub, the paper nds that detailed contextual prompts without having to manage complex dependencies. Experiments
signicantly enhance agent performance in complex software in competitive coding tasks show that the ”Flows” frame-
engineering tasks which mentions the importance of context- work increases the AI model’s problem-solving rate by 21
rich prompts in reducing errors and improving efciency, percentage points and the human-AI collaboration rate by 54
underscoring the potential of LLM-based agents to automate percentage points. This demonstrates how modular design can
and optimize various SDLC tasks, thereby enhancing devel- enhance AI and human collaboration, thereby improving the
opment efciency. This paper also evaluate the limitation software design and evaluation process.
of the Auto-GPT, includes task or goal skipping, generating [125] presents a new taxonomy to structurally understand
unnecessary code or les (hallucinations), repetitive or looping and analyze LLM-integrated applications, providing new the-
responses, lack of task completion verication mechanisms. ories and methods for software design and evaluation. This
These limitations can lead to incomplete workows, inaccurate taxonomy helps in understanding the integration of LLM com-
outputs, and unstable performance in practical applications. ponents in software systems, laying a theoretical foundation
[121] introduces ChatDev, the rst virtual chat-driven for developing more effective and efcient LLM-integrated
software development company, a concept of using LLMs not applications. Similarly, [126] explores the application of LLM-
just for specic tasks but as central coordinators in a chat- based agents in software maintenance tasks, improving code
based, multi-agent framework. this approach allows for more quality and reliability through a collaborative framework.
structured, efcient, and collaborative software development This study should origin be categorized under the software
processes, exploring how chat-driven multi-agent systems can maintenance domain but exhibit the iterative manner of the
achieve efcient software design and evaluation, reduce code design structure. The framework utilize the task decomposition
vulnerabilities, and enhance development efciency and qual- and multi-agent strategies to tackle complex engineering tasks
ity. Experiments show that ChatDev can design and generate that traditional one-shot methods cannot handle effectively,
software in an average of 409.84 seconds at a cost of only multiple agents can learn from each other, leading to improved
$0.2967 while signicantly reducing code vulnerabilities. This software maintenance outcomes. Experiments show that multi-
indicates that chat-based multi-agent frameworks capable to agent systems outperform single-agent systems in complex
improve software development efciency and quality. Another debugging tasks, indicating that this new framework can be
similar collaboration framework introduced by Microsoft re- applied in software design to provide safer architectures.
search team, [122] demonstrates the effectiveness of using
LLMs, particularly ChatGPT as agent’s controllers to manage
and execute various AI tasks. The HuggingGPT system that C. Analysis
uses ChatGPT to orchestrate the execution of tasks by various Overall, LLM applications in software design and evaluation
AI models available in Hugging Face, the purpose is to test typically focus on the automation of specic tasks, such
how effectively the system can handle complex AI tasks, as code generation and log summarization, with a tendency
including language, vision, and speech tasks, by executing towards evaluation the capability rather than implementation
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 24
during the design phases. The process of software design is evaluate the coordination capabilities of generative agents in
largely intertwined with software development and require- complex social tasks, with the main evaluation metrics being
ments engineering. As previously mentioned, the use of LLMs task coordination success rate and role matching accuracy.
to assist in software development often includes aspects of Comparatively, LLMs research tends to use specic and
the software design process, particularly in generating related publicly available datasets, such as BigCloneBench. These
design documentation. Therefore, there is relatively limited datasets provide standardized evaluation benchmarks, aiding
research focused on using LLMs for higher-level software in the reproducibility and comparability of results. Researches
design tasks. on LLM-based agents tends to use customized experimental
LLM-based agents expand the capabilities of LLMs by han- settings or unspecied datasets, such as requirement documen-
dling more complex workows through intelligent decision- tations, without specifying particular datasets but emphasizing
making and task execution, these agents can collaborate, that the experiments involve 70 user requirements. This choice
dynamically adjust tasks and gather and utilize external infor- is usually because the research needs to evaluate the perfor-
mation. In software design and evaluation, a single model often mance from multiple angles, and it is difcult to perfectly
cannot comprehensively consider both design and evaluation adapt to the vertical application scenarios if some general
aspects, which is why more software developers are reluctant datasets are used. Both LLM and LLM-based agents use a
to entrust high-level tasks to AI. LLM-based agents, through variety of datasets to evaluate the performance of the model,
collaborative work and more rened role division, can ef- these datasets cover tasks ranging from code generation, code
ciently complete design tasks and adapt to various application understanding, to natural language generation and task man-
scenarios. However, the application of LLM-based agents agement, due to the topic of software design and evaluation
in software design is commonly included in the software is relatively inter-related with others. However, because the
development, like previously discussed, the self-reection and LLM-based agents can be expanded to application scenarios
reasoning before action occurs during the software design such as videos and pictures, the agents like Auto-GPT and
phases. The Chatdev [121] framework uses role distribution HuggingGPT also use multimodal datasets. These datasets not
to create a separate software design phase which signicantly only contain code and text, but also involve multiple data types
increases the exibility and accuracy in the later development such as images and speech. Moreover, compared with a single
phases. In terms of efciency and cost, LLMs are still slightly LLM framework, LLM-based agents need to evaluate more
superior to LLM-based agents in text generation and vulner- areas, so benchmarks also need to be considered separately.
ability detection. However, handling tasks similar to software For example, LLMARENA is specially designed to test the
maintenance and root cause analysis requires more complex performance of LLM in dynamic, multi-agent environments,
architectures, such as multi-turn dialogues, knowledge graphs, covering complex tasks such as spatial reasoning, strategic
and RAG techniques, which can further benet the design and planning, and risk assessment.
evaluation phases.
E. Evaluation Metrics
D. Benchmarks In Software Design and Evaluation, various studies have
The benchmarks include public datasets and datasets self- employed different evaluation metrics to measure the perfor-
crafted by the authors themselves, and the application scenar- mance of LLMs and LLM-based agents across a range of
ios are also quite differently as shown in the Table VIII. Big- tasks. Both LLM and LLM-based agent research use more
CloneBench is a benchmark dataset for code clone detection, than one metrics to comprehensively assess model perfor-
containing a large number of Java function pairs. These pairs mance, LLMs research tends to focus on traditional metrics
are classied as clones and non-clones, used for training and such as accuracy, win rate, and consistency, while LLM-
evaluating clone detection models, with the main evaluation based agent research still consider those fundamental metrics
metric being the correct identication rate. The Chrysalis but further introduces complex evaluation methods, such as
dataset created by [116], it contains over 1000 function- task coordination success rate and role matching accuracy.
level designs from 11 open-source synthesizable HLS datasets, However, it cannot be denitively stated that future LLM-
primarily used to evaluate the effectiveness of LLM debugging based agent research will always use more exible evaluation
tools in detecting and correcting injected errors in HLS de- metrics considering multiple dimensions, but more dependent
signs, with the main evaluation metric being the effectiveness on the specic task and dataset being used. The reason for this
of error detection and correction. The CodexGLUE dataset is a phenomenon, as observed in this survey, is primarily that tasks
comprehensive benchmark dataset covering various code gen- in LLMs research are relatively single-tasked, mainly focusing
eration and understanding tasks such as code completion, code on static tasks such as log summarization with traditional eval-
repair, and code translation, used to evaluate the performance uation methods. On the other hand, LLM-based agent research
of code generation models in practical programming tasks. In involves more general multi-agent tasks, and its evaluation
addition to these public datasets, some articially simulated methods emphasize interactivity and dynamics. LLM-based
datasets are used, such as a simulated job fair environment agent research focuses more on the model’s collaboration
dataset. This dataset simulates a virtual job fair environment and decision-making capabilities by using multi-dimensional
containing multiple task scenarios such as interviews, recruit- evaluation metrics to comprehensively assess their potential
ment, and team project coordination. The dataset used to in practical applications consider not only the accuracy. This
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 25
explains why, despite the similarity in evaluation metrics development, creating and rening test cases still requires large
such as accuracy and completion time, LLM-based agents human effort.
use exible evaluation metrics, including metrics like mutual Typical roles in development involve software testing, such
exclusiveness and appropriateness. as writing unit tests, integration tests, and fuzz tests. Re-
searchers have been attempting to use AI to help generate test
VIII. S OFTWARE T EST G ENERATION cases since before the 2000. Initial implementations typically
involved simpler forms of AI and machine learning to auto-
In software development, a crucial component is software
mate parts of the test case generation process. Over time, more
testing, which need to continuously been conducted from
sophisticated methods such as natural language processing
the initial system development to the nal deployment. In
and machine learning models have been applied to improve
industry, agile development is commonly used which test
the precision and scope of test case generation. Online tools
system continuously at every stage to ensure the robustness
like Sofy5 , which use machine learning to generate context-
of the entire system, whenever new code is committed to
based paths in applications, also exist to aid in generating test
the GitHub, tests are conducted to ensure the usability of
suites. Using large language models to generate test cases is
the updated version. A common approach is to use Jenkins4
a relatively new attempt but has been developing rapidly. In
to achieve continuous integration and continuous deployment.
2020, researchers utilized pre-trained language models ne-
Jenkins automatically hooks into the developer’s action of
tuned on labeled data to generate test cases. They devel-
pushing code to GitHub and runs a test suite against the new
oped a sequence-to-sequence transformer-based model called
version. Although the entire process leans towards automated
4 https://www.jenkins.io/ 5 https://sofy.ai/
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 26
”ATHENATEST” and compared its generated results with The experimental results show that LIBRO successfully re-
EvoSuite and GPT-3, demonstrating better test coverage [131]. produced 33.5% of bugs in the Defects4J dataset and 32.2%
More research and models are being dedicated to test suite in the GHRB dataset. By combining advanced prompt engi-
generation experiments, for instance, the Codex model [67], neering and post-processing techniques, LIBRO demonstrates
mentioned earlier in the code generation section, combined the effectiveness and efciency of LLMs in generating bug
with chain-of-thought prompting, achieved high-quality test reproduction tests. Although LIBRO has a lower absolute
suite generation with CodeCoT, even in zero-shot scenarios. effectiveness compared to AdbGPT, it was tested across a more
The introduction of LLMs aims to automate and streamline diverse set of Java applications and not limited to Android.
the testing process, making it more rigorous and capable of Therefore, while AdbGPT excels in specialized bug replay
addressing aspects that humans might easily overlook. for Android, LIBRO provides a wider range of bug repro-
duction for Java applications. The extensive application of
LLMs in test generation tasks such as security test generation,
A. LLMs Tasks bug reproduction, fuzz testing, program repair, and coverage-
The application of LLMs in software test generation is driven test generation highlights their signicant potential
extensive and encompasses more than just test suite generation. in improving software quality and reducing the burden on
The reviewed paper included in this survey covers several developers. Through various models and techniques, these
aspects, including security test generation, bug reproduction, tasks demonstrate how LLMs can automate and enhance the
general bug reproduction, fuzz testing, and coverage-driven software testing process, addressing aspects that are often
test generation. These tasks are achieved through various mod- overlooked by humans.
els and techniques, signicantly improving software quality Similarly, in fuzz testing, LLMs have shown promise poten-
and reducing developers’ workload. [132] aims to evaluate tial. [136] developed a universal fuzzing tool, Fuzz4All, which
the effectiveness of using GPT-4 to generate security tests, uses LLMs to generate and mutate inputs for various software
demonstrating how to conduct supply chain attacks by ex- systems. This tool addresses the issues of traditional fuzzers
ploiting dependency vulnerabilities. The study experimented being tightly coupled with specic languages or systems and
with different prompt styles and templates to explore the lacking support for evolving language features. The study
effectiveness of varying information inputs on test generation conducted various experiments to test the tool’s capabilities,
quality, the results showed that tests generated by ChatGPT including coverage comparison, bug nding, and targeted
successfully discovered 24 proof-of-concept vulnerabilities in fuzzing. The results showed that Fuzz4All achieved the highest
55 applications, outperforming existing tools TRANSFER code coverage in all tested languages, with an average increase
[133] and SIEGE6 . This research introduces a new method for of 36.8%, and discovered 98 bugs across nine systems, which
generating security tests using LLMs and provides empirical considered as state-of-art technique in universal fuzzing with
evidence of LLM’s potential in the security testing domain, LLMs at that time. Through self-prompting and LLM-driven
offering developers a novel approach to handling library fuzzing loops, Fuzz4All demonstrated the effectiveness of
vulnerabilities in applications. LLMs in fuzz testing and showcased their capability across
Another application is bug reproduction, which allows multiple languages and systems under test (SUTs) through
testers to locate and x bugs more quickly and ef- comprehensive evaluations.
ciently. [134] addresses the limitations of current bug repro- [137] introduced SymPrompt, a new code-aware prompting
duction methods, which are constrained by the quality and strategy aimed at addressing the limitations of existing Search-
clarity of handcrafted patterns and predened vocabularies. Based Software Testing (SBST) methods and traditional LLM
The paper proposes and evaluates a new method framework prompting strategies in generating high-coverage test cases. By
called AdbGPT, which uses a large language model to auto- decomposing the original test generation process into a multi-
matically reproduce errors from Android bug reports. AdbGPT stage sequence aligned with the execution paths of the method
is described as outperforming current SOTA approaches in the under test, SymPrompt generated high-coverage test cases.
context of automated bug replay for only Android system. The Experimental results indicated that SymPrompt increased cov-
experimental results show that AdbGPT achieved accuracies of erage on CodeGen2 and GPT-4 by 26% and 105% respectively.
90.4% and 90.8% in S2R entity extraction and a success rate Through path constraint prompting and context construction
of 81.3% in error reproduction, signicantly outperforming the techniques, SymPrompt demonstrated the potential of LLMs
baseline ReCDroid and ablation study versions. By introducing in generating high-coverage test cases. [138] also focused
prompt engineering, few-shot learning, and chain-of-thought on test suite coverage, this study introduced the COVERUP
reasoning, AdbGPT demonstrates the powerful capabilities system which generates high-coverage Python regression tests
of LLMs in automated error reproduction. It also uses GUI through coverage analysis and interaction with LLMs. The
encoding to convert the GUI view hierarchy into HTML-like experimental results showed that COVERUP increased code
syntax, providing LLMs with a clear understanding of the coverage from 62% to 81% and branch coverage from 35% to
current GUI state. While AdbGPT is specialized for Android 53% through iterative prompting and coverage-driven meth-
systems, [135] proposes the LIBRO framework, which uses ods. [139] proposed the AID method, which combines LLMs
LLMs to generate bug reproduction tests from bug reports. with differential testing to improve fault detection in ”plausibly
correct” software. By comparing the effectiveness of AID in
6 https://siegecyber.com.au/services/penetration-testing/ generating fault-revealing test inputs and oracles, the experi-
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 27
ments showed that AID improved recall and precision by 1.80 Pay UAT system, comparing it to a single-agent system and a
times and 2.65 times respectively, and increased the F1 score variant without the reection component. Experimental results
by 1.66 times. By integrating LLMs with differential testing, show that XUAT-Copilot achieved a Pass@1 rate of 88.55%,
AID showcased the powerful capability of LLMs in detecting compared to 22.65% for the single-agent system and 81.96%
complex bugs. for the variant without the reection component, with a
Complete@1 rate of 93.03%. XUAT-Copilot employs a multi-
B. LLM-based Agents Tasks agent collaborative framework, including action planning, state
In the eld of software test generation, the application of checking, and parameter selection agents, and uses advanced
LLM-based agents demonstrates their potential in automated prompting techniques. XUAT-Copilot demonstrates the poten-
test generation. While relying on LLM-based agents for soft- tial and feasibility of LLMs in automating UAT test script
ware test generation might seem excessive, more research generation.
is directed towards vulnerability detection and system main-
tenance. LLM-based agents can enhance test reliability and C. Analysis
quality by distributing tasks such as test generation, execution,
and optimization through a multi-agent collaborative system.
These multi-agent systems offer obvious improvements in
error detection and repair, and coverage testing. An example
of such a system is AgentCoder’s multi-agent framework, as
discussed in the code generation and software development
section [82]. The primary goal of this system is to leverage
multiple specialized agents to iteratively optimize code gener-
ation, overcoming the limitations of a single agent model in
generating effective code and test cases. The paper introduce
the test design agent, which creates diverse and comprehensive
test cases; and the test execution agent, which executes the
tests and provides feedback, it reached an 89.9% pass rate
on the MBPP dataset. Similarly, the SocraTest framework
falls under the Autonomous Learning and Decision Making
topic [106]. This framework automates the testing process
through conversational interactions, the paper presents detailed
examples of generating and optimizing test cases using GPT-
4, emphasizing how multi-step interactions enhance testing
methods and generate test code. Experimental results show F IG . 7: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN
that through conversational LLMs, SocraTest can effectively LLM- BASED AGENT [141] AND LLM [136] IN S OFTWARE T EST
generate and optimize test cases and utilize middleware to G ENERATION
facilitate interactions between the LLM and various testing
tools, achieving more advanced automated testing capabilities. In comparison, LLMs perform well in single-task imple-
The paper collected for the software test generation topic mentations, generating high-quality test cases through tech-
are mostly multiple agents based system. The study [140] niques like prompt engineering and few-shot learning. The
evaluates the effectiveness of LLMs in generating high- number of related studies is increasing as the capabilities
quality test cases and identies their limitations. It proposes of LLMs improve. On the other hand, LLM-Based Agents,
a novel multi-agent framework called TestChain. The paper through multi-agent collaborative systems, decompose tasks
evaluates StarChat, CodeLlama, GPT-3.5, and GPT-4 on the for specialized processing, signicantly enhancing the ef-
HumanEval and LeetCode-hard datasets. Experimental results fectiveness and efciency of test generation and execution
show that the TestChain framework, using GPT-4, achieved through iterative optimization and feedback. Considering the
71.79% accuracy on the LeetCode-hard dataset, an improve- cost, using LLMs for test generation only is enough and
ment of 13.84% over baseline methods. On the HumanEval more cost saving than using LLM-based agents. However, if a
dataset, TestChain with GPT-4 achieved 90.24% accuracy. The specic model performs poorly, it can affect the entire system’s
TestChain framework designs agents to generate diverse test performance.
inputs, maps inputs to outputs using ReAct format dialogue A single LLM may struggle with complex, multi-step tasks.
chains, and interacts with the Python interpreter to obtain For example, in high-coverage test generation, LLMs may
accurate test outputs. require more complex prompts and post-processing steps to
LLM-based agents can also be applied in user acceptance achieve the desired results. Additionally, the quality of the
testing (UAT), [141] aims to enhance the automation of the generated results depends heavily on the prompt design and
WeChat Pay UAT process by proposing a multi-agent collab- quality. For tasks requiring ne control and continuous opti-
orative system named XUAT-Copilot, which uses LLMs to mization, a single LLM may nd it challenging to deal with.
automatically generate test scripts. The study evaluates XUAT- As shown in Figure.??, the LLM framework uses [136] as an
Copilot’s performance on 450 test cases from the WeChat example to demonstrate the usage of LLMs in fuzz testing, the
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 28
prompt will be optimized by given code snippets (fuzz inputs), using MBPP,HummanEval to do the evaluations, researches
and re-select by the LLM again to choose the best prompt for on LLM-Based agents places more emphasis on verifying the
the future generation. The overall framework lacks autonomy, effectiveness of the system through qualitative evaluation and
the LLM-based agent [141] framework on the left lls this gap, user feedback.
as well as able to perceive the UI and interact with the skill
library for the operations. The operation agent will receive E. Evaluation Metrics
any error reported by the inspection agent and do the self-
reection to rene the process autonomously. However, as As seen in Table IX, LLMs research predominantly utilizes
previously discussed, build a LLM-based agents framework traditional quantitative metrics such as bug reproduction rate,
only for the software test generation task are ”overkill”, so code coverage, precision, and recall, these metrics directly
the collected paper for LLM-based agents system generally reect the effectiveness and quality of test generation. In
focused on program repair by generated test cases or bug contrast, LLM-Based agents research not only focuses on
replay system, like in the Figure.??, the LLM-based agent quantitative metrics but also introduces qualitative evaluations,
framework is actually used for automatically test the Wechat such as improvements through conversational interactions and
Pay system. the collaborative effects of multi-agent systems. This diver-
sied evaluation approach provides a more comprehensive
D. Benchmarks reection of the system’s practical application effects. From
the task perspectives, LLMs are more inclined to single task
In the tasks of LLMs in software test generation, the
processing, such as generating test sets and considering the
dataset Defects4J used to evaluate bug reproduction and
coverage of generated test sets. However, because of the
program repair techniques. Other public datasets such as
expansion of agents framework, LLM-based agents often tend
ReCDroid, ANDROR2+, and Themis are primarily used to
to use the generated test sets to evaluate whether vulnerabilities
evaluate mobile application bug reproduction and security
can be found to achieve a more ideal practicality. From
test generation, particularly for the Android platform. GCC,
a design perspective, LLM systems are relying on prompt
Clang, Go toolchain, Java compiler (javac), and Qiskit involve
engineering and the generative capabilities of the models
fuzz testing datasets for various programming languages and
themselves, their evaluation metrics are also mainly focused
toolchains, aimed at assessing the effectiveness of fuzz testing
on the quality and effectiveness of the model outputs, also
in multi-language environments. TrickyBugs and EvalPlus are
their evaluation metrics include the collaborative effects and
datasets containing complex bug scenarios, used to evaluate
efciency within the system, such as improving Pass@1 and
the precision and recall of generated test cases, the benchmark
Complete@1 rates through multi-agent collaboration. Overall,
applications evaluated by CODAMOSA are used to assess the
LLMs are more suited for rapid test generation and evaluation
effectiveness of coverage-based test generation tools.
for specic tasks, with evaluation metrics directly reecting
The datasets used in LLM-Based Agents research are also
the generation’s effectiveness and quality. LLM-Based Agents
quite common, HumanEval, MBPP, and LeetCode-hard are
excel in handling complex and diversied tasks, achieving
mainly used to evaluate the accuracy and coverage of code
higher system efciency and effectiveness through multi-agent
generation and test generation, involving various program-
collaboration and iterative optimization.
ming problems and challenges which frequently appeared in
previous sections. Datasets like Codeaws, QuixBugs, and
ConDefects are collected to familiarize LLMs with erroneous IX. S OFTWARE S ECURITY AND M AINTENANCE
code and programs, containing multiple program errors and In the software engineering, software security and mainte-
defects, and are used to evaluate the effectiveness of automated nance is a popular area for the application of LLMs, primarily
debugging and bug repair. A unique dataset is the WeChat Pay aimed at enhancing the security and stability of software
UAT system, which includes user acceptance test cases from systems through existing technologies to meet the needs of
actual applications and is used to evaluate the performance users and developers. These models provide promise methods
of multi-agent systems in user acceptance testing, focusing of vulnerability detection and repair, while also enabling auto-
specically on WeChat’s security system. mated security testing and innovative maintenance processes.
Overall, the datasets used in LLM-based agents’ research The application of LLMs in software security and maintenance
are broader covering a wide range of programming problems encompasses several aspects, including vulnerability detection,
and challenges, LLM research is more focused on the actual automatic repair, penetration testing, and system robustness
generation tasks, such as bug reproduction on the Android evaluation. Compared to traditional methods, LLMs leverage
platform and fuzz testing in multi-language environments. natural language processing and generation technologies to
This because the LLM-Based agents not only focus on the understand and generate complex code and security policies,
quality of generated test cases and code but also evaluate thereby automating detection and repair tasks. For exam-
the collaborative effects and iterative optimization capabilities ple, LLMs can accurately identify potential vulnerabilities
of multi-agent systems, so the benchmarks also include the by analyzing code structures and contextual information and
dataset used to evaluate performance of the framework. For in- generate corresponding repair suggestions which improves the
stance, AgentCoder [82] improves the efciency and accuracy efciency and accuracy of vulnerability recovery.
of test generation and execution through multi-agent collab- Moreover, LLMs not only exhibit strong capabilities in vul-
oration consider qualitative and quantitative evaluations and nerability detection but also play a role in tasks like penetration
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 29
testing and security evaluations. Automated penetration testing of vulnerability detection in source code and to determine
tools, such as PENTESTGPT [143]. LLMs also demonstrate if the performance limits of CodeBERT-like models are due
signicant advantages in evaluating system robustness by sim- to their limited capacity and code understanding ability. The
ulating various attack scenarios to assess system performance study ne-tuned the WizardCoder model (an improved ver-
under different conditions, helping developers better identify sion of StarCoder) and compared its performance with the
and address potential security issues. Research on LLM- ContraBERT model on balanced and unbalanced datasets. The
based agents in software security and maintenance is also experimental results showed that WizardCoder outperformed
keep growing, these intelligent agents can execute complex ContraBERT in both ROC AUC and F1 scores, signicantly
code generation and vulnerability repair tasks and possess improving Java function vulnerability detection performance,
self-learning and optimization capabilities to handle issues which achieved the state-of-art performance at that time by
encountered in dynamic development environments. Tools like improving ROC AUC from 0.66 in CodeBERT to 0.69.
RITFIS [144] and NAVRepair [145] have shown potential in
improving the precision and efciency of program repairs by There are study mainly explored the applications of pure
using LLM-based agents. LLMs without any framework architecture in vulnerability
detection, uncovering current challenges. [147] evaluated only
the performance of ChatGPT and GPT-3 models in detecting
A. LLMs Tasks vulnerabilities in Java code, the study compared text-davinci-
In the eld of software security and maintenance, research 003 (GPT-3) and gpt-3.5-turbo against a baseline virtual
on LLMs can be categorized into three main areas: vulnerabil- classier in binary and multi-label classication tasks. The
ity detection, automatic repair, and penetration testing, along experimental results showed that while text-davinci-003 and
with some evaluation studies. The collected papers reviewed gpt-3.5-turbo had high accuracy and recall rates in binary
on LLMs in these domain illustrate their diverse applications classication tasks, their AUC (Area Under Curve) scores were
and potential. only 0.51, indicating performance equivalent to random guess-
1) Program Vulnerability: In the domain of vulnerability ing. In multi-label classication tasks, GPT-3.5-turbo and text-
detection, researchers have ne-tuned LLMs to enhance the davinci-003 did not signicantly outperform the baseline vir-
accuracy of source code vulnerability detection. [146] aims tual classier in overall accuracy and F1 scores. These ndings
to investigate the potential of applying LLMs to the task indicate that the earlier model like GPT-3 has limited capabil-
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 30
ities in practical vulnerability detection tasks, suggesting the how RTT performs when using programming languages as an
need for further research and model optimization to improve intermediate representation, how RTT performs when using
their performance in real-world applications, ne-tuning and natural language (English) as an intermediate representation,
optimizing LLMs can signicantly enhance their performance and what qualitative trends can be observed in the patches
in source code vulnerability detection, However, these models generated by RTT. Three measurement standards and eight
still face many challenges in practical applications, requiring models were used in the experiments, the results showed that
further research and technological improvements to enhance the RTT method achieved signicant repair effects on multiple
their real-world effectiveness and reliability. benchmarks, particularly excelling in terms of compilation and
In the later years, [148] introduced a method to incorporate feasibility [151]. Similarly, in automated program repair, [145]
complex code structures directly into the model learning introduced several innovative methods. For example, NAVRe-
process, the GRACE framework combines graph structure pair specically targets C/C++ code vulnerabilities by com-
information and in-context learning, using Code Property bining node type information and error types. Due to the
Graphs (CPGs) to represent code structure information. By unique pointer operations and memory management issues in
integrating the semantic, syntactic, and lexical similarities C/C++, this language poses complexities. The framework uses
of code, the framework GRACE addresses the limitations Abstract Syntax Trees (ASTs) to extract node type information
of text-based LLM analysis, improves the precision and re- and combines it with CWE-derived vulnerability templates
call rates of vulnerability detection tasks. The study utilized to generate targeted repair suggestions, the study evaluated
three vulnerability datasets, showing a 28.65% improvement NAVRepair on several popular LLMs (ChatGPT, DeepSeek
in F1 scores over baseline models, an important aspect of Coder, and Magicoder) to demonstrate its effectiveness in
vulnerability detection is enhancing LLM performance in code improving code vulnerability repair performance. The results
security tasks. [149] ne-tuned LLMs for specic tasks and showed that NAVRepair achieved state-of-art performance in
evaluated their performance against existing models such as C/C++ program repair task, which improved repair accuracy
ContraBERT. The researchers conducted numerous experi- by 26% compared to existing methods.
ments to determine the optimal model architecture, training In order to address the two main limitations of existing
hyperparameters, and loss functions to optimize performance ne-tuning methods for LLM-based program repair, which is
in vulnerability detection tasks. The study primarily focused on the lack of reasoning about the logic behind code changes
WizardCoder and ContraBERT, validating their performance and high computational costs associated with ne-tuning large
through comparisons on balanced and unbalanced datasets and datasets. [152] introduced the MOREPAIR framework, this
developing an efcient batch packing strategy that improved framework improve the performance of LLMs in automated
training speed. Results indicated that with appropriate ne- program repair (APR) by simultaneously optimizing syntactic
tuning and optimization, LLMs could surpass state-of-the- code transformations and the logical reasoning behind code
art models, contributing to more robust and secure software changes, the study used techniques to enhance ne-tuning
development practices. efciency, such as QLoRA (Quantized Low-Rank Adapta-
Despite the development of numerous models, it is still tion) [153] to reduce memory requirements and NEFTune
necessary to investigate their practical effectiveness. [150] ex- (Noisy Embedding Fine-Tuning) [154] to prevent overtting
plored the effectiveness of code language models (code LMs) during the ne-tuning process. The experiments evaluated
in detecting software vulnerabilities and identied signicant MOREPAIR on four open-source LLMs of different sizes
aws in existing vulnerability datasets and benchmarks. The and architectures (CodeLlama-13B, CodeLlama-7B, StarChat-
researchers developed a new dataset called PRIMEVUL, and alpha, and Mistral-7B) using two benchmarks, evalrepair-C++
conducted experiments using it, they compared PRIMEVUL and EvalRepair-Java. The results indicated that CodeLlama
with existing benchmarks such as BigVul to evaluate several improved by 11% and 8% on the rst 10 repair suggestions
code LMs, including state-of-the-art base models like GPT- for evalrepair-C++ and EvalRepair-Java respectively. Another
3.5 and GPT-4, using various training techniques and evalu- study introduced the PyDex system, which automatically
ation metrics. The results revealed that existing benchmarks repairs syntax and semantic errors in introductory Python
signicantly overestimated the performance of code LMs. For programming assignments using LLMs, the system combines
example, a state-of-the-art 7B model scored an F1 of 68.26% multimodal prompts and iterative querying methods to gener-
on BigVul but only 3.09% on PRIMEVUL, highlighting the ate repair candidates and uses few-shot learning to improve
gap between current code language models’ performance and repair accuracy. PyDex was evaluated on 286 real student
actual requirements for vulnerability detection. programs from an introductory Python programming course
2) Automating Program Repair: In the domain of software and compared against three baselines. The results showed that
security and maintenance, LLMs have not only been applied to PyDex signicantly improved the repair rate and effectiveness
vulnerability detection but also extensively used for automat- compared to existing baselines [155].
ing program repair. One study proposed using Round-Trip [156] introduced a new system named RING that lever-
Translation (RTT) for automated program repair, researchers ages large language models (LLMCs) to perform multilingual
translated defective code into another language and then back program repair across six programming languages. RING em-
to the original language to generate potential patches. The ploys a prompt strategy that minimizes customization efforts,
study used various language models and benchmarks to eval- including three stages: fault localization, code transforma-
uate RTT’s performance in APR. The experiments explored tion, and candidate ranking. The results showed that RING
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 31
was particularly effective in Python, successfully repairing consequences, emphasizing the importance of responsible use
94% of errors on the rst attempt. The study also intro- and human oversight. Assessing the robustness of systems is
duced a new PowerShell command repair dataset, providing also a crucial part of development, LLMs are used to develop
valuable resources for the research community, this research and evaluate new testing frameworks to detect and improve the
demonstrated that AI-driven automation makes program repair robustness of intelligent software systems. [144] introduces
more efcient and scalable. Another study, [157] conducted a robust input testing framework named RITFIS, designed
a comprehensive investigation into function-level automated to evaluate the robustness of LLM-based intelligent software
program repair, introducing a new LLM-based APR technique against natural language inputs. The study adapts 17 existing
called SRepair. SRepair utilizes a dual-LLM framework to DNN testing methods to LLM scenarios and empirically
enhance repair performance, the SRepair framework combines validates them on multiple datasets to highlight the current
a repair suggestion model and a patch generation model. It robustness deciencies and limitations of LLM software. The
uses chain-of-thought to generate natural language repair sug- study indicate that RITFIS effectively assesses the robustness
gestions based on auxiliary repair-related information and then of LLM software and reveals its vulnerabilities in handling
utilizes these suggestions to generate the repaired function. complex natural language inputs. This research underscores
The results showed that SRepair outperformed existing APR the importance of robustness testing for LLM-based intelligent
techniques on the Defects4J dataset, repairing 300 single- software and provides directions for improving testing meth-
function errors, with an improvement of at least 85% over pre- ods to enhance reliability and security in practical applications.
vious techniques. This study demonstrated the effectiveness of
the dual-LLM framework in function-level repair and, for the
rst time achieved multi-function error repair, highlighting the B. LLM-based Agents Task
signicant potential of LLMs in program repair. By extending LLM-based Agents primarily appied in areas such as
the scope of APR, SRepair paves the way for applying LLMs autonomous decision-making, task-specic optimization, and
in practical software development and evaluation. multi-agent collaboration, these frameworks showcasing their
3) Penetration Testing: LLMs can also be applied in the strong potential in proactive defense. [159] aims to address
eld of penetration testing, where they are used to enhance the the limitations of existing debugging methods that treat the
efciency and effectiveness of automated penetration testing. generated program as an indivisible entity. By segmenting the
Although not as frequently studied as vulnerability detection program into basic blocks and verifying the correctness of each
and automated repair, this review includes two relevant pa- block based on task descriptions, the proposed method LDB
pers. [143] investigates the development and evaluation of an (Large Language Model Debugger) provide a more detailed
LLM-driven automatic penetration testing tool PENTESTGPT. and effective debugging tool that closely reects human debug-
The main purpose of this study is to evaluate the perfor- ging practices. The study’s experiments covered testing LDB
mance of LLMs in practical penetration testing tasks and on several benchmarks and compared with baseline models
address the issue of context loss during the penetration testing without a debugger and those using traditional debugging
process, the paper introduces three self-interaction modules methods (self-debugging with explanations and traces). LDB’s
of PENTESTGPT (reasoning, generation, and parsing) and accuracy increased from a baseline of 73.8% to 82.9% on
provides empirical research based on benchmarks involving 13 the HumanEval benchmark, an improvement of 9.1%. In the
targets and 182 sub-tasks. It compares the penetration testing domain of vulnerability detection, researchers have enhanced
performance of GPT-3.5, GPT-4, and Bard. The experimental detection accuracy by combining Role-Based Access Control
results show that PENTESTGPT’s task completion rate is (RBAC) practices with deep learning of complex code struc-
228.6% higher than GPT-3.5 and 58.6% higher than GPT-4, tures.
this study demonstrates the potential of LLMs in automated [160] addresses the challenge of automatically and appro-
penetration testing, helping to identify and resolve security priately repairing access control (AC) vulnerabilities in smart
vulnerabilities, thereby enhancing the security and robustness contracts. The innovation of this paper lies in combining mined
of software systems. RBAC practices with LLMs to create a context-aware repair
A similar research paper explores the application of genera- framework for AC vulnerabilities. The model primarily uses
tive AI in penetration testing. [158] evaluates the effectiveness, GPT-4, enhanced by a new method called ACFIX, which
challenges, and potential consequences of using generative mines common RBAC practices from existing smart contracts
AI tools (specically ChatGPT 3.5) in penetration testing. and employs a Multi-Agent Debate (MAD) mechanism to
Through practical application experiments. The research con- verify the generated patches through debates between gen-
ducts a ve-stage penetration test (reconnaissance, scanning, erator and verier agents to ensure correctness. Experimen-
vulnerability assessment, exploitation, and reporting) on a vul- tal results show that ACFIX successfully repaired 94.92%
nerable machine from VulnHub, integrating Shell GPT (sgpt) of access control vulnerabilities, signicantly outperforming
with ChatGPT’s API to automate guidance in the penetration the baseline GPT-4’s 52.54%. Another application in smart
testing process. The experimental results demonstrate that contracts [161], this paper introduces a two-stage adversarial
generative AI tools can signicantly speed up the penetration framework, GPTLENS, which improves vulnerability detec-
testing process and provide accurate and useful commands, tion accuracy through generation and discrimination phases.
enhancing testing efciency and effectiveness. This study in- GPTLENS achieved a 76.9% success rate in detecting smart
dicates that the need to consider potential risks and unintended contract vulnerabilities, better than the 38.5% success rate of
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 32
traditional methods. Another study, [109] investigates the use LLM and security engineering models to improve security
of GPT-4 to automatically exploit disclosed but unpatched vul- analysis and design processes. [163] aims to propose a com-
nerabilities, the experiments showed that the LLM-based agent plex hybrid strategy to ensure the reliability and security of
achieved an 87% success rate in exploiting vulnerabilities software systems, this involves a concept-guided approach
when provided with CVE descriptions. Finally another LLM- where LLM-based agents interact with system model diagrams
based agent application in the penetration test, [107] employs to perform tasks related to safety analysis. [108] introduces
GPT-3.5 to assist penetration testers by automating high-level the TrustLLM framework which increase the accuracy and
task planning and low-level vulnerability discovery, thereby interpretability of smart contract auditing by customizing
enhancing penetration testing capabilities. The experiments LLM capabilities to the specic requirements of smart con-
demonstrated successful automation of multiple stages of pen- tract code. This paper conducts experiments on a balanced
etration testing, including high-level strategy formulation and dataset comprising 1,734 positive samples and 1,810 negative
low-level vulnerability discovery, showcasing the effectiveness samples, comparing TrustLLM with other models such as
of LLMs in penetration testing. CodeBERT, GraphCodeBERT, and several versions of GPT
In the eld of software repair by multi-agents collabora- and CodeLlama. TrustLLM achieves an F1 score of 91.21%
tions, [162] proposes a dual-agent framework that enhances and an accuracy of 91.11% which outperforming other models.
the automation and accuracy of repairing declarative speci- Beyond software-level security design, LLMs can also be
cations through iterative prompt optimization and multi-agent integrated into autonomous driving systems. [164] which has
collaboration. The researcher compare the effectiveness of the already been discussed in the IV.
LLM-based repair pipeline with several state-of-the-art Alloy
APR techniques (ARepair, ICEBAR, BeAFix, and ATR). In
the result, framework repaired 231 defects in the Alloy4Fun C. Analysis
benchmark which surpassing the 278 defects repaired by tradi- Overall, the direction of LLM-based Agents represents
tional tools. In [142], developed and evaluated an automated signicant innovative advancements in software security and
debugging framework named FixAgent, which improves fault maintenance, demonstrating improvements across all areas.
localization, repair generation, and error analysis through LLM-based Agents, through multi-agent collaboration and
an LLM-based multi-agent system. Although this research runtime information tracking to help with debugging tasks,
primarily focuses on automated debugging, incorporating el- compared to traditional LLMs approaches are often rely on
ements like fault localization and automated program repair xed prompts or feedback loops to debug a given code snippet
(APR), it intersects with test generation, particularly in the or program. In vulnerability detection, LLM-based Agents
validation phase for testing bug xes. The study evaluates combine RBAC practices and in-depth learning of complex
FixAgent’s performance on the Codeaws, QuixBugs, and code structures to improve the accuracy and efciency of
ConDefects datasets, comparing it to 16 baseline methods, detecting vulnerabilities, traditional LLMs methods normally
including state-of-the-art APR tools and LLMs. Experimental depend on extensive manual intervention and detailed guid-
results show that FixAgent xed 78 out of 79 bugs in the ance when handling tasks. LLM-based Agents also demon-
QuixBugs dataset, including 9 never-before-xed bugs. In the strate effectiveness in penetration testing by automating high-
Codeaws dataset, FixAgent xed 3982 out of 2780 defects, level task planning and low-level vulnerability exploration,
with a correctness rate of 96.5%. The framework includes thereby enhancing penetration testing capabilities. In contrast,
specialized agents responsible for localization, repair, and traditional LLM methods are more suited for passive detection
analysis tasks and uses the rubber duck debugging principle. and analysis, lacking proactive testing and defense capabilities.
FixAgent demonstrates the powerful capabilities of LLMs in From the perspective of automation, LLM-based agents
automated debugging, improving the performance of existing automate the detection and repair of software errors through
APR tools and LLMs which can be considered as state-of-art multi-agent frameworks and dynamic analysis tools, improving
framework in APR by LLM-based agent. the automation and accuracy of the repair process. Traditional
[46] introduces an automated program repair agent named LLMs methods also have a good performance on various main-
RepairAgent, this agent can dynamically generates prompts tenance or debug tasks, but often lack autonomous decision-
and integrates tools to automatically x software bugs. This making and dynamic adjustment capabilities during the repair
researcher also address the limitations of current LLM-based process. In terms of software security, intelligent agent become
repair techniques, which typically involve xed prompts or more exibly by combining LLM and security engineering
feedback loops that do not allow the model to gather compre- models to improve security analysis and design processes,
hensive information about the bug or code. RepairAgent is a thereby enhancing the reliability and security of software
LLM-based agent designed to alternately collect information systems. when dealing with security tasks by LLMs only, often
about the bug, gather repair ingredients, and validate the re- rely on static analysis lacking adaptability and optimization ca-
pairs, similar to how human developers x bugs. RepairAgent pabilities. As shown in the Figure.8, the comparison using the
achieved impressive result by overall repaired 186 bugs in MOREPAIR [152] for LLMs and RepairAgent [46] for LLM-
the Defects4J benchmark, with 164 being correctly repaired based agents. The LLM framework utilize the optimization
outperforming existing repair techniques achieved the state- techniques (QLoRA, NEFTune) to generate repair advices, the
of-art performances. RepairAgent utilize multiple tools during the inspection which
In the realm of software security, researchers have combined facilitate the precision and accuracy of the analysis before the
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 33
faced with diverse inputs. Time overhead and query number often include tests of model performance in specic domains,
are used to assess the efciency and resource consumption such as evaluating LLaMA’s performance in code generation.
of models when performing specic tasks. Additionally, ROC Therefore, during our data collection process, we also included
AUC, F1 score, and accuracy are important for evaluating models used for comparison purposes, as these models often
the model’s ability to identify vulnerabilities, especially in represent the state-of-the-art capabilities in their respective
binary classication tasks. In code repair tasks, metrics such as elds at the time of the study. In summary, across the 117
compilability and plausibility are very common, these metrics papers, we identied a total of 79 unique large language
ensure that the generated solutions are correct and deployable. models. We visualized the frequency of these model names
Common standards like BLEU and CodeBLEU are used to in a word cloud for a more intuitive representation, as shown
evaluate the quality and human-likeness of generated code, in Figure.9. From the gure, we can observe that models
which helps determine if the model’s capabilities and perfor- such as GPT-3.5, GPT-4, LLaMA2, and Codex are frequently
mance are comparable to human performance. Furthermore, used. Although close source LLMs cannot be locally deployed
domain-specic metrics like tree edit distance and test pass or further trained, their exceptional capabilities make them
rate are used to evaluate the effectiveness of LLM applications a popular choice for comparison in experiments or for data
in specialized elds of software engineering, these metrics are augmentation, where GPT-4 is used to generate additional
used to address the limitations posed by software security data to support the research model frameworks. For instance,
and maintenance. In contrast, while LLM-based agents use researchers might use OpenAI’s API to generate initial text and
evaluation metrics similar to those used for LLMs, such as then employ locally deployed models for further processing
success rate, they also incorporate more subjective metrics and optimization [76] [122] [119] [113].
for evaluation. These include appropriateness, relevance, and Therefore, it is not difcult to see that the use of general
adequacy, which are human-judged standards. Overall, the large models with superior performance to assist development
evaluation metrics used by agents tend to be simpler and or as a measurement standard has been increasingly used in
easier to understand than those used for LLMs. This is likely the vertical eld of software engineering in the past two years.
because agents handle high-level tasks, such as the success In addition, for some elds that have never been touched
rate of generating potential vulnerabilities and the frequency by LLMs before, many researchers rst refer to the model
of agents calling external tools, so they also need to consider ChatGPT and conduct various performance experiments on the
computational and time overheads of the overall architecture. newer GPT-4 [55] [58] [64]. Those models can be integrated
By comparing these metrics, we can see that LLMs empha- into larger systems and combining with other machine learning
sising the success rate of individual testing methods, LLM- models and tools, these models can be used to generate
based agents focus more on the overall task completion natural language responses, while another model handles intent
time/cost/effectiveness. LLMs typically use binary classica- recognition and dialogue management.
tion metrics like ROC, AUC, and F1 score, while agents tend
to emphasize the success rate and accuracy during both the
generation and validation phases, providing a comprehensive
evaluation. For the time cost and performance, LLMs mainly
focus on the execution time of testing methods and the number
of queries to assess their efciency. In contrast, LLM-based
agents focus more on the completion time of repair tasks and
the number of API calls, it will make sure the efciency and
practicality of overall architecture.
X. D ISCUSSION
F IG . 9: E XPERIMENT M ODELS U SAGE W ORD C LOUD
A. Experiment Models
In section 3-8, we reviewed and introduced the research Although the word cloud provides a rough overview of
on LLMs and LLM-based agent applications in software model usage frequency, it lacks detailed information. To gain
engineering in recent years. These studies have different deeper insights, we combined a grouped bar chart and a
research directions and we divided them into six subtopics stacked bar chart to further analyze the usage of models
for classication and discussion. With the advancement of in studies across different subtopics. The corresponding bar
large language models, thousands of models have appeared charts are presented in Figure.10. During the analysis, we
in the public eye, in order to more intuitively understand the found that a large number of models appeared only once.
application of large language models in various elds and the Including these in the bar chart would have made the overall
use of large language models as the core of intelligent agents, representation cluttered. Therefore, we excluded models that
we summarized a total of 117 papers, mainly to discuss the appeared only once and focused on the versatility of the
frequency of use of LLMs in the eld of software engineering. remaining models. On the left side of each subtopic, we
Based on the review of 117 papers, our primary focus depict the models used in LLM-related studies, with the
is on the models or frameworks utilized by the authors in models used in LLM-based agent-related studies highlighted
their experiments. This is due to the fact that these papers in red-bordered bars. From the gure, it is evident that in
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 35
the Autonomous Learning and Decision Making subtopic, the for agents, ne-tuning it on the generated datasets to evaluate
number of models used in LLM-based agent-related studies the emergence of knowledge and capabilities. Overall, we
is quite high. Specically, GPT-4 and GPT-3.5 were used in found that in the Autonomous Learning and Decision Making
10 out of 18 papers and 15 out of 18 papers, respectively. subtopic, LLM-based agents often use multiple models for
In this subtopic, studies commonly utilized GPT-3.5/4 and testing and performance evaluation in a single task, this results
LLaMA-2 for research and evaluation. During our analysis, we in a signicantly higher model usage frequency in this topic
found that many studies on LLM-based agents evaluated the compared to others.
agents’ ability to mimic human behavior and decision-making Not only in the Autonomous Learning and Decision Making
or perform some reasoning tasks [103] [111] [108]. Since these subtopic, but also across other themes, we observe that the
studies do not require local deployment, they mainly assess the variety of models (represented by the number of colors) used
performance of state-of-the-art models in specic directions, by LLM-based agents is relatively limited. For instance, in
leading to the frequent use of the GPT-family models. Frame- the requirement engineering and documentation subtopic, only
works like [98] [36] constructed LLM-based agents by calling GPT-3.5 and GPT-4 models were involved in the experiments.
the GPT-4 API, using verbal reinforcement to help language To analyze the reasons behind this phenomenon, we need to
agents learn from their mistakes. Due to the limitations of exclude the factors that models appearing only once were
GPT models, many studies also used LLaMA as the LLM not considered and that there are inherently fewer studies
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 36
F IG . 10: E XPERIMENT M ODELS U SAGE IN D IFFERENT S UBTOPICS (REQ DENOTES ”R EQUIREMENT E NGINEERING AND D OCUMEN -
TATION ”, CODE DENOTES ”C ODE G ENERATION AND S OFTWARE D EVELOPMENT ”, AUTO DENOTES ”AUTONOMOUS L EARNING
AND D ECISION M AKING ”, DES DENOTES ”S OFTWARE D ESIGN AND D ECISION M AKING ”, SEC DENOTES ”S OFTWARE S ECURITY
AND M AINTENANCE ”)
on intelligent agents. We believe this primarily reects the allowing for further reasoning, planning, and task execution.
integration relationship between the agents and the large From the Figure.10, we can also observe that research in the
language models. The combination of these two technologies code generation and software development elds adopts a wide
aims to address the limitations of large language models in variety of models, further indicating the extensive attention this
specic tasks or aspects. Intelligent agents allow researchers area receives and the excellent performance of models in code
to design a more exible framework and incorporate the large generation task.
language model into it. These models, having been trained on
vast amounts of data, possess strong generalizability, making B. Topics Overlapping
them suitable for a wide range of tasks and domains.
Figure.11 shows the distribution of all collected literature
Therefore, researchers and developers can use the same across six themes. For LLM-type literature, the theme of
model to address multiple issues, reducing the need for var- software security and maintenance accounts for nearly 30%,
ious models. In code generation [83] [79], test case genera- whereas test case generation accounts for less than 10%.
tion [140] [142], and software security [167] [159], there are This trend is similarly reected in the LLM-based agent
instances of using CodeLlama. This model is ne-tuned and literature. Research on using LLM-based agents to address
optimized based on the LLaMA architecture. At its release, requirements engineering and test case generation is relatively
it was considered one of the state-of-the-art models for code sparse. Requirements engineering is a new endeavor for LLM-
generation and understanding tasks, showing strong perfor- based agents, and using the entire agent framework to generate
mance and potential compared to other models like Codex. test cases might be considered excessive. Therefore, more
Another potential reason is the previous successful applica- research tends to evaluate and explore the changes LLMs bring
tions and research outcomes that have proven these models’ within the agent framework, such as autonomous decision-
effectiveness, further enhancing researchers’ trust and reliance making abilities and capabilities in software maintenance and
on them. Compared to models that perform well in specic repair.
domains, in intelligent agent development, there is a preference Table‘XI presents the number of papers spanning multiple
for using general-purpose large models to ensure that the core themes. For instance, ve papers can be classied under both
of the agent possesses excellent text comprehension abilities, software security and maintenance and autonomous learning
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 37
TABLE XI: OVERLAP OF PAPERS A MONG D IFFERENT T OPICS (REQ DENOTES ”R EQUIREMENT E NGINEERING AND D OCUMENTA -
TION ”, CODE DENOTES ”C ODE G ENERATION AND S OFTWARE D EVELOPMENT ”, AUTO DENOTES ”AUTONOMOUS L EARNING AND
D ECISION M AKING ”, DES DENOTES ”S OFTWARE D ESIGN AND D ECISION M AKING ”, SEC DENOTES ”S OFTWARE S ECURITY AND
M AINTENANCE ”)
and decision making. These two themes also overlap the most enhance capabilities in code generation, test automation, and
with other themes, indicating that LLMs and LLM-based agent security analysis through autonomous learning and decision-
research is broad and these tasks often require integrating making. This sharing of technologies promotes knowledge
knowledge and techniques from various elds such as code exchange and technological dissemination across various elds
generation, design, and testing. The signicant overlap reects within software engineering.
the close interrelation between these themes and other areas.
For example, autonomous learning and decision-making often C. Benchmarks and Metrics
involve the model’s ability to autonomously learn and optimize
decision trees, techniques that are applied in many specic As shown in Figure.12, it include the distribution of com-
software engineering tasks. Similarly, software security and mon benchmarks across six topics. In reality, the number of
maintenance typically require a combination of multiple tech- benchmark datasets used is far greater than what is shown
niques to enhance security, such as automatic code generation in the gure. Different software engineering tasks use various
tools and automated testing frameworks [71] [80] [83] [102]. benchmark datasets for evaluation and testing. For instance,
The overlap in literature highlights the increasing need for in requirements engineering, researchers often collect user
integrating methods and techniques from different research stories or requirement specications as datasets [55] [63], and
areas within software engineering. For instance, ensuring these datasets are not well-known public datasets, so they
software security relies not only on security measures but also were not included in the statistics. Alternatively, some studies
on leveraging code generation, automated testing, and design specify their datasets as “Customized GitHub datasets” [168].
optimization technologies. Similarly, autonomous learning and Therefore, the benchmark datasets shown in the gure repre-
decision-making require a comprehensive consideration of sent commonly used public datasets. For example, MBPP and
requirements engineering, code generation, and system design. HumanEval, which have been introduced in previous sections,
Moreover, it suggests that certain technologies and methods are frequently used. We can also observe that the datasets used
possess strong commonality. For instance, LLM-based agents in LLM and LLM-based agents tasks, apart from common
public datasets, are different.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 38
For instance, the FEVER7 dataset is often used in agent- 31st IEEE/ACM International Conference on Automated Software
related research. In [35], the FEVER dataset is used to test the Engineering, pp. 724–735, 2016.
[2] A. Vogelsang and M. Borg, “Requirements engineering for machine
ExpeL agent’s performance in fact verication tasks. Similarly, learning: Perspectives from data scientists,” in 2019 IEEE 27th In-
the HotpotQA8 dataset is frequently used in agent-related ternational Requirements Engineering Conference Workshops (REW),
research for knowledge-intensive reasoning and question- (Jeju, Korea (South)), pp. 245–251, 2019.
[3] “Chatgpt: Optimizing language models for dialogue,” 11 2022. [Online;
answering tasks. When handling vulnerability repair tasks,
accessed 17-July-2024].
LLMs often use the Defects4J9 benchmark dataset. This [4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
dataset contains 835 real-world defects from multiple open- H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
source Java projects, categorized into buggy versions and G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
repaired versions, typically used to evaluate the effectiveness C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
of automated program repair techniques. Despite its extensive E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
use in LLM research, Defects4J is relatively less used in J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Rad-
LLM-based agents research. We speculate that this may be ford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder,
because Defects4J primarily evaluates single code repair tasks, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba,
which do not fully align with the multi-task and real-time “Evaluating large language models trained on code,” arXiv preprint
arXiv:2107.03374, 2021. arXiv:2107.03374 [cs.LG].
requirements of LLM-based agents. Additionally, new datasets [5] N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra-
like ConDefects have been introduced [142], focusing on jamani, and R. Sharma, “Jigsaw: large language models meet program
addressing data leakage issues and providing more compre- synthesis,” in Proceedings of the 44th International Conference on
Software Engineering, ICSE ’22, (New York, NY, USA), p. 1219–1231,
hensive defect localization and repair evaluations. Association for Computing Machinery, 2022.
As shown in Figure.13, it includes the top ten evaluation [6] T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long-context llms
metrics for LLMs and LLM-based agents. The analysis reveals struggle with long in-context learning,” 2024.
that the evaluation methods used by both are almost identical. [7] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin,
and X. Hu, “Harnessing the power of llms in practice: A survey on
In previous sections, we also discussed that for agents, it chatgpt and beyond,” ACM Trans. Knowl. Discov. Data, vol. 18, apr
is necessary to consider time and computational resource 2024.
consumption, which is evident from the pie chart. Meanwhile, [8] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo,
and J. M. Zhang, “Large language models for software engineering:
many studies focus on the code generation capabilities of Survey and open problems,” in 2023 IEEE/ACM International Confer-
LLMs, so more evaluation metrics pertain to the correctness ence on Software Engineering: Future of Software Engineering (ICSE-
and Exact Match of the generated code [73] [69] [30], but FoSE), pp. 31–53, 2023.
overall, the evaluation metrics for LLMs and LLM-based [9] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen,
J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J. Wen, “A
agents in software engineering applications are quite similar. survey on large language model based autonomous agents,” Frontiers
of Computer Science, vol. 18, no. 6, pp. 186345–, 2024.
XI. C ONCLUSION [10] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou,
In this paper, we conducted a comprehensive literature W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng,
review on the application of LLM and LLM-based agents Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui, “The rise
in software engineering. We categorized software engineering and potential of large language model based agents: A survey,” 2023.
[11] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
into six topics: requirement engineering and documentation, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
code generation and software development, autonomous learn- “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in
ing and decision making, software design and evaluation, Advances in Neural Information Processing Systems (H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33,
software test generation, and software security and mainte- pp. 9459–9474, Curran Associates, Inc., 2020.
nance. For each topic, we analyzed the tasks, benchmarks, and [12] GitHub, Inc., “Github copilot: Your ai pair programmer.” https://github.
evaluation metrics, distinguishing between LLM and LLM- com/features/copilot, 2024. [Online; accessed 17-July-2024].
based agents and discussing the differences and impacts they [13] S. Russell and P. Norvig, Articial Intelligence: A Modern Approach.
Pearson Education Limited, 2016.
bring. We further analyzed and discussed the models used in [14] N. R. Jennings, “A survey of agent-oriented software engineering,”
the experiments of the 117 collected papers. Additionally, we Knowledge Engineering Review, vol. 15, no. 4, pp. 215–249, 2000.
provided statistics and distinctions between LLM and LLM- [15] Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai,
M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian,
based agents regarding datasets and evaluation metrics. The “Experience grounds language,” 2020.
analysis revealed that the emergence of LLM-based agents [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
has led to extensive research and applications across various D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large
software engineering topics, demonstrating different emphases language models,” Advances in neural information processing systems,
vol. 35, pp. 24824–24837, 2022.
compared to traditional LLMs in terms of tasks, benchmarks, [17] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
and evaluation metrics. J. Grundy, and H. Wang, “Large language models for software engi-
neering: A systematic literature review,” 2024.
R EFERENCES [18] Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, and W. Wang,
“Towards an understanding of large language models in software
[1] S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan, “Bugram: engineering tasks,” 2023.
bug detection with n-gram language models,” in Proceedings of the [19] A. Nguyen-Duc, B. Cabrero-Daniel, A. Przybylek, C. Arora,
D. Khanna, T. Herda, U. Raq, J. Melegati, E. Guerra, K.-K. Kemell,
7 https://fever.ai/dataset/fever.html
M. Saari, Z. Zhang, H. Le, T. Quan, and P. Abrahamsson, “Generative
8 https://hotpotqa.github.io/
articial intelligence for software engineering – a research agenda,”
9 https://github.com/rjust/defects4j 2023.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 39
[20] W. Ma, S. Liu, Z. Lin, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, work for methods that learn from human feedback,” Advances in Neural
L. Li, and Y. Liu, “Lms: Understanding code syntax and semantics for Information Processing Systems, vol. 36, 2024.
code analysis,” 2024. [46] I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous,
[21] Z. Yang, Z. Sun, T. Z. Yue, P. Devanbu, and D. Lo, “Robustness, secu- llm-based agent for program repair,” arXiv preprint arXiv:2403.17134,
rity, privacy, explainability, efciency, and usability of large language 2024.
models for code,” 2024. [47] E. Musumeci, M. Brienza, V. Suriani, D. Nardi, and D. D. Bloisi, “Llm
[22] Y. Huang, Y. Chen, X. Chen, J. Chen, R. Peng, Z. Tang, J. Huang, based multi-agent generation of semi-structured documents from se-
F. Xu, and Z. Zheng, “Generative software engineering,” 2024. mantic templates in the public administration domain,” in International
[23] C. Manning and H. Schutze, Foundations of statistical natural language Conference on Human-Computer Interaction, pp. 98–117, Springer,
processing. MIT press, 1999. 2024.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [48] X. Luo, Y. Xue, Z. Xing, and J. Sun, “Prcbert: Prompt learning for
Computation, vol. 9, no. 8, pp. 1735–1780, 1997. requirement classication using bert-based pretrained language mod-
[25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural els,” in Proceedings of the 37th IEEE/ACM International Conference
computation, vol. 9, no. 8, pp. 1735–1780, 1997. on Automated Software Engineering, pp. 1–13, 2022.
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [49] T. Hey, J. Keim, A. Koziolek, and W. F. Tichy, “Norbert: Transfer learn-
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” ing for requirements classication,” in 2020 IEEE 28th International
Advances in neural information processing systems, vol. 30, 2017. Requirements Engineering Conference (RE), pp. 169–179, 2020.
[27] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and [50] J. Zhang, Y. Chen, N. Niu, and C. Liu, “Evaluation of chatgpt on
consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020. requirements information retrieval under zero-shot setting,” Available
[28] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, at SSRN 4450322, 2023.
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling [51] C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements
language modeling with pathways,” Journal of Machine Learning engineering through generative ai: Assessing the role of llms,” in Gen-
Research, vol. 24, no. 240, pp. 1–113, 2023. erative AI for Effective Software Development, pp. 129–148, Springer,
[29] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, 2024.
C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, [52] M. Krishna, B. Gaur, A. Verma, and P. Jalote, “Using llms in software
K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettle- requirements specications: An empirical evaluation,” 2024.
moyer, “Opt: Open pre-trained transformer language models,” 2022. [53] L. Ma, S. Liu, Y. Li, X. Xie, and L. Bu, “Specgen: Automated gen-
[30] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, eration of formal program specications via large language models,”
“Codet5+: Open code large language models for code understanding 2024.
and generation,” arXiv preprint arXiv:2305.07922, 2023. [54] C. Flanagan and K. R. M. Leino, “Houdini, an annotation assistant
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training for esc/java,” in FME 2001: Formal Methods for Increasing Software
of deep bidirectional transformers for language understanding,” arXiv Productivity (J. N. Oliveira and P. Zave, eds.), (Berlin, Heidelberg),
preprint arXiv:1810.04805, 2018. pp. 500–517, Springer Berlin Heidelberg, 2001.
[32] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [55] J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language mod- ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,
els are few-shot learners,” Advances in neural information processing Requirements Elicitation, and Software Design, pp. 71–108. Cham:
systems, vol. 33, pp. 1877–1901, 2020. Springer Nature Switzerland, 2024.
[33] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, [56] D. Luitel, S. Hassani, and M. Sabetzadeh, “Improving requirements
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: completeness: Automated assistance through large language models,”
Open and efcient foundation language models,” arXiv preprint Requirements Engineering, vol. 29, no. 1, pp. 73–95, 2024.
arXiv:2302.13971, 2023. [57] A. Moharil and A. Sharma, “Identication of intra-domain ambiguity
[34] J. X. Chen, “The evolution of computing: Alphago,” Computing in using transformer-based machine learning,” in Proceedings of the 1st
Science & Engineering, vol. 18, no. 4, pp. 4–7, 2016. International Workshop on Natural Language-Based Software Engi-
[35] A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: neering, NLBSE ’22, (New York, NY, USA), p. 51–58, Association
Llm agents are experiential learners,” in Proceedings of the AAAI for Computing Machinery, 2023.
Conference on Articial Intelligence, vol. 38, pp. 19632–19642, 2024. [58] K. Ronanki, B. Cabrero-Daniel, and C. Berger, “Chatgpt as a tool
[36] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, for user story quality evaluation: Trustworthy out of the box?,” in
“React: Synergizing reasoning and acting in language models,” arXiv Agile Processes in Software Engineering and Extreme Programming
preprint arXiv:2210.03629, 2022. – Workshops (P. Kruchten and P. Gregory, eds.), (Cham), pp. 173–181,
[37] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models Springer Nature Switzerland, 2024.
as zero-shot planners: Extracting actionable knowledge for embodied [59] A. Poudel, J. Lin, and J. Cleland-Huang, “Leveraging transformer-
agents,” in International conference on machine learning, pp. 9118– based language models to automate requirements satisfaction assess-
9147, PMLR, 2022. ment,” 2023.
[38] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, [60] E. Musumeci, M. Brienza, V. Suriani, D. Nardi, and D. D. Bloisi, “Llm
and A. Anandkumar, “Voyager: An open-ended embodied agent with based multi-agent generation of semi-structured documents from se-
large language models,” arXiv preprint arXiv:2305.16291, 2023. mantic templates in the public administration domain,” in Articial
[39] C. Whitehouse, M. Choudhury, and A. F. Aji, “Llm-powered data Intelligence in HCI (H. Degen and S. Ntoa, eds.), (Cham), pp. 98–
augmentation for enhanced cross-lingual performance,” 2023. 117, Springer Nature Switzerland, 2024.
[40] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- [61] S. Zhang, J. Wang, G. Dong, J. Sun, Y. Zhang, and G. Pu, “Experi-
nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern menting a new programming practice with llms,” 2024.
catalog to enhance prompt engineering with chatgpt,” arXiv preprint [62] A. Nouri, B. Cabrero-Daniel, F. Törner, H. Sivencrona, and C. Berger,
arXiv:2302.11382, 2023. “Engineering safety requirements for autonomous driving with large
[41] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. language models,” 2024.
Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., [63] Z. Zhang, M. Rayhan, T. Herda, M. Goisauf, and P. Abrahamsson,
“Gemini 1.5: Unlocking multimodal understanding across millions of “Llm-based agents for automating the enhancement of user story
tokens of context,” arXiv preprint arXiv:2403.05530, 2024. quality: An early report,” in Agile Processes in Software Engineering
[42] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, and Extreme Programming (D. Šmite, E. Guerra, X. Wang, M. March-
A. Madotto, and P. Fung, “Survey of hallucination in natural language esi, and P. Gregory, eds.), (Cham), pp. 117–126, Springer Nature
generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. Switzerland, 2024.
[43] K. An, F. Yang, L. Li, Z. Ren, H. Huang, L. Wang, P. Zhao, Y. Kang, [64] K. Ronanki, C. Berger, and J. Horkoff, “Investigating chatgpt’s po-
H. Ding, Q. Lin, et al., “Nissist: An incident mitigation copilot based tential to assist in requirements elicitation processes,” in 2023 49th
on troubleshooting guides,” arXiv preprint arXiv:2402.17531, 2024. Euromicro Conference on Software Engineering and Advanced Appli-
[44] J. Li, Q. Zhang, Y. Yu, Q. Fu, and D. Ye, “More agents is all you cations (SEAA), pp. 354–361, 2023.
need,” 2024. [65] D. Xie, B. Yoo, N. Jiang, M. Kim, L. Tan, X. Zhang, and J. S. Lee, “Im-
[45] Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, pact of large language models on generating software specications,”
P. S. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation frame- 2023.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 40
[66] A. Moharil and A. Sharma, “Tabasco: A transformer based con- [87] S. Zhang, J. Wang, G. Dong, J. Sun, Y. Zhang, and G. Pu, “Ex-
textualization toolkit,” Science of Computer Programming, vol. 230, perimenting a new programming practice with llms,” arXiv preprint
p. 102994, 2023. arXiv:2401.01062, 2024.
[67] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, [88] V. Murali, C. Maddila, I. Ahmad, M. Bolin, D. Cheng, N. Ghor-
H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, bani, R. Fernandez, and N. Nagappan, “Codecompose: A large-scale
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, industrial deployment of ai-assisted code authoring,” arXiv preprint
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, arXiv:2305.12050, 2023.
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, [89] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, “Large
E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, language models can self-improve,” arXiv preprint arXiv:2210.11610,
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, 2022.
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Rad- [90] L. Chen, J. Q. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and
ford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, J. Zou, “Are more llm calls all you need? towards scaling laws of
B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, compound inference systems,” arXiv preprint arXiv:2403.02419, 2024.
“Evaluating large language models trained on code,” 2021. [91] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language
[68] A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, S. Yin, Y. Liu, models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.
S. Yavuz, C. Xiong, S. Joty, Y. Zhou, D. Radev, and A. Cohan, [92] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
“L2ceval: Evaluating language-to-code generation capabilities of large debugging via large language model-driven scientic debugging,” arXiv
language models,” 2023. preprint arXiv:2304.02195, 2023.
[69] R. Sun, S. Ö. Arik, A. Muzio, L. Miculicich, S. Gundabathula, P. Yin, [93] G. Franceschelli and M. Musolesi, “On the creativity of large language
H. Dai, H. Nakhost, R. Sinha, Z. Wang, et al., “Sql-palm: Improved models,” arXiv preprint arXiv:2304.00008, 2023.
large language model adaptation for text-to-sql (extended),” arXiv [94] J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu, “Large language models in
preprint arXiv:2306.00739, 2023. law: A survey,” 2023.
[70] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, [95] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang,
L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, “Codegeex: A Z. Lin, Z. Li, D. Li, E. Xing, et al., “Judging llm-as-a-judge with mt-
pre-trained model for code generation with multilingual benchmarking bench and chatbot arena,” Advances in Neural Information Processing
on humaneval-x,” 2024. Systems, vol. 36, 2024.
[71] X. Hu, K. Kuang, J. Sun, H. Yang, and F. Wu, “Leveraging print [96] Q. Wang, Z. Wang, Y. Su, H. Tong, and Y. Song, “Rethinking the
debugging to improve code generation in large language models,” 2024. bounds of llm reasoning: Are multi-agent discussions the key?,” arXiv
[72] S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of preprint arXiv:2402.18272, 2024.
ai on developer productivity: Evidence from github copilot,” 2023. [97] L. Chen, Y. Zhang, S. Ren, H. Zhao, Z. Cai, Y. Wang, P. Wang, T. Liu,
[73] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, and B. Chang, “Towards end-to-end embodied decision making via
W. tau Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative multi-modal large language model: Explorations with gpt4-vision and
model for code inlling and synthesis,” 2023. beyond,” arXiv preprint arXiv:2310.02071, 2023.
[74] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, [98] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao,
S. Savarese, and C. Xiong, “Codegen: An open large language model “Reexion: Language agents with verbal reinforcement learning,”
for code with multi-turn program synthesis,” 2023. Advances in Neural Information Processing Systems, vol. 36, 2024.
[75] Y. Ding, M. J. Min, G. Kaiser, and B. Ray, “Cycle: Learning to self- [99] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian, C.-M. Chan,
rene the code generation,” Proc. ACM Program. Lang., vol. 8, apr Y. Qin, Y. Lu, R. Xie, et al., “Agentverse: Facilitating multi-agent
2024. collaboration and exploring emergent behaviors in agents,” arXiv
preprint arXiv:2308.10848, 2023.
[76] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code genera-
tion via chatgpt,” 2024. [100] G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel:
Communicative agents for” mind” exploration of large language model
[77] F. Lin, D. J. Kim, et al., “When llm-based code generation meets
society,” Advances in Neural Information Processing Systems, vol. 36,
the software development process,” arXiv preprint arXiv:2403.15852,
pp. 51991–52008, 2023.
2024.
[101] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke, R. Murthy, Y. Feng,
[78] S. Holt, M. R. Luyten, and M. van der Schaar, “L2MAC: Large Z. Chen, J. C. Niebles, D. Arpit, et al., “Bolaa: Benchmarking
language model automatic computer for extensive code generation,” and orchestrating llm-augmented autonomous agents,” arXiv preprint
in The Twelfth International Conference on Learning Representations, arXiv:2308.05960, 2023.
2024.
[102] J. Lu, W. Zhong, W. Huang, Y. Wang, Q. Zhu, F. Mi, B. Wang,
[79] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, W. Wang, X. Zeng, L. Shang, X. Jiang, and Q. Liu, “Self: Self-evolution
Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, with language feedback,” 2024.
and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent [103] C. Xie, C. Chen, F. Jia, Z. Ye, K. Shu, A. Bibi, Z. Hu, P. Torr,
collaborative framework,” 2023. B. Ghanem, and G. Li, “Can large language model agents simulate
[80] Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, human trust behaviors?,” arXiv preprint arXiv:2402.04559, 2024.
K. Systä, and P. Abrahamsson, “Autonomous agents in software [104] Z. Liu, W. Yao, J. Zhang, L. Yang, Z. Liu, J. Tan, P. K. Choubey,
development: A vision paper,” arXiv preprint arXiv:2311.18440, 2023. T. Lan, J. Wu, H. Wang, et al., “Agentlite: A lightweight library for
[81] Z. Rasheed, M. Waseem, M. Saari, K. Systä, and P. Abrahamsson, building and advancing task-oriented llm agent system,” arXiv preprint
“Codepori: Large scale model for autonomous software development arXiv:2402.15538, 2024.
by using multi-agents,” arXiv preprint arXiv:2402.01411, 2024. [105] M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and
[82] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: J. Schmidhuber, “Language agents as optimizable graphs,” arXiv
Multi-agent-based code generation with iterative testing and optimisa- preprint arXiv:2402.16823, 2024.
tion,” arXiv preprint arXiv:2312.13010, 2023. [106] R. Feldt, S. Kang, J. Yoon, and S. Yoo, “Towards autonomous
[83] T. Zheng, G. Zhang, T. Shen, X. Liu, B. Y. Lin, J. Fu, W. Chen, testing agents via conversational large language models,” in 2023
and X. Yue, “Opencodeinterpreter: Integrating code generation with 38th IEEE/ACM International Conference on Automated Software
execution and renement,” arXiv preprint arXiv:2402.14658, 2024. Engineering (ASE), pp. 1688–1693, IEEE, 2023.
[84] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, E. Ham- [107] A. Happe and J. Cito, “Getting pwn’d by ai: Penetration testing
bro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: with large language models,” in Proceedings of the 31st ACM Joint
Language models can teach themselves to use tools,” Advances in European Software Engineering Conference and Symposium on the
Neural Information Processing Systems, vol. 36, 2024. Foundations of Software Engineering, pp. 2082–2086, 2023.
[85] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, [108] W. Ma, D. Wu, Y. Sun, T. Wang, S. Liu, J. Zhang, Y. Xue, and
X. Tang, B. Qian, et al., “Toolllm: Facilitating large language models Y. Liu, “Combining ne-tuning and llm-based agents for intuitive smart
to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, contract auditing with justications,” arXiv preprint arXiv:2403.16073,
2023. 2024.
[86] X. Jiang, Y. Dong, L. Wang, F. Zheng, Q. Shang, G. Li, Z. Jin, and [109] R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents
W. Jiao, “Self-planning code generation with large language models,” can autonomously exploit one-day vulnerabilities,” arXiv preprint
ACM Trans. Softw. Eng. Methodol., jun 2024. Just Accepted. arXiv:2404.08144, 2024.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 41
[110] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Grifths, Y. Cao, and [133] H. J. Kang, T. G. Nguyen, B. Le, C. S. Păsăreanu, and D. Lo,
K. Narasimhan, “Tree of thoughts: Deliberate problem solving with “Test mimicry to assess the exploitability of library vulnerabilities,”
large language models,” Advances in Neural Information Processing in Proceedings of the 31st ACM SIGSOFT International Symposium
Systems, vol. 36, 2024. on Software Testing and Analysis, pp. 276–288, 2022.
[111] Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. [134] S. Feng and C. Chen, “Prompting is all you need: Automated android
Duc, and P. Abrahamsson, “Can large language models serve as data bug replay with large language models,” in Proceedings of the 46th
analysts? a multi-agent assisted approach for qualitative data analysis,” IEEE/ACM International Conference on Software Engineering, pp. 1–
arXiv preprint arXiv:2402.01386, 2024. 13, 2024.
[112] M. Ataei, H. Cheong, D. Grandi, Y. Wang, N. Morris, and A. Tessier, [135] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-
“Elicitron: An llm agent-based simulation framework for design re- shot testers: Exploring llm-based general bug reproduction,” in 2023
quirements elicitation,” arXiv preprint arXiv:2404.16045, 2024. IEEE/ACM 45th International Conference on Software Engineering
[113] G. Sridhara, S. Mazumdar, et al., “Chatgpt: A study on its (ICSE), pp. 2312–2323, IEEE, 2023.
utility for ubiquitous software engineering tasks,” arXiv preprint [136] C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang, “Fuzz4all:
arXiv:2305.16837, 2023. Universal fuzzing with large language models,” in Proceedings of the
[114] M. Desmond, Z. Ashktorab, Q. Pan, C. Dugan, and J. M. Johnson, IEEE/ACM 46th International Conference on Software Engineering,
“Evalullm: Llm assisted evaluation of generative outputs,” in Compan- pp. 1–13, 2024.
ion Proceedings of the 29th International Conference on Intelligent [137] G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and
User Interfaces, pp. 30–32, 2024. B. Ray, “Code-aware prompting: A study of coverage guided test gener-
[115] M. Gao, X. Hu, J. Ruan, X. Pu, and X. Wan, “Llm-based nlg evaluation: ation in regression setting using llm,” arXiv preprint arXiv:2402.00097,
Current status and challenges,” arXiv preprint arXiv:2402.01383, 2024. 2024.
[116] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang, and D. Chen, [138] J. A. Pizzorno and E. D. Berger, “Coverup: Coverage-guided llm-based
“Software/hardware co-design for llm and its application for design test generation,” arXiv preprint arXiv:2403.16218, 2024.
verication,” in 2024 29th Asia and South Pacic Design Automation [139] K. Liu, Y. Liu, Z. Chen, J. M. Zhang, Y. Han, Y. Ma, G. Li, and
Conference (ASP-DAC), pp. 435–441, IEEE, 2024. G. Huang, “Llm-powered test case generation for detecting tricky
[117] K. Kolthoff, C. Bartelt, and S. P. Ponzetto, “Data-driven prototyping via bugs,” arXiv preprint arXiv:2404.10304, 2024.
natural-language-based gui retrieval,” Automated software engineering, [140] K. Li and Y. Yuan, “Large language models as test case gen-
vol. 30, no. 1, p. 13, 2023. erators: Performance evaluation and enhancement,” arXiv preprint
[118] V. D. Kirova, C. S. Ku, J. R. Laracy, and T. J. Marlowe, “Software arXiv:2404.13340, 2024.
engineering education must adapt and evolve for an llm environment,” [141] Z. Wang, W. Wang, Z. Li, L. Wang, C. Yi, X. Xu, L. Cao, H. Su,
in Proceedings of the 55th ACM Technical Symposium on Computer S. Chen, and J. Zhou, “Xuat-copilot: Multi-agent collaborative system
Science Education V. 1, pp. 666–672, 2024. for automated user acceptance testing with large language model,”
[119] S. Jalil, S. Ra, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and arXiv preprint arXiv:2401.02705, 2024.
software testing education: promises & perils (2023),” arXiv preprint [142] C. Lee, C. S. Xia, J.-t. Huang, Z. Zhu, L. Zhang, and M. R. Lyu, “A
arXiv:2302.03287, 2023. unied debugging approach via llm-based multi-agent synergy,” arXiv
preprint arXiv:2404.17153, 2024.
[120] S. Suri, S. N. Das, K. Singi, K. Dey, V. S. Sharma, and V. Kaulgud,
[143] G. Deng, Y. Liu, V. Mayoral-Vilches, P. Liu, Y. Li, Y. Xu, T. Zhang,
“Software engineering using autonomous agents: Are we there yet?,” in
Y. Liu, M. Pinzger, and S. Rass, “Pentestgpt: An llm-empowered
2023 38th IEEE/ACM International Conference on Automated Software
automatic penetration testing tool,” arXiv preprint arXiv:2308.06782,
Engineering (ASE), pp. 1855–1857, IEEE, 2023.
2023.
[121] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and
[144] M. Xiao, Y. Xiao, H. Dong, S. Ji, and P. Zhang, “Rits: Robust input
M. Sun, “Communicative agents for software development,” arXiv
testing framework for llms-based intelligent software,” arXiv preprint
preprint arXiv:2307.07924, 2023.
arXiv:2402.13518, 2024.
[122] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
[145] R. Wang, Z. Li, C. Wang, Y. Xiao, and C. Gao, “Navrepair:
Solving ai tasks with chatgpt and its friends in hugging face,” Advances
Node-type aware c/c++ code vulnerability repair,” arXiv preprint
in Neural Information Processing Systems, vol. 36, 2024.
arXiv:2405.04994, 2024.
[123] J. Chen, X. Hu, S. Liu, S. Huang, W.-W. Tu, Z. He, and L. Wen, [146] A. Shestov, A. Cheshkov, R. Levichev, R. Mussabayev, P. Zadorozhny,
“Llmarena: Assessing capabilities of large language models in dynamic E. Maslov, C. Vadim, and E. Bulychev, “Finetuning large language
multi-agent environments,” arXiv preprint arXiv:2402.16499, 2024. models for vulnerability detection,” arXiv preprint arXiv:2401.17010,
[124] M. Josifoski, L. Klein, M. Peyrard, Y. Li, S. Geng, J. P. Schnitzler, 2024.
Y. Yao, J. Wei, D. Paul, and R. West, “Flows: Building blocks [147] A. Cheshkov, P. Zadorozhny, and R. Levichev, “Evaluation of chatgpt
of reasoning and collaborating ai,” arXiv preprint arXiv:2308.01285, model for vulnerability detection,” arXiv preprint arXiv:2304.07232,
2023. 2023.
[125] I. Weber, “Large language models as software components: A taxon- [148] G. Lu, X. Ju, X. Chen, W. Pei, and Z. Cai, “Grace: Empowering
omy for llm-integrated applications,” arXiv preprint arXiv:2406.10300, llm-based software vulnerability detection with graph structure and in-
2024. context learning,” Journal of Systems and Software, vol. 212, p. 112031,
[126] F. Vallecillos Ruiz, “Agent-driven automatic software improvement,” in 2024.
Proceedings of the 28th International Conference on Evaluation and [149] H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The hitchhiker’s guide to
Assessment in Software Engineering, pp. 470–475, 2024. program analysis: A journey with large language models,” arXiv
[127] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efcient inference preprint arXiv:2308.00245, 2023.
with large language model apis,” arXiv preprint arXiv:2301.08721, [150] Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair,
2023. D. Wagner, B. Ray, and Y. Chen, “Vulnerability detection with code
[128] S. Shankar, J. Zamrescu-Pereira, B. Hartmann, A. G. Parameswaran, language models: How far are we?,” arXiv preprint arXiv:2403.18624,
and I. Arawjo, “Who validates the validators? aligning llm-assisted 2024.
evaluation of llm outputs with human preferences,” arXiv preprint [151] F. V. Ruiz, A. Grishina, M. Hort, and L. Moonen, “A novel approach
arXiv:2404.12272, 2024. for automatic program repair using round-trip translation with large
[129] D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and language models,” arXiv preprint arXiv:2401.07994, 2024.
S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in [152] B. Yang, H. Tian, J. Ren, H. Zhang, J. Klein, T. F. Bissyandé, C. L.
Companion Proceedings of the 32nd ACM International Conference on Goues, and S. Jin, “Multi-objective ne-tuning for enhanced program
the Foundations of Software Engineering, pp. 208–219, 2024. repair with llms,” arXiv preprint arXiv:2404.12636, 2024.
[130] Y. Li, Y. Zhang, and L. Sun, “Metaagents: Simulating interactions of [153] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
human behaviors for llm-based task-oriented coordination via collabo- Efcient netuning of quantized llms,” 2023.
rative generative agents,” arXiv preprint arXiv:2310.06500, 2023. [154] N. Jain, P. yeh Chiang, Y. Wen, J. Kirchenbauer, H.-M. Chu,
[131] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, G. Somepalli, B. R. Bartoldson, B. Kailkhura, A. Schwarzschild,
“Unit test case generation with transformers and focal context,” arXiv A. Saha, M. Goldblum, J. Geiping, and T. Goldstein, “Neftune: Noisy
preprint arXiv:2009.05617, 2020. embeddings improve instruction netuning,” 2023.
[132] Y. Zhang, W. Song, Z. Ji, N. Meng, et al., “How well does llm generate [155] J. Zhang, J. P. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares,
security tests?,” arXiv preprint arXiv:2310.00710, 2023. and G. Verbruggen, “Pydex: Repairing bugs in introductory python
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 42
A PPENDIX
A. Benchmarks