From_LLMs to_LLM_based_Agents

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO.
9, SEPTEMBER 2020 1
From LLMs to LLM-based Agents for Software

Engineering: A Survey of Current, Challenges and
Future
Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, Huaming Chen
Abstract—With the rise of large language models (LLMs), as GPT [3] and Codex [4], have demonstrated remarkable
researchers are increasingly exploring their applications in var- capabilities in handling downstream tasks in SE, including
arXiv:2408.02479v1 [cs.SE] 5 Aug 2024
ious vertical domains, such as software engineering. LLMs have code generation, debugging, and documentation. These models
achieved remarkable success in areas including code generation
and vulnerability detection. However, they also exhibit numerous leverage vast amounts of training data to generate human-like
limitations and shortcomings. LLM-based agents, a novel tech- text, offering unprecedented levels of uency and coherence.
nology with the potential for Articial General Intelligence (AGI), Studies have shown that LLMs can enhance productivity in
combine LLMs as the core for decision-making and action-taking, software projects by providing intelligent code suggestions,
addressing some of the inherent limitations of LLMs such as lack automating repetitive tasks, even generating entire code snip-
of autonomy and self-improvement. Despite numerous studies
and surveys exploring the possibility of using LLMs in software pets from natural language descriptions [5].
engineering, it lacks a clear distinction between LLMs and LLM- Despite their potential, there are signicant challenges in
based agents. It is still in its early stage for a unied standard applying LLMs to SE. One major issue is their limited context
and benchmarking to qualify an LLM solution as an LLM-based length [6], which restricts the model’s ability to comprehend
agent in its domain. In this survey, we broadly investigate the and manage extensive codebases, making it challenging to
current practice and solutions for LLMs and LLM-based agents
for software engineering. In particular we summarise six key maintain coherence over prolonged interactions. Hallucina-
topics: requirement engineering, code generation, autonomous tions is another main concern, where the model generates code
decision-making, software design, test generation, and software that appears plausible but is actually incorrect or nonsensi-
maintenance. We review and differentiate the work of LLMs cal [7], potentially introducing bugs or vulnerabilities if not
and LLM-based agents from these six topics, examining their carefully reviewed by experienced developers. Additionally,
differences and similarities in tasks, benchmarks, and evaluation
metrics. Finally, we discuss the models and benchmarks used, the inability of LLMs to use external tools restricts their access
providing a comprehensive analysis of their applications and to real-time data and prevent them from performing tasks
effectiveness in software engineering. We anticipate this work outside their training scope. It diminishes their effectiveness in
will shed some lights on pushing the boundaries of LLM-based dynamic environments. These limitations signicantly impact
agents in software engineering for future research. the application of LLMs in SE, and also highlight the need
Index Terms—Large Language Models, LLM-based Agents, for expert developeers to critically rene and validate LLM-
Software Engineering, Benchmark, Software Security, AI System generated code for accuracy and security [8]. In complex
Development projects, the static nature of LLMs can hinder their ability
to adapt to changing requirements or efciently incorporate
I. I NTRODUCTION new information. Moreover, LLMs typically cannot interact
S OFTWARE engineering (SE) has seen its booming re- with external tools or databases, further limits their utility in
search and development with the aid of articial intel- dynamic and evolving SE contexts.
ligence techniques. Traditional approaches leveraging neural To address these challenges, LLM-based agents have
networks and machine learning have facilitated various SE emerged [9] [10], combining the strengths of LLMs with
topics such as bug detection, code synthesis, and requirements external tools and resources to enable more dynamic and
analysis [1] [2]. However, they often present limitations, in- autonomous operations. These agents leverage recent advance-
cluding the need for exclusive feature engineering, scalability ments in AI, such as Retrieval-Augmented Generation (RAG)
issues, and the adaptability across diverse codebases. The and tool utilization, to perform more complex and contextually
rise of Large Language Models (LLMs) has embarked on aware tasks [11]. For instance, OpenAI’s Codex has been
new solutions and ndings in this landscape. LLMs, such integrated into GitHub Copilot [12], enabling real-time code
suggestions and completion within development environments.
Haolin Jin, Linghan Huang and Huaming Chen are with the School of Unlike static LLMs, LLM-based agents can perform a wide
Electrical and Computer Engineering, The University of Sydney, Sydney, range of tasks, such as autonomously debugging code by
2006, Australia. (email: [email protected])
Haipeng Cai is with the School of Electrical Engineering and Computer identifying and xing errors, proactively refactoring code to
Science at Washington State University, US enhance efciency or readability, and generating adaptive test
Jun Yan is with the School of Computing and Information Technology at cases that evolve alongside the codebase. These features make
University of Wollongong, Australia
Bo Li is with the Computer Science Department at the University of LLM-based agents a powerful tool for SE, capable of handling
Chicago, US more complex and dynamic workows than traditional LLMs.
JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 2
F IG . 1: PAPER N UMBER FOR LLM S AND LLM- BASED AGENT

BETWEEN 2020-2024
F IG . 2: PAPER D ISTRIBUTION
Historically, AI agents focused on autonomous actions
based on predened rules or learning from interactions [13]
tion. IV- IX)
[14]. The integration of LLMs has presented new opportuni-
• RQ1: What are the key differences in task performance
ties in this area, providing the language understanding and
between LLMs and LLM-based agents in SE applica-
generative capabilities needed for more sophisticated agent
tions? (Section. IV- IX)
behaviors. [10] shows that LLM-based agents are capable
• RQ2: Which benchmark datasets and evaluation metrics
of autonomous reasoning and decision-making, achieving the
are most commonly used for assessing the performance of
third and fourth levels of WS (World Scope) [15], which
LLMs and LLM-based agents in SE tasks? (Section. IV-
outlines the progression from natural language processing
IX and Section. X)
(NLP) to general AI. In software engineering, LLM-based
• RQ3: What are the predominant experimental models and
agents show promise in areas such as autonomous debugging,
methodologies employed when utilizing LLMs in SE?
code refactoring, and adaptive test generation, demonstrating
(Section. X)
capabilities that approach articial general intelligence(AGI).
In this work, we present, to the best of our knowledge, a rst
survey outlining the integration and transformation of LLMs II. EXISTING WORKS AND THE SURVEY
to LLM-based agents in the domain of SE. Our survey covers STRUCTURE
six key themes in SE: A. Existing works
1) Requirement Engineering and Documentation: Cap- In recent years, large language models have been primarily
turing, analyzing, and documenting software require- applied to help programmers generate code and x bugs.
ments, as well as generating user manuals and technical These models understand and complete code or text based
documentation. on the user’s input, leveraging their training data and reason-
2) Code Generation and Software Development: Au- ing capabilities. In previous survey papers, such as Angela
tomating code generation, assisting in the development Fan’s research [8], there has not been much elaboration on
lifecycle, refactoring code, and providing intelligent code requirement engineering. As mentioned in the paper, software
recommendations. engineers are generally reluctant to rely on LLMs for higher-
3) Autonomous Learning and Decision Making: High- level design goals. However, with LLMs achieving remarkable
lighting the capabilities of LLM-based agents in au- improvements in contextual analysis and reasoning abilities
tonomous learning, decision-making, and adaptive plan- through various methods like prompt engineering and Chain-
ning within SE contexts. of-Thought (COT) [16], their applications in requirement
4) Software Design and Evaluation: Contributing to design engineering are gradually increasing. Table I summarizes and
processes, architecture validation, performance evalua- categorizes the tasks in requirement engineering. Many studies
tion, and code quality assessment. utilize models for requirement classication and generation.
5) Software Test Generation: Generating, optimizing, and Since the collection primarily focuses on the latter half of
maintaining software tests, including unit tests, integra- 2023 and before April 2024, and some papers address multiple
tion tests, and system tests. tasks, the table does not reect the exact number of papers we
6) Software Security & Maintenance: Enhancing security have collected.
protocols, facilitating maintenance tasks, and aiding in While other works have surveyed LLMs applications in
vulnerability detection and patching. some SE tasks [17] [8] [18], they lack a wider coverage of the
In detail, we aim to address following research questions: general SE area to incorporate recent research developments.
• RQ1: What are the state-of-the-art techniques and prac- More importantly, a focus of LLMs is the main contributions
tices in LLMs and LLM-based agents for SE? (Sec- of these works, but there is no distinguish the capabilities
TABLE I: D ISTRIBUTION OF SE TASKS
Category LLMs LLM-based agents Total
Requirement Classication and Extraction (3)

Generation of Semi-structured Documents (1)
Requirement Requirement Generation and Description (4)
Generate safety requirements (1)
Engineering and Requirements Satisfaction Assessment (1) 19
Automatically generating use cases based on
Documentation Specication Generation (3)
high-level requirements (1)
Quality Evaluation (2)
Automated User Story Quality Enhancement (2)
Ambiguity Detection (2)
Code Generation Debugging (3)

Code Generation Automating the Software Development Process (5)
Code Evaluation (2)
and Large-Scale Code and Document Generation (1)
Implement HTTP server (1) 23
software Tool and External API Usage (2)
Enhancing Code Generation Capabilities (3)
development Multi-Agent Collaboration and Code Rene (2)
Specialized Code Generation (2)
Improving Code Generation Quality (2)
Human Feedback Preference Simulation (1)
Multi-LLMs Decision-Making (1) Collaborative Decision-Making and Multi-Agent

Autonomous
Creativity Evaluation (1) Systems (5)
Learning
Self-Identify and Correct Code (1) Autonomous Reasoning and Decision-Making (7) 24
and Decision
Judge Chatbot Response (1) Learning and Adaptation Through Feedback (4)
Making
Mimics Human Scientic Debugging (1) Simulation and Evaluation of Human-like
Deliberate Problem Solving(1) Behaviors (2)
Automation of Software Engineering Processes (3)

Enhancing Problem Solving and Reasoning (4)
Creative Capabilities Evaluation (1)
Software Design Integration and Management of AI Models and
Performance in SE Tasks (1) 19
and Evaluation Tools (3)
Educational Utility and Assessment (1)
Optimization and Efciency Improvement (2)
Efciency Optimization (2)
Performance Assessment in Dynamic Environments
(2)
Bug Reproduction and Debugging (2)

Software Test Multi-agent Collaborative Test Generation (2)
Security Test (2) 11
Generation Autonomous Testing and Conversational Interfaces
Test Coverage (2)
(3)
Universal Fuzzing (1)
Vulnerability Detection (6)

Vulnerability Repair (2) Autonomous Software Development and
Program Repair (4) Maintenance (4)
Software Security Robustness Testing (1) Debugging and Fault Localization (4)
& Requirements Analysis (1) Vulnerability Detection and Penetration 39
Maintainance Fuzzing (1) Testing (3)
Duplicate Entry (1) Smart Contract Auditing and Repair (2)
Code Generation and Debugging (4) Safety and Risk Analysis (2)
Penetration Testing and Security Assessment (2) Adaptive and Communicative Agents (1)
Program Analysis and Debugging (1)
between LLMs and LLM-based agents. We summarize the agents, our survey provides a thorough and in-depth overview
difference between our work and others in Table II, this sur- of how these technologies are applied and new opportunities
vey addresses these limitations by distinctly analyzing LLMs they bring to SE.
and LLM-based agents applications across six SE domains,
providing a thorough and up-to-date review. From previous
research, it is evident that the performance of LLMs in various
applications and tasks heavily depends on the model’s inherent In summary, we have collected a total of 117 papers directly
capabilities [10]. More importantly, earlier surveys often relevant to this topic, covering the six SE domains men-
present ndings from papers spanning in a wide range of tioned earlier as shown in Figure.1. Our analysis distinguishes
publication dates, leading to signicant content disparities between LLM and LLM-based agent contributions, offering
for LLMs in different SE tasks. For instance, research in a comparative overview and addressing the limitations of
requirement engineering was relatively nascent, resulting in previous surveys. Considering the novelty nature of the LLM-
sparse content in this area in previous surveys. The recent rise based agents eld and the lack of standardized benchmarks,
of LLM-based agents, with their enhanced capabilities and this work seeks to offer a detailed review that can guide future
autonomy, lls these gaps. By focusing on the latest researches research and provide a clearer view of the potential of these
and clearly differentiating between LLMs and LLM-based technologies in SE.
TABLE II: C OMPARISON BETWEEN O UR W ORK AND THE E XIST-

advent of machine learning technologies in the 1980s and
ING W ORK FOR LLM IN SE
the groundbreaking introduction of neural networks in the
Paper Year Domain Benchmarks Metrics
Agent Agent
1990s indicated a new era for NLP [23]. These advancements
in SE Distinction
[19] 2023 GenAI in SE ✓ ✓ ✓ ✗

facilitated signicant progress in the NLP eld, especially
[8] 2023 LLM in SE ✓ ✓ ✓ ✗ in the development of technologies for text translation and
[18] 2023 Generation task by LLM in SE ✓ ✓ ✗ ✗ generation. The development of Long Short-Term Memory
[20] 2023 LLM in syntax comprehension ✓ ✓ ✗ ✗ (LSTM) and Recurrent Neural Networks (RNN) during this
[21] 2024 LLM4Code in SE
[17] 2024 LLM for process optimization in SE
✓
✓
✓
✓
✗
✗
✗
✗
period enabled more effective handling of the sequential nature
[22] 2024 Generation task by LLM in SE ✓ ✓ ✗ ✗ of language data [24] [25]. These models addressed challenges
Ours 2024 LLM & LLM-based Agent in SE ✓ ✓ ✓ ✓ associated with the lack of dependency in context, thereby
enhancing the application of NLP in various domains.
In 2017 the new framework called ”Transformer” introduced
B. Methodology by Google’s research team [26]. The transformer model based
The paper collection process primarily involved searching on the self-attention mechanism which signicantly improved
DBLP and arXiv databases, focusing on recent studies from the effectiveness of language models. The inclusion of posi-
the latter half of 2023 to May 2024. This approach ensured tional encoding not only solved the long-sequence dependency
the inclusion of the latest research. We ltered out non-LLM- issue but also enabled parallel computation, which was a con-
related papers and those with fewer than seven pages. To siderable improvement over previous models. In 2018, OpenAI
further rene our selection, we used keywords in Table III developed the Generative Pre-trained Transformer (GPT) [3],
to search SE-related works. We then manually screened the a model based on the transformer architecture. The core idea
remaining papers to remove any with formatting errors or behind GPT-1 was to utilize a large corpus of unlabelled text
student projects. Additionally, we employed a snowballing for pre-training to learn the patterns and structures of language,
search technique to capture signicant works that might have followed by ne-tuning for specic tasks. Over the next two
been missed initially. Overall, we identied 117 relevant years, OpenAI released GPT-2 and GPT-3 which increased the
papers. Figure. 2 presents the distribution of these papers parameter count to 175 billions and also demonstrated strong
across the six SE domains and the proportion of LLM-based capabilities in context understanding and text generation [27].
agents studies. However, some papers can be counted as multi- GPT-4 launched by OpenAI in 2023, represents a milestone
class elds so the literature review in the gure is more than following GPT-3.5. Although GPT-4 maintains a similar pa-
117 in total. rameter count of approximately 175 billion, its performance
and diversity have seen considerable improvements. Through
more rened training techniques and algorithm optimizations,
C. Overall Structure of the Work
GPT-4 enhanced the capability of language understanding
The remainder of this paper is organized as follows Sec- and generation, particularly outperformed in handling complex
tion 2 Introduces the architectures and background of LLMs texts and special contexts. Compared to other contemporary
and LLM-based agents, including an overview of RAG, tool models like Google’s PaLM or Meta’s OPT, GPT-4 continues
utilization, and their implications for SE. Section 3-8 is the to stand out in multi-task learning and logical consistency in
comparative analysis which Summarizes and compares the the text generation. While Google’s PaLM model boasts up
datasets, tasks, benchmarks, and metrics used in LLM and to 54 billion parameters, GPT-4 shows superior generalization
LLM-based agents studies across the six SE domains. Section abilities across a broader range of NLP tasks [28]. On the
9 is the general discussion, section 10 is the nal conclusion. open-source large models, Meta’s OPT model with a parameter
size similar to the GPT-4 offers direct competition. Despite
III. P RELIMINARIES OPT’s advantages in openness and accessibility, GPT-4 still
In this section, we introduce the foundational concepts maintains a lead in specic application areas such as creative
of large language models, including the evolution of their writing and complex problem solving [29].
frameworks and an overview of their architectures. Subsequent
B. Model Architecture
to this, we will discuss LLM-based agents, exploring both
single-agent and multi-agent systems. We will also covers There are three common LLM architectures, the Encoder-
the background of these systems and their applications and Decoder architecture, exemplied by the traditional trans-
distinctions in the eld of software engineering. former model. This architecture comprises six encoders and six
decoders, data input into the system will rst passes through
the encoder, where it undergoes sequential feature extraction
A. Large Language Model via the model’s self-attention mechanism. Subsequently, the
There is an inherent connection between large language decoders utilize the word vectors produced by the encoders to
models and natural language processing (NLP), with the generate outputs, this technique is common to see in machine
historical development of natural language technologies trac- translation tasks, where the encoder processes word vectors
ing back to the 1950s. The earliest attempts to generate from one language through several attention layers and feed-
language dialogues through machines using specic rules forward networks, thereby creating representations of the con-
can be traced to the period between 1950 and 1970. The text. The decoder then uses this information to incrementally
TABLE III: K EYWORDS FOR S OFTWARE E NGINEERING T OPICS
Topic Keywords
Software Security & Maintenance Software security, Vulnerability detection, Automated Program repair, Self-debugging,
Vulnerability reproduction
Code Generation and Software Development Code generation, Automatic code synthesis, Code refactoring, Programming language
translation, Software development automation, Code completion, AI-assisted coding,
Development lifecycle automation
Requirement Engineering and Documentation Requirement engineering, Software requirements analysis, Automated requirement
documentation, Technical documentation generation, User manual generation, Doc-
umentation maintenance, Requirements modeling, Requirements elicitation
Software Design and Evaluation Software design automation, Architectural validation, Design optimization, Perfor-
mance evaluation, Code quality assessment, Software metrics, Design pattern recogni-
tion, Architectural analysis, Code structure analysis
Software Test Generation Test case generation, Automated testing, Unit test generation, Integration test genera-
tion, System test generation, Test suite optimization, Fault localization, Test mainte-
nance, Regression testing, Adaptive testing
Autonomous Learning and Decision Making Autonomous learning systems, Decision making, Adaptive planning, Project manage-
ment automation, Self-improving software, Autonomous software agents
construct the correct translated text. A recent example of this model easier to train and ne-tune.
architecture is the CodeT5+ model, launched by Salesforce
AI Research in 2023 [30]. This model is an enhancement C. Large Language Model Based Agent
of the original T5 architecture, which designed to improve
performance in code understanding and generation tasks. It The concept of agents even trace back to the 19th century
incorporates a exible architecture and diversied pre-training and is often referred to as intelligent agents, envisioned to
objectives to optimize its effectiveness in these specialized ar- possess intelligence comparable to humans. Over the past
eas. This development highlights the competency of Encoder- few decades, as AI technology has evolved, the capabilities
Decoder architectures in tackling increasingly complex NLP of AI agents have signicantly advanced, particularly with
challenges. the reinforcement learning. This development has enabled AI
agents to autonomously handle tasks and learn and improve
The Encoder-only architecture, as the name suggests it based on specied reward/punishment rules. Notable mile-
eliminates the decoder from the entire structure making the stones include AlphaGo [34], which leveraged reinforcement
data more compact. Unlike RNNs, this architecture is stateless learning to defeat the world champion in Go competition.
and uses a masking mechanism that allows input process- The success of GPT has further propelled the eld, with
ing without relying on hidden states, and also accelerating researchers exploring the use of large language models as the
parallel processing speeds and providing excellent contextual ”brain” of AI agents, thanks to GPT’s powerful text under-
awareness. BERT (Bidirectional Encoder Representations from standing and reasoning capabilities. In 2023, a research team
Transformers) is a representative model of this architecture, from Fudan University [10] conducted a comprehensive survey
this model is a large language model built solely on the on LLM-based agents, examining their perception, behavior,
encoder architecture. BERT leverages the encoder’s powerful and cognition. Traditional LLMs typically generate responses
feature extraction capabilities and pre-training techniques to based solely on given natural language descriptions, lacking
learn bidirectional representations of text, achieving outstand- the ability for independent thinking and judgment. LLM-based
ing results in sentiment analysis and contextual analysis [31]. agents able to employ multiple rounds of interaction and
The Decoder-only archiecture, in the transformer framework customized prompts to gather more information, which enable
primarily involves the decoder receiving processed word vec- the model to think and make decisions autonomously. In 2023,
tors and generating output. Utilizing the decoder to directly Andrew Zhao proposed the ExpeL framework [35], which
generate text accelerates tasks such as text generation and utilizes ReAct as the planning framework combined with an
sequence prediction. This characteristic with high scalability experience pool [36]. This allows the LLM to extract insights
is known as auto-regressiveness, which is why popular mod- from past records to aid in subsequent related queries, by
els like GPT use this architecture. In 2020, the exceptional letting the LLM analyze why previous answers were incorrect,
performance of GPT-3 and its remarkable few-shot learning it learns from experience to identify the problems.
capabilities demonstrated the vast potential of the decoder- At the same time, the application of LLM-based embodied
only architecture [32]. Given the enormous computational agents has also become a hot research area in recent years.
cost and time required to train a model from scratch, and LLM-based Embodied Agents are intelligent systems that inte-
the exponential increase in the number of parameters, many grate LLMs with embodied agents [37]. These systems can not
researchers now prefer to leverage pre-trained models for only process natural language but also complete tasks through
further research. The most popular open-source pre-trained perception and actions in physical or virtual environments. By
language model LLaMA, developed by Meta AI also employs combining language understanding with actual actions, these
the decoder-only architecture [33], as mentioned earlier, the agents can perform tasks in more complex environments. This
autoregressiveness and simplicity of this structure make the integration often involves using visual domain technologies
to process and understand visual data and reinforcement ensuring the generation of high-quality outputs and achieving
learning algorithms to train agents to take optimal actions in complex automated tasks. In 2023, Jules White introduced a
the environment. These algorithms guide the agent through set of methods and patterns to enhance prompt engineering,
reward mechanisms to learn how to make optimal decisions optimizing interactions with LLMs such as ChatGPT [40].
in different situations, while the LLM acts as the brain to This study categorizes prompt engineering into ve main
understand user instructions and generate appropriate feed- areas: Input Semantics, Output Customization, Error Identi-
back. In 2023, Guanzhi Wang introduced VOYAGER, an open- cation, Prompt Improvement, and Interaction, to address a
ended embodied agent with large language models [38]. It uses wide range of problems and adapt to different elds.
GPT-4 combined with input prompts, an iterative prompting One notable technique is Retrieval-Augmented Generation
mechanism, and a skill library enabling the LLM-based agents (RAG), the input question undergoes similarity matching with
to autonomously learn and play the game Minecraft, becoming documents in an index library, attempting to nd relevant re-
the rst lifelong learning agent in the game. sults. If similar documents are retrieved, they are organized in
conjunction with the input question to generate a new prompt,
which is then fed into the large language model. Currently,
large language models possess long-text memory capabilities,
numerous studies have tested Gemini v1.5 against the Needle
In A Haystack (NIAH) evaluation to explore whether RAG
has become obsolete [41]. However, from various perspec-
tives such as cost and expense, RAG still holds signicant
advantages (RAG could be 99 percent cheaper than utilizing
all tokens). Additionally, long texts can negatively impact
response performance, causing LLM to respond more slowly
when the input text is too long. Therefore, the advancements
in context length for LLM will not fully replace the role of
RAG but treated as complements between each other.
D. Single and Multi-Agents

F IG . 3: I LLUSTRATION OF C OMMON DATA AUGMENTATION Single agent systems leverage the capabilities of a LLM to
M ETHODS perform various tasks, these agents typically use a single LLM
to understand and respond to user queries, generate content,
Nowadays, various agent systems are emerging and they or execute automated tasks based on predened instructions.
relies on large language models to make judgments, com- Single agents are commonly used in scenarios where tasks
bined with techniques such as few-shot learning and multi- accept a general answer and do not require complex decision-
turn dialogue for model ne-tuning. However, due to the making. Examples include customer service chatbots, virtual
lack of datasets and the novelty of LLM-based agents, many assistants for scheduling, and automated content generation
researchers employ different methods for data augmentation. tools. However, single agents may struggle with dealing long
Common approaches include synonym replacement, where context inputs, leading to inconsistent or irrelevant responses.
words in the text are replaced with synonyms from the same The scalability of these systems is also limited when dealing
domain to increase textual diversity; back-translation, where with tasks that require extensive knowledge or context, this
the text is translated into another language and then back to the issue is often exacerbated by long texts as large language
original language to generate new texts with slightly different models cannot fully comprehend and analyze overly lengthy
grammatical structures and word choices; Paraphrasing refers information in one turn. One of the primary issues with large
to the new dialogue which similar in context but slightly language models is hallucination [7]. Hallucination refers to
different in expression created through manual or automated the generation of fabricated information or denitions by
means; Synthetic Data Generation refers to use pretrained LLMs, presented in seemingly logical and reasonable language
model to make a synthetic data generation as shown in Fig- to the user. Most research papers on LLMs have stated this
ure.3. In 2023, Chenxi Whitehouse explored using LLMs for problem, while prompt engineering or tool interventions can
data augmentation to enhance the performance of multilingual mitigate the affect caused by hallucination, but it cannot be
commonsense reasoning datasets, especially under conditions entirely eliminated. In 2023, Ziwei Ji conducted an in-depth
of extremely limited training data [39]. The study employed study on hallucination in natural language generation [42].
various LLMs (such as Dolly-v2, StableVicuna, ChatGPT, and This survey reviewed the progress and challenges in addressing
GPT-4) to generate new data. These models were prompted hallucinations in NLG, providing a comprehensive analysis of
to create new examples similar to the original data, thereby hallucination phenomena across different tasks, including their
increasing the diversity and quantity of training data. Prompt denitions and classications, causes, evaluation metrics, and
engineering was mentioned as an essential skill for effectively mitigation methods.
interacting with LLMs. By applying these prompt patterns, Multi-agent systems involve the collaboration of multiple
users can efciently customize their dialogues with LLMs LLMs or agents to tackle complex tasks effectively. These
systems fully utilize the advantages of multiple models, with in complex environments. Moreover, signicant research has
each model specializing in specic aspects of the task to re- been conducted in the eld of automated program repair
duce overhead caused by multi-processes in single agents, the (APR). In 2024, Islem Bouzenia introduced RepairAgent [46],
collaboration among agents allows for more sophisticated and another LLM-based tool designed for automatic software
robust problem-solving capabilities. Due to their exceptional repair, this tool reduced the time developers spent on xing
capabilities, more researchers are beginning to explore the eld issues. Additionally, in 2023, Emanuele Musumeci demon-
of Multi-LLM based agents and start applying into software strated a multi-agent LLM system [47], which involved a
engineering domains. In 2024, a lot of researchers adopt the multi-agent architecture where each agent had a specic role
multi-agent system into the practical experiments [43] [44]. in the generation of documents. This system signicantly
Multi-agent systems address the limitations of single-agent improved handling complex document structures without ex-
systems in the following ways: tensive human supervision. Besides these, LLMs have made
• Enhanced Context Management: Multiple agents can outstanding contributions in software testing, software design,
maintain and share context, generating more coherent and and emerging elds such as software security and mainte-
relevant responses over long interactions. nance.
• Specialization and Division of Labor: Different agents Currently, there is no comprehensive and accurate denition
can focus on specic tasks or aspects of a problem, of the capabilities an LLM must exhibit to be considered an
improving efciency and effectiveness. llm-based agent. Since the application of LLMs in software
• Robustness and Error Correction: Collaborative agents engineering is relatively broad and some frameworks already
can cross-check and validate each other’s outputs, re- behave certain levels of autonomy and intelligent, this study
ducing the likelihood of errors and improving overall denes the distinction between LLM and LLM-based agents
reliability. based on mainstream denitions and literature from the rst
• Contextual Consistency: Multi-agent systems can better half of 2024. In this survey, an LLM architecture can be called
manage context over long dialogues. The collaboration an agent when it satisfy the Table IV.
of multiple agents improves the efciency of incident
mitigation. IV. R EQUIREMENT E NGINEERING AND AND
• Scalability and Flexibility: These systems can integrate D OCUMENTATION
specialized agents to scale and handle more complex Requirement Engineering is a critical eld within software
tasks. Through the division of labor among multiple engineering and plays an essential role in the software devel-
agents, the quality of code generation is improved. opment process, its primary task is to ensure that the software
• Dynamic Problem Solving: By integrating agents with system meets the needs of all relevant stakeholders. Typically,
different expertise, multi-agent systems can adapt to a requirement engineering in project development involves many
wider range of problems and provide more accurate steps, where developers need to fully understand the users’
solutions. needs and expectations to ensure that the development direc-
tion of the software system aligns with actual requirements.
E. LLM in Software Engineering The collected requirements are then organized and evaluated
by the development group. Requirements Specication is the
Recently, there has been a shift towards applying general
process of formally documenting the analyzed requirements,
AI models to specic vertical domains such as medical and
the specication must be accurate and concise, and the
nance. In software engineering, new AI agents are emerging
requirement verication must be conducted to ensure that
that are more exible and intelligent compared to previous
developers are building what users need and that it aligns
applications of LLMs, although they utilize different data
with the specications. Requirement engineering also includes
and experiments. This continuous innovation underscores the
requirement management, a task that spans the entire software
transformative potential of AI agents across various elds,
development life-cycle, developers need to continuously track,
these models excel in text understanding and generation,
control, and respond to any changes occurring during devel-
promoting innovative applications in software development
opment, ensuring that these changes do not negatively impact
and maintenance.
the project’s progress and overall quality.
LLMs profoundly impact software engineering by facili-
tating tasks such as code generation, defect prediction, and
automated documentation. Integrating these models into devel- A. LLMs Tasks
opment workows not only simplies the coding process but In the eld of requirement engineering, LLMs have demon-
also reduces human errors. LLM-based agents enhance basic strated signicant potential in automating and enhancing tasks
LLM capabilities by integrating decision-making and interac- such as requirement elicitation, classication, generation, spec-
tive problem-solving functions. These agents can understand ication generation, and quality assessment. Requirement
and generate result by interacting with other software tools classication and extraction is a crucial task in requirement
which optimize workows, and make autonomous decisions engineering during the development process. It is common
to improve software development practices. In 2023, Yann to encounter situations where clients present multiple re-
Dubois introduced the AlpacaFarm framework [45], where quirements at once, necessitating manual classication by
LLMs are used to simulate the behavior of software agents developers. By categorizing requirements into functional and
TABLE IV: C RITERIA FOR LLM- BASED AGENT
Criteria
1) The LLM serves as the brain (the center of information processing and generation of thought).
2) The framework not only relies on the language understanding and generation capabilities of LLMs but also possesses decision-
making and planning abilities.
3) If tools are available, the model can autonomously decide when and which tools to use and integrate the results into its predictions
to enhance task completion efciency and accuracy.
4) The model can select the optimal solution from multiple homogeneous results (the ability to evaluate and choose among various
possible solutions).
5) The model can handle multiple interactions and maintain contextual understanding.
6) The model has autonomous learning capabilities and adaptability.
non-functional requirements, developer can better understand of LLMs to generate SRS. The experiment conducted on GPT-
and manage them, thanks to the strong performance of LLMs 4 and CodeLlama-34b one close-source LLM and one open-
in classication tasks, many relevant frameworks have been source LLM for comprehensive evaluation, the generated SRS
developed. The PRCBERT framework, utilizing the BERT will compare with human-crafted SRS and nally scored by
pre-trained language model, transforms classication problems the likert scale. The result indicate that, the human-generated
into a series of binary classication tasks through exi- SRS was overall superior, but CodeLlama often came close,
ble prompt templates, signicantly improving classication sometimes outperforming in specic categories. The CodeL-
performance [48]. Studies have shown that the PRCBERT lama scored higher in completeness and internal consistency
achieved an F1 score of 96.13% on the PROMISE dataset than GPT-4 but less concise, so this stuy demonstrated the
which outperform the previous state-of-arts NoRBERT [49] potential of using ne-tuned LLMs to generate SRS and
and BERT-MLM models [31]. Additionally, the application increase the overall project productivity. Another paper also
of ChatGPT in requirement information retrieval has shown explores using LLMs for generating specications. In [53], the
promising results, by classifying and extracting information authors introduce a framework called SpecGen for generating
from requirement documents, ChatGPT achieved comparable program specications. The framework primarily uses GPT-
or even better F β scores under zero-shot settings, particularly 3.5-turbo as the base model and employs prompt engineering
in feature extraction tasks, where its performance surpassed combined with multi-turn dialogues to generate the specica-
baseline models [50]. As seen in Table I, there is also tions. SpecGen applies four mutation operators to modify these
substantial literature and research on using LLMs to automat- specications and nally uses a heuristic selection strategy to
ically generate requirements and descriptions in requirement choose the optimal variant. The results show that SpecGen
engineering. can generate 70% of the program specications, outperforming
traditional tools like Houdini [54] and Daikon1 .
By automating the generation and description of require- Furthermore, designing prompt patterns can signicantly
ments, the efciency and accuracy of requirement elicitation enhance LLMs’ capabilities in tasks such as requirement
can be improved. Research indicates that LLMs hold signif- elicitation and system design. The paper provides a catalog
icant potential in requirements generation task. For example, of 13 prompt patterns, each aimed at addressing specic
using ChatGPT to generate and gather user requirements, challenges in software development [55]. The experiments
studies found that participants with professional knowledge test the efcacy of these patterns in real world scenarios to
could use ChatGPT more effectively, indicating the inuence validate their usefulness. By applying different prompt pat-
of domain expertise on the effectiveness of LLM-assisted terns, the study found that these patterns could help generate
requirement elicitation [51]. The study employed qualitative more structured and modular results and reduce common
assessments of the LLMs’ output against predened criteria for errors. Automated requirement completeness enhancement is
requirements matches, including full matches, partial matches, another important benet brought by the LLMs in requirement
and the relevancy of the elicited requirements, although their generation. The study [56] use BERT’s Masked Language
success varied depending on the complexity of the task and the Model (MLM) can detect and ll in missing parts in natural
experience of the users, the result showing that LLMs could language requirements, signicantly improving the complete-
effectively assist in eliciting requirements, and its particularly ness of requirements. BERT’s MLM achieved a precision of
useful in identifying, and suggesting requirements based on the 82%, indicating that 82% of the predicted missing terms were
large corpus of training data they provided. The SRS (Software correct.
Requirement Specication) generation is an important task There is also the application of LLMs in ambiguity detection
which the developer normally spent a lot of time to rene tasks, aimed at detecting ambiguities in natural language
and veried. In [52], researchers use both iterative prompting
and a single comprehensive prompt to assess the performance 1 https://github.com/codespecs/daikon
requirement documents to improve clarity and reduce misun- the system according to the case, and then the generated
derstandings. This study primarily aims to address the issue results will let humans judge whether the requirements are
of detecting term ambiguities within the same application met or not. The study results indicate that AISD signicantly
domain (where the same term has different meanings in increased use case pass rates to 75.2%, compared to only
different domains). Although current models generally possess 24.1% without human involvement. AISD demonstrates the
excellent contextual understanding capabilities, this was a agents’ autonomous learning ability by allowing LLMs to
common problem in machine learning at that time. This study generate all code les in a single session, continually rening
provides an excellent paradigm for the subsequent application and modifying based on user feedback. This also ensures code
of LLMs in requirements engineering, study demonstrated the dependency and consistency, further proving the importance
transformer-based machine learning models can effectively of human involvement in the requirement analysis and system
detect and identify ambiguities in requirement documents, testing stages.
thereby enhancing document clarity and consistency. The Furthermore, in generating safety requirements for au-
framework utilizes BERT and K-means clustering to identify tonomous driving, LLM-based agents have shown unique
terms used in different contexts within the same applica- advantages by introducing multimodal capabilities. The sys-
tion domain or interdisciplinary project requirements docu- tem employs LLMs as automated agents to generate and
ments [57]. In recent two years, more and more researchers rene safety requirements with minimal human intervention
use LLMs to help them to evaluate the requirement docu- until the verication stage, which is unattainable with only
mentations, quality assessment tasks ensure that the generated LLMs. [62] describes an LLM prototype integrated into the ex-
requirements and code meet expected quality standards. The isting Hazard Analysis and Risk Assessment (HARA) process,
application of ChatGPT in user story quality evaluation has signicantly enhancing efciency by automatically generating
shown potential in identifying quality issues, but it requires specic safety-related requirements. The study through three
further optimization and improvement [58]. A similar study design iterations progressively improved the LLM prototype’s
use LLM to automatically process the requirement satisfac- efciency by completing within a day compared to months
tion assessment, and evaluate whether design elements fully manually. In agile software development, the quality of user
covered by the given requirements, but the the researcher stories directly impacts the development cycle and the realiza-
indicated the necessity of further verication and optimization tion of customer expectations. [63] demonstrates the successful
in practical applications [59]. application of the ALAS system in six agile teams at the Aus-
trian Post Group IT. The ALAS system signicantly improved
the clarity, comprehensibility, and alignment with business
B. LLM-based Agents Tasks objectives of user stories through automated analysis and
Currently the application of LLM-based agents in the re- enhancement. The entire agent framework allows the model to
quirement engineering is till quite nascent, but there are some perform specic roles in the Agile development process, the
useful researches to help us to see the potential possibility. study results indicated that the ALAS-improved user stories
LLM-based agents bring both efciency and accuracy for received high satisfaction ratings from team members.
tasks like requirement elicitation, classication, generation,
and verication. Compared to traditional LLMs, these systems C. Analysis
exhibit higher levels of automation and precision through The application of LLM-based agents in requirement engi-
task division and collaboration. The application of multi-agent neering has demonstrated signicant efciency improvements
systems in semi-structured document generation has shown and quality assurance. Through multi-agent collaboration and
signicant effectiveness. In [60], a multi-agent framework automated processing, these systems not only reduce manual
is introduced that combines semantic recognition, informa- intervention but also enhance the accuracy and consistency of
tion retrieval, and content generation tasks to streamline the requirement generation and verication. We can see that the
creation and management of semi-structured documents in tasks of LLM-based agents are no longer limited to simply
the public administration domain. The proposed framework generating requirements or lling in the gaps in descriptions.
involves three main types of agents: Semantics Identication Instead, they involve the implementation of an automated
Agent, Information Retrieval Agent, and Content Generation process, with the generation of requirement documents being
Agent. By avoiding the overhead of a single model, each agent just one part of it, integrating LLM into agents enhances the
is assigned a specic task with minimal user intervention, overall system’s natural language processing and reasoning
following the designed framework and workow. capabilities. In the real-world application, many tasks can no
Additionally, the AI-assisted software development frame- longer be accomplished by simple LLMs alone, especially
work (AISD) also showcases the autonomy brought by the for high-level software design. The emergence of LLM-based
LLM-based agents in requirement engineering. [61] proposes agents addresses this issue through a multi-agent collaborative
the AISD framework, which continuously improves and op- system centered around LLMs, these agents continuously ana-
timizes generated use cases and code through ongoing user lyze and rene the deciencies in the requirement documents,
feedback and interaction. In the process of the experiment, hu- this is might be the main application trend of LLM-based
mans need to rst give a fuzzy requirement denition, and then agents in requirements engineering in the future.
LLM-based agent will improve the requirement case according The application of LLM-based agents in requirements engi-
to this information, and then design the model and generate neering is still relatively limited, with most efforts focusing on
In [50], four datasets are primarily used, characterized by

average length, type-token ratio (TTR), and lexical density
(LD). The NFR Multi-class Classication dataset includes 249
non-functional requirements (NFRs) across 15 projects from
the PROMISE NFR dataset. The App Review NFR Multi-
label Classication dataset comprises 1800 app reviews from
Google Play and Apple App Store, labeled with various NFRs.
The Term Extraction dataset contains 100 smart home user
stories with 250 manually extracted domain terms. Lastly,
the Feature Extraction dataset consists of 50 app descrip-
tions across 10 application categories with manually identied
feature phrases. In [56], the PURE dataset consisting of 40
requirements specications totaling over 23,000 sentences,
is used to test BERT’s ability to complete requirements.
In [64], the benchmark dataset comprised 36 responses to
six questions: 6 responses generated by ChatGPT and 30
responses from ve human RE experts (each expert provided 6
responses). These datasets serve as evaluation metrics for the
models. Combining these papers, we can see that benchmark
datasets for LLMs in requirement engineering mainly include
F IG . 4: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN
LLM- BASED AGENT AND LLM IN U SER S TORY R EFINEMENT various classications of software requirements and functional
and non-functional requirements to aid and assist models in
learning this domain, the dataset utilization are quite exible
leveraging the collaborative advantage of multi-agent systems and diversies.
to generate and rene requirements engineering documents. In LLM-based agents’ research in requirement engineering,
As illustrated in Figure.4, which roughly simulates the ar- the selection and construction of datasets are also important.
chitectures presented in [58] and [63], both applied to the In [47], the dataset mainly consists of semantic templates
generation and renement of user stories, we can clearly from the public administration domain. These templates cover
compare the differences between the two architectures. On various semi-structured forms of administrative documents,
the left is the architecture of the LLM-based agent, while on such as ofcial certicates and public service forms. Although
the right is the approach of using prompt engineering and the detailed composition of the dataset is not specied, it can
LLMs alone to rene user stories. The gure omits more be inferred that these templates include a large number of
detailed and complex aspects of the architecture to highlight practical cases and contextual information to ensure that the
the core differences between the two approaches. LLM-based documents generated by the multi-agent system meet actual
agents can continuously improve from different professional needs.
perspectives by utilizing a shared database. Although there are Additionally, in [61], the CAASD (Capability Assessment
not many papers on LLM-based agents, we can observe the of Automatic Software Development) dataset is introduced.
trend and benets of transitioning from LLMs to LLM-based This specially constructed benchmark dataset is used to
agents. evaluate the capabilities of AI-assisted software development
systems. The CAASD dataset contains 72 tasks from various
domains, such as small games and personal websites, each
with reference use cases to dene system requirements. The
D. Benchmarks
purpose of constructing this dataset is to provide a comprehen-
Requirement engineering, unlike tasks such as bug xing sive evaluation benchmark that covers different types of devel-
and code generation, does not have an abundance of public opment tasks, testing the performance of LLM-based agents
datasets available, such as HumanEval which commonly used in diverse tasks. In [62], the study mainly uses Design Science
for code generation assessment. Most training datasets for Methodology to design and evaluate the LLM prototype but
models in requirement engineering are self-collected by the does not mention a specic dataset, focusing on validating
authors and not all of them are open-sourced on Huggingface, the model’s effectiveness through practical application and
resulting in a limited amount of dataset in requirement engi- case studies. Despite the lack of detailed dataset descriptions,
neering. For instance, some papers do not mention a specic this approach emphasizes iterative improvement and practical
benchmark dataset but instead focus on practical examples application to ensure that the safety requirements generated
and case studies to demonstrate the effectiveness of proposed by LLM-based agents meet high safety standards. Finally,
prompt patterns [55]. The researcher Let actual developers in [63], 25 synthetic user stories are used, derived from a
and requirements engineers use the generated requirements mobile delivery application project. The study evaluates the
documents and code to evaluate its accuracy, usability, and ALAS system’s effectiveness by testing it in six agile teams
completeness. User feedback will be collected to further at the Austrian Post Group IT. Although these user stories are
improve and optimize the prompt mode. synthetic data designed for the experiment, they realistically
TABLE V: E VALUATION M ETRICS IN R EQUIREMENT E NGINEERING AND D OCUMENTATION

Reference Paper Benchmarks Evaluation Metrics Agent
[51] Evaluation on ActApp Precision and Recall for Elicitation No
Clarity, Consistency, and Compliance
Completeness and accuracy of acceptance
criteria
[50] NFR, Smarthome user stories Precision, recall, and F\beta (F1 or F2) No
[56] PURE Precision, F1 Score, Recall No
[52] Not Specied Likert Scale No
[55] Case studies Accuracy in identifying missing requirements. No
Quality and Modularity of generated code.
Correctness of refactoring suggestions.
Efciency in automating software engineering
tasks
[64] 36 responses to the six questions Abstraction, Atomicity, Consistency, Correctness No
Unambiguity, Understandability, Feasibility
[48] PROMISE NFR-Review, NFR-SO F1 Score, Weighted F1 Score (w-F)) No
[53] SV-COMP, SpecGenBench Number of Passes, Success Probability No
Number of Verier Calls, User Rating
[65] Jdoctor-data, DocTer-data, Accuracy, Precision, Recall, F1 Score No
SpecGenBench, SV-COMP
[58] Benchmark evaluations of user Agreement Rate, Precision, Recall, Specicity, No
stories using the AQUSA tool F1 Score
[57] Crawled Documents from Wikipedia Manual validation No
[59] CM1, CCHIT, Dronology, PTC-A, F\beta score, Mean Average Precision (MAP) No
PTC-B
[49]] PROMISE NFR Precision (P), Recall (R), F1-score (F1), No
Weighted average F1-score (A)
[66] CS-specic corpora, PURE Contextual Clarity, User Feedback No
[60] Semantic Templates from Public Accuracy, Prompt Conformity, User Intervention Yes
Administration Frequency, Hallucination Rate
[61] CAASD Pass Rate, Token Consumption Yes
[62] AEB, CAEM Performance Accuracy and Relevance, Efciency, Yes
Feedback from industry
[63] 25 Synthetic User Stories for a Independence, Negotiability, Value, Estimability, Yes
Mobile Delivery Application Smallness, Testability.
Survey among professionals
reect the requirements in actual projects, providing a valuable recall, and F1 score but also more specic indicators tailored to
testing benchmark. the unique nature of requirement engineering. Through these
From these papers, it can be seen that the selection and evaluations, we can see how these models are assessed and
construction of datasets in LLM-based agents’ research in how they are changing the practice of requirement engineering.
requirement engineering often rely on practical projects and The specic evaluation metrics are detailed in Table V. In [51],
case studies, lacking standardization and large-scale datasets. while precision and recall are fundamental for evaluating the
Compared to LLM literature, the datasets used are broader effectiveness of information retrieval, additional evaluations
and in a higher level like an actual system’s les, not lim- of clarity, consistency, and compliance are included, which
ited to the classication of non-functional requirements and are crucial quality indicators in requirement engineering. This
pure software requirement specications. Researchers focus multidimensional evaluation method not only measures the
more on validating the model’s effectiveness through practical operational performance of LLMs but also examines their
application and iterative improvement to enhance model per- ability to maintain the quality of requirement specications.
formance. While this approach is exible and targeted, it also Through this approach, LLMs have demonstrated their value in
highlights the eld’s shortcomings in dataset standardization automating and optimizing the requirement elicitation process,
and scaling. In the future, with more public datasets being enhancing both efciency and the reliability of results. The
constructed and shared, the application of LLM-based agents paper [52] use the Likert Scale to measure the quality of
in requirement engineering is expected to achieve broader and generated specications, the specication will be scored by its
deeper development. Unambiguous, Understandable, Conciseness, etc. The Likert
Scale will be scored from 1 to 5 of agreements.
For agent-based LLMs, as demonstrated in [63], the evalu-
E. Evaluation Metrics ation extends to assessing the independence and negotiability
In the eld of requirement engineering, LLMs and LLM- of the agents, elevating their functionality to a new level.
based agents are evaluated using various metrics. These met- These agents provide technical solutions and also interact
rics not only include traditional indicators such as precision, with users, autonomously adjusting to meet specic project
needs, thus resembling collaborative partners. This capability tion and reasoning, covering areas such as code generation,
makes LLM-based agents valuable in requirement engineering debugging, code comprehension, code completion, code trans-
in the requirement management and decision optimization, formation, and multi-turn interactive code generation. The
also highlight that LLMs typically focus on improving the primary method is generating executable code from natural
accuracy and efciency of specic tasks, while LLM-based language descriptions, where models utilize previously learnt
agents exhibit higher capabilities in autonomy and adaptability. code snippets or apply few-shot learning to better understand
In Table V, we can see that the application of LLMs in user requirements. Nowadays the AI tools integrates deeply
requirements engineering typically requires common metrics with IDEs like Visual Studio Code2 and JetBrains3 to enhance
such as F1 Score to evaluate model’s performance. However, code writing and translation tasks such as OpenAI’s Codex
for LLM-based agents, the evaluation focus shifts from the per- model [67]. Codex ne-tuned on public code from GitHub,
formance of requirement document generation to the quality of demonstrate the capability to generate Python functions from
the nal product. Therefore, evaluation metrics emphasize user doc-strings also outperformed other similar models on the
satisfaction, such as pass rate, feedback, etc. Essentially, LLM- HumanEval benchmark.
based agents still leverage LLMs themselves to achieve higher- In [68], researchers comprehensively evaluated the perfor-
level development, and depends pretty much on the nature of mance of multiple LLMs on L2C(language to code) tasks.
the task. In summary we can conclude that, the characteristics The results showed that GPT-4 demonstrates strong capability
of agent models both reect their complex decision-making in tasks such as semantic parsing, mathematical reasoning,
and learning abilities also reveal their potential advantages and Python programming. With instruction tuning and support
in collaboration with human or other tools to provide higher from large-scale training data, the model can understand and
scalability and exibility design. This phenomenon implies generate code that aligns with user intent, achieving high-
the potential opportunities that the methods of requirement precision code generation. Applying LLMs to text-to-database
elicitation and processing in future software development will management and query optimization is also a novel research
become more efcient, precise, and continuous rene to better direction in natural language to code generation task. By
align with stakeholder’s needs by using LLM-based agents. converting natural language queries into SQL statements,
LLMs help developers quickly generate efcient database
V. C ODE G ENERATION AND S OFTWARE D EVELOPMENT query code. In [69], proposed the SQL-PaLM framework
Code generation and software development are core areas which signicantly enhances the execution accuracy and exact
within software engineering which plays a crucial role in match rate for text-to-SQL tasks through a few-shot prompt
the software development process. The primary objective of and instruction ne-tuning, providing an effective solution for
using LLMs in code generation is to enhance development complex cross-domain SQL generation tasks. The improve-
efciency and code quality through automation processes, ments in accuracy and exact match achieved in the SQL-PaLM
thereby meeting the needs of both developers and users. model are considered state-of-the-art (SOTA) in tested bench-
In recent years, the application of LLMs in code generation marks, the SQL-PaLM performed promise results comparing
and software development has made signicant progress, this with existing methods such as T5-3B + PICARD, RASAT +
has changed the way developers work and revealed a shift in PICARD, and even GPT-4, achieving the highest test accuracy
automated development processes. Compared to requirement of 77.3% and an execution accuracy of 82.7%. Multilingual
engineering, research on the application of LLMs and LLM- code generation is another important application of LLMs,
based agents in code generation and software development particularly suited to the transformer architecture. In [70],
is more extensive and in-depth. Using natural language pro- researchers introduced the CodeGeeX model, which was pre-
cessing and generation technologies, LLMs can understand trained on multiple programming languages and performed
and generate complex code snippets, assisting developers in well in multilingual code generation and translation tasks.
automating various stages from code writing and debugging Experimental results showed that CodeGeeX outperformed
to software optimization. The decoder-based large language other multilingual models on the HumanEval-X benchmark.
models such as GPT-4 have shown signicant potential in code Although current LLMs possess excellent code generation
generation by providing accurate code suggestions and auto- capabilities, with accuracy and compile rates reaching usable
mated debugging, greatly improving development efciency. levels, the quality of generated code often depends on the
Recently, the application of LLM-based agents in software user’s prompts. If the prompts are too vague or general,
development is also gaining attention, these intelligent agents the LLM typically struggles to understand the user’s true
can not only perform complex code generation tasks but also requirements, making it difcult to generate the desired code
engage in autonomous learning and continuous renement, in a single attempt. In [71], researchers introduced ”print
thereby offering exible assist in dynamic development en- debugging” technique, using GPT-4 to track variable values
vironments. Tools like GitHub Copilot [12], which integrate and execution ows, which enhancing the efciency and ac-
LLMs, have already demonstrated their advantages in enhanc- curacy by using in-context learning techniques. This method is
ing programming efciency and code quality. particularly suitable for medium-difculty problems on Leet-
code, compared to the rubber duck debugging method, print
A. LLMs Tasks
Large language models have optimized various tasks in 2 https://code.visualstudio.com/
code generation and software development through automa- 3 https://www.jetbrains.com/

debugging improved performance by 1.5% on simple Leetcode efciency and improves the quality and accuracy of generated
problems and by 17.9% on medium-difculty problems. code to mitigate the hallucination from the single LLM.
Additionally, the application of LLMs in improving pro- In [76], researchers proposed a self-collaboration framework
gramming efciency has garnered widespread attention, the where multiple ChatGPT (GPT-3.5-turbo) agents act as differ-
tools like GitHub Copilot which integrating OpenAI’s Codex ent roles to collaboratively handle complex code generation
model, provide real-time code completion and suggestions dur- tasks. Specically, the introduction of Software Development
ing coding. According to [72], researchers present a controlled Methodology (SDM) divides the development process into
experiment with the Github Copilot, the result demonstrated three stages: analysis, coding, and testing. Each stage is
that with developers completing HTTP server tasks 55.8% managed by specic roles, and after completing their tasks,
faster when using Copilot. Another similar study also using each role provides feedback and collaborates with others to
LLM to be the programmer tools, in [73], researchers intro- improve the quality of the generated code. Experiment shows
duced the INCODER model which capable of both program that this self-collaboration framework signicantly improves
synthesis and editing. By leveraging bidirectional context, performance on both the HumanEval and MBPP benchmarks,
the model performs well in both single-line and multi-line with the highest improvement reaching 29.9% in HummanEval
code lling tasks, providing developers with smarter code compared to the SOTA model GPT-4. This result demon-
editing tools. This real-time code generation and completion strating the potential of collaborative teams in complex code
functionality not only improves programming efciency but generation tasks. Although it lacks external tool integration
also reduce the burden on developers, allowing them to focus and dynamic adjustment capabilities, this framework exhibits
on higher-level design which is a common problem in software common characteristics of LLM-based agents, such as role dis-
development where substantial workforce and time are wasted tribution, self-improvement ability, and excellent autonomous
on tedious coding tasks. decision-making, these combined capabilities qualify it to be
The multi-turn program synthesis tasks represent a signif- considered an LLM-based agent. Similarly, In [77], the LCG
icant breakthrough for LLMs in handling complex program- framework improved code generation quality also through
ming tasks, in [74], researchers introduced the CODEGEN multi-agent collaboration and chain-of-thought techniques,
model, which iteratively generates programs through multiple once again demonstrating the effectiveness of multi-agent
interactions, signicantly improving program synthesis quality collaboration in the software development process.
and making the development process more efcient and ac- The limitations of context windows was not discussed in
curate. By gradually generating and continuously optimizing previous studies, this has been thoroughly explored in a 2024
code at each interaction, LLMs can better understand user by University of Cambridge team. In [78], researchers intro-
intent and generate more precise and optimized code. In the duced the L2MAC framework, which dynamically manages
experiments, comparisons were made with the Codex model, memory and execution context through a multi-agent system
which was considered state-of-the-art in code generation at to generate large codebases, and achieved SOTA performance
the time. CODEGEN-MONO 2.7B outperformed the Codex in generating large codebases for system design tasks. The
model of equivalent outcome in pass@k metrics for both k=1 framework is primarily divided into the following components:
and k=10. Furthermore, CODEGEN-MONO 16.1B exhibited the processor, which is responsible for the actual generation
performance that was comparable to or better than the best of task outputs; the Instruction Registry, which stores program
Codex model on certain metrics, further demonstrating its prompts to solve user tasks; and the File Storage, which
SOTA performance in the code generation. By iteratively contains both nal and intermediate outputs. The Control Unit
generating and optimizing code, LLMs continuously improve periodically checks the outputs to ensure that the generated
their output quality. In [75], researchers proposed the Cycle content is both syntactically and functionally correct. The
framework, which enhances the self-improvement capability of researchers conducted multiple experiments and compared
code language models by learning from execution feedback, with many novel methods like GPT-4, Reexion, and Auto-
improving code generation performance by 63.5% on multiple GPT, achieving a Pass@1 score of 90.2% on the HumanEval
benchmark datasets. Although Cycle has a certain degree of benchmark, showcasing its superior performance in generating
autonomy, its decision-making and planning capabilities are large-scale codebases.
mainly limited to code generation and improvement tasks Recently, many studies have begun to use LLM-based
without overall planning, and the execution sequence is com- agents to simulate real software development processes, the
pletely followed a xed pattern, so it’s better to classied as paper [79] introduced the MetaGPT framework, which en-
an advanced LLM application. hanced problem-solving capabilities through standard operat-
ing procedures (SOPs) encoded in multi-agent collaboration.
The entire process of the multi-collaboration framework sim-
B. LLM-based Agents Tasks ulates the waterfall life-cycle of software development, with
LLM-based agents have shown signicant potential and each agent playing different roles and collaborating to achieve
advantages by substantially improving task efciency and the goal of automating software development. LLM-based
effectiveness through multi-agent collaboration. Unlike tradi- agents have also shown strong ability in automated software
tional LLMs, LLM-based agents adopt a division of labor development, [80] proposed a multi-GPT agent framework
approach, breaking down complex tasks into multiple subtasks that automates tasks such as project planning, requirement
handled by specialized agents, this method can enhance task engineering, software design, and debugging, illustrating the
potential for automated software development. Similarly [81] overcomes the limitations of traditional LLMs also provides
introduced the model called CodePori, which is a novel new directions and ideas for future software development
model designed to automate code generation for extensive research and applications, to frees programmers from the
and complex software projects based on natural language boring test suite generation.
prompts. In [82] the AgentCoder framework collaborates with
programmer agents, test design agents, and test execution
agents to generate and optimize code, outperforming existing
methods, achieved SOTA performance on the HumanEval-ET
benchmark with pass@1 of 77.4% compared to the previous
state-of-the-art result of 69.5%, this result showcasing the
advantages of multi-agent systems in code generation and
testing.
The purpose of integrating LLMs into agents from many
framework is to enhance the self-feedback and reection
capabilities of the entire agent system. Because the current
open-source LLMsgenerally have much lower capabilities in
this aspect compared to proprietary models, the emergence
of LLM-based agents can help bridge the gap between open-
source models and the advanced capabilities of proprietary
systems like GPT-4. [83] introduced the OpenCodeInterpreter
framework, which improved the accuracy of code generation
models by integrating code generation, execution, and human
feedback. Based on CodeLlama and DeepSeekCoder, this
framework performed close to the GPT-4 Code Interpreter on
the HumanEval and MBPP benchmarks. The abbility of using
external tools or APIs is another signicant advantage of LLM-
based agents, [84] proposed the Toolformer model, which
signicantly enhanced task performance by learning to call
APIs through self-supervision. The framework Based on GPT- F IG . 5: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN
LLM- BASED AGENT AND LLM IN C ODE G ENERATION AND S OFT-
J (6.7B parameters) achieved signicant performance improve- WARE D EVELOPMENT
ments across multiple benchmark tasks, demonstrating the
possibility of LLM-based agent brought by the external tool,
the diverse choice of tools and architectures, allowing LLMs In software engineering task handling, there are subtle
to continuously learn new things and improve themselves. differences between LLMs and LLM-based agents in terms
Similarly, [85] enhanced LLMs’ interaction with external of task focus, approach, complexity and scalability, automa-
APIs through the ToolLLM framework, outperforming Text- tion level, and task management. LLMs primarily focus on
Davinci-003 and Claude-2 on the ToolBench and APIBench enhancing the code generation capabilities of a single LLM,
benchmarks and excelling in multi-tool instruction processing. including debugging, precision, evaluation. These methods
typically improve specic aspects of code generation or eval-
uation through a single model, concentrating on performance
C. Analysis enhancement within existing constraints, such as context
The main differences between LLM-based agents and tra- windows and single-task execution. In contrast, LLM-based
ditional LLMs in software development applications mainly agents emphasize handling more complex and broader tasks
focus on the efciency and autonomy, particularly in task through the collaboration of multiple specialized LLMs or
division and collaboration. Traditional LLMs typically use frameworks, integrating tool usage, iterative testing, and multi-
a single model to handle specic tasks, such as generating agent coordination to optimize the whole development process
code from text and code completion. However, this approach and easily surpass the state-of-art model in common bench-
has limitations when dealing with complex tasks, especially marks. The emeergence of multi-agent systems also brings
regarding context window restrictions and the need for con- more possibilities, this system can imitate the real software
tinuous feedback. LLM-based agents handle different subtasks developer to perform the scrum development. Figure. 5 utilize
through collaboration with clear division of labor, thereby studies [77] and [75] showcase the differences between LLM-
enhancing task efciency and quality. For example, in a code based agents and LLMs on the same code generation task.
generation task, one agent generates the initial code, another The LLM-based agents system are able to perform multi-agent
designs test cases, and a third executes tests and provides collaboration and simulate the real scrum development team
feedback, thus achieving iterative optimization. Through task in the industry. In contrast the LLMs on the right are normally
division, multi-agent systems, and tool integration, LLM-based use multi-LLMs to analysis mistakes from the test cases, and
agents can tackle more complex and broader tasks, improving rene the initial generated code, but they lack autonomy and
the quality and efciency of code generation. This approach efciency, as the test cases are manually generated by humans.
D. Benchmarks and LLM-based agents, LLMs typically optimize a single

model’s performance by ne-tuning on relevant datasets.
In the eld of code generation and software development,
there are notable differences and commonalities in the dataset E. Evaluation Metrics
used for research on LLMs and LLM-based agents. These Various evaluation metrics are used to assess the perfor-
datasets provide important benchmarks for evaluating model mance of LLMs and LLM-based agents in code generation and
performance, the HumanEval dataset, widely used for assess- software development. These metrics measures the models’
ing code generation models, is handcrafted by OpenAI and performance in specic tasks and how they improve the
contains 164 programming problems, each including a func- code generation and software development process. Table VI
tion signature, problem description, function body, and unit includes the distribution of evaluation metrics cited in this
tests. This dataset is primarily used to evaluate a model’s abil- paper, encompassing both LLMs and LLM-based agents.
ity to generate correct code, particularly in tasks that involve In research on LLMs and LLM-based agents, Pass@k is
converting natural language descriptions into executable code. a common evaluation metric used to measure the propor-
Many studies have utilized HumanEval to test the performance tion of generated code that passes all test cases within the
of code generation models [76]. The MBPP (Mostly Basic rst k attempts, this metric is widely applied across various
Python Programming) dataset is another common benchmark, datasets. In [86], Pass@k was used to evaluate the quality
comprising 427 Python programming problems that cover ba- of code generation in multi-turn interactions, showing that
sic concepts and standard library functions, this dataset is used the model’s Pass@k signicantly improved by introducing
to evaluate model performance across various programming a planning phase. Besides Pass@k, BLEU score is another
scenarios. In [82], researchers used the MBPP dataset to test common evaluation metric, mainly used to measure the syn-
the performance of multi-agent systems in code generation tactic similarity and correctness between generated code and
and optimization, improving the accuracy and robustness of reference code. In [73], BLEU score was used to evaluate
generated code through agent collaboration. The HumanEval- the quality of generated code. Complete Time and Success
ET and MBPP-ET datasets are extensions of the original Rate are other important evaluation metrics, particularly when
HumanEval and MBPP datasets, adding more test cases and assessing the productivity impact of AI-assisted development
more complex problems for a comprehensive evaluation of tools, these metrics are crucial as we expect LLMs to generate
model performance [86]. The Spider and BIRD datasets focus accurate code while maintaining expected speed. Condence
on converting natural language to SQL queries, evaluating Calibration and Execution Rate are metrics used to evaluate
the model’s ability to handle complex query generation tasks. the condence level and execution success rate of the model
In [69], researchers used these datasets to test the SQL-PaLM when generating code. Researchers often use these metrics
framework, which evaluating the execution accuracy and exact to assess various LLMs’ performance in understanding user
match rate for SQL generation tasks through few-shot prompt intent and generating correct code with high precision.
and instruction ne-tuning. ToolBench and APIBench datasets Compared to the evaluation metrics for LLMs in software
are used to evaluate a model’s capability in using tools and development, LLM-based agents also use Pass@k but more
APIs, ToolBench contains 16,464 real-world RESTful API diverse to reect their multi-agent collaboration characteristics.
instructions, and APIBench normally tests a model’s gener- Win Rate and Agreement Rate are important metrics for evalu-
alization ability to unseen API instructions [85]. The CAASD ating the effectiveness of multi-agent collaboration. Addition-
(Capability Assessment of Automatic Software Development) ally, LLM-based agents often use metrics like Execution Ef-
dataset is a newly developed benchmark comprising 72 soft- fectiveness and Cost Efciency to evaluate their performance
ware development tasks from various domains, each with a in real-world applications. For instance, in MetaGPT [79],
set of reference use cases to evaluate AI-assisted software researchers evaluated not only the correctness of code gen-
development systems [61]. eration but also analyzed the execution effectiveness, develop-
There are some obvious commonalities in dataset selection ment costs, and productivity. Results indicated that MetaGPT
for LLMs and LLM-based agents, the HumanEval and MBPP signicantly improved development efciency and reduced
datasets are widely used to assess code generation capabili- development costs while generating high-quality code. Overall
ties, covering a variety of programming tasks and languages. both are using traditional metrics such as Pass@k, Win Rate,
Moreover, many studies have adopted multilingual and cross- and task completion time to evaluate their code generation
domain datasets, such as HumanEval-X and CodeSearchNet, capabilities, these metrics directly reect the accuracy and
to evaluate model performance across different languages efciency of the model in generating code. But LLM-based
and tasks. For the differences, LLM-based agents tend to agents normally requiring more comprehensive and diverse
use multi-agent collaboration frameworks to handle complex metrics for evaluation to help assess the performance of
tasks, thus favoring benchmark datasets that emphasize multi- multiple agents and the whole development process, that’s why
turn interactions and iterative optimization, also focus on we can see the human revision cost, qualitative feedback in
tool usage and API integration capabilities, the framework the evaluation metrics. Researchers consider user or developer
TOOLLLM used ToolBench and APIBench to assess its tool satisfaction metrics, as agent applications often involve ex-
usage capabilities, while Toolformer demonstrated its ability to tensive projects rather than isolated small-scale development,
autonomously learn to use tools. These differences primarily these metrics focus on the correctness of code generation and
from the different approaches to task handling between LLMs also resource utilization efciency of the agent system.
TABLE VI: E VALUATION M ETRICS IN C ODE G ENERATION AND S OFTWARE D EVELOPMENT

[71] Leetcode problem Accuracy No
[72] HTTP server in JavaScript by Task Completion Time, No
95 programmer Task Success
[68] Spider, WikiTQ, GSM8k, Execution Accuracy, No
SVAMP, MBPP, MBPP, DS-1000 Condence Calibration
Execution Rate
[45] Alpaca Data Win-Rate, Agreement No
Rate
[86] HumanEval/-X/-ET, Pass@k, AvgPassRatio, No
MBPP-sanitized/-ET CodeBLEU
[73] HumanEval, CodeXGLUE Pass rate, Exact Match No
BLEU Score
[69] Spider Accuracy, No
Exact Match
[74] HumanEval, MTPB Pass@k, Pass rate No
[30] HumanEval, MathQA-Python, Pass@k, BLEU-4, No
GSM8K-Python, CodeSearchNet, Exact Matcha
CosQA, AdvTest Edit Similarity,
Mean Reciprocal Rank (MRR)
[70] HumanEval/-X Pass@k No
[67] HumanEval Pass@k, BLEU Score No
[75] HumanEval, MBPP-S, APPS Pass Rate, Token Edit Distance, No
Exact Copy Rate
[76] MBPP/-ET, HumanEval/-ET Pass@k Yes
[78] HumanEval Pass@1 Yes
[87] CAASD Pass Rate, Token Consumption Yes
[84] CCNet, SQuAD, Google-RE, T-REx, Zero-shot performance, Yes
ASDiv, SVAMP, MAWPS, Perplexity, Tool usage effectiveness
Web Questions, Natural Questions,
TriviaQA, MLQA, TEMPLAMA
[82] MBPP/-ET, HumanEval/-ET Pass@1 Yes
[79] HumanEval, HumanEval, Pass@k, Executability, Cost, Yes
SoftwareDev Code Statistics, Productivity,
Human Revision Cost
[81] HumanEval, MBPP Pass@k, Practitioner-Based Yes
Assessment
[85] ToolBench, APIBench Pass Rate, Win Rate Yes
[80] No Speciced Pass Rate, Win Rate Yes
[77] MBPP/-ET, HumanEval/-ET Pass@1 Yes
[83] HumanEval, MBPP, EvalPlus Pass@1 Yes
[88] First-party data from Meta’s Acceptance Rate, P Yes
code repositories and notebooks ercentage of Code Typed,
Qualitative Feedback
VI. AUTONOMOUS L EARNING AND D ECISION M AKING timizing performance, with the frequently used method called
majority vote [89], this improves the accuracy of inference
Autonomous Learning and Decision Making is a critical
systems and ensures the selection of the optimal possibility.
and evolving eld in modern software engineering, especially
Additionally, the performance of LLMs in tasks such as
under the inuence of articial intelligence and big data.
automated debugging and self-correction has enhanced the
The core task of autonomous learning and decision making
system’s autonomous learning capabilities, achieving efcient
is to achieve automated data analysis, model building, and
error identication and correction. At the same time, the
decision optimization through machine learning algorithms
application of LLM-based agents in autonomous learning and
and intelligent systems, thereby enhancing the autonomy and
decision-making is also a novel but popular topic, these agents
intelligence of systems.
can perform complex reasoning and decision-making tasks
In this process, LLMs and LLM-based agents bring numer-
with the help from the LLM, and also improve their adapt-
ous possibilities, following the development of NLP technol-
ability in dynamic environments through continuous learning
ogy, a lot of achievements have been made in the application
and optimization. In this context, we have collected nineteen
of LLMs in this eld. These models can handle complex
research papers on LLM-based agents in this eld. This survey
language tasks and also demonstrate powerful reasoning and
will provides a general review of these studies, analyzing
decision-making abilities, the research on voting inference
the specic applications and technical implementations in
using multiple LLMs calls has revealed new methods for op-
autonomous learning and decision making. transformative creativity. Although LLMs can generate high-
quality creative content, further research and improvement are
needed to achieve true creative breakthroughs. Additionally,
A. LLMs Tasks
innovative responses generated by LLMs may come with the
The API call for the LLMs is one common applications, possibility of hallucination, a long-standing issue for large
often requiring continuous calls to enable the model to make language models. Despite many techniques to mitigate its
judgments and inferences, but does continuously increasing downsides, it still cannot be entirely prevented. There are
the number of calls always improve performance? In [90], many interesting experiments in decision making, such as
researchers explored the impact of increasing LLM calls on the having LLMs act as judges to determine whether a person
performance of composite reasoning systems. Paper analyze has committed a crime [94]. A familiar attempt is to have
the voting inference design systems, the result showed that a primary LLM interact with other LLMs. [95] explored the
there is a non-linear relationship between the number of LLM effectiveness of using LLMs as judges to evaluate other LLM-
calls and system performance; performance improves initially driven chat assistants. The study validated the consistency of
with more calls but declines after reaching a certain thresh- LLM judgements with human preferences through the MT-
old. This research provides a theoretical basis for optimizing Bench and Chatbot Arena benchmarks, with results showing
LLM calls, helping to allocate resources reasonably in prac- that GPT-4’s judgments were highly consistent with human
tical applications to achieve optimal performance. However, judgments across various tasks. This research demonstrates the
the performance of Voting Inference Systems shows a non- potential of LLMs in simulating human evaluation, providing
monotonic trend due to the diversity of query difculties, and new ideas for automated evaluation and optimization.
the continuously increasing cost also needs to be considered.
Autonomous learning is also applied in bug xing, where
researchers hope LLMs can continuously learn to x bugs B. LLM-based Agents Tasks
and eventually identify human oversights or common errors. Multi-agent collaboration and dialogue frameworks also
In [91], the SELF-DEBUGGING method was proposed, en- demonstrated strong capabilities in both decision making and
abling LLMs to debug code by analyzing execution results and autonomous learning. [96] explores whether multi-agent dis-
natural language explanations. This method signicantly im- cussions can enhance the reasoning abilities of LLMs. The
proved the accuracy and sample efciency of code generation proposed CMD framework simulates human group discussion
tasks especially for complex problems. Experimental results on processes, showing that multi-agent discussions can improve
the Spider and TransCoder benchmarks showed that the SELF- performance in commonsense knowledge and mathematical
DEBUGGING method increase the model’s accuracy by 2- reasoning tasks without task-specic examples. Additionally,
12% which demonstrates the potential of LLMs in autonomous the study found that multi-agent discussions also correct com-
learning to debug and correct any erros. Another similar mon errors in single agents, such as judgment errors and the
study introduced the AutoSD (Automated Scientic Debug- propagation of incorrect answers, thereby enhancing overall
ging) technique [92], which simulates the scientic debugging reasoning accuracy. [97] researchers explored the potential
process through LLMs, generating explainable patched code. of multi-modal large language models (MLLMs) like GPT4-
Researchers evaluated AutoSD’s capabilities from six aspects: Vision in enhancing agents’ autonomous decision-making pro-
feasibility, debugger ablation, language model change, devel- cesses. The paper introduce the PCA-EVAL benchmark, and
oper benet, developer acceptance, and qualitative analysis. evaluated multi-modal decision-making capabilities in areas
Result have shown that AutoSD can generate effective patches such as autonomous driving, home assistants, and gaming.
and also improve developers’ accuracy in evaluating patched The results showed that GPT4-Vision exhibiting outstanding
code by providing explanations, its explainability function performance across the dimensions of perception, cognition,
makes it easier for developers to understand and accept au- and action.
tomatically generated patches. Although the above two stud- [98] proposes the Reexion framework, a novel approach
ies primarily focus on automated debugging techniques, the that strengthens learning through language feedback rather
frameworks designed in these studies automatically determine than traditional weight updates to avoid expensive re-train
the optimal repair solution based on the debugging results costs. The framework uses self-reection and language feed-
after collecting sufcient information, and provide specic back to help language agents learn from mistakes, signicantly
code implementations, which demonstrated the capability of improving performance in decision-making, reasoning, and
autonomous decision-making and learning. programming tasks. The Reexion’s rst-pass success rate
Since the rise of LLMs applied to various elds, one on the HumanEval Python programming task increased from
research direction has been the rational analysis of their 80.1% to 91.0%, success rates in the ALFWorld decision-
creativity and the exploration of their potential for continuous making task improved by 22%, and performance in the
learning, this creativity also highly determined by the decision HotPotQA reasoning task increased by 14%. These results
making capability of the models. [93] analyzed the outputs indicate that the Reexion framework demonstrate the state-
of LLMs from the perspective of creativity theory, exploring of-art performance in various tasks through self-reection and
their ability to generate creative content, the study used metrics language feedback.
such as value, novelty, and surprise, nding that current LLMs Another agent framework [35] introduces the ExpeL agent
have limitations in generating combinatorial, exploratory, and framework which enhances decision-making capabilities by
autonomously collecting experiences and extracting knowl- enhancing LLM capabilities. [101] investigates the compre-
edge from a series of training tasks using natural language, this hensive comparison of LLM-augmented autonomous agents
experience collection process is similar to how humans gain in- and proposes a new multi-agent coordination strategy for
sights through practice and apply them in exams. By accessing solving complex tasks through efcient communication and
internal databases, ExpeL also reduces hallucinations, employ- coordination called BOLAA. The experiment showed that the
ing the RAG technique discussed in III. The ExpeL framework BOLAA outperforms other agent architectures in the WebShop
doesn’t require parameter updates it enhances decision-making environment especially in high-performance LLMs The above
capabilities by recalling past successes and failures which three studies focus on achieving a multi-agent collaboration
fully leveraging the advantages of ReAct framework [36]. architecture by increasing the number of agents. This trend
Experimental showed that ExpeL can continuous improvement indicates that more frameworks are beginning to explore the
across tasks in multiple domains and exhibited cross-task potential of multi-agent systems. [44] explores methods to
transfer learning capabilities. The combination of ExpeL and enhance LLMs performance by increasing the number of
Reexion even further enhance the performance in iterative agents. Using sampling and voting methods, the study showed
task attempts, highlighting the importance of autonomous that as the number of agents increased, LLM performance
learning and experiential accumulation in developing intelli- in arithmetic reasoning, general reasoning, and code gener-
gent agents. The ExpeL framework demonstrates its potential ation tasks improved signicantly. This method proves the
as a state-of-the-art (SOTA) LLM-based agents in several effectiveness of multi-agent collaboration in enhancing model
aspects, particularly in cross-task learning, self-improvement, performance. These studies collectively indicate the impor-
and memory mechanisms. By comparing ExpeL with existing tance of multi-agent collaboration and dialogue frameworks
SOTA agents like Reexion [98], ExpeL outperforms baseline in autonomous learning and decision-making tasks. Compared
methods in various task environments. These studies collec- to traditional LLMs, these multi-agent frameworks enhance
tively indicate the importance of autonomous learning and im- reasoning accuracy under zero-shot learning and demonstrate
provement in LLM-based agents, agent systems continuously higher autonomy and exibility which reduce the burden on
optimize and improve decision-making processes through self- developers.
feedback, self-reection, and experiential accumulation which LLM-based agents not only perform complex data analysis
shows higher autonomy and exibility in handling dynamic tasks but also demonstrate potential in simulating and under-
and complex tasks compared to traditional LLMs. Unlike standing human trust behaviors. [102] introduces a framework
traditional LLMs, which mainly rely on pre-training data and named SELF, designed to achieve self-evolution of LLMs
parameter updates, LLM-based agents adapt and improve their through language feedback, use RLHF to train agent behavior
performance in real-time through continuous self-learning and to meet the human alignment. The framework enhances model
feedback mechanisms, thus demonstrating outstanding perfor- capabilities through iterative processes of self-feedback and
mance in various tasks. self-improvement without human intervention. In experiments,
[99] proposes the AGENTVERSE multi-agent framework, the test accuracy on GSM8K and SVAMP datasets increasing
designed to improve task completion efciency and effective- by 6.82% and 4.9%, respectively and the overall task win rates
ness through collaboration. The framework draws on human on the Vicuna test set and Evol-Instruct test set also increased
group dynamics by designing a collaborative system of expert by 10% and 6.9%. Another similar study exploring the po-
agents that exhibit outstanding performance in tasks such as tential of LLM-based agents to simulate the human trust be-
text understanding, reasoning, coding, and tool usage. Experi- haviors. [103] also examines whether LLM-based agents can
ments showed that the AGENTVERSE framework performed simulate human trust behaviors. The study aims to determine if
well not only in independent task completion but also sig- LLM-based agents exhibit trust behaviors similar to humans
nicantly improved performance through group collaboration and explore whether these behaviors can align with human
especially in coding tasks where the framework use GPT-4 trust. Through a series of trust game variants such as initial
to be the brain of the agent groups. The framework also ob- fund allocation and return trust games, the research analyzes
served emergent behaviors in agents during collaboration, such LLM-based agents’ trust decisions and behaviors in different
as voluntary actions, conformity, and destructive behaviors, contexts. Results show that particularly for GPT-4, LLM-based
providing valuable insights for understanding and optimizing agents exhibit trust behaviors consistent with human expecta-
multi-agent systems. tions in these trust games, validating the potential of LLM-
Another multi-agents study [100] Introducing the CAMEL based agents in simulating human trust behaviors. The efcient
framework, this is a well known agent framework, which and accurate handling of diverse datasets highlights the broad
explores building scalable techniques to facilitate autonomous application prospects in elds such as software engineering.
collaborative agent frameworks. The study proposes a role- In terms of simulating trust behaviors, LLM-based agents
playing collaborative agent framework that guides dialogue demonstrate human-like behavior patterns through complex
agents to complete tasks through embedded prompts while trust decisions and behavior analysis providing an important
maintaining alignment with human intentions. The CAMEL theoretical foundation for future human-machine collaboration
framework generates dialogue data to study behaviors and and human behavior simulation.
capabilities within the agent society, the study further en- Integrating LLMs into agents allows for more complex
hanced agent performance by ne-tuning the LLaMA-7B task processing. [104] proposes a lightweight user-friendly
model, validating the effectiveness of generated datasets in library named AgentLite whic designed to simplify the de-
velopment, prototyping and evaluation of task-oriented LLM- agents exhibit higher autonomy which are typically designed
based agents systems. The main goal of the study is to to interact with or adapt to the environment in real-time, they
enhance the capabilities and exibility of LLM-based agents in are often part of multi-agent systems where collaboration and
various applications by introducing a exible framework. This communication are key components, for example use extra
framework enhances task-solving capabilities through task de- model or tools to further help with the planning phases. In
composition and multi-agent coordination, using a hierarchical terms of integration with other systems and modalities, LLMs
multi-agent coordination approach where a managing agent typically operate in text input-output scenarios and even in
supervises the task execution of each agent. [105] introduces multi-modal settings, their role is usually limited to processing
a framework, GPTSwarm, that represents LLM-based agents and generating text-based content. Also, LLM-based agents are
as computational graphs to unify existing prompt engineering more likely to integrate with other systems and modalities such
techniques and introduces methods for optimizing these graphs as visual input or real-world perception data, enabling them
to enhance agent performance. The study veries the effective- to perform more complex and context-based decision-making
ness of the framework through various benchmarks such as tasks.
MMLU, Mini Crosswords, and HumanEval. The framework
demonstrated signicant performance improvements on the
GAIA benchmark with an improvement margin of up to
90.2% compared to the best existing methods. Additionally,
agents have shown strong capabilities in autonomous learning
and decision-making in software engineering and security,
which will be introduced in the subsequent Software Security
section [106] [107] [108] [109].
C. Analysis
Overall, LLMs and LLM-based agents exhibit strong capa- F IG . 6: E XPEL [35] F RAMEWORK WITH R EFLEXION [98] IN E XPE -
RIENCE G ATHERING
bility on the autonomous learning and decision making but
slightly different view. These differences are reected in the
Regarding learning and adaptation mechanisms, LLMs’
focus of task execution and also in autonomy, interactivity,
adaptation and learning are usually conned to the model’s
learning and adaptation mechanisms, and the integration with
training data and parameter range, although they can adapt
other systems and modalities. From the perspective of task
through new data updates, they lack the ability to continuously
execution focus, LLMs primarily concentrate on enhancing
learn from real-time feedback, they are more focused on using
specic functions in software engineering, such as debugging,
existing knowledge to solve problems and generate responses.
problem-solving and automated reasoning. The tasks they
In contrast, LLM-based agents are often equipped with ex-
perform are usually static and well-dened, such as automatic
periential learning and real-time feedback adaptation mech-
debugging, enhancing debugging capabilities to autonomously
anisms, allowing them to optimize strategies and responses
identify and correct errors, evaluating creativity and judging
based on continuous interactions. One good example of LLm-
responses from other chatbots. In contrast, LLM-based agents
based agents framework is Expel [35], which utilize the
not only focus on specic tasks but also manage multi-
previous researches ReAct [36] and Reexion [98] as shown
ple tasks simultaneously, often involving dynamic decision-
in Figure. 6. This framework utilizes a memory pool and
making and interaction with other agents or systems. Examples
insights pool to enable the LLM to learn from past knowledge,
of these agents’ tasks include enhancing reasoning through
thereby aiding subsequent decision-making. This autonomous
multi-agent discussions, continuous learning from experiences,
decision-making capability is something that traditional LLM
requiring real-time dynamic decision-making, and also LLM-
frameworks cannot achieve.
based agents can get in touch with the multimodal task in the
visual environment.
We can conclude that, the application of LLM-based agents D. Benchmarks
in the topic of autonomous learning and decision-making In the eld of autonomous learning and decision-making,
primarily involves exploring their performance in specic tasks the benchmark datasets used by LLMs and LLM-based agents
through various framework designs. These studies evaluate the are quite similarly in the task handling and application require-
agents’ autonomy and decision-making capabilities to deter- ments. We can gain a deeper understanding of the strengths
mine whether they align with human behavior and decision- and weaknesses of both approaches in different tasks and their
making processes. If we dive into the specic task deigns, application contexts. The specic dataset references, please see
in terms of autonomy and interactivity LLMs are usually Table VII.
designed to perform highly specic tasks without needing In the research on LLMs, the main datasets include De-
to adapt to external input or environmental changes, they fects4J, MMLU, TransCoder, and MBPP. These datasets are
mainly operate as single models focusing on processing and primarily used to evaluate model performance in specic
responding within predened boundaries, this also applied domains and tasks. Defects4J is a widely used in the soft-
to all LLM applications. On the other hand, LLM-based ware engineering, this software defect dataset containing 525
real defects from 17 Java projects. It’s designed to test the complex interactions and multitasking scenarios. Datasets
effectiveness of automated program repair and defect detection like HotpotQA, ALFWorld, FEVER, and WebShop challenge
tools by providing a standardized benchmark that allows the models’ performance in information synthesis, dynamic
researchers to compare the performance of different methods. decision-making/interaction and multimodal tasks. This dif-
MMLU (Massive Multitask Language Understanding) is a ference arises from the distinct design goals of the two:
large-scale benchmark dataset covering 57 subjects, testing LLMs aim to optimize performance on single tasks, while
models on a broad spectrum of knowledge and reasoning LLM-based agents are designed to handle complex or multi-
abilities in multitask language understanding. It includes ques- modal task, this require higher autonomy and adaptability.
tions ranging from elementary education to professional level It’s also reects modern applications’ demand for highly
such as College Mathematics, Business Ethics, and College interactive, adaptive, and multifunctional AI systems, driving
Chemistry, challenging the models’ diverse knowledge base the development from single LLM models to multi-agent
and reasoning capabilities. The TransCoder dataset focuses systems. Through these analyses, we can identify the different
on code translation across programming languages which application of LLMs and LLM-based agents in autonomous
evaluate the model’s ability to automatically translate code learning and decision-making, it’s important to choose the
from one programming language to another. This is crucial for appropriate framework to meet different task requirements in
multilingual software development and maintenance, as it can the real world applications.
greatly enhance development efciency. MBPP (Mostly Basic
Python Programming) has been introduced in previous section,
it’s a dataset containing 427 Python programming problems, E. Evaluation Metrics
covering basic concepts and standard library functions, it’s various evaluation metrics are used in the research on
widely used to test the model’s performance in different pro- LLMs and LLM-based agents, these metrics used to evaluate
gramming scenarios, evaluating its ability to generate correct the models’ performance in specic tasks and analyze their
and efcient code. application effectiveness in this domain. Below, we discuss
In contrast, LLM-based agents use datasets that empha- several representative studies analyzing the evaluation metrics
size multitasking and decision-making capabilities in complex they employed and exploring the differences between LLMs
scenarios. The main datasets include HotpotQA, ALFWorld, and LLM-based agents in this eld.
FEVER, WebShop, and MGSM. HotpotQA is a multi-hop In research on LLMs, evaluation metrics primarily focus on
question-answering dataset that requires models to reference model accuracy and task completion. In [90], researchers used
content from multiple documents when answering questions, the accuracy of a voting inference system which measured by
evaluating their information synthesis and reasoning abilities, the expected 0/1 loss (the proportion of correct responses) to
this dataset challenges the model’s performance in complex assess model performance. This metric evaluates the accuracy
reasoning tasks. ALFWorld is a text-based environment simu- of models through multiple calls, reecting the ability of
lation dataset requiring multi-step decision-making where the LLMs to improve result accuracy via iterative reasoning.
model completes tasks in a virtual home environment. The Common evaluation metrics in the literature include accuracy
dataset combines natural language processing and decision- and sample efciency, accuracy refers to the proportion of
making, evaluating the model’s performance in dynamic and correct predictions made by the model, while sample efciency
interactive tasks. The FEVER (Fact Extraction and VERica- measures the number of samples required to achieve a certain
tion) dataset is used for fact verication tasks, where the model accuracy level. These metrics assess both the predictive and
needs to verify the truthfulness of given statements and provide decision making ability of the model and its data utilization
evidence, it assesses the model’s capabilities in information efciency during training. In [92], evaluation metrics include
retrieval and logical reasoning. WebShop is an online shopping possible patches, correct patches, precision, and developer
environment simulation dataset containing 1.18 million real- accuracy. Possible patches refer to patches that pass all
world products and human instructions, it used to test the tests, while correct patches are semantically equivalent to
model’s performance in complex decision-making tasks such the original developer patches. Precision measures the pro-
as completing shopping tasks and attribute matching. MGSM portion of correct patches among the possible patches, and
(Multimodal Generalized Sequence Modeling) is a multimodal developer accuracy assesses the correctness of patches with
dataset containing tasks related to dialogue, creative writing, and without explanations through human evaluation. These
mathematical reasoning, and logical reasoning, evaluating the metrics emphasize the model’s explanatory capability and
model’s comprehensive abilities in multimodal tasks. practical effectiveness in automated code repair, increasing
Comparatively, LLM datasets typically focus on single, reliance on human evaluation. To assess model creativity,
static tasks such as code generation, mathematical reason- value, novelty and surprise are used as creativity dimensions.
ing and creative writing, which suitable for models work- Quality, social acceptability, and similarity of generated works,
ing within predened task scopes. Datasets like Defects4J, as well as the ability to generate creative product, are also
MMLU, and MBPP help evaluate model capabilities in spe- included in the evaluation. [110] used the success rate in the
cic domains. LLM-based agents are more suited for com- Game of 24 and the coherence of generated paragraphs in
plex, multitasking, and dynamic environments where datasets creative writing as evaluation metrics. These metrics assess the
require models to handle multimodal inputs and real-time model’s performance in problem-solving and text generation,
decision-making, it can showcase their advantages in handling showcasing LLMs’ potential in solving complex problems and
generating coherent text. In [95], consistency and success rate integration of design and development to ensure that decisions
were used as evaluation metrics, the consistency calculates made during the design phase seamlessly translate into high-
the probability of agreement between two judges on randomly quality code. Consequently, the research on software design
selected questions which measures the alignment of LLM often explores aspects related to code generation and devel-
judges with human preferences. Success rate is used for opment by utilizing LLMs for software development with a
specic tasks (such as the Game of 24) to measure the correct certain framework and special architecture design. Software
response rate. design frameworks often involve multiple stages of continuous
In contrast, LLM-based agents use more diverse evaluation renement to achieve optimal results, which can be considered
metrics to reect their multi-agent collaboration character- part of LLM applications in software development [83].
istics. In [97], evaluation metrics include Perception Score Similarly, [85] and [84] highlight the frequent use of tools
(P-Score), Cognition Score (C-Score), and Action Score (A- or API interfaces when using LLMs to assist in development
Score). These metrics comprehensively assess the model’s per- and design, demonstrating an overlap with the topic of code
ception, cognition and action capabilities, demonstrating the generation and software development.
comprehensive performance of LLM-based agents in handling LLMs in software design and evaluation also intersect
multimodal tasks. In multimodal applications, success rate extensively with autonomous learning and decision making,
(SR) is often used as a primary metric, evaluated through tasks these two topics are interrelated elds. Software design needs
such as HotpotQA and FEVER to assess precise matching to consider system adaptability and learning capabilities to
success. These metrics focus on task completion success and handle dynamic environments, therefore design evaluations
accuracy, showcasing the practical execution capabilities of involving autonomous learning and decision making naturally
LLM-based agents in different task environments. In [111], become a focal point of intersection for these two topics.
evaluation metrics include practitioner feedback, efciency, Many LLM techniques and methods nd similar applications
and accuracy. Practitioner feedback uses the Likert scale in both elds, for example LLMs based on reinforcement
to collect satisfaction and performance feedback, the Likert learning can be used for automated design decisions and
scale is a commonly used psychometric tool designed to evaluations, as well as for self-learning and optimization.
measure an individual’s attitude or opinion toward a particular Common applications of LLMs in software engineering in-
statement. The scale typically consists of the following ve volve ne-tuning models with prompt engineering techniques
options: Strongly Disagree, Disagree, Neutral, Agree, Strongly to continuously enhance performance particularly in soft-
Agree. While efciency and accuracy are measured through ware design and evaluation, more sample learning is often
the effectiveness of model-executed qualitative data analysis required to ensure that the model outputs align with user
validated by practitioners. These metrics assess the agents’ expectations [93] [102] [44] [111] [105] [96]. Additionally,
performance in qualitative data analysis, demonstrating their requirement elicitation and specication in requirement engi-
utility and accuracy in practical applications. neering can also be considered part of software design and
By comparing these metrics, we nd that LLMs using evaluation [51] [112]. This section reviews the main research
traditional metrics such as accuracy and sample efciency achievements of LLMs in software design and evaluation
to assess their capabilities. In contrast, LLM-based agents in recent years, discussing their application scenarios and
handle more complex algorithm through multi-agents, which practical effects.
requires more comprehensive and diverse metrics to evaluate
their performance from multiple directions. LLM-based agents
A. LLMs Tasks
in multimodal tasks and self-evolution tasks emphasize the
integrated performance of perception, cognition, and action In recent years, there has been extensive research on the use
capabilities. This difference reects LLMs’ strengths in single- of LLMs in tasks such as automation, optimization, and code
task optimization and LLM-based agents’ potential in col- understanding. ChatGPT has been widely utilized for various
laborative handling of complex tasks with higher capability software engineering tasks and demonstrated excellent perfor-
of autonomous learning. Additionally, practical application mance in tasks like log summarization, pronoun resolution,
evaluation metrics for LLM-based agents, such as practitioner and code summarization, achieving a 100% success rate in
feedback, efciency, and accuracy, demonstrate their utility both log summarization and pronoun resolution tasks [113].
and user satisfaction in real-world scenarios. This evaluation However, its performance on tasks such as code review
approach assesses task completion but also consider a compre- and vulnerability detection is relatively poor, which shows
hensive evaluation of user experience, which can also evaluate that it needs further improvement for more complex tasks.
the human alignment of their decision making capabilities. Another framework EvaluLLM addresses the limitations of
traditional reference-based evaluation metrics (such as BLEU
and ROUGE) by using LLMs to assess the quality of natural
VII. S OFTWARE D ESIGN AND E VALUATION
language generation (NLG) outputs [114]. The EvaluLLM
The application of LLMs to software design and evaluation introduces a new evaluation method that compares genera-
has very similar overlaps with previous topics, software design tive outputs in pairs and uses win rate metrics to measure
is an early phase of software development, and the quality of model performance, this approach can simplies the evaluation
the design directly impacts the quality of furture development. process also ensures consistency with human assessments,
Modern software engineering methodologies emphasize the showcasing the broad application prospects of LLMs in gener-
TABLE VII: E VALUATION M ETRICS IN AUTONOMOUS L EARNING AND D ECISION M AKING

[90] MMLU Accuracy No
[91] Spider, TransCoder, MBPP Accuracy, Sample Efciency No
[92] Defects4J v1.2, Defects4J v2.0, Plausible Patches, No
Almost-Right HumanEval Correct Patches,
Precision, Accuracy
[93] No Specic Quality, Acceptance Rate No
[110] Game of 24, Creative Writing, Success rate, Coherency No
5x5 Crosswords
[95] MT-Bench, Chatbot Arena Agreement Rate, Success Rate No
Human Judgement
[96] ECQA, GSM8K, FOLIO-wiki Accuracy Yes
[97] PCA-EVAL Accuracy, P/C/A-Score Yes
[35] HotpotQA, ALFWorld, WebShop, FEVER Success Rate Yes
[106] Not specied Success Rate, Autonomy Leve Yes
[44] GSM8K, MATH, MMLU, Chess, HumanEval Accuracy Yes
[107] MITRE ATTCK framework Ability Identify Vulnerabilities Yes
[102] GSM8K, SVAMP, Vicuna testset, Accuracy, Feedback Accuracy Yes
Evol-Instruct testset
[98] HotPotQA, ALFWorld, HumanEval,MBPP, Pass@1, Success Rate Yes
LeetcodeHardGym
[111] Github Developer Discussions,BBC News, Practitioner Feedback, Yes
Social MediaConversations, Efciency and Accuracy
In-depth Interviews
[100] AI Society, Code, Math,Science, Human Evaluation, Yes
Misalignment GPT-4 Evaluation
[99] FED, Commongen Challenge, Pass@1, Task completion rate Yes
MGSM, Logic Grid Puzzles,
HumanEval
[36] HotpotQA, FEVER, ALFWorld, WebShop Exact Match, Accuracy, Yes
Success rate, Average Score
[103] Trust Game, Dictator Game, Valid Response Rate, Yes
MAP Trust Game, Alignment
Risky Dictator Game,
Lottery Game, Repeated Trust Game
[104] HotPotQA, WebShop F1-Score, Average Reward Yes
[108] 263 real smart contract vulnerabilities F1 Score, Accuracy Yes
Precision, Recall
Consistency Rate.
[109] 15 real-world one-day vulnerabilities Success Rate, Cost Yes
from CVE database
[101] WebShop, HotPotQA with Wikipedia AP Reward Score, Recall Yes
[105] MMLU, Mini Crosswords, HumanEval, Accuracy, Pass@1 Yes
GAIA
ative tasks. Similarly, in the LLMs evaluation domain, LLM- used to generate high-level synthesis (HLS) designs containing
based NLG Evaluation provides a review and classication of predened errors to create a dataset called Chrysalis, this
current LLMs used for NLG evaluation, the paper summarizes dataset provides a valuable resource for evaluating and opti-
four main evaluation methods: LLM-derived metrics, prompt- mizing LLM-based HLS debugging assistants. The optimized
based LLMs, ne-tuned LLMs, and human-LLM collaborative LLM signicantly improves inference performance, providing
evaluations [115]. These methods demonstrate the potential of new possibilities for error detection and correction in the elec-
LLMs in evaluating generative outputs which also mention tronic design automation (EDA) eld. In [117], the researchers
challenges such as the need for improved evaluation metrics introduces RaWi, a data-driven GUI prototyping approach. The
and further exploration of human-LLM collaboration. framework allows users to retrieve GUIs from this repository,
edit them, and create new high-delity prototypes quickly. The
There are also many novel application design with the
experiment conducted by comparing RaWi with a traditional
LLMs which applied in the engineering design, one study ex-
GUI prototyping tool (Mockplus) to measure how quickly
plores strategies for software/hardware co-design to optimize
and effectively users can create prototypes. The result demon-
LLMs and applies these strategies to design verication [116].
strated that RaWi outperformed on multiple benchmarks, with
Through quantization, pruning, and operation-level optimiza-
40% improvement on precision@k metric. This study proves
tion, this research demonstrates applications in high-level
the possibility of LLMs to improve the efciency during
synthesis (HLS) design functionality verication, GPT-4 was
prototyping phase of software design, which allows designers appropriate models based on user requests. The innovation
to quickly iterate on GUI designs, facilitating early detection of lies in using LLMs not just as tools for direct task execution
design aws. With the new possibility brought by the LLMs, but as central orchestrators that leverage existing AI models
there has been much discussion in the education eld, with to fulll complex tasks, This approach expands the practical
researchers exploring the implications of the prevalence of applicability of LLMs beyond typical language tasks. [123]
large language models for education [118]. Study indicates proposes the LLMARENA benchmark framework to evaluate
that ChatGPT shows signicant potential but some limitations LLMs’ capabilities in dynamic multi-agent environments, the
in answering questions from software testing courses [119]. idea is similar to the ChatDev but innovates by shifting the
ChatGPT was able to answer about 77.5% of the questions focus from single-agent static tasks to dynamic and interac-
and provided correct or partially correct answers 55.6% of tive multi-agent environments, providing a more realistic and
the time. However, the correctness of its explanations was challenging setting to assess the practical utility of LLMs, this
only 53.0%, indicating the need for further improvement in approach mirrors real-world conditions where multiple agents
educational applications. (either AI or human) interact and collaborate. Experiments
show that this framework can test LLMs’ spatial designing,
strategic planning, and teamwork abilities in gaming environ-
B. LLM-based Agents Tasks ments, offering new possibilities and tools for designing and
The application of LLM-based agents in software design evaluating LLMs in multi-agent systems.
and evaluation enhance the development efciency and code [124] introduces the ”Flows” conceptual framework for
quality, as well as showcase the broad applicability and structuring interactions between AI models and humans to
immense potential of LLM-based agents in practical software improve reasoning and collaboration capabilities. The study
engineering tasks. [120] explores the current capabilities, present the idea of conceptualizing processes as indepen-
challenges, and opportunities of autonomous agents in soft- dent, goal-driven entities that interact through standardized
ware engineering. Study evaluate Auto-GPT’s performance message-based interfaces, enabling a modular and extensible
across different stages of the software development lifecycle design. This approach is inherently concurrency-friendly and
(SDLC), including software design, testing, and integration supports the development of complex nested AI interactions
with GitHub, the paper nds that detailed contextual prompts without having to manage complex dependencies. Experiments
signicantly enhance agent performance in complex software in competitive coding tasks show that the ”Flows” frame-
engineering tasks which mentions the importance of context- work increases the AI model’s problem-solving rate by 21
rich prompts in reducing errors and improving efciency, percentage points and the human-AI collaboration rate by 54
underscoring the potential of LLM-based agents to automate percentage points. This demonstrates how modular design can
and optimize various SDLC tasks, thereby enhancing devel- enhance AI and human collaboration, thereby improving the
opment efciency. This paper also evaluate the limitation software design and evaluation process.
of the Auto-GPT, includes task or goal skipping, generating [125] presents a new taxonomy to structurally understand
unnecessary code or les (hallucinations), repetitive or looping and analyze LLM-integrated applications, providing new the-
responses, lack of task completion verication mechanisms. ories and methods for software design and evaluation. This
These limitations can lead to incomplete workows, inaccurate taxonomy helps in understanding the integration of LLM com-
outputs, and unstable performance in practical applications. ponents in software systems, laying a theoretical foundation
[121] introduces ChatDev, the rst virtual chat-driven for developing more effective and efcient LLM-integrated
software development company, a concept of using LLMs not applications. Similarly, [126] explores the application of LLM-
just for specic tasks but as central coordinators in a chat- based agents in software maintenance tasks, improving code
based, multi-agent framework. this approach allows for more quality and reliability through a collaborative framework.
structured, efcient, and collaborative software development This study should origin be categorized under the software
processes, exploring how chat-driven multi-agent systems can maintenance domain but exhibit the iterative manner of the
achieve efcient software design and evaluation, reduce code design structure. The framework utilize the task decomposition
vulnerabilities, and enhance development efciency and qual- and multi-agent strategies to tackle complex engineering tasks
ity. Experiments show that ChatDev can design and generate that traditional one-shot methods cannot handle effectively,
software in an average of 409.84 seconds at a cost of only multiple agents can learn from each other, leading to improved
$0.2967 while signicantly reducing code vulnerabilities. This software maintenance outcomes. Experiments show that multi-
indicates that chat-based multi-agent frameworks capable to agent systems outperform single-agent systems in complex
improve software development efciency and quality. Another debugging tasks, indicating that this new framework can be
similar collaboration framework introduced by Microsoft re- applied in software design to provide safer architectures.
search team, [122] demonstrates the effectiveness of using
LLMs, particularly ChatGPT as agent’s controllers to manage
and execute various AI tasks. The HuggingGPT system that C. Analysis
uses ChatGPT to orchestrate the execution of tasks by various Overall, LLM applications in software design and evaluation
AI models available in Hugging Face, the purpose is to test typically focus on the automation of specic tasks, such
how effectively the system can handle complex AI tasks, as code generation and log summarization, with a tendency
including language, vision, and speech tasks, by executing towards evaluation the capability rather than implementation
during the design phases. The process of software design is evaluate the coordination capabilities of generative agents in
largely intertwined with software development and require- complex social tasks, with the main evaluation metrics being
ments engineering. As previously mentioned, the use of LLMs task coordination success rate and role matching accuracy.
to assist in software development often includes aspects of Comparatively, LLMs research tends to use specic and
the software design process, particularly in generating related publicly available datasets, such as BigCloneBench. These
design documentation. Therefore, there is relatively limited datasets provide standardized evaluation benchmarks, aiding
research focused on using LLMs for higher-level software in the reproducibility and comparability of results. Researches
design tasks. on LLM-based agents tends to use customized experimental
LLM-based agents expand the capabilities of LLMs by han- settings or unspecied datasets, such as requirement documen-
dling more complex workows through intelligent decision- tations, without specifying particular datasets but emphasizing
making and task execution, these agents can collaborate, that the experiments involve 70 user requirements. This choice
dynamically adjust tasks and gather and utilize external infor- is usually because the research needs to evaluate the perfor-
mation. In software design and evaluation, a single model often mance from multiple angles, and it is difcult to perfectly
cannot comprehensively consider both design and evaluation adapt to the vertical application scenarios if some general
aspects, which is why more software developers are reluctant datasets are used. Both LLM and LLM-based agents use a
to entrust high-level tasks to AI. LLM-based agents, through variety of datasets to evaluate the performance of the model,
collaborative work and more rened role division, can ef- these datasets cover tasks ranging from code generation, code
ciently complete design tasks and adapt to various application understanding, to natural language generation and task man-
scenarios. However, the application of LLM-based agents agement, due to the topic of software design and evaluation
in software design is commonly included in the software is relatively inter-related with others. However, because the
development, like previously discussed, the self-reection and LLM-based agents can be expanded to application scenarios
reasoning before action occurs during the software design such as videos and pictures, the agents like Auto-GPT and
phases. The Chatdev [121] framework uses role distribution HuggingGPT also use multimodal datasets. These datasets not
to create a separate software design phase which signicantly only contain code and text, but also involve multiple data types
increases the exibility and accuracy in the later development such as images and speech. Moreover, compared with a single
phases. In terms of efciency and cost, LLMs are still slightly LLM framework, LLM-based agents need to evaluate more
superior to LLM-based agents in text generation and vulner- areas, so benchmarks also need to be considered separately.
ability detection. However, handling tasks similar to software For example, LLMARENA is specially designed to test the
maintenance and root cause analysis requires more complex performance of LLM in dynamic, multi-agent environments,
architectures, such as multi-turn dialogues, knowledge graphs, covering complex tasks such as spatial reasoning, strategic
and RAG techniques, which can further benet the design and planning, and risk assessment.
evaluation phases.
E. Evaluation Metrics
D. Benchmarks In Software Design and Evaluation, various studies have
The benchmarks include public datasets and datasets self- employed different evaluation metrics to measure the perfor-
crafted by the authors themselves, and the application scenar- mance of LLMs and LLM-based agents across a range of
ios are also quite differently as shown in the Table VIII. Big- tasks. Both LLM and LLM-based agent research use more
CloneBench is a benchmark dataset for code clone detection, than one metrics to comprehensively assess model perfor-
containing a large number of Java function pairs. These pairs mance, LLMs research tends to focus on traditional metrics
are classied as clones and non-clones, used for training and such as accuracy, win rate, and consistency, while LLM-
evaluating clone detection models, with the main evaluation based agent research still consider those fundamental metrics
metric being the correct identication rate. The Chrysalis but further introduces complex evaluation methods, such as
dataset created by [116], it contains over 1000 function- task coordination success rate and role matching accuracy.
level designs from 11 open-source synthesizable HLS datasets, However, it cannot be denitively stated that future LLM-
primarily used to evaluate the effectiveness of LLM debugging based agent research will always use more exible evaluation
tools in detecting and correcting injected errors in HLS de- metrics considering multiple dimensions, but more dependent
signs, with the main evaluation metric being the effectiveness on the specic task and dataset being used. The reason for this
of error detection and correction. The CodexGLUE dataset is a phenomenon, as observed in this survey, is primarily that tasks
comprehensive benchmark dataset covering various code gen- in LLMs research are relatively single-tasked, mainly focusing
eration and understanding tasks such as code completion, code on static tasks such as log summarization with traditional eval-
repair, and code translation, used to evaluate the performance uation methods. On the other hand, LLM-based agent research
of code generation models in practical programming tasks. In involves more general multi-agent tasks, and its evaluation
addition to these public datasets, some articially simulated methods emphasize interactivity and dynamics. LLM-based
datasets are used, such as a simulated job fair environment agent research focuses more on the model’s collaboration
dataset. This dataset simulates a virtual job fair environment and decision-making capabilities by using multi-dimensional
containing multiple task scenarios such as interviews, recruit- evaluation metrics to comprehensively assess their potential
ment, and team project coordination. The dataset used to in practical applications consider not only the accuracy. This
TABLE VIII: E VALUATION M ETRICS IN S OFTWARE D ESIGN AND E VALUATION

BigCloneBench,
Python functions,
Java methods,
[113] Accuracy No
Random logs,
Bug reports,
Requirement specications
[114] Not Specied Win rate, Agreement score No
Embedding-based metrics,
[115] Not Specied probability-based metrics, No
Comparison, Ranking
[116] Chrysalis Effectiveness No
CommonsenseQA, Accuracy,
[127] No
StrategyQA, GSM8K Token, Time costs
31 Questions from
[119] Correctness, Effectiveness No
software testing textbook.
Medical transcripts, Coverage,
[128] Amazon Product False Failure Rate No
Descriptions Alignment.
Precision@k,
NDCG@k,
[117] Rico No
Mean Reciprocal Rank,
Average Precision, HITS@k
Accuracy, Success rate,
[120] Not Specied Yes
Consistency, Effectiveness
Accuracy,
Precision,
Recall,
F1-Score,
Hugging Face’s
[122] Edit Distance, Yes
Model Repository.
GPT-4 Score,
Passing Rate,
Rationality,
Success Rate.
[124] Codeforces, LeetCode Pass@1 Yes
Number of les generated,
[121] 70 User Requirements. Yes
Time taken, Cost
Comprehensiveness,
Robustness, Conciseness,
[121] Codeforces Mutual exclusiveness, Yes
Explanatory power,
Extensibility.
[125] Sample Applications. BERTScore, BLEU Yes
BLEU, METEOR,
[126] CodexGLUE Yes
ROUGE-L, BERTScore
Success rate,
[129] Production Incidents Accuracy, Alignment, Yes
Appropriateness
Completion time,
Simulated Job Fair
[130] Task Progress, Yes
Environment
Understanding Level
explains why, despite the similarity in evaluation metrics development, creating and rening test cases still requires large
such as accuracy and completion time, LLM-based agents human effort.
use exible evaluation metrics, including metrics like mutual Typical roles in development involve software testing, such
exclusiveness and appropriateness. as writing unit tests, integration tests, and fuzz tests. Re-
searchers have been attempting to use AI to help generate test
VIII. S OFTWARE T EST G ENERATION cases since before the 2000. Initial implementations typically
involved simpler forms of AI and machine learning to auto-
In software development, a crucial component is software
mate parts of the test case generation process. Over time, more
testing, which need to continuously been conducted from
sophisticated methods such as natural language processing
the initial system development to the nal deployment. In
and machine learning models have been applied to improve
industry, agile development is commonly used which test
the precision and scope of test case generation. Online tools
system continuously at every stage to ensure the robustness
like Sofy5 , which use machine learning to generate context-
of the entire system, whenever new code is committed to
based paths in applications, also exist to aid in generating test
the GitHub, tests are conducted to ensure the usability of
suites. Using large language models to generate test cases is
the updated version. A common approach is to use Jenkins4
a relatively new attempt but has been developing rapidly. In
to achieve continuous integration and continuous deployment.
2020, researchers utilized pre-trained language models ne-
Jenkins automatically hooks into the developer’s action of
tuned on labeled data to generate test cases. They devel-
pushing code to GitHub and runs a test suite against the new
oped a sequence-to-sequence transformer-based model called
version. Although the entire process leans towards automated
4 https://www.jenkins.io/ 5 https://sofy.ai/
”ATHENATEST” and compared its generated results with The experimental results show that LIBRO successfully re-
EvoSuite and GPT-3, demonstrating better test coverage [131]. produced 33.5% of bugs in the Defects4J dataset and 32.2%
More research and models are being dedicated to test suite in the GHRB dataset. By combining advanced prompt engi-
generation experiments, for instance, the Codex model [67], neering and post-processing techniques, LIBRO demonstrates
mentioned earlier in the code generation section, combined the effectiveness and efciency of LLMs in generating bug
with chain-of-thought prompting, achieved high-quality test reproduction tests. Although LIBRO has a lower absolute
suite generation with CodeCoT, even in zero-shot scenarios. effectiveness compared to AdbGPT, it was tested across a more
The introduction of LLMs aims to automate and streamline diverse set of Java applications and not limited to Android.
the testing process, making it more rigorous and capable of Therefore, while AdbGPT excels in specialized bug replay
addressing aspects that humans might easily overlook. for Android, LIBRO provides a wider range of bug repro-
duction for Java applications. The extensive application of
LLMs in test generation tasks such as security test generation,
A. LLMs Tasks bug reproduction, fuzz testing, program repair, and coverage-
The application of LLMs in software test generation is driven test generation highlights their signicant potential
extensive and encompasses more than just test suite generation. in improving software quality and reducing the burden on
The reviewed paper included in this survey covers several developers. Through various models and techniques, these
aspects, including security test generation, bug reproduction, tasks demonstrate how LLMs can automate and enhance the
general bug reproduction, fuzz testing, and coverage-driven software testing process, addressing aspects that are often
test generation. These tasks are achieved through various mod- overlooked by humans.
els and techniques, signicantly improving software quality Similarly, in fuzz testing, LLMs have shown promise poten-
and reducing developers’ workload. [132] aims to evaluate tial. [136] developed a universal fuzzing tool, Fuzz4All, which
the effectiveness of using GPT-4 to generate security tests, uses LLMs to generate and mutate inputs for various software
demonstrating how to conduct supply chain attacks by ex- systems. This tool addresses the issues of traditional fuzzers
ploiting dependency vulnerabilities. The study experimented being tightly coupled with specic languages or systems and
with different prompt styles and templates to explore the lacking support for evolving language features. The study
effectiveness of varying information inputs on test generation conducted various experiments to test the tool’s capabilities,
quality, the results showed that tests generated by ChatGPT including coverage comparison, bug nding, and targeted
successfully discovered 24 proof-of-concept vulnerabilities in fuzzing. The results showed that Fuzz4All achieved the highest
55 applications, outperforming existing tools TRANSFER code coverage in all tested languages, with an average increase
[133] and SIEGE6 . This research introduces a new method for of 36.8%, and discovered 98 bugs across nine systems, which
generating security tests using LLMs and provides empirical considered as state-of-art technique in universal fuzzing with
evidence of LLM’s potential in the security testing domain, LLMs at that time. Through self-prompting and LLM-driven
offering developers a novel approach to handling library fuzzing loops, Fuzz4All demonstrated the effectiveness of
vulnerabilities in applications. LLMs in fuzz testing and showcased their capability across
Another application is bug reproduction, which allows multiple languages and systems under test (SUTs) through
testers to locate and x bugs more quickly and ef- comprehensive evaluations.
ciently. [134] addresses the limitations of current bug repro- [137] introduced SymPrompt, a new code-aware prompting
duction methods, which are constrained by the quality and strategy aimed at addressing the limitations of existing Search-
clarity of handcrafted patterns and predened vocabularies. Based Software Testing (SBST) methods and traditional LLM
The paper proposes and evaluates a new method framework prompting strategies in generating high-coverage test cases. By
called AdbGPT, which uses a large language model to auto- decomposing the original test generation process into a multi-
matically reproduce errors from Android bug reports. AdbGPT stage sequence aligned with the execution paths of the method
is described as outperforming current SOTA approaches in the under test, SymPrompt generated high-coverage test cases.
context of automated bug replay for only Android system. The Experimental results indicated that SymPrompt increased cov-
experimental results show that AdbGPT achieved accuracies of erage on CodeGen2 and GPT-4 by 26% and 105% respectively.
90.4% and 90.8% in S2R entity extraction and a success rate Through path constraint prompting and context construction
of 81.3% in error reproduction, signicantly outperforming the techniques, SymPrompt demonstrated the potential of LLMs
baseline ReCDroid and ablation study versions. By introducing in generating high-coverage test cases. [138] also focused
prompt engineering, few-shot learning, and chain-of-thought on test suite coverage, this study introduced the COVERUP
reasoning, AdbGPT demonstrates the powerful capabilities system which generates high-coverage Python regression tests
of LLMs in automated error reproduction. It also uses GUI through coverage analysis and interaction with LLMs. The
encoding to convert the GUI view hierarchy into HTML-like experimental results showed that COVERUP increased code
syntax, providing LLMs with a clear understanding of the coverage from 62% to 81% and branch coverage from 35% to
current GUI state. While AdbGPT is specialized for Android 53% through iterative prompting and coverage-driven meth-
systems, [135] proposes the LIBRO framework, which uses ods. [139] proposed the AID method, which combines LLMs
LLMs to generate bug reproduction tests from bug reports. with differential testing to improve fault detection in ”plausibly
correct” software. By comparing the effectiveness of AID in
6 https://siegecyber.com.au/services/penetration-testing/ generating fault-revealing test inputs and oracles, the experi-
ments showed that AID improved recall and precision by 1.80 Pay UAT system, comparing it to a single-agent system and a
times and 2.65 times respectively, and increased the F1 score variant without the reection component. Experimental results
by 1.66 times. By integrating LLMs with differential testing, show that XUAT-Copilot achieved a Pass@1 rate of 88.55%,
AID showcased the powerful capability of LLMs in detecting compared to 22.65% for the single-agent system and 81.96%
complex bugs. for the variant without the reection component, with a
Complete@1 rate of 93.03%. XUAT-Copilot employs a multi-
B. LLM-based Agents Tasks agent collaborative framework, including action planning, state
In the eld of software test generation, the application of checking, and parameter selection agents, and uses advanced
LLM-based agents demonstrates their potential in automated prompting techniques. XUAT-Copilot demonstrates the poten-
test generation. While relying on LLM-based agents for soft- tial and feasibility of LLMs in automating UAT test script
ware test generation might seem excessive, more research generation.
is directed towards vulnerability detection and system main-
tenance. LLM-based agents can enhance test reliability and C. Analysis
quality by distributing tasks such as test generation, execution,
and optimization through a multi-agent collaborative system.
These multi-agent systems offer obvious improvements in
error detection and repair, and coverage testing. An example
of such a system is AgentCoder’s multi-agent framework, as
discussed in the code generation and software development
section [82]. The primary goal of this system is to leverage
multiple specialized agents to iteratively optimize code gener-
ation, overcoming the limitations of a single agent model in
generating effective code and test cases. The paper introduce
the test design agent, which creates diverse and comprehensive
test cases; and the test execution agent, which executes the
tests and provides feedback, it reached an 89.9% pass rate
on the MBPP dataset. Similarly, the SocraTest framework
falls under the Autonomous Learning and Decision Making
topic [106]. This framework automates the testing process
through conversational interactions, the paper presents detailed
examples of generating and optimizing test cases using GPT-
4, emphasizing how multi-step interactions enhance testing
methods and generate test code. Experimental results show F IG . 7: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN
that through conversational LLMs, SocraTest can effectively LLM- BASED AGENT [141] AND LLM [136] IN S OFTWARE T EST
generate and optimize test cases and utilize middleware to G ENERATION
facilitate interactions between the LLM and various testing
tools, achieving more advanced automated testing capabilities. In comparison, LLMs perform well in single-task imple-
The paper collected for the software test generation topic mentations, generating high-quality test cases through tech-
are mostly multiple agents based system. The study [140] niques like prompt engineering and few-shot learning. The
evaluates the effectiveness of LLMs in generating high- number of related studies is increasing as the capabilities
quality test cases and identies their limitations. It proposes of LLMs improve. On the other hand, LLM-Based Agents,
a novel multi-agent framework called TestChain. The paper through multi-agent collaborative systems, decompose tasks
evaluates StarChat, CodeLlama, GPT-3.5, and GPT-4 on the for specialized processing, signicantly enhancing the ef-
HumanEval and LeetCode-hard datasets. Experimental results fectiveness and efciency of test generation and execution
show that the TestChain framework, using GPT-4, achieved through iterative optimization and feedback. Considering the
71.79% accuracy on the LeetCode-hard dataset, an improve- cost, using LLMs for test generation only is enough and
ment of 13.84% over baseline methods. On the HumanEval more cost saving than using LLM-based agents. However, if a
dataset, TestChain with GPT-4 achieved 90.24% accuracy. The specic model performs poorly, it can affect the entire system’s
TestChain framework designs agents to generate diverse test performance.
inputs, maps inputs to outputs using ReAct format dialogue A single LLM may struggle with complex, multi-step tasks.
chains, and interacts with the Python interpreter to obtain For example, in high-coverage test generation, LLMs may
accurate test outputs. require more complex prompts and post-processing steps to
LLM-based agents can also be applied in user acceptance achieve the desired results. Additionally, the quality of the
testing (UAT), [141] aims to enhance the automation of the generated results depends heavily on the prompt design and
WeChat Pay UAT process by proposing a multi-agent collab- quality. For tasks requiring ne control and continuous opti-
orative system named XUAT-Copilot, which uses LLMs to mization, a single LLM may nd it challenging to deal with.
automatically generate test scripts. The study evaluates XUAT- As shown in Figure.??, the LLM framework uses [136] as an
Copilot’s performance on 450 test cases from the WeChat example to demonstrate the usage of LLMs in fuzz testing, the
prompt will be optimized by given code snippets (fuzz inputs), using MBPP,HummanEval to do the evaluations, researches
and re-select by the LLM again to choose the best prompt for on LLM-Based agents places more emphasis on verifying the
the future generation. The overall framework lacks autonomy, effectiveness of the system through qualitative evaluation and
the LLM-based agent [141] framework on the left lls this gap, user feedback.
as well as able to perceive the UI and interact with the skill
library for the operations. The operation agent will receive E. Evaluation Metrics
any error reported by the inspection agent and do the self-
reection to rene the process autonomously. However, as As seen in Table IX, LLMs research predominantly utilizes
previously discussed, build a LLM-based agents framework traditional quantitative metrics such as bug reproduction rate,
only for the software test generation task are ”overkill”, so code coverage, precision, and recall, these metrics directly
the collected paper for LLM-based agents system generally reect the effectiveness and quality of test generation. In
focused on program repair by generated test cases or bug contrast, LLM-Based agents research not only focuses on
replay system, like in the Figure.??, the LLM-based agent quantitative metrics but also introduces qualitative evaluations,
framework is actually used for automatically test the Wechat such as improvements through conversational interactions and
Pay system. the collaborative effects of multi-agent systems. This diver-
sied evaluation approach provides a more comprehensive
D. Benchmarks reection of the system’s practical application effects. From
the task perspectives, LLMs are more inclined to single task
In the tasks of LLMs in software test generation, the
processing, such as generating test sets and considering the
dataset Defects4J used to evaluate bug reproduction and
coverage of generated test sets. However, because of the
program repair techniques. Other public datasets such as
expansion of agents framework, LLM-based agents often tend
ReCDroid, ANDROR2+, and Themis are primarily used to
to use the generated test sets to evaluate whether vulnerabilities
evaluate mobile application bug reproduction and security
can be found to achieve a more ideal practicality. From
test generation, particularly for the Android platform. GCC,
a design perspective, LLM systems are relying on prompt
Clang, Go toolchain, Java compiler (javac), and Qiskit involve
engineering and the generative capabilities of the models
fuzz testing datasets for various programming languages and
themselves, their evaluation metrics are also mainly focused
toolchains, aimed at assessing the effectiveness of fuzz testing
on the quality and effectiveness of the model outputs, also
in multi-language environments. TrickyBugs and EvalPlus are
their evaluation metrics include the collaborative effects and
datasets containing complex bug scenarios, used to evaluate
efciency within the system, such as improving Pass@1 and
the precision and recall of generated test cases, the benchmark
Complete@1 rates through multi-agent collaboration. Overall,
applications evaluated by CODAMOSA are used to assess the
LLMs are more suited for rapid test generation and evaluation
effectiveness of coverage-based test generation tools.
for specic tasks, with evaluation metrics directly reecting
The datasets used in LLM-Based Agents research are also
the generation’s effectiveness and quality. LLM-Based Agents
quite common, HumanEval, MBPP, and LeetCode-hard are
excel in handling complex and diversied tasks, achieving
mainly used to evaluate the accuracy and coverage of code
higher system efciency and effectiveness through multi-agent
generation and test generation, involving various program-
collaboration and iterative optimization.
ming problems and challenges which frequently appeared in
previous sections. Datasets like Codeaws, QuixBugs, and
ConDefects are collected to familiarize LLMs with erroneous IX. S OFTWARE S ECURITY AND M AINTENANCE
code and programs, containing multiple program errors and In the software engineering, software security and mainte-
defects, and are used to evaluate the effectiveness of automated nance is a popular area for the application of LLMs, primarily
debugging and bug repair. A unique dataset is the WeChat Pay aimed at enhancing the security and stability of software
UAT system, which includes user acceptance test cases from systems through existing technologies to meet the needs of
actual applications and is used to evaluate the performance users and developers. These models provide promise methods
of multi-agent systems in user acceptance testing, focusing of vulnerability detection and repair, while also enabling auto-
specically on WeChat’s security system. mated security testing and innovative maintenance processes.
Overall, the datasets used in LLM-based agents’ research The application of LLMs in software security and maintenance
are broader covering a wide range of programming problems encompasses several aspects, including vulnerability detection,
and challenges, LLM research is more focused on the actual automatic repair, penetration testing, and system robustness
generation tasks, such as bug reproduction on the Android evaluation. Compared to traditional methods, LLMs leverage
platform and fuzz testing in multi-language environments. natural language processing and generation technologies to
This because the LLM-Based agents not only focus on the understand and generate complex code and security policies,
quality of generated test cases and code but also evaluate thereby automating detection and repair tasks. For exam-
the collaborative effects and iterative optimization capabilities ple, LLMs can accurately identify potential vulnerabilities
of multi-agent systems, so the benchmarks also include the by analyzing code structures and contextual information and
dataset used to evaluate performance of the framework. For in- generate corresponding repair suggestions which improves the
stance, AgentCoder [82] improves the efciency and accuracy efciency and accuracy of vulnerability recovery.
of test generation and execution through multi-agent collab- Moreover, LLMs not only exhibit strong capabilities in vul-
oration consider qualitative and quantitative evaluations and nerability detection but also play a role in tasks like penetration
TABLE IX: E VALUATION M ETRICS IN S OFTWARE T EST G ENERATION

Number of applications for
26 libraries and 55
which security tests were successfully
[132] applications with No
generated.Number of tests that could
known vulnerabilities
demonstrate exploits.
Accuracy of S2R Entity Extraction.
ReCDroid, ANDROR2+, Reproducibility of Bugs.
[134] No
Themis Empirical Study Dataset Runtime Efciency.
User Satisfaction.
Bug Reproduction Rate.
Precision and Recall.
[135] Defects4J, GHRB No
Execution Time.
Developer Effort.
GCC and Clang.
Code Coverage.
CVC5 and Z3.
Validity Rate.
[136] Go Toolchain. No
Hit Rate.
Java Compiler (javac).
Bugs Detected.
Qiskit.
Pass@1.
897 focal methods from 26
FM Call@1.
[137] widely used open-source No
Correct@1.
Python projects.
Line & Branch Coverage.
Recall.
TrickyBugs
[139] Precision. No
EvalPlus datasets.
F1 Score.
Line Coverage.
Benchmark applications originally
[138] Branch Coverage. No
used to evaluate CODAMOSA.
Line + Branch Coverage.
HumanEval.
MBPP.
[82] Pass@1 Yes
HumanEval-ET.
MBPP-ET.
Qualitative improvement through
[106] Not Specied Yes
conversational interactions.
Accuracy.
HumanEval.
[140] Line Coverage (Line Cov). Yes
LeetCode-hard.
Code-with-Bugs (CwB).
Codeaws. Number of Correct Patches.
[142] QuixBugs. Number of Plausible Patches. Yes
ConDefects. Correctness Rate.
450 test cases from the Pass@1.
[141] Yes
WeChat Pay UAT system Complete@1.
testing and security evaluations. Automated penetration testing of vulnerability detection in source code and to determine
tools, such as PENTESTGPT [143]. LLMs also demonstrate if the performance limits of CodeBERT-like models are due
signicant advantages in evaluating system robustness by sim- to their limited capacity and code understanding ability. The
ulating various attack scenarios to assess system performance study ne-tuned the WizardCoder model (an improved ver-
under different conditions, helping developers better identify sion of StarCoder) and compared its performance with the
and address potential security issues. Research on LLM- ContraBERT model on balanced and unbalanced datasets. The
based agents in software security and maintenance is also experimental results showed that WizardCoder outperformed
keep growing, these intelligent agents can execute complex ContraBERT in both ROC AUC and F1 scores, signicantly
code generation and vulnerability repair tasks and possess improving Java function vulnerability detection performance,
self-learning and optimization capabilities to handle issues which achieved the state-of-art performance at that time by
encountered in dynamic development environments. Tools like improving ROC AUC from 0.66 in CodeBERT to 0.69.
RITFIS [144] and NAVRepair [145] have shown potential in
improving the precision and efciency of program repairs by There are study mainly explored the applications of pure
using LLM-based agents. LLMs without any framework architecture in vulnerability
detection, uncovering current challenges. [147] evaluated only
the performance of ChatGPT and GPT-3 models in detecting
A. LLMs Tasks vulnerabilities in Java code, the study compared text-davinci-
In the eld of software security and maintenance, research 003 (GPT-3) and gpt-3.5-turbo against a baseline virtual
on LLMs can be categorized into three main areas: vulnerabil- classier in binary and multi-label classication tasks. The
ity detection, automatic repair, and penetration testing, along experimental results showed that while text-davinci-003 and
with some evaluation studies. The collected papers reviewed gpt-3.5-turbo had high accuracy and recall rates in binary
on LLMs in these domain illustrate their diverse applications classication tasks, their AUC (Area Under Curve) scores were
and potential. only 0.51, indicating performance equivalent to random guess-
1) Program Vulnerability: In the domain of vulnerability ing. In multi-label classication tasks, GPT-3.5-turbo and text-
detection, researchers have ne-tuned LLMs to enhance the davinci-003 did not signicantly outperform the baseline vir-
accuracy of source code vulnerability detection. [146] aims tual classier in overall accuracy and F1 scores. These ndings
to investigate the potential of applying LLMs to the task indicate that the earlier model like GPT-3 has limited capabil-
ities in practical vulnerability detection tasks, suggesting the how RTT performs when using programming languages as an
need for further research and model optimization to improve intermediate representation, how RTT performs when using
their performance in real-world applications, ne-tuning and natural language (English) as an intermediate representation,
optimizing LLMs can signicantly enhance their performance and what qualitative trends can be observed in the patches
in source code vulnerability detection, However, these models generated by RTT. Three measurement standards and eight
still face many challenges in practical applications, requiring models were used in the experiments, the results showed that
further research and technological improvements to enhance the RTT method achieved signicant repair effects on multiple
their real-world effectiveness and reliability. benchmarks, particularly excelling in terms of compilation and
In the later years, [148] introduced a method to incorporate feasibility [151]. Similarly, in automated program repair, [145]
complex code structures directly into the model learning introduced several innovative methods. For example, NAVRe-
process, the GRACE framework combines graph structure pair specically targets C/C++ code vulnerabilities by com-
information and in-context learning, using Code Property bining node type information and error types. Due to the
Graphs (CPGs) to represent code structure information. By unique pointer operations and memory management issues in
integrating the semantic, syntactic, and lexical similarities C/C++, this language poses complexities. The framework uses
of code, the framework GRACE addresses the limitations Abstract Syntax Trees (ASTs) to extract node type information
of text-based LLM analysis, improves the precision and re- and combines it with CWE-derived vulnerability templates
call rates of vulnerability detection tasks. The study utilized to generate targeted repair suggestions, the study evaluated
three vulnerability datasets, showing a 28.65% improvement NAVRepair on several popular LLMs (ChatGPT, DeepSeek
in F1 scores over baseline models, an important aspect of Coder, and Magicoder) to demonstrate its effectiveness in
vulnerability detection is enhancing LLM performance in code improving code vulnerability repair performance. The results
security tasks. [149] ne-tuned LLMs for specic tasks and showed that NAVRepair achieved state-of-art performance in
evaluated their performance against existing models such as C/C++ program repair task, which improved repair accuracy
ContraBERT. The researchers conducted numerous experi- by 26% compared to existing methods.
ments to determine the optimal model architecture, training In order to address the two main limitations of existing
hyperparameters, and loss functions to optimize performance ne-tuning methods for LLM-based program repair, which is
in vulnerability detection tasks. The study primarily focused on the lack of reasoning about the logic behind code changes
WizardCoder and ContraBERT, validating their performance and high computational costs associated with ne-tuning large
through comparisons on balanced and unbalanced datasets and datasets. [152] introduced the MOREPAIR framework, this
developing an efcient batch packing strategy that improved framework improve the performance of LLMs in automated
training speed. Results indicated that with appropriate ne- program repair (APR) by simultaneously optimizing syntactic
tuning and optimization, LLMs could surpass state-of-the- code transformations and the logical reasoning behind code
art models, contributing to more robust and secure software changes, the study used techniques to enhance ne-tuning
development practices. efciency, such as QLoRA (Quantized Low-Rank Adapta-
Despite the development of numerous models, it is still tion) [153] to reduce memory requirements and NEFTune
necessary to investigate their practical effectiveness. [150] ex- (Noisy Embedding Fine-Tuning) [154] to prevent overtting
plored the effectiveness of code language models (code LMs) during the ne-tuning process. The experiments evaluated
in detecting software vulnerabilities and identied signicant MOREPAIR on four open-source LLMs of different sizes
aws in existing vulnerability datasets and benchmarks. The and architectures (CodeLlama-13B, CodeLlama-7B, StarChat-
researchers developed a new dataset called PRIMEVUL, and alpha, and Mistral-7B) using two benchmarks, evalrepair-C++
conducted experiments using it, they compared PRIMEVUL and EvalRepair-Java. The results indicated that CodeLlama
with existing benchmarks such as BigVul to evaluate several improved by 11% and 8% on the rst 10 repair suggestions
code LMs, including state-of-the-art base models like GPT- for evalrepair-C++ and EvalRepair-Java respectively. Another
3.5 and GPT-4, using various training techniques and evalu- study introduced the PyDex system, which automatically
ation metrics. The results revealed that existing benchmarks repairs syntax and semantic errors in introductory Python
signicantly overestimated the performance of code LMs. For programming assignments using LLMs, the system combines
example, a state-of-the-art 7B model scored an F1 of 68.26% multimodal prompts and iterative querying methods to gener-
on BigVul but only 3.09% on PRIMEVUL, highlighting the ate repair candidates and uses few-shot learning to improve
gap between current code language models’ performance and repair accuracy. PyDex was evaluated on 286 real student
actual requirements for vulnerability detection. programs from an introductory Python programming course
2) Automating Program Repair: In the domain of software and compared against three baselines. The results showed that
security and maintenance, LLMs have not only been applied to PyDex signicantly improved the repair rate and effectiveness
vulnerability detection but also extensively used for automat- compared to existing baselines [155].
ing program repair. One study proposed using Round-Trip [156] introduced a new system named RING that lever-
Translation (RTT) for automated program repair, researchers ages large language models (LLMCs) to perform multilingual
translated defective code into another language and then back program repair across six programming languages. RING em-
to the original language to generate potential patches. The ploys a prompt strategy that minimizes customization efforts,
study used various language models and benchmarks to eval- including three stages: fault localization, code transforma-
uate RTT’s performance in APR. The experiments explored tion, and candidate ranking. The results showed that RING
was particularly effective in Python, successfully repairing consequences, emphasizing the importance of responsible use
94% of errors on the rst attempt. The study also intro- and human oversight. Assessing the robustness of systems is
duced a new PowerShell command repair dataset, providing also a crucial part of development, LLMs are used to develop
valuable resources for the research community, this research and evaluate new testing frameworks to detect and improve the
demonstrated that AI-driven automation makes program repair robustness of intelligent software systems. [144] introduces
more efcient and scalable. Another study, [157] conducted a robust input testing framework named RITFIS, designed
a comprehensive investigation into function-level automated to evaluate the robustness of LLM-based intelligent software
program repair, introducing a new LLM-based APR technique against natural language inputs. The study adapts 17 existing
called SRepair. SRepair utilizes a dual-LLM framework to DNN testing methods to LLM scenarios and empirically
enhance repair performance, the SRepair framework combines validates them on multiple datasets to highlight the current
a repair suggestion model and a patch generation model. It robustness deciencies and limitations of LLM software. The
uses chain-of-thought to generate natural language repair sug- study indicate that RITFIS effectively assesses the robustness
gestions based on auxiliary repair-related information and then of LLM software and reveals its vulnerabilities in handling
utilizes these suggestions to generate the repaired function. complex natural language inputs. This research underscores
The results showed that SRepair outperformed existing APR the importance of robustness testing for LLM-based intelligent
techniques on the Defects4J dataset, repairing 300 single- software and provides directions for improving testing meth-
function errors, with an improvement of at least 85% over pre- ods to enhance reliability and security in practical applications.
vious techniques. This study demonstrated the effectiveness of
the dual-LLM framework in function-level repair and, for the
rst time achieved multi-function error repair, highlighting the B. LLM-based Agents Task
signicant potential of LLMs in program repair. By extending LLM-based Agents primarily appied in areas such as
the scope of APR, SRepair paves the way for applying LLMs autonomous decision-making, task-specic optimization, and
in practical software development and evaluation. multi-agent collaboration, these frameworks showcasing their
3) Penetration Testing: LLMs can also be applied in the strong potential in proactive defense. [159] aims to address
eld of penetration testing, where they are used to enhance the the limitations of existing debugging methods that treat the
efciency and effectiveness of automated penetration testing. generated program as an indivisible entity. By segmenting the
Although not as frequently studied as vulnerability detection program into basic blocks and verifying the correctness of each
and automated repair, this review includes two relevant pa- block based on task descriptions, the proposed method LDB
pers. [143] investigates the development and evaluation of an (Large Language Model Debugger) provide a more detailed
LLM-driven automatic penetration testing tool PENTESTGPT. and effective debugging tool that closely reects human debug-
The main purpose of this study is to evaluate the perfor- ging practices. The study’s experiments covered testing LDB
mance of LLMs in practical penetration testing tasks and on several benchmarks and compared with baseline models
address the issue of context loss during the penetration testing without a debugger and those using traditional debugging
process, the paper introduces three self-interaction modules methods (self-debugging with explanations and traces). LDB’s
of PENTESTGPT (reasoning, generation, and parsing) and accuracy increased from a baseline of 73.8% to 82.9% on
provides empirical research based on benchmarks involving 13 the HumanEval benchmark, an improvement of 9.1%. In the
targets and 182 sub-tasks. It compares the penetration testing domain of vulnerability detection, researchers have enhanced
performance of GPT-3.5, GPT-4, and Bard. The experimental detection accuracy by combining Role-Based Access Control
results show that PENTESTGPT’s task completion rate is (RBAC) practices with deep learning of complex code struc-
228.6% higher than GPT-3.5 and 58.6% higher than GPT-4, tures.
this study demonstrates the potential of LLMs in automated [160] addresses the challenge of automatically and appro-
penetration testing, helping to identify and resolve security priately repairing access control (AC) vulnerabilities in smart
vulnerabilities, thereby enhancing the security and robustness contracts. The innovation of this paper lies in combining mined
of software systems. RBAC practices with LLMs to create a context-aware repair
A similar research paper explores the application of genera- framework for AC vulnerabilities. The model primarily uses
tive AI in penetration testing. [158] evaluates the effectiveness, GPT-4, enhanced by a new method called ACFIX, which
challenges, and potential consequences of using generative mines common RBAC practices from existing smart contracts
AI tools (specically ChatGPT 3.5) in penetration testing. and employs a Multi-Agent Debate (MAD) mechanism to
Through practical application experiments. The research con- verify the generated patches through debates between gen-
ducts a ve-stage penetration test (reconnaissance, scanning, erator and verier agents to ensure correctness. Experimen-
vulnerability assessment, exploitation, and reporting) on a vul- tal results show that ACFIX successfully repaired 94.92%
nerable machine from VulnHub, integrating Shell GPT (sgpt) of access control vulnerabilities, signicantly outperforming
with ChatGPT’s API to automate guidance in the penetration the baseline GPT-4’s 52.54%. Another application in smart
testing process. The experimental results demonstrate that contracts [161], this paper introduces a two-stage adversarial
generative AI tools can signicantly speed up the penetration framework, GPTLENS, which improves vulnerability detec-
testing process and provide accurate and useful commands, tion accuracy through generation and discrimination phases.
enhancing testing efciency and effectiveness. This study in- GPTLENS achieved a 76.9% success rate in detecting smart
dicates that the need to consider potential risks and unintended contract vulnerabilities, better than the 38.5% success rate of
traditional methods. Another study, [109] investigates the use LLM and security engineering models to improve security
of GPT-4 to automatically exploit disclosed but unpatched vul- analysis and design processes. [163] aims to propose a com-
nerabilities, the experiments showed that the LLM-based agent plex hybrid strategy to ensure the reliability and security of
achieved an 87% success rate in exploiting vulnerabilities software systems, this involves a concept-guided approach
when provided with CVE descriptions. Finally another LLM- where LLM-based agents interact with system model diagrams
based agent application in the penetration test, [107] employs to perform tasks related to safety analysis. [108] introduces
GPT-3.5 to assist penetration testers by automating high-level the TrustLLM framework which increase the accuracy and
task planning and low-level vulnerability discovery, thereby interpretability of smart contract auditing by customizing
enhancing penetration testing capabilities. The experiments LLM capabilities to the specic requirements of smart con-
demonstrated successful automation of multiple stages of pen- tract code. This paper conducts experiments on a balanced
etration testing, including high-level strategy formulation and dataset comprising 1,734 positive samples and 1,810 negative
low-level vulnerability discovery, showcasing the effectiveness samples, comparing TrustLLM with other models such as
of LLMs in penetration testing. CodeBERT, GraphCodeBERT, and several versions of GPT
In the eld of software repair by multi-agents collabora- and CodeLlama. TrustLLM achieves an F1 score of 91.21%
tions, [162] proposes a dual-agent framework that enhances and an accuracy of 91.11% which outperforming other models.
the automation and accuracy of repairing declarative speci- Beyond software-level security design, LLMs can also be
cations through iterative prompt optimization and multi-agent integrated into autonomous driving systems. [164] which has
collaboration. The researcher compare the effectiveness of the already been discussed in the IV.
LLM-based repair pipeline with several state-of-the-art Alloy
APR techniques (ARepair, ICEBAR, BeAFix, and ATR). In
the result, framework repaired 231 defects in the Alloy4Fun C. Analysis
benchmark which surpassing the 278 defects repaired by tradi- Overall, the direction of LLM-based Agents represents
tional tools. In [142], developed and evaluated an automated signicant innovative advancements in software security and
debugging framework named FixAgent, which improves fault maintenance, demonstrating improvements across all areas.
localization, repair generation, and error analysis through LLM-based Agents, through multi-agent collaboration and
an LLM-based multi-agent system. Although this research runtime information tracking to help with debugging tasks,
primarily focuses on automated debugging, incorporating el- compared to traditional LLMs approaches are often rely on
ements like fault localization and automated program repair xed prompts or feedback loops to debug a given code snippet
(APR), it intersects with test generation, particularly in the or program. In vulnerability detection, LLM-based Agents
validation phase for testing bug xes. The study evaluates combine RBAC practices and in-depth learning of complex
FixAgent’s performance on the Codeaws, QuixBugs, and code structures to improve the accuracy and efciency of
ConDefects datasets, comparing it to 16 baseline methods, detecting vulnerabilities, traditional LLMs methods normally
including state-of-the-art APR tools and LLMs. Experimental depend on extensive manual intervention and detailed guid-
results show that FixAgent xed 78 out of 79 bugs in the ance when handling tasks. LLM-based Agents also demon-
QuixBugs dataset, including 9 never-before-xed bugs. In the strate effectiveness in penetration testing by automating high-
Codeaws dataset, FixAgent xed 3982 out of 2780 defects, level task planning and low-level vulnerability exploration,
with a correctness rate of 96.5%. The framework includes thereby enhancing penetration testing capabilities. In contrast,
specialized agents responsible for localization, repair, and traditional LLM methods are more suited for passive detection
analysis tasks and uses the rubber duck debugging principle. and analysis, lacking proactive testing and defense capabilities.
FixAgent demonstrates the powerful capabilities of LLMs in From the perspective of automation, LLM-based agents
automated debugging, improving the performance of existing automate the detection and repair of software errors through
APR tools and LLMs which can be considered as state-of-art multi-agent frameworks and dynamic analysis tools, improving
framework in APR by LLM-based agent. the automation and accuracy of the repair process. Traditional
[46] introduces an automated program repair agent named LLMs methods also have a good performance on various main-
RepairAgent, this agent can dynamically generates prompts tenance or debug tasks, but often lack autonomous decision-
and integrates tools to automatically x software bugs. This making and dynamic adjustment capabilities during the repair
researcher also address the limitations of current LLM-based process. In terms of software security, intelligent agent become
repair techniques, which typically involve xed prompts or more exibly by combining LLM and security engineering
feedback loops that do not allow the model to gather compre- models to improve security analysis and design processes,
hensive information about the bug or code. RepairAgent is a thereby enhancing the reliability and security of software
LLM-based agent designed to alternately collect information systems. when dealing with security tasks by LLMs only, often
about the bug, gather repair ingredients, and validate the re- rely on static analysis lacking adaptability and optimization ca-
pairs, similar to how human developers x bugs. RepairAgent pabilities. As shown in the Figure.8, the comparison using the
achieved impressive result by overall repaired 186 bugs in MOREPAIR [152] for LLMs and RepairAgent [46] for LLM-
the Defects4J benchmark, with 164 being correctly repaired based agents. The LLM framework utilize the optimization
outperforming existing repair techniques achieved the state- techniques (QLoRA, NEFTune) to generate repair advices, the
of-art performances. RepairAgent utilize multiple tools during the inspection which
In the realm of software security, researchers have combined facilitate the precision and accuracy of the analysis before the
common datasets like HumanEval and MBPP are also fre-

quently used to evaluate the functional correctness of code
generated by LLMs. Similarly, Alloy4Fun is used to test the
repair of declarative specications in Alloy framework [162],
reecting the LLM’s performance in understanding and xing
logical errors in formal languages.
Specialized datasets such as VulnHub and HackTheBox
are used to evaluate the penetration testing capabilities of
LLMs. Papers like [107] utilize these environments to simulate
real-world hacking scenarios, thereby assessing the practical
applications of LLMs in cybersecurity. These benchmarks
are crucial for evaluating the real-world efcacy of LLM-
based agents in cyber-security environments, bridging the gap
between theoretical capabilities and practical applications. In
the context of smart contract security, datasets extracted from
Etherscan and those compiled for tools like SmartFix provide
benchmarks for evaluating LLMs’ ability to identify and x
F IG . 8: I LLUSTRATION OF C OMPARISON F RAMEWORK B ETWEEN vulnerabilities in blockchain-based applications, emphasizing
LLM- BASED AGENT [46] AND LLM [152] IN S OFTWARE S ECU - the reliability and security of decentralized applications.
RITY AND M AINTENANCE When comparing the benchmarks used in LLM and LLM-
based agent research, several key similarities and differences
emerge. Both approaches frequently use datasets like De-
repair process, the idea is quite similar with ”reasoning before
fects4J, CVE, and HumanEval, highlighting their foundational
action”. Then the agent framework utilize state machine and
role in evaluating software engineering tasks. However, LLM-
LLM to rene continuously, and if failed during the repair
based agent research often combines these datasets with spe-
process, the RepairAgent will enter the self-reection phase
cialized benchmarks like VulnHub and HackTheBox to test
to understand the reason autonomously.
more dynamic and interactive capabilities, especially in the
Thus, from the review, we can say that LLM-based agents
context of cybersecurity. LLM-based agent research typically
brings more autonomy and exibility in the eld of software
focuses more on real-time autonomous decision-making and
security and maintenance. These improvements can enhance
action, reected in their choice of benchmarks. These datasets
task execution efciency and accuracy, also extend the appli-
test not only the agents’ knowledge but also their ability to
cation scope of LLMs in complex software engineering tasks,
autonomously apply this knowledge in real-world scenarios.
demonstrating their strong potential in proactive defense, com-
This contrasts with traditional LLMs research, which typ-
plex task handling, and meeting high reliability requirements.
ically focuses on static tasks like vulnerability repair and
code generation without requiring real-time interaction and
D. Benchmarks further changes or decision-making. Moreover, the use of
When analyzing the benchmarks used in LLM literature, specialized benchmarks like the smart contract datasets from
several public datasets stand out due to their frequent use and Etherscan in LLM-based agent research underscores the im-
presence across different application scenarios. Datasets such portance of blockchain technology and the need for robust
as Defects4J, Codeaws, QuixBugs, and the Common Vulner- security measures in decentralized applications, this trend
ability and Exposure (CVE) database are commonly employed highlights the adaptability and diversity of LLM-based agents
in the domains of vulnerability detection and software security. in addressing emerging challenges in software security and
For instance, Defects4J is widely used in papers like [46] maintenance. This distinction reects the broader and more
and [159] to evaluate automated program repair tools. Sim- interactive application scenarios of LLM-based agents, also
ilarly, Codeaws and QuixBugs are used in papers like [142] the public dataset may not suitable for LLM-based agent in
to test debugging capabilities, focusing on smaller algorithmic particular structure designed, so there are a lot of self-collected
problems typically found in competitive programming and benchmark emerged which provide more exibility.
educational settings. These datasets effectively measure the
ability of LLMs to detect vulnerabilities and modify code in
specic code blocks. E. Evaluation Metrics
CVE is a critical benchmark for evaluating the security The evaluation metrics for the llm in the software se-
capabilities of LLMs, offering a repository of known vulnera- curity and maintenance are quite diverse. Researchers need
bilities that allow LLMs to assess their ability to autonomously to consider various factors such as coverage, efciency, and
detect and exploit security aws, bridging the gap between reliability of the model or framework. Evaluation Metrics
theoretical research and practical cybersecurity applications. like success rate and pass rate are directly related to the
Another notable dataset is ARepair, used in [162]. This dataset performance of LLMs in different scenarios. In Table X,
consists of defective specications and tests the ability of common standards such as success rate and change rate are
LLMs to understand and repair formal specications. More frequently used to evaluate the robustness of models when
faced with diverse inputs. Time overhead and query number often include tests of model performance in specic domains,
are used to assess the efciency and resource consumption such as evaluating LLaMA’s performance in code generation.
of models when performing specic tasks. Additionally, ROC Therefore, during our data collection process, we also included
AUC, F1 score, and accuracy are important for evaluating models used for comparison purposes, as these models often
the model’s ability to identify vulnerabilities, especially in represent the state-of-the-art capabilities in their respective
binary classication tasks. In code repair tasks, metrics such as elds at the time of the study. In summary, across the 117
compilability and plausibility are very common, these metrics papers, we identied a total of 79 unique large language
ensure that the generated solutions are correct and deployable. models. We visualized the frequency of these model names
Common standards like BLEU and CodeBLEU are used to in a word cloud for a more intuitive representation, as shown
evaluate the quality and human-likeness of generated code, in Figure.9. From the gure, we can observe that models
which helps determine if the model’s capabilities and perfor- such as GPT-3.5, GPT-4, LLaMA2, and Codex are frequently
mance are comparable to human performance. Furthermore, used. Although close source LLMs cannot be locally deployed
domain-specic metrics like tree edit distance and test pass or further trained, their exceptional capabilities make them
rate are used to evaluate the effectiveness of LLM applications a popular choice for comparison in experiments or for data
in specialized elds of software engineering, these metrics are augmentation, where GPT-4 is used to generate additional
used to address the limitations posed by software security data to support the research model frameworks. For instance,
and maintenance. In contrast, while LLM-based agents use researchers might use OpenAI’s API to generate initial text and
evaluation metrics similar to those used for LLMs, such as then employ locally deployed models for further processing
success rate, they also incorporate more subjective metrics and optimization [76] [122] [119] [113].
for evaluation. These include appropriateness, relevance, and Therefore, it is not difcult to see that the use of general
adequacy, which are human-judged standards. Overall, the large models with superior performance to assist development
evaluation metrics used by agents tend to be simpler and or as a measurement standard has been increasingly used in
easier to understand than those used for LLMs. This is likely the vertical eld of software engineering in the past two years.
because agents handle high-level tasks, such as the success In addition, for some elds that have never been touched
rate of generating potential vulnerabilities and the frequency by LLMs before, many researchers rst refer to the model
of agents calling external tools, so they also need to consider ChatGPT and conduct various performance experiments on the
computational and time overheads of the overall architecture. newer GPT-4 [55] [58] [64]. Those models can be integrated
By comparing these metrics, we can see that LLMs empha- into larger systems and combining with other machine learning
sising the success rate of individual testing methods, LLM- models and tools, these models can be used to generate
based agents focus more on the overall task completion natural language responses, while another model handles intent
time/cost/effectiveness. LLMs typically use binary classica- recognition and dialogue management.
tion metrics like ROC, AUC, and F1 score, while agents tend
to emphasize the success rate and accuracy during both the
generation and validation phases, providing a comprehensive
evaluation. For the time cost and performance, LLMs mainly
focus on the execution time of testing methods and the number
of queries to assess their efciency. In contrast, LLM-based
agents focus more on the completion time of repair tasks and
the number of API calls, it will make sure the efciency and
practicality of overall architecture.
X. D ISCUSSION
F IG . 9: E XPERIMENT M ODELS U SAGE W ORD C LOUD
A. Experiment Models
In section 3-8, we reviewed and introduced the research Although the word cloud provides a rough overview of
on LLMs and LLM-based agent applications in software model usage frequency, it lacks detailed information. To gain
engineering in recent years. These studies have different deeper insights, we combined a grouped bar chart and a
research directions and we divided them into six subtopics stacked bar chart to further analyze the usage of models
for classication and discussion. With the advancement of in studies across different subtopics. The corresponding bar
large language models, thousands of models have appeared charts are presented in Figure.10. During the analysis, we
in the public eye, in order to more intuitively understand the found that a large number of models appeared only once.
application of large language models in various elds and the Including these in the bar chart would have made the overall
use of large language models as the core of intelligent agents, representation cluttered. Therefore, we excluded models that
we summarized a total of 117 papers, mainly to discuss the appeared only once and focused on the versatility of the
frequency of use of LLMs in the eld of software engineering. remaining models. On the left side of each subtopic, we
Based on the review of 117 papers, our primary focus depict the models used in LLM-related studies, with the
is on the models or frameworks utilized by the authors in models used in LLM-based agent-related studies highlighted
their experiments. This is due to the fact that these papers in red-bordered bars. From the gure, it is evident that in
TABLE X: E VALUATION M ETRICS IN S OFTWARE S ECURITY AND M AINTENANCE

Financial Sentiment Analysis
[144] Movie Review Analysis Success Rate, Change Rate, Perplexity, Time Overhead, Query Number No
News Classication
CVExes
Manually-Curated Dataset
(624 vulnerabilities across
[146] ROC AUC, F1 Score, Accuracy, Optimal Classication, Threshold No
205 Java projects)
VCMatch
(10 popular repositories)
CVExes
[149] Manually-Curated Dataset Precision, Recall No
VCMatch
HackTheBox Overall Task Completion, Sub-task Completion, Task Variety, Challenge
[143] No
VulnHub Levels, Progress Tracking
Defects4J v1.2
Defects4J v2.0
[151] Compilability, Plausibility, Test pass rate, Exact Match, BLEU No
QuixBugs
HumanEval-Java
Devign
[148] Reveal F1 score, Accuracy, Precision,Recall. No
Big-Vul
PRIMEVUL
[150] F1 score, Accuracy, Precision, Recall, VD-S, Pair-wise, evaluation metrics No
BigVul
Customized GitHub dataset
[147] (308 binary classication and Precision, Recall, F1-Score, AUC, Accuracy No
120 multi-label classication)
[158] VulnHub Output’s Description No
VulDeePecker
[165] False Positive Rate, False Negative Rate, Precision, Recall, F1-score No
SeVC
[145] CVEFixes CodeBLEU, Tree Edit Distance, Pass@k No
EvalRepair-C++
[152] TOP-5 and TOP-10, Repair No
EvalRepair-Java
Introductory Python
[155] Repair Rate, Token Edit Distance No
Assignments Dataset
Multi-languages dataset
[156] (Excel,Power Fx,Python, Exact Matches No
JavaScript,C andPowerShell)
[157] Defects4J 1.2 and 2.0 Plausible Patches, Correct Fix No
Vulnerable Virtual Machine
[107] Success Rate Yes
(lin.security Linux VM)
Success Rate, Exploitation based evaluation, Manual inspection of the
[160] 118 AC Vulnerabilities Yes
patches.
[161] 13 Smart Contract Bugs Success Rate, Contract level, Trial level Yes
Codeaws,QuixBugs, Number of correctly xed bugs, Number of plausibly patched bugs,
[142] Yes
ConDefects Correctness rate of generated patches
[162] ARepair,Alloy4Fun Correct@6, Runtime and Token Usage Yes
[163] System Model Graph Accuracy, Effectiveness, Appropriateness Yes
1734 Positive Samples,
[108] F1 score, Accuracy, Consistency Yes
1810 Negative Samples
HumanEval,MBPP,
[166] Accuracy, Pass@1 Yes
TransCoder
13 Android Applications
[139] Recall, Precision, Correct, Over-tting, Correct@k Yes
from GitHub
[164] System Model Graph Relevance, Adequacy Yes
[109] 15 Vulnerabilities from CVE Lib Success Rate Yes
[46] Defects4J Plausible Fixes, Correct Fixes Yes
the Autonomous Learning and Decision Making subtopic, the for agents, ne-tuning it on the generated datasets to evaluate
number of models used in LLM-based agent-related studies the emergence of knowledge and capabilities. Overall, we
is quite high. Specically, GPT-4 and GPT-3.5 were used in found that in the Autonomous Learning and Decision Making
10 out of 18 papers and 15 out of 18 papers, respectively. subtopic, LLM-based agents often use multiple models for
In this subtopic, studies commonly utilized GPT-3.5/4 and testing and performance evaluation in a single task, this results
LLaMA-2 for research and evaluation. During our analysis, we in a signicantly higher model usage frequency in this topic
found that many studies on LLM-based agents evaluated the compared to others.
agents’ ability to mimic human behavior and decision-making Not only in the Autonomous Learning and Decision Making
or perform some reasoning tasks [103] [111] [108]. Since these subtopic, but also across other themes, we observe that the
studies do not require local deployment, they mainly assess the variety of models (represented by the number of colors) used
performance of state-of-the-art models in specic directions, by LLM-based agents is relatively limited. For instance, in
leading to the frequent use of the GPT-family models. Frame- the requirement engineering and documentation subtopic, only
works like [98] [36] constructed LLM-based agents by calling GPT-3.5 and GPT-4 models were involved in the experiments.
the GPT-4 API, using verbal reinforcement to help language To analyze the reasons behind this phenomenon, we need to
agents learn from their mistakes. Due to the limitations of exclude the factors that models appearing only once were
GPT models, many studies also used LLaMA as the LLM not considered and that there are inherently fewer studies
F IG . 10: E XPERIMENT M ODELS U SAGE IN D IFFERENT S UBTOPICS (REQ DENOTES ”R EQUIREMENT E NGINEERING AND D OCUMEN -
TATION ”, CODE DENOTES ”C ODE G ENERATION AND S OFTWARE D EVELOPMENT ”, AUTO DENOTES ”AUTONOMOUS L EARNING
AND D ECISION M AKING ”, DES DENOTES ”S OFTWARE D ESIGN AND D ECISION M AKING ”, SEC DENOTES ”S OFTWARE S ECURITY
AND M AINTENANCE ”)
on intelligent agents. We believe this primarily reects the allowing for further reasoning, planning, and task execution.
integration relationship between the agents and the large From the Figure.10, we can also observe that research in the
language models. The combination of these two technologies code generation and software development elds adopts a wide
aims to address the limitations of large language models in variety of models, further indicating the extensive attention this
specic tasks or aspects. Intelligent agents allow researchers area receives and the excellent performance of models in code
to design a more exible framework and incorporate the large generation task.
language model into it. These models, having been trained on
vast amounts of data, possess strong generalizability, making B. Topics Overlapping
them suitable for a wide range of tasks and domains.
Figure.11 shows the distribution of all collected literature
Therefore, researchers and developers can use the same across six themes. For LLM-type literature, the theme of
model to address multiple issues, reducing the need for var- software security and maintenance accounts for nearly 30%,
ious models. In code generation [83] [79], test case genera- whereas test case generation accounts for less than 10%.
tion [140] [142], and software security [167] [159], there are This trend is similarly reected in the LLM-based agent
instances of using CodeLlama. This model is ne-tuned and literature. Research on using LLM-based agents to address
optimized based on the LLaMA architecture. At its release, requirements engineering and test case generation is relatively
it was considered one of the state-of-the-art models for code sparse. Requirements engineering is a new endeavor for LLM-
generation and understanding tasks, showing strong perfor- based agents, and using the entire agent framework to generate
mance and potential compared to other models like Codex. test cases might be considered excessive. Therefore, more
Another potential reason is the previous successful applica- research tends to evaluate and explore the changes LLMs bring
tions and research outcomes that have proven these models’ within the agent framework, such as autonomous decision-
effectiveness, further enhancing researchers’ trust and reliance making abilities and capabilities in software maintenance and
on them. Compared to models that perform well in specic repair.
domains, in intelligent agent development, there is a preference Table‘XI presents the number of papers spanning multiple
for using general-purpose large models to ensure that the core themes. For instance, ve papers can be classied under both
of the agent possesses excellent text comprehension abilities, software security and maintenance and autonomous learning
F IG . 11: D ISTRIBUTION OF LLM S AND AGENTS ACROSS SIX TOPICS
CODE REQ AUTO DESIGN SEC TEST

CODE X 1 0 2 3 1
REQ 1 X 1 0 2 0
AUTO 0 1 X 6 5 1
DESIGN 2 0 6 X 1 0
SEC 3 2 5 1 X 2
TEST 1 0 1 0 2 X
TABLE XI: OVERLAP OF PAPERS A MONG D IFFERENT T OPICS (REQ DENOTES ”R EQUIREMENT E NGINEERING AND D OCUMENTA -
TION ”, CODE DENOTES ”C ODE G ENERATION AND S OFTWARE D EVELOPMENT ”, AUTO DENOTES ”AUTONOMOUS L EARNING AND
D ECISION M AKING ”, DES DENOTES ”S OFTWARE D ESIGN AND D ECISION M AKING ”, SEC DENOTES ”S OFTWARE S ECURITY AND
M AINTENANCE ”)
and decision making. These two themes also overlap the most enhance capabilities in code generation, test automation, and
with other themes, indicating that LLMs and LLM-based agent security analysis through autonomous learning and decision-
research is broad and these tasks often require integrating making. This sharing of technologies promotes knowledge
knowledge and techniques from various elds such as code exchange and technological dissemination across various elds
generation, design, and testing. The signicant overlap reects within software engineering.
the close interrelation between these themes and other areas.
For example, autonomous learning and decision-making often C. Benchmarks and Metrics
involve the model’s ability to autonomously learn and optimize
decision trees, techniques that are applied in many specic As shown in Figure.12, it include the distribution of com-
software engineering tasks. Similarly, software security and mon benchmarks across six topics. In reality, the number of
maintenance typically require a combination of multiple tech- benchmark datasets used is far greater than what is shown
niques to enhance security, such as automatic code generation in the gure. Different software engineering tasks use various
tools and automated testing frameworks [71] [80] [83] [102]. benchmark datasets for evaluation and testing. For instance,
The overlap in literature highlights the increasing need for in requirements engineering, researchers often collect user
integrating methods and techniques from different research stories or requirement specications as datasets [55] [63], and
areas within software engineering. For instance, ensuring these datasets are not well-known public datasets, so they
software security relies not only on security measures but also were not included in the statistics. Alternatively, some studies
on leveraging code generation, automated testing, and design specify their datasets as “Customized GitHub datasets” [168].
optimization technologies. Similarly, autonomous learning and Therefore, the benchmark datasets shown in the gure repre-
decision-making require a comprehensive consideration of sent commonly used public datasets. For example, MBPP and
requirements engineering, code generation, and system design. HumanEval, which have been introduced in previous sections,
Moreover, it suggests that certain technologies and methods are frequently used. We can also observe that the datasets used
possess strong commonality. For instance, LLM-based agents in LLM and LLM-based agents tasks, apart from common
public datasets, are different.
For instance, the FEVER7 dataset is often used in agent- 31st IEEE/ACM International Conference on Automated Software
related research. In [35], the FEVER dataset is used to test the Engineering, pp. 724–735, 2016.
[2] A. Vogelsang and M. Borg, “Requirements engineering for machine
ExpeL agent’s performance in fact verication tasks. Similarly, learning: Perspectives from data scientists,” in 2019 IEEE 27th In-
the HotpotQA8 dataset is frequently used in agent-related ternational Requirements Engineering Conference Workshops (REW),
research for knowledge-intensive reasoning and question- (Jeju, Korea (South)), pp. 245–251, 2019.
[3] “Chatgpt: Optimizing language models for dialogue,” 11 2022. [Online;
answering tasks. When handling vulnerability repair tasks,
accessed 17-July-2024].
LLMs often use the Defects4J9 benchmark dataset. This [4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
dataset contains 835 real-world defects from multiple open- H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
source Java projects, categorized into buggy versions and G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
repaired versions, typically used to evaluate the effectiveness C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
of automated program repair techniques. Despite its extensive E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
use in LLM research, Defects4J is relatively less used in J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Rad-
LLM-based agents research. We speculate that this may be ford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder,
because Defects4J primarily evaluates single code repair tasks, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba,
which do not fully align with the multi-task and real-time “Evaluating large language models trained on code,” arXiv preprint
arXiv:2107.03374, 2021. arXiv:2107.03374 [cs.LG].
requirements of LLM-based agents. Additionally, new datasets [5] N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra-
like ConDefects have been introduced [142], focusing on jamani, and R. Sharma, “Jigsaw: large language models meet program
addressing data leakage issues and providing more compre- synthesis,” in Proceedings of the 44th International Conference on
Software Engineering, ICSE ’22, (New York, NY, USA), p. 1219–1231,
hensive defect localization and repair evaluations. Association for Computing Machinery, 2022.
As shown in Figure.13, it includes the top ten evaluation [6] T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen, “Long-context llms
metrics for LLMs and LLM-based agents. The analysis reveals struggle with long in-context learning,” 2024.
that the evaluation methods used by both are almost identical. [7] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin,
and X. Hu, “Harnessing the power of llms in practice: A survey on
In previous sections, we also discussed that for agents, it chatgpt and beyond,” ACM Trans. Knowl. Discov. Data, vol. 18, apr
is necessary to consider time and computational resource 2024.
consumption, which is evident from the pie chart. Meanwhile, [8] A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo,
and J. M. Zhang, “Large language models for software engineering:
many studies focus on the code generation capabilities of Survey and open problems,” in 2023 IEEE/ACM International Confer-
LLMs, so more evaluation metrics pertain to the correctness ence on Software Engineering: Future of Software Engineering (ICSE-
and Exact Match of the generated code [73] [69] [30], but FoSE), pp. 31–53, 2023.
overall, the evaluation metrics for LLMs and LLM-based [9] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen,
J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J. Wen, “A
agents in software engineering applications are quite similar. survey on large language model based autonomous agents,” Frontiers
of Computer Science, vol. 18, no. 6, pp. 186345–, 2024.
XI. C ONCLUSION [10] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang,
S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou,
In this paper, we conducted a comprehensive literature W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng,
review on the application of LLM and LLM-based agents Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui, “The rise
in software engineering. We categorized software engineering and potential of large language model based agents: A survey,” 2023.
[11] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
into six topics: requirement engineering and documentation, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela,
code generation and software development, autonomous learn- “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in
ing and decision making, software design and evaluation, Advances in Neural Information Processing Systems (H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33,
software test generation, and software security and mainte- pp. 9459–9474, Curran Associates, Inc., 2020.
nance. For each topic, we analyzed the tasks, benchmarks, and [12] GitHub, Inc., “Github copilot: Your ai pair programmer.” https://github.
evaluation metrics, distinguishing between LLM and LLM- com/features/copilot, 2024. [Online; accessed 17-July-2024].
based agents and discussing the differences and impacts they [13] S. Russell and P. Norvig, Articial Intelligence: A Modern Approach.
Pearson Education Limited, 2016.
bring. We further analyzed and discussed the models used in [14] N. R. Jennings, “A survey of agent-oriented software engineering,”
the experiments of the 117 collected papers. Additionally, we Knowledge Engineering Review, vol. 15, no. 4, pp. 215–249, 2000.
provided statistics and distinctions between LLM and LLM- [15] Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai,
M. Lapata, A. Lazaridou, J. May, A. Nisnevich, N. Pinto, and J. Turian,
based agents regarding datasets and evaluation metrics. The “Experience grounds language,” 2020.
analysis revealed that the emergence of LLM-based agents [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
has led to extensive research and applications across various D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large
software engineering topics, demonstrating different emphases language models,” Advances in neural information processing systems,
vol. 35, pp. 24824–24837, 2022.
compared to traditional LLMs in terms of tasks, benchmarks, [17] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo,
and evaluation metrics. J. Grundy, and H. Wang, “Large language models for software engi-
neering: A systematic literature review,” 2024.
R EFERENCES [18] Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, and W. Wang,
“Towards an understanding of large language models in software
[1] S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan, “Bugram: engineering tasks,” 2023.
bug detection with n-gram language models,” in Proceedings of the [19] A. Nguyen-Duc, B. Cabrero-Daniel, A. Przybylek, C. Arora,
D. Khanna, T. Herda, U. Raq, J. Melegati, E. Guerra, K.-K. Kemell,
7 https://fever.ai/dataset/fever.html
M. Saari, Z. Zhang, H. Le, T. Quan, and P. Abrahamsson, “Generative
8 https://hotpotqa.github.io/
articial intelligence for software engineering – a research agenda,”
9 https://github.com/rjust/defects4j 2023.
[20] W. Ma, S. Liu, Z. Lin, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, work for methods that learn from human feedback,” Advances in Neural
L. Li, and Y. Liu, “Lms: Understanding code syntax and semantics for Information Processing Systems, vol. 36, 2024.
code analysis,” 2024. [46] I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous,
[21] Z. Yang, Z. Sun, T. Z. Yue, P. Devanbu, and D. Lo, “Robustness, secu- llm-based agent for program repair,” arXiv preprint arXiv:2403.17134,
rity, privacy, explainability, efciency, and usability of large language 2024.
models for code,” 2024. [47] E. Musumeci, M. Brienza, V. Suriani, D. Nardi, and D. D. Bloisi, “Llm
[22] Y. Huang, Y. Chen, X. Chen, J. Chen, R. Peng, Z. Tang, J. Huang, based multi-agent generation of semi-structured documents from se-
F. Xu, and Z. Zheng, “Generative software engineering,” 2024. mantic templates in the public administration domain,” in International
[23] C. Manning and H. Schutze, Foundations of statistical natural language Conference on Human-Computer Interaction, pp. 98–117, Springer,
processing. MIT press, 1999. 2024.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [48] X. Luo, Y. Xue, Z. Xing, and J. Sun, “Prcbert: Prompt learning for
Computation, vol. 9, no. 8, pp. 1735–1780, 1997. requirement classication using bert-based pretrained language mod-
[25] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural els,” in Proceedings of the 37th IEEE/ACM International Conference
computation, vol. 9, no. 8, pp. 1735–1780, 1997. on Automated Software Engineering, pp. 1–13, 2022.
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [49] T. Hey, J. Keim, A. Koziolek, and W. F. Tichy, “Norbert: Transfer learn-
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” ing for requirements classication,” in 2020 IEEE 28th International
Advances in neural information processing systems, vol. 30, 2017. Requirements Engineering Conference (RE), pp. 169–179, 2020.
[27] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and [50] J. Zhang, Y. Chen, N. Niu, and C. Liu, “Evaluation of chatgpt on
consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020. requirements information retrieval under zero-shot setting,” Available
[28] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, at SSRN 4450322, 2023.
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling [51] C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements
language modeling with pathways,” Journal of Machine Learning engineering through generative ai: Assessing the role of llms,” in Gen-
Research, vol. 24, no. 240, pp. 1–113, 2023. erative AI for Effective Software Development, pp. 129–148, Springer,
[29] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, 2024.
C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, [52] M. Krishna, B. Gaur, A. Verma, and P. Jalote, “Using llms in software
K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettle- requirements specications: An empirical evaluation,” 2024.
moyer, “Opt: Open pre-trained transformer language models,” 2022. [53] L. Ma, S. Liu, Y. Li, X. Xie, and L. Bu, “Specgen: Automated gen-
[30] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, eration of formal program specications via large language models,”
“Codet5+: Open code large language models for code understanding 2024.
and generation,” arXiv preprint arXiv:2305.07922, 2023. [54] C. Flanagan and K. R. M. Leino, “Houdini, an annotation assistant
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training for esc/java,” in FME 2001: Formal Methods for Increasing Software
of deep bidirectional transformers for language understanding,” arXiv Productivity (J. N. Oliveira and P. Zave, eds.), (Berlin, Heidelberg),
preprint arXiv:1810.04805, 2018. pp. 500–517, Springer Berlin Heidelberg, 2001.
[32] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, [55] J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language mod- ChatGPT Prompt Patterns for Improving Code Quality, Refactoring,
els are few-shot learners,” Advances in neural information processing Requirements Elicitation, and Software Design, pp. 71–108. Cham:
systems, vol. 33, pp. 1877–1901, 2020. Springer Nature Switzerland, 2024.
[33] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, [56] D. Luitel, S. Hassani, and M. Sabetzadeh, “Improving requirements
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: completeness: Automated assistance through large language models,”
Open and efcient foundation language models,” arXiv preprint Requirements Engineering, vol. 29, no. 1, pp. 73–95, 2024.
arXiv:2302.13971, 2023. [57] A. Moharil and A. Sharma, “Identication of intra-domain ambiguity
[34] J. X. Chen, “The evolution of computing: Alphago,” Computing in using transformer-based machine learning,” in Proceedings of the 1st
Science & Engineering, vol. 18, no. 4, pp. 4–7, 2016. International Workshop on Natural Language-Based Software Engi-
[35] A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: neering, NLBSE ’22, (New York, NY, USA), p. 51–58, Association
Llm agents are experiential learners,” in Proceedings of the AAAI for Computing Machinery, 2023.
Conference on Articial Intelligence, vol. 38, pp. 19632–19642, 2024. [58] K. Ronanki, B. Cabrero-Daniel, and C. Berger, “Chatgpt as a tool
[36] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, for user story quality evaluation: Trustworthy out of the box?,” in
“React: Synergizing reasoning and acting in language models,” arXiv Agile Processes in Software Engineering and Extreme Programming
preprint arXiv:2210.03629, 2022. – Workshops (P. Kruchten and P. Gregory, eds.), (Cham), pp. 173–181,
[37] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models Springer Nature Switzerland, 2024.
as zero-shot planners: Extracting actionable knowledge for embodied [59] A. Poudel, J. Lin, and J. Cleland-Huang, “Leveraging transformer-
agents,” in International conference on machine learning, pp. 9118– based language models to automate requirements satisfaction assess-
9147, PMLR, 2022. ment,” 2023.
[38] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, [60] E. Musumeci, M. Brienza, V. Suriani, D. Nardi, and D. D. Bloisi, “Llm
and A. Anandkumar, “Voyager: An open-ended embodied agent with based multi-agent generation of semi-structured documents from se-
large language models,” arXiv preprint arXiv:2305.16291, 2023. mantic templates in the public administration domain,” in Articial
[39] C. Whitehouse, M. Choudhury, and A. F. Aji, “Llm-powered data Intelligence in HCI (H. Degen and S. Ntoa, eds.), (Cham), pp. 98–
augmentation for enhanced cross-lingual performance,” 2023. 117, Springer Nature Switzerland, 2024.
[40] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. El- [61] S. Zhang, J. Wang, G. Dong, J. Sun, Y. Zhang, and G. Pu, “Experi-
nashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern menting a new programming practice with llms,” 2024.
catalog to enhance prompt engineering with chatgpt,” arXiv preprint [62] A. Nouri, B. Cabrero-Daniel, F. Törner, H. Sivencrona, and C. Berger,
arXiv:2302.11382, 2023. “Engineering safety requirements for autonomous driving with large
[41] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. language models,” 2024.
Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., [63] Z. Zhang, M. Rayhan, T. Herda, M. Goisauf, and P. Abrahamsson,
“Gemini 1.5: Unlocking multimodal understanding across millions of “Llm-based agents for automating the enhancement of user story
tokens of context,” arXiv preprint arXiv:2403.05530, 2024. quality: An early report,” in Agile Processes in Software Engineering
[42] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, and Extreme Programming (D. Šmite, E. Guerra, X. Wang, M. March-
A. Madotto, and P. Fung, “Survey of hallucination in natural language esi, and P. Gregory, eds.), (Cham), pp. 117–126, Springer Nature
generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023. Switzerland, 2024.
[43] K. An, F. Yang, L. Li, Z. Ren, H. Huang, L. Wang, P. Zhao, Y. Kang, [64] K. Ronanki, C. Berger, and J. Horkoff, “Investigating chatgpt’s po-
H. Ding, Q. Lin, et al., “Nissist: An incident mitigation copilot based tential to assist in requirements elicitation processes,” in 2023 49th
on troubleshooting guides,” arXiv preprint arXiv:2402.17531, 2024. Euromicro Conference on Software Engineering and Advanced Appli-
[44] J. Li, Q. Zhang, Y. Yu, Q. Fu, and D. Ye, “More agents is all you cations (SEAA), pp. 354–361, 2023.
need,” 2024. [65] D. Xie, B. Yoo, N. Jiang, M. Kim, L. Tan, X. Zhang, and J. S. Lee, “Im-
[45] Y. Dubois, C. X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, pact of large language models on generating software specications,”
P. S. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation frame- 2023.
[66] A. Moharil and A. Sharma, “Tabasco: A transformer based con- [87] S. Zhang, J. Wang, G. Dong, J. Sun, Y. Zhang, and G. Pu, “Ex-
textualization toolkit,” Science of Computer Programming, vol. 230, perimenting a new programming practice with llms,” arXiv preprint
p. 102994, 2023. arXiv:2401.01062, 2024.
[67] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, [88] V. Murali, C. Maddila, I. Ahmad, M. Bolin, D. Cheng, N. Ghor-
H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, bani, R. Fernandez, and N. Nagappan, “Codecompose: A large-scale
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, industrial deployment of ai-assisted code authoring,” arXiv preprint
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, arXiv:2305.12050, 2023.
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, [89] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and J. Han, “Large
E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, language models can self-improve,” arXiv preprint arXiv:2210.11610,
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, 2022.
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Rad- [90] L. Chen, J. Q. Davis, B. Hanin, P. Bailis, I. Stoica, M. Zaharia, and
ford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, J. Zou, “Are more llm calls all you need? towards scaling laws of
B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, compound inference systems,” arXiv preprint arXiv:2403.02419, 2024.
“Evaluating large language models trained on code,” 2021. [91] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language
[68] A. Ni, P. Yin, Y. Zhao, M. Riddell, T. Feng, R. Shen, S. Yin, Y. Liu, models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.
S. Yavuz, C. Xiong, S. Joty, Y. Zhou, D. Radev, and A. Cohan, [92] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated
“L2ceval: Evaluating language-to-code generation capabilities of large debugging via large language model-driven scientic debugging,” arXiv
language models,” 2023. preprint arXiv:2304.02195, 2023.
[69] R. Sun, S. Ö. Arik, A. Muzio, L. Miculicich, S. Gundabathula, P. Yin, [93] G. Franceschelli and M. Musolesi, “On the creativity of large language
H. Dai, H. Nakhost, R. Sinha, Z. Wang, et al., “Sql-palm: Improved models,” arXiv preprint arXiv:2304.00008, 2023.
large language model adaptation for text-to-sql (extended),” arXiv [94] J. Lai, W. Gan, J. Wu, Z. Qi, and P. S. Yu, “Large language models in
preprint arXiv:2306.00739, 2023. law: A survey,” 2023.
[70] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, [95] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang,
L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, “Codegeex: A Z. Lin, Z. Li, D. Li, E. Xing, et al., “Judging llm-as-a-judge with mt-
pre-trained model for code generation with multilingual benchmarking bench and chatbot arena,” Advances in Neural Information Processing
on humaneval-x,” 2024. Systems, vol. 36, 2024.
[71] X. Hu, K. Kuang, J. Sun, H. Yang, and F. Wu, “Leveraging print [96] Q. Wang, Z. Wang, Y. Su, H. Tong, and Y. Song, “Rethinking the
debugging to improve code generation in large language models,” 2024. bounds of llm reasoning: Are multi-agent discussions the key?,” arXiv
[72] S. Peng, E. Kalliamvakou, P. Cihon, and M. Demirer, “The impact of preprint arXiv:2402.18272, 2024.
ai on developer productivity: Evidence from github copilot,” 2023. [97] L. Chen, Y. Zhang, S. Ren, H. Zhao, Z. Cai, Y. Wang, P. Wang, T. Liu,
[73] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, and B. Chang, “Towards end-to-end embodied decision making via
W. tau Yih, L. Zettlemoyer, and M. Lewis, “Incoder: A generative multi-modal large language model: Explorations with gpt4-vision and
model for code inlling and synthesis,” 2023. beyond,” arXiv preprint arXiv:2310.02071, 2023.
[74] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, [98] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao,
S. Savarese, and C. Xiong, “Codegen: An open large language model “Reexion: Language agents with verbal reinforcement learning,”
for code with multi-turn program synthesis,” 2023. Advances in Neural Information Processing Systems, vol. 36, 2024.
[75] Y. Ding, M. J. Min, G. Kaiser, and B. Ray, “Cycle: Learning to self- [99] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian, C.-M. Chan,
rene the code generation,” Proc. ACM Program. Lang., vol. 8, apr Y. Qin, Y. Lu, R. Xie, et al., “Agentverse: Facilitating multi-agent
2024. collaboration and exploring emergent behaviors in agents,” arXiv
preprint arXiv:2308.10848, 2023.
[76] Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code genera-
tion via chatgpt,” 2024. [100] G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel:
Communicative agents for” mind” exploration of large language model
[77] F. Lin, D. J. Kim, et al., “When llm-based code generation meets
society,” Advances in Neural Information Processing Systems, vol. 36,
the software development process,” arXiv preprint arXiv:2403.15852,
pp. 51991–52008, 2023.
2024.
[101] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke, R. Murthy, Y. Feng,
[78] S. Holt, M. R. Luyten, and M. van der Schaar, “L2MAC: Large Z. Chen, J. C. Niebles, D. Arpit, et al., “Bolaa: Benchmarking
language model automatic computer for extensive code generation,” and orchestrating llm-augmented autonomous agents,” arXiv preprint
in The Twelfth International Conference on Learning Representations, arXiv:2308.05960, 2023.
2024.
[102] J. Lu, W. Zhong, W. Huang, Y. Wang, Q. Zhu, F. Mi, B. Wang,
[79] S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, W. Wang, X. Zeng, L. Shang, X. Jiang, and Q. Liu, “Self: Self-evolution
Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, with language feedback,” 2024.
and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent [103] C. Xie, C. Chen, F. Jia, Z. Ye, K. Shu, A. Bibi, Z. Hu, P. Torr,
collaborative framework,” 2023. B. Ghanem, and G. Li, “Can large language model agents simulate
[80] Z. Rasheed, M. Waseem, K.-K. Kemell, W. Xiaofeng, A. N. Duc, human trust behaviors?,” arXiv preprint arXiv:2402.04559, 2024.
K. Systä, and P. Abrahamsson, “Autonomous agents in software [104] Z. Liu, W. Yao, J. Zhang, L. Yang, Z. Liu, J. Tan, P. K. Choubey,
development: A vision paper,” arXiv preprint arXiv:2311.18440, 2023. T. Lan, J. Wu, H. Wang, et al., “Agentlite: A lightweight library for
[81] Z. Rasheed, M. Waseem, M. Saari, K. Systä, and P. Abrahamsson, building and advancing task-oriented llm agent system,” arXiv preprint
“Codepori: Large scale model for autonomous software development arXiv:2402.15538, 2024.
by using multi-agents,” arXiv preprint arXiv:2402.01411, 2024. [105] M. Zhuge, W. Wang, L. Kirsch, F. Faccio, D. Khizbullin, and
[82] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: J. Schmidhuber, “Language agents as optimizable graphs,” arXiv
Multi-agent-based code generation with iterative testing and optimisa- preprint arXiv:2402.16823, 2024.
tion,” arXiv preprint arXiv:2312.13010, 2023. [106] R. Feldt, S. Kang, J. Yoon, and S. Yoo, “Towards autonomous
[83] T. Zheng, G. Zhang, T. Shen, X. Liu, B. Y. Lin, J. Fu, W. Chen, testing agents via conversational large language models,” in 2023
and X. Yue, “Opencodeinterpreter: Integrating code generation with 38th IEEE/ACM International Conference on Automated Software
execution and renement,” arXiv preprint arXiv:2402.14658, 2024. Engineering (ASE), pp. 1688–1693, IEEE, 2023.
[84] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, E. Ham- [107] A. Happe and J. Cito, “Getting pwn’d by ai: Penetration testing
bro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: with large language models,” in Proceedings of the 31st ACM Joint
Language models can teach themselves to use tools,” Advances in European Software Engineering Conference and Symposium on the
Neural Information Processing Systems, vol. 36, 2024. Foundations of Software Engineering, pp. 2082–2086, 2023.
[85] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, [108] W. Ma, D. Wu, Y. Sun, T. Wang, S. Liu, J. Zhang, Y. Xue, and
X. Tang, B. Qian, et al., “Toolllm: Facilitating large language models Y. Liu, “Combining ne-tuning and llm-based agents for intuitive smart
to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, contract auditing with justications,” arXiv preprint arXiv:2403.16073,
2023. 2024.
[86] X. Jiang, Y. Dong, L. Wang, F. Zheng, Q. Shang, G. Li, Z. Jin, and [109] R. Fang, R. Bindu, A. Gupta, and D. Kang, “Llm agents
W. Jiao, “Self-planning code generation with large language models,” can autonomously exploit one-day vulnerabilities,” arXiv preprint
ACM Trans. Softw. Eng. Methodol., jun 2024. Just Accepted. arXiv:2404.08144, 2024.
[110] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Grifths, Y. Cao, and [133] H. J. Kang, T. G. Nguyen, B. Le, C. S. Păsăreanu, and D. Lo,
K. Narasimhan, “Tree of thoughts: Deliberate problem solving with “Test mimicry to assess the exploitability of library vulnerabilities,”
large language models,” Advances in Neural Information Processing in Proceedings of the 31st ACM SIGSOFT International Symposium
Systems, vol. 36, 2024. on Software Testing and Analysis, pp. 276–288, 2022.
[111] Z. Rasheed, M. Waseem, A. Ahmad, K.-K. Kemell, W. Xiaofeng, A. N. [134] S. Feng and C. Chen, “Prompting is all you need: Automated android
Duc, and P. Abrahamsson, “Can large language models serve as data bug replay with large language models,” in Proceedings of the 46th
analysts? a multi-agent assisted approach for qualitative data analysis,” IEEE/ACM International Conference on Software Engineering, pp. 1–
arXiv preprint arXiv:2402.01386, 2024. 13, 2024.
[112] M. Ataei, H. Cheong, D. Grandi, Y. Wang, N. Morris, and A. Tessier, [135] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-
“Elicitron: An llm agent-based simulation framework for design re- shot testers: Exploring llm-based general bug reproduction,” in 2023
quirements elicitation,” arXiv preprint arXiv:2404.16045, 2024. IEEE/ACM 45th International Conference on Software Engineering
[113] G. Sridhara, S. Mazumdar, et al., “Chatgpt: A study on its (ICSE), pp. 2312–2323, IEEE, 2023.
utility for ubiquitous software engineering tasks,” arXiv preprint [136] C. S. Xia, M. Paltenghi, J. Le Tian, M. Pradel, and L. Zhang, “Fuzz4all:
arXiv:2305.16837, 2023. Universal fuzzing with large language models,” in Proceedings of the
[114] M. Desmond, Z. Ashktorab, Q. Pan, C. Dugan, and J. M. Johnson, IEEE/ACM 46th International Conference on Software Engineering,
“Evalullm: Llm assisted evaluation of generative outputs,” in Compan- pp. 1–13, 2024.
ion Proceedings of the 29th International Conference on Intelligent [137] G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and
User Interfaces, pp. 30–32, 2024. B. Ray, “Code-aware prompting: A study of coverage guided test gener-
[115] M. Gao, X. Hu, J. Ruan, X. Pu, and X. Wan, “Llm-based nlg evaluation: ation in regression setting using llm,” arXiv preprint arXiv:2402.00097,
Current status and challenges,” arXiv preprint arXiv:2402.01383, 2024. 2024.
[116] L. J. Wan, Y. Huang, Y. Li, H. Ye, J. Wang, X. Zhang, and D. Chen, [138] J. A. Pizzorno and E. D. Berger, “Coverup: Coverage-guided llm-based
“Software/hardware co-design for llm and its application for design test generation,” arXiv preprint arXiv:2403.16218, 2024.
verication,” in 2024 29th Asia and South Pacic Design Automation [139] K. Liu, Y. Liu, Z. Chen, J. M. Zhang, Y. Han, Y. Ma, G. Li, and
Conference (ASP-DAC), pp. 435–441, IEEE, 2024. G. Huang, “Llm-powered test case generation for detecting tricky
[117] K. Kolthoff, C. Bartelt, and S. P. Ponzetto, “Data-driven prototyping via bugs,” arXiv preprint arXiv:2404.10304, 2024.
natural-language-based gui retrieval,” Automated software engineering, [140] K. Li and Y. Yuan, “Large language models as test case gen-
vol. 30, no. 1, p. 13, 2023. erators: Performance evaluation and enhancement,” arXiv preprint
[118] V. D. Kirova, C. S. Ku, J. R. Laracy, and T. J. Marlowe, “Software arXiv:2404.13340, 2024.
engineering education must adapt and evolve for an llm environment,” [141] Z. Wang, W. Wang, Z. Li, L. Wang, C. Yi, X. Xu, L. Cao, H. Su,
in Proceedings of the 55th ACM Technical Symposium on Computer S. Chen, and J. Zhou, “Xuat-copilot: Multi-agent collaborative system
Science Education V. 1, pp. 666–672, 2024. for automated user acceptance testing with large language model,”
[119] S. Jalil, S. Ra, T. D. LaToza, K. Moran, and W. Lam, “Chatgpt and arXiv preprint arXiv:2401.02705, 2024.
software testing education: promises & perils (2023),” arXiv preprint [142] C. Lee, C. S. Xia, J.-t. Huang, Z. Zhu, L. Zhang, and M. R. Lyu, “A
arXiv:2302.03287, 2023. unied debugging approach via llm-based multi-agent synergy,” arXiv
[120] S. Suri, S. N. Das, K. Singi, K. Dey, V. S. Sharma, and V. Kaulgud,
[143] G. Deng, Y. Liu, V. Mayoral-Vilches, P. Liu, Y. Li, Y. Xu, T. Zhang,
“Software engineering using autonomous agents: Are we there yet?,” in
Y. Liu, M. Pinzger, and S. Rass, “Pentestgpt: An llm-empowered
2023 38th IEEE/ACM International Conference on Automated Software
automatic penetration testing tool,” arXiv preprint arXiv:2308.06782,
Engineering (ASE), pp. 1855–1857, IEEE, 2023.
2023.
[121] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and
[144] M. Xiao, Y. Xiao, H. Dong, S. Ji, and P. Zhang, “Rits: Robust input
M. Sun, “Communicative agents for software development,” arXiv
testing framework for llms-based intelligent software,” arXiv preprint
arXiv:2402.13518, 2024.
[122] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt:
[145] R. Wang, Z. Li, C. Wang, Y. Xiao, and C. Gao, “Navrepair:
Solving ai tasks with chatgpt and its friends in hugging face,” Advances
Node-type aware c/c++ code vulnerability repair,” arXiv preprint
in Neural Information Processing Systems, vol. 36, 2024.
arXiv:2405.04994, 2024.
[123] J. Chen, X. Hu, S. Liu, S. Huang, W.-W. Tu, Z. He, and L. Wen, [146] A. Shestov, A. Cheshkov, R. Levichev, R. Mussabayev, P. Zadorozhny,
“Llmarena: Assessing capabilities of large language models in dynamic E. Maslov, C. Vadim, and E. Bulychev, “Finetuning large language
multi-agent environments,” arXiv preprint arXiv:2402.16499, 2024. models for vulnerability detection,” arXiv preprint arXiv:2401.17010,
[124] M. Josifoski, L. Klein, M. Peyrard, Y. Li, S. Geng, J. P. Schnitzler, 2024.
Y. Yao, J. Wei, D. Paul, and R. West, “Flows: Building blocks [147] A. Cheshkov, P. Zadorozhny, and R. Levichev, “Evaluation of chatgpt
of reasoning and collaborating ai,” arXiv preprint arXiv:2308.01285, model for vulnerability detection,” arXiv preprint arXiv:2304.07232,
2023. 2023.
[125] I. Weber, “Large language models as software components: A taxon- [148] G. Lu, X. Ju, X. Chen, W. Pei, and Z. Cai, “Grace: Empowering
omy for llm-integrated applications,” arXiv preprint arXiv:2406.10300, llm-based software vulnerability detection with graph structure and in-
2024. context learning,” Journal of Systems and Software, vol. 212, p. 112031,
[126] F. Vallecillos Ruiz, “Agent-driven automatic software improvement,” in 2024.
Proceedings of the 28th International Conference on Evaluation and [149] H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The hitchhiker’s guide to
Assessment in Software Engineering, pp. 470–475, 2024. program analysis: A journey with large language models,” arXiv
[127] Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efcient inference preprint arXiv:2308.00245, 2023.
with large language model apis,” arXiv preprint arXiv:2301.08721, [150] Y. Ding, Y. Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair,
2023. D. Wagner, B. Ray, and Y. Chen, “Vulnerability detection with code
[128] S. Shankar, J. Zamrescu-Pereira, B. Hartmann, A. G. Parameswaran, language models: How far are we?,” arXiv preprint arXiv:2403.18624,
and I. Arawjo, “Who validates the validators? aligning llm-assisted 2024.
evaluation of llm outputs with human preferences,” arXiv preprint [151] F. V. Ruiz, A. Grishina, M. Hort, and L. Moonen, “A novel approach
arXiv:2404.12272, 2024. for automatic program repair using round-trip translation with large
[129] D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and language models,” arXiv preprint arXiv:2401.07994, 2024.
S. Rajmohan, “Exploring llm-based agents for root cause analysis,” in [152] B. Yang, H. Tian, J. Ren, H. Zhang, J. Klein, T. F. Bissyandé, C. L.
Companion Proceedings of the 32nd ACM International Conference on Goues, and S. Jin, “Multi-objective ne-tuning for enhanced program
the Foundations of Software Engineering, pp. 208–219, 2024. repair with llms,” arXiv preprint arXiv:2404.12636, 2024.
[130] Y. Li, Y. Zhang, and L. Sun, “Metaagents: Simulating interactions of [153] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
human behaviors for llm-based task-oriented coordination via collabo- Efcient netuning of quantized llms,” 2023.
rative generative agents,” arXiv preprint arXiv:2310.06500, 2023. [154] N. Jain, P. yeh Chiang, Y. Wen, J. Kirchenbauer, H.-M. Chu,
[131] M. Tufano, D. Drain, A. Svyatkovskiy, S. K. Deng, and N. Sundaresan, G. Somepalli, B. R. Bartoldson, B. Kailkhura, A. Schwarzschild,
“Unit test case generation with transformers and focal context,” arXiv A. Saha, M. Goldblum, J. Geiping, and T. Goldstein, “Neftune: Noisy
preprint arXiv:2009.05617, 2020. embeddings improve instruction netuning,” 2023.
[132] Y. Zhang, W. Song, Z. Ji, N. Meng, et al., “How well does llm generate [155] J. Zhang, J. P. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares,
security tests?,” arXiv preprint arXiv:2310.00710, 2023. and G. Verbruggen, “Pydex: Repairing bugs in introductory python
assignments using llms,” Proceedings of the ACM on Programming B. Evaluation Metrics

Languages, vol. 8, no. OOPSLA1, pp. 1100–1124, 2024.
[156] H. Joshi, J. C. Sanchez, S. Gulwani, V. Le, G. Verbruggen, and
I. Radiček, “Repair is nearly generation: Multilingual program repair
with llms,” in Proceedings of the AAAI Conference on Articial
Intelligence, vol. 37, pp. 5131–5140, 2023.
[157] J. Xiang, X. Xu, F. Kong, M. Wu, H. Zhang, and Y. Zhang, “How
far can we go with practical function-level program repair?,” arXiv
[158] E. Hilario, S. Azam, J. Sundaram, K. Imran Mohammed, and B. Shan-
mugam, “Generative ai for pentesting: the good, the bad, the ugly,”
International Journal of Information Security, vol. 23, no. 3, pp. 2075–
2097, 2024.
[159] L. Zhong, Z. Wang, and J. Shang, “Ldb: A large language model
debugger via verifying runtime execution step-by-step,” arXiv preprint
arXiv:2402.16906, 2024.
[160] L. Zhang, K. Li, K. Sun, D. Wu, Y. Liu, H. Tian, and Y. Liu, “Acx:
Guiding llms with mined common rbac practices for context-aware F IG . 13: T OP 10 E VALUATION M ETRICS
repair of access control vulnerabilities in smart contracts,” 2024.
[161] S. Hu, T. Huang, F. İlhan, S. F. Tekin, and L. Liu, “Large language
model-powered smart contract vulnerability detection: New perspec-
tives,” 2023.
[162] M. Alhanahnah, M. R. Hasan, and H. Bagheri, “An empirical evaluation
of pre-trained large language models for repairing declarative formal
specications,” arXiv preprint arXiv:2404.11050, 2024.
[163] F. Geissler, K. Roscher, and M. Trapp, “Concept-guided llm agents
for human-ai safety codesign,” in Proceedings of the AAAI Symposium
Series, vol. 3, pp. 100–104, 2024.
[164] A. Nouri, B. Cabrero-Daniel, F. Törner, H. Sivencrona, and C. Berger,
“Engineering safety requirements for autonomous driving with large
language models,” arXiv preprint arXiv:2403.16289, 2024.
[165] C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and
S. Nepal, “Transformer-based language models for software vulnera-
bility detection,” in Proceedings of the 38th Annual Computer Security
Applications Conference, pp. 481–496, 2022.
[166] L. Zhong, Z. Wang, and J. Shang, “Debug like a human: A large
language model debugger via verifying runtime execution step-by-
step,” 2024.
[167] N. Alshahwan, M. Harman, I. Harper, A. Marginean, S. Sengupta, and
E. Wang, “Assured llm-based software engineering,” 2024.
[168] A. Cheshkov, P. Zadorozhny, and R. Levichev, “Evaluation of chatgpt
model for vulnerability detection,” 2023.
A PPENDIX
A. Benchmarks
F IG . 12: D ISTRIBUTION OF B ENCHMARKS

From_LLMs to_LLM_based_Agents

Uploaded by

Copyright:

Available Formats

From_LLMs to_LLM_based_Agents

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

From_LLMs to_LLM_based_Agents

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 18, NO.

From LLMs to LLM-based Agents for Software

F IG . 1: PAPER N UMBER FOR LLM S AND LLM- BASED AGENT

TABLE I: D ISTRIBUTION OF SE TASKS

Category LLMs LLM-based agents Total

Requirement Classication and Extraction (3)

Code Generation Debugging (3)

Multi-LLMs Decision-Making (1) Collaborative Decision-Making and Multi-Agent

Automation of Software Engineering Processes (3)

Bug Reproduction and Debugging (2)

Vulnerability Detection (6)

TABLE II: C OMPARISON BETWEEN O UR W ORK AND THE E XIST-

[19] 2023 GenAI in SE ✓ ✓ ✓ ✗

TABLE III: K EYWORDS FOR S OFTWARE E NGINEERING T OPICS

D. Single and Multi-Agents

TABLE IV: C RITERIA FOR LLM- BASED AGENT

6) The model has autonomous learning capabilities and adaptability.

In [50], four datasets are primarily used, characterized by

TABLE V: E VALUATION M ETRICS IN R EQUIREMENT E NGINEERING AND D OCUMENTATION

code generation and software development through automa- 3 https://www.jetbrains.com/

D. Benchmarks and LLM-based agents, LLMs typically optimize a single

TABLE VI: E VALUATION M ETRICS IN C ODE G ENERATION AND S OFTWARE D EVELOPMENT

TABLE VII: E VALUATION M ETRICS IN AUTONOMOUS L EARNING AND D ECISION M AKING

TABLE VIII: E VALUATION M ETRICS IN S OFTWARE D ESIGN AND E VALUATION

TABLE IX: E VALUATION M ETRICS IN S OFTWARE T EST G ENERATION

common datasets like HumanEval and MBPP are also fre-

TABLE X: E VALUATION M ETRICS IN S OFTWARE S ECURITY AND M AINTENANCE

F IG . 11: D ISTRIBUTION OF LLM S AND AGENTS ACROSS SIX TOPICS

CODE REQ AUTO DESIGN SEC TEST

assignments using llms,” Proceedings of the ACM on Programming B. Evaluation Metrics

F IG . 12: D ISTRIBUTION OF B ENCHMARKS

You might also like