TPTU: Task Planning and Tool Usage of Large Language Model-Based AI Agents
TPTU: Task Planning and Tool Usage of Large Language Model-Based AI Agents
TPTU: Task Planning and Tool Usage of Large Language Model-Based AI Agents
Rui Zhao
[email protected]
Abstract
1 Introduction
Large Language Model (LLM) [1] is a recent breakthrough in natural language processing (NLP)
research. These models are trained on massive amounts of text data and can solve a wide range of
tasks, even those that were not included in their training dataset. This ability is especially evident
in few-shot [2] and zero-shot [3] learning, where LLMs can perform well with minimal or no
task-specific training.
†
These authors contribute equally to this work.
‡
These authors work as research interns at SenseTime Research.
∗
The corresponding author.
However, the application of LLMs in real-world settings presents unique challenges. On the one hand,
LLMs are proved to be poor at solving logic problems such as mathematics, and their training data is
also out of date. Teaching LLMs to use tools such as calculators or search engines can help prevent
them from hallucinating. On the other hand, despite their impressive problem-solving abilities,
the successful integration of these models into complex systems often requires more than just task
understanding - it necessitates the capacity to manipulate various tools and interact effectively with
users. This is exemplified in systems like AutoGPT 1 , BabyAGI 2 , and ChatGPT-plugins 3 , which
leverage LLMs’ capabilities beyond merely generating well-written texts and programs. In these
systems, LLMs operate as the central controller, manipulating different tools and interacting with
humans, thus taking on the role of Artificial Intelligence Agents (AI Agents). In addition to being
central planners, LLMs are often used as intermediaries between macro plans and low-level tool calls,
or as specific tools. As such, LLMs are seen as a crucial approximation of the linguistic world model
in real-world systems.
In this paper, we propose a structured framework and discuss the necessary abilities of such LLM-
based AI Agents. Furthermore, we instantiate the framework with different LLMs and evaluate
their task planning and tool usage (TPTU) abilities on several tasks. Our main contributions are
summarized as follows:
1. We propose a structured framework tailored for LLM-based AI Agents to evaluate the TPTU
abilities of the existing open-source LLMs.
2. We design two distinct types of agents (i.e., one-step agent and sequential agent) to execute
the inference process and provide detailed empirical results and analysis.
3. Our study reveals significant potential in utilizing LLMs for complex tasks while highlighting
areas for further investigation and improvement.
2 Method
To the best of our knowledge, the study of “Agent”, “Autonomous Agent”, “AI Agent" and “Multi-
Agent” has been a central part of AI research for decades [4–9], aimed at understanding and building
intelligent and autonomous systems, but there is currently no standardized definition for AI Agents,
particularly those that are based on LLMs.
In this paper, the Artificial Intelligence Agent (AI Agent) is defined as a program that employs
artificial intelligence techniques to perform tasks that typically require human-like intelligence.
AI Agents can take many forms, from simple chatbots to complex autonomous systems that interact
with their environment and make decisions in real time. They can be trained using a variety of
machine learning techniques, including supervised, unsupervised, and reinforcement learning, and
can be programmed to perform specific tasks or learn from their experiences in order to improve their
performance over time.
We are particularly interested in the AI Agent that employs the LLM techniques (i.e., LLM-based AI
Agent), due to its high efficiency and flexibility in various tasks and domains. Specifically, we design
our AI Agent framework with six components as shown in Figure 1:
1. Task Instruction. This is the explicit input of the agent. In practical systems, the task
instruction comes from human users of the systems. For example, in an HR system, the
user may give a task instruction: How much budget is required to provide a 100$ incentive
for each colleague who has worked for five years? In contrast, in a criminal investigation
system, the user may give a task instruction: Deploy surveillance on a group of suspects.
2. Designed Prompt. This is an additional form of input for the agent, derived from tasks
that the human users anticipate the AI Agent will complete. Humans can craft specific
instructions or demonstrations to steer the LLM-based AI Agents toward generating suitable
1
https://github.com/Significant-Gravitas/Auto-GPT
2
https://github.com/yoheinakajima/babyagi
3
https://openai.com/blog/chatgpt-plugins
2
Input Intermediate Output Final Output Necessary Ability for LLM-based AI Agents
Task Instruction
Deploy surveillance on a
History/Memory group of suspects.
Perception Ability
Learning Ability
LLMs Reflection Ability
Memory Ability
Correct Result or Exception Error
Subtask 1 LLM
figuring out how many colleague →
Database()
who has worked for five years SQL Code
from the database; taking it as X. →
Database()
Subtask 2
LLM
Calculating the value of 100*X Calculator()
→
with a calculator
Python Code
→
Subtask … New Tools PythonREPL()
Subtask N
Tool Set
Get Final Answer
Calculator() Database() PythonREPL()
Summarization Ability
Final Answer
responses. These guiding inputs could encompass system instructions, tool descriptions,
few-shot demonstrations, chat history, or even error output.
3. Tool Set. It is another input for the agent, which refers to the set of external resources,
services, or subsystems that the AI Agent can utilize to aid in its tasks. This could include
databases for information retrieval, APIs for interacting with external systems, other AI
models specialized for tasks such as image recognition or sentiment analysis, or even non-AI
tools and resources such as web scraping tools or data visualization libraries. The toolset
expands the capabilities of the AI Agent, enabling it to access and process information
beyond its internal knowledge, interact with other systems, or perform specialized tasks
that it may not be capable of on its own. For example, an AI Agent might use a weather
3
API to fetch current weather information, or a Python interpreter to solve the mathematical
question.
4. LLM. This is the core component of the system that interprets the task instructions and
prompts, interacts with the toolset, and generates the intermediate outputs and final answers.
In this context, we utilize publicly available large language models such as ChatGPT, GPT-4,
and others.
5. Intermediate Output. This represents the output generated by the LLM-based AI Agent
after it processes the task instructions and prompts, and interacts with the toolset. There
are three typical intermediate outputs: (1) the high-level plans to fulfill the original user
instruction, (2) selected and created tools to fulfill each subtask in the plans, and (3) the
results or errors produced after tool execution. The output can be reviewed and refined,
either by the AI Agent itself or with human oversight, to ensure it is accurate and meets the
requirements of the task instruction.
6. Final Answer. This is the output that the AI Agent summarizes and provides to the user
after all processing (including task planning, tool usage, and possibly error feedback) has
been completed.
Task planning and tool usage represent the cornerstone of LLM’s abilities. Others like perception,
learning/reflection/memory (from feedback), and summarization are indeed critical, but they primarily
serve to enhance and support these two core competencies. Therefore, concentrating on these two
key competencies - Task Planning and Tool Usage (TPTU for short) - we have devised two distinct
types of AI agents, as depicted in Figure 2:
• The first one, named as the One-step Agent (TPTU-OA), adopts a global perspective to
interpret the original problem, effectively breaking it down into a sequence of sub-tasks in
4
Table 1: A simple illustration of the techniques for endowing the key ability.
Sequential Plan 2
...
Figure 2: The workflows of the One-step Agent and the Sequential Agent are specifically designed to
assess the Task Planning and Tool Usage abilities of LLMs.
a single instance. This strategy fully harnesses the model’s comprehensive understanding
capabilities to map out the problem-solving steps for all sub-tasks at once. This method
underscores the significance of a holistic understanding and planning of the overall task,
albeit it might lack flexibility when dealing with individual sub-tasks.
• The second type, referred to as the Sequential Agent (TPTU-SA), emphasizes tackling the
current sub-task at hand. Upon successful resolution of the ongoing sub-task, this agent
requests the LLMs to provide the succeeding sub-task. This approach enables the model to
maintain a clear and concentrated focus throughout the problem-solving journey, tackling
issues incrementally. Such a methodology allows for continuous feedback and progress
within the confines of addressing a broader problem.
These two distinct agent models represent two disparate problem-solving strategies - the sequential
and one-step resolution 4 . In our subsequent experiments, we aim to understand their respective
4
One can also combine the two strategies to design a hierarchical agent, but this is beyond the scope of this
paper.
5
strengths and weaknesses, and how they can be best utilized to leverage the capabilities of LLMs in
real-world problem-solving scenarios.
3 Evaluation
We instantiate the proposed LLM-based AI Agent framework (TPTU-OA and TPTU-SA) with
different LLMs and evaluate their performance on typical tasks.
3.1 Preparations
Before beginning our evaluation, we first outline the preparations. We will give detailed descriptions
of the datasets, available tools, and popular large language models.
3.1.1 Datasets
We first clarify the motivations behind our choice of tools for evaluation. The selection was guided
by two primary factors: the number of tools to be evaluated and the specific tools to be included.
Firstly, regarding the number of tools, it is important to state that our proposed evaluation framework
is extensible. It can incorporate any number of tools as pluggable components to be managed by
the LLM-based AI agents. Secondly, looking at the current work on tool-augmented LLMs, such as
T-Bench [10] and ToolBench [11], we see that only a handful of tools are launched and executed in a
single scenario. Meanwhile, API-Bank [12], in a single scenario, typically dispatches only one API
tool and awaits its response. APIBench [13] and ToolAlpaca [14] do not even execute a tool response.
Hence, for the sake of simplicity and focus in our evaluation, we have decided to primarily assess
two tools (which can be called multiple times) within a single scenario.
Secondly, we also need to decide which specific tools should be used for evaluation. Consider a
real-world scenario where we pose the question: “How much budget is required to offer a $100
incentive to each employee who has been with the company for over five years?". To answer this, we
first need to retrieve the relevant data from a database, typically using SQL, to find the number of
eligible employees. Then, we need to perform a mathematical calculation to estimate the total budget.
Such scenarios are quite common in daily life where the formulation and resolution of a question
often involve SQL and mathematical tools.
Recognizing the importance of these tools, we have chosen to focus our evaluation on SQL and Python
generators, which represent the capabilities of database querying and mathematical computation,
respectively. To this end, we have prepared 120 question-answer pairs that vary in complexity. These
pairs provide a rigorous assessment of the LLM-based AI agents in understanding, generating, and
utilizing these essential tools. For further information on these queries and their corresponding
demonstrations, please refer to Appendix A.
3.1.2 Tools
We have defined a total of 12 available tools for the selection of the LLM-based AI agents for
evaluation. They are defined as follows:
• SQL generator: Given an input question and a database, create a syntactically correct SQLite
query statement.
• Python generator: Given an input question and some information, generate a syntactically
correct Python code.
• Weather query tool: Given a location, output the current real-time weather at that location.
• Image generator: Given a text description, generate a related image.
• Text extractor: Given a link to an image, extract the corresponding text and its position
coordinates.
• Translator: Given a piece of text, translate it into other languages.
• Bing Searcher: Given a piece of text, conduct a search on the Bing browser and return
content.
6
• Shell generator: Given an input question and some information, generate a syntactically
correct Shell code.
• Java generator: Given an input question and some information, generate a syntactically
correct Java code.
• Wikipedia searcher: Given a piece of text, conduct a search on Wikipedia and return content.
• Office software: Given a text description, automatically generate corresponding long docu-
ments or spreadsheets or PPTs.
• Movie player: Given a movie name, automatically play the corresponding movie resources.
3.1.3 LLMs
The LLMs evaluated in this paper are listed in Table 2, elaborated as follows:
• GPT series developed by OpenAI boasts a powerful language model with a vast number of
parameters, enabling it to tackle intricate problems efficiently. This paper aims to evaluate
the performance of ChatGPT, which balances the performance with costs (the number of
OpenAI API calls).
• Claude is committed to maintaining honesty and ensuring user safety, which is developed
by Anthropic. With its impressive size, Claude ranks among the largest language models
globally and poses a formidable challenge to ChatGPT as a strong competitor.
• InternLM, a sophisticated language model developed by Shanghai AI Lab, boasts a multi-
round dialogue capability and an impressive ability to comprehend super-long text. This
language model is meticulously designed to cater to the nuances of the Chinese language,
enabling it to comprehensively understand and effectively process Chinese text. Here, we
adopted the version with 120 billion parameters.
• Ziya is an expansive and robust pre-training model developed by IDEA, derived from the
LLaMa with 13 billion parameters. This comprehensive model exhibits a wide range of
capabilities, including translation, programming, and mathematical calculations. Notably, it
stands out as a bilingual LLM, highlighting its ability to effectively process and comprehend
text in Chinese.
• ChatGLM, developed by Tsinghua University, is an open-source dialogue language model
that supports bilingual Q&A in Chinese and English, with a particular focus on Chinese
optimization. Built on the General Language Model (GLM) architecture and utilizing model
quantization technology, the ChatGLM can be easily deployed on consumer-grade graphics
cards, enabling local implementation by users.
• Chinese-Alpaca-Plus is achieved by extending LLaMA’s existing vocabulary with an
additional 20,000 Chinese tokens from Meta AI (formerly known as Facebook AI Research
Laboratory). In this version, we use a model with 33 billion parameters. The training text
has been expanded to 120GB, and the fine-tuning instruction data has been increased to
4.3M.
7
3.2 Evaluation on Task Planning Ability
In this section, to evaluate the planning capabilities of the LLM-based AI agents, we have structured
the evaluations as follows.
For TPTU-OA, we begin by examining the agents’ ability to plan the order of tool use. This is
followed by an evaluation of the agents’ capacity to not only plan the sequence of tools but also the
corresponding subtask descriptions. Subsequently, we conduct a specialized planning evaluation
where the agents must generate multiple sequences of key-value pairs of the form {tool: subtask
description} in complex problem teardowns. Moreover, we expand the toolset with additional,
unrelated tools to further challenge and reassess the planning ability of the LLM-based AI agents.
For TPTU-SA, we follow the regime that the agent should generate multiple sequences of key-value
pairs of the form {tool: subtask description} for evaluation.
Table 3: The evaluation results for the planning of tool order generation.
Model ChatGPT Claude Ziya
Accuracy 100% 100% 45%
Model ChatGLM Chinese-Alpaca-Plus InternLM
Accuracy 45% 20% 80%
The results of our experiments indicate that models, notably Ziya and ChatGLM, frequently grapple
with the generation of lists in the correct format. For other models, the predominant challenges lie in
generating tools in the correct sequence or in the occasional omission of necessary tools. Nonetheless,
the issue of parsing list formats is generally negligible.
These findings suggest that the majority of LLM-based AI agents possess a fundamental capability to
analyze the tool needs of a given problem and understand its task requirements. To further explore
whether these LLM-based AI agents can effectively break down the original problem into sub-tasks,
we proceed to the following section.
8
Table 4: The evaluation results for the planning of tool order and subtask description generation.
Model ChatGPT Claude Ziya
Accuracy 55% 15% 10%
Model ChatGLM Chinese-Alpaca-Plus InternLM
Accuracy 10% 0% 45%
Although the generation of tool sequences and their corresponding subtask descriptions might be an
effective way to problem-solving, there is a significant decrease in accuracy for all LLMs as can be
seen. We hypothesize that there are a few potential drawbacks to this method:
1. Difficulty in Error Tracking and Debugging. Generating the complete tool and subtask
sequences may make it more challenging to track and debug errors. If an error arises within
the sequence, it might require a total regeneration instead of a simple modification or repair
to the erroneous part.
2. Tool-Subtask Pairing Issue. If all tool sequences and subtask descriptions are generated
independently, there’s an inherent risk of misalignment between the tools and their corre-
sponding subtasks. This could potentially lead to an improper pairing, which, in turn, could
result in a flawed or ineffective solution that fails to appropriately resolve the given problem.
3. Lack of Flexibility. The approach may lack this flexibility when facing complex problems
requiring adjustments to the tool or subtask sequence.
4. Dependency on Global Information. Generating the entire tool and subtask sequences
requires a global understanding and planning of the entire problem. However, in some
instances, certain parts of the problem might not be clear at the early stages of problem-
solving, which could pose challenges within this framework.
After feeding the prompt to these LLM-based AI agents, we get results shown in Table 5.
Analyzing the results from Tables 4 and 5, we observe a marked improvement of 52.9% when the
tool-subtask pairs are generated in a unified format compared to separate generation of tools and
subtasks.
This significant performance enhancement can likely be attributed to the close coupling between
tools and their associated subtasks in our unified generation strategy. When tools and subtasks are
9
generated separately, there is a potential disconnect or lack of coherence between the two, which
could lead to less accurate or efficient solutions. In contrast, by generating tool-subtask pairs together,
we ensure that each tool is directly tied to its relevant subtask, leading to a more coordinated and
effective problem-solving approach. This might explain the observed increase in overall performance.
Table 6: The evaluation results for the planning of Tool-Subtask pair with unrelated tools.
Model ChatGPT Claude Ziya
Accuracy 70% 90% 10%
Model ChatGLM Chinese-Alpaca-Plus InternLM
Accuracy 0% 5% 50%
After feeding the prompt to these LLM-based AI agents, we get results shown in Table 6. The results
from our expanded evaluation demonstrate that even when presented with irrelevant or similar tools
and descriptions, LLM-based AI agents consistently avoid selecting these unrelated tools (i.e., the
accuracy has remained unchanged or exhibited only a marginal decrease compared with Table 5).
This outcome indicates the effectiveness of our designed prompt, which successfully guides the
LLM-based agents to understand the appropriate tool sequence for complex problem decomposition.
This observation reinforces the notion that a well-structured and informative prompt can efficiently
guide AI agents to understand the core essence of the problem, thereby enabling them to sift through
irrelevant information and focus on key tasks. This successful discrimination against unrelated tools
also points towards the models’ ability to understand the specific context of a problem and select the
appropriate tools, thereby enhancing the overall problem-solving process.
Table 7: The evaluation results for the planning of Tool-Subtask with the sequential agent.
Model ChatGPT Claude Ziya
Accuracy 80% 100% 10%
Model ChatGLM Chinese-Alpaca-Plus InternLM
Accuracy 0% 0% 65%
The evaluation results are shown in Table 7. Compared with results shown in Table 5, TPTU-SA
generally performs better than TPTU-OA especially for high–performing LLMs (e.g., ChatGPT,
Claude and InternLM). We propose the following potential reasons for this observation:
10
2. Richer Contextual Understanding: Sequential agents are exposed to the outcome of each
previous subtask before moving on to the next one. This iterative process could facilitate a
richer understanding of the problem context, enabling more accurate task planning and tool
usage.
3. Flexibility in Task Management: In comparison to one-step agents, sequential agents
might have more flexibility in managing tasks. They have the opportunity to correct errors
or adjust their strategy after each step, which can lead to improved overall performance.
4. Improved Learning From History: The sequential process provides a history of actions
and results which can be beneficial in learning. The agent can use this history to make better
predictions about what tool to use next or what subtask to tackle, leading to more accurate
and efficient problem-solving.
These points of analysis suggest that the structure and operation of sequential agents inherently confer
certain advantages in complex problem-solving scenarios, leading to their superior performance.
Before evaluating the end-to-end multi-tool usage ability of LLM-based AI agents, we first evaluate
the effectiveness of single-tool usage for SQL generation and mathematical code generation.
Subsequently, to assess the end-to-end performance of LLMs across various tools, two types of agents
(TPTU-OA and TPTU-SA) were developed and several LLMs were subjected to testing under these
agents. The role of the agents is to break down complex questions into simpler sub-questions and plan
corresponding tools to solve them, based on the available toolset and corresponding tool descriptions.
The Effectiveness of simple SQL Creation Using the schemas provided in Table 12 and Table 13,
we construct questions similar to those in Table 14, and refer readers to Appendix A. These questions
are posed to various LLMs using our specifically designed prompts in Appendix B.
Following the tailored prompts, the LLMs are evaluated based on their responses to the presented
queries. The results of this comprehensive assessment are compiled and exhibited in Figure 8.
This verifies the capabilities of each LLM in handling varying simple single-table SQL queries, thus
providing a basis for comparison and analysis.
The Effectiveness of Complex Nested SQL Creation Using the schemas provided in Ta-
ble 15, 16, 17, and 18, we construct questions similar to those in Table 19, and refer readers to
Appendix A. For complex nested SQL questions, to further verify the SQL tool creation capability of
LLMs, we have designed two types of prompts. One is the direct-guidance type, which explicitly
informs the model that it needs to generate nested SQL query statements, as shown in Figure 13 in
Appendix B.
The other is based on the Chain-of-Thought (CoT) [20] approach, which leverages the model’s ability
to reason step by step to comprehend and craft SQL tools, and the prompt is shown in Figure 14 in
Appendix B. This method guides the model to sequentially generate SQL query clauses based on the
problem context, thus breaking down the complex query generation task into smaller and manageable
11
subtasks. This approach provides the model with a structured way to handle complex SQL tasks and
showcases its capacity to engage in incremental reasoning and problem-solving.
The design of these two types of prompts serves as the backbone of our evaluation for complex nested
SQL questions. While the direct-guidance approach focuses on testing the model’s raw ability to
generate SQL queries when explicitly instructed, the CoT-based approach evaluates a more nuanced
capability: the model’s reasoning and problem-solving skills in a step-by-step manner. Both these
methods present unique challenges and offer valuable insights into the strengths and potential areas
of improvement for the large language model’s SQL tool generation ability. Subsequently, we will
explore these two dimensions based on our experimental evaluations shown in Table 9.
From the above results in Table 9, it is clear that different models possess varying levels of proficiency
in handling complex nested SQL tasks. Some models, like Claude, exhibit a robust capability in SQL
generation, no matter whether the approach is direct or CoT-based. Most of these models demonstrate
the SQL tool usage capability.
Specifically, some models such as ChatGLM show a distinct preference for the CoT-based approach,
their performance improves when problems are broken down into smaller, manageable sub-tasks.
This suggests that these models may have a stronger ability in sequential problem-solving and benefit
more from step-by-step guidance. Conversely, models like Ziya and InternLM show a drop in
performance when tasks are guided in the CoT-based format. This might indicate challenges in
managing dependencies between sub-tasks or handling the continuity in sequential problem-solving.
Lastly, Chinese-Alpaca-Plus shows significant room for improvement in complex SQL generation
tasks. This shows that not all models are equally suited to handle advanced problem-solving involving
nested SQL queries.
Overall, these findings underscore the importance of tailoring evaluation and training methodologies
to the individual strengths and weaknesses of each model. By adopting this approach, we can better
understand the performance variations across different models and provide targeted improvements
to enhance their problem-solving abilities. Furthermore, this analysis highlights the potential of
LLM-based agents in real-world applications, and the need to push their boundaries through continued
research and development.
The Effectiveness of Mathematical Code Creation Following our evaluation of the LLM’s profi-
ciency in creating complex SQL queries, we now shift our focus to another tool creation: the creation
of mathematical code. To the best of our knowledge, while large language models possess significant
capabilities, they often fall short of providing highly accurate solutions to mathematical problems.
Guiding these LLMs to generate mathematical code, and subsequently leveraging external tools to
execute and derive the solutions, could significantly enhance their ability to tackle mathematical
challenges.
In the upcoming section, we will conduct a detailed evaluation of guiding these LLMs to generate
mathematical code. We aim to shed light on the true capability of these models in generating
mathematical code and to elucidate the extent to which they can be utilized to aid in mathematical
problem-solving. The prompt about how to guide LLMs is shown in Figure 15 in Appendix B.
The results shown in Table 10 indicate that the capabilities of LLM-based agents to generate math-
ematical code vary considerably. High-performing models like ChatGPT, Claude, and InternLM
display excellent proficiency, suggesting their potent ability to solve complex mathematical tasks.
Middle-tier models, such as Ziya, show moderate success, indicating the potential for improvement
and adaptability with the right training and optimization. Surprisingly, Alpaca demonstrated a notable
12
Table 10: The evaluation results for mathematical questions.
Model ChatGPT Claude Ziya
Accuracy 90% 85% 50%
Model ChatGLM Chinese-Alpaca-Plus InternLM
Accuracy 0% 55% 95%
proficiency in mathematical tasks, despite its poor performance in SQL generation, suggesting a
possible inclination towards mathematical problems. In contrast, ChatGLM struggles significantly
with mathematical code generation, underlining a potential weak spot in its capabilities and the need
for focused improvement in this area.
Overall, these results underscore the task-dependent nature of LLMs’ capabilities and highlight the
importance of recognizing their individual strengths and weaknesses for optimal model guidance and
enhanced problem-solving.
We now aim to utilize the one-step agent and sequential agent, which we designed, to conduct an
evaluation involving multiple tools. Corresponding prompts for each agent type have been crafted
and are presented in Figure 16 and Figure 17 of Appendix B, respectively.
In this phase of the evaluation, we need to automatically invoke the respective tools through code
and produce the results. Given that user interface-based LLMs lack the capability to call external
tools, we will only utilize the following four API-based LLMs (ChatGPT, Ziya, Chinese-Alpaca, and
InternLM) for this comprehensive evaluation of external tool usage ability.
Table 11: The evaluation results for end-to-end ability of multiple tools.
Model ChatGPT Ziya Chinese-Alpaca-Plus InternLM
TPTU-OA 50% 0% 0% 15%
TPTU-SA 55% 0% 0% 20%
With agents mentioned above, the final results are presented in Table 11. The evaluation results
demonstrate varying levels of task planning and tool usage capabilities among the four API-based
LLMs. In the TPTU-OA evaluation, ChatGPT achieved a performance rate of 50%, significantly
outperforming the other models, with InternLM at 15%, while both Ziya and Chinese-Alpaca did not
manage to complete any tasks successfully, resulting in a score of 0%. In the TPTU-SA evaluation,
an overall slight improvement was observed. ChatGPT maintained its leading position, with a slightly
improved performance rate of 55%. InternLM also exhibited better performance, achieving a score of
20%, whereas Ziya and Chinese-Alpaca-Plus again failed to register any successful task completion.
These results reflect a notable discrepancy in the performance of LLMs when it comes to using
external tools. ChatGPT and InternLM have demonstrated some ability to navigate these tasks,
but their performance rates suggest there is significant room for improvement. Ziya and Chinese-
Alpaca-Plus’ performance indicates a struggle to effectively utilize external tools in their current
state.
The differential performance between the TPTU-OA and TPTU-SA evaluation hints at the possible
impact of the agent design on the LLMs’ task execution ability. In particular, the performance increase
under the sequential agent framework suggests that breaking down tasks into sequential steps might
help LLM-based AI agents better utilize external tools. This insight could prove valuable in future
improvements and developments of LLM-based AI agents. However, even with this approach, it is
clear that LLM-based AI agents are far from perfect when it comes to effectively using external tools
for complex tasks. This finding underlines the importance of further investigation and improvement
in this domain.
13
3.4 Insightful Observations
Upon closer observation of our experimental results, we have identified several phenomena that
deserved further exploration. These findings serve to broaden our understanding of LLM-based
agents’ behavior and capabilities and provide essential insights that could shape future research in
this field. In the following, we will dissect these phenomena as shown in Figure 3 - 6, casting light on
the weaknesses of LLM-based agents in the context of task planning and tool usage.
How many more concerts has Jay Chou held than Li Ronghao? Is
this number bigger than the square root of 10?
3. Endless Extensions: LLMs tend to overutilize a particular tool, even in instances where
a single use would suffice for the correct result. This issue can lead to extended and
nonsensical planning, where the same subtask is repeatedly solved.
4. Lack of Summary Skills: LLMs do not take into account the responses to subproblems,
relying instead on their internalized knowledge to generate the final answer. This may lead
to a scenario where the final response only addresses a portion of the original query.
By identifying and addressing these common issues, we stand a better chance at improving and
refining LLMs, thereby unlocking their full potential.
14
Exclude the two birthplaces with the most singers, provide the
number of singers from other birthplaces, and calculate the
factorial of this number.
The Tool_Query for the first execution of the tool is: {{"SQL Generator":
"Not the two birthplaces with the most singers"}}
The Tool_Query for the second execution of the tool is: {{"SQL Generator":
"Exclude the two birthplaces with the most singers, provide the number of
singers from other birthplaces"}}
The Tool_Query for the third execution of the tool is: {{"SQL Generator":
"Exclude the two birthplaces with the most singers, provide the number of
singers from other birthplaces, and calculate the factorial of this number"}}
……
Please use SQL language to query who are the singers who have not
been nominated in the Golden Melody Awards? Give their names.
4 Related Work
The remarkable capacity for usage and creation of tools have facilitated the transcendence of our
innate physical and cognitive constraints, thereby profoundly advancing the progress and prosperity
of human civilization and society. The swift advancement of LLM has rendered it feasible to use and
create tools like humans. The integration of specialized tools with LLM has unlocked substantial
potential in addressing intricate tasks. In this section, we offer a concise synopsis of the relevant
research pertaining to tool learning based on LLMs.
The initial advancements in tool learning have been constrained by the capabilities of artificial
intelligence (AI) models. [21] Traditional deep learning approaches exhibit limitations in terms
of comprehension of tool functionality and user intentions, and common sense reasoning abilities.
Consequently, these limitations directly result in a notable decline in the stability and precision of tool
learning methodologies. Recently, the advent of LLM has marked a pivotal juncture in the realm of
tool learning. LLMs encompass a broad spectrum of common sense cognitive capabilities and exhibit
remarkable proficiencies in natural language processing, reasoning, and interactive decision-making
[22–26]. These attributes furnish indispensable prerequisites for LLMs to comprehend user intentions
and effectively employ tools in tackling intricate tasks [27]. Simultaneously, the advancement of
fine-tuning [28–32] and in-context learning [33, 34] technology has offered robust support to LLM
in addressing increasingly intricate challenges. In addition, tool usage can mitigate the inherent
limitations of LLMs, encompassing the acquisition of up-to-date information from real-world events,
refined mathematical computational abilities, and the mitigation of potential hallucinatory phenomena.
[35]
Within the realm of embodied intelligence [36–38], LLM engages in direct interactions with tangible
tools like robots in order to enhance their cognitive abilities, optimize work productivity, and expand
functional capacities. LLM possesses the capability to automatically devise action steps based on
user intentions, enabling the guidance of robots in the completion of tasks [39–47], or alternatively,
15
to directly generate underlying code that can be executed by robots [48–52]. Palm-E [44] introduced
a multimodal language model which seamlessly integrates sensor data into its framework, enabling
efficient planning of robot actions and task completion. Code as Policies (CaP) [52] facilitates the
transformation of natural language instructions into code fragments that can be directly compiled and
executed on robots. As for Inner Monologue [42], LLM incorporates diverse environmental feedback
to construct inner monologues, thereby formulating effective robot control strategies. Furthermore,
LP-SLAM [39] proposes a simultaneous localization and mapping (SLAM) system empowered
with language perception capabilities, exploiting the potential of ChatGPT. PromptCraft [51], on the
other hand, devises a function library tailored to ChatGPT on the robot platform, streamlining the
conversion of user intentions into executable tasks via the underlying backend API.
In addition to directly changing the real environment through interaction with tools in the physical
world, LLM can also utilize software tools such as search engines [53–61], mobile [62, 63], Microsoft
Office [64, 65], calculators [66–68], deep models [13, 69–76] and other versatile APIs [77–80, 14, 81]
to enhance model performance or complete complex workflows through flexible control of the
software. Toolformer [78] employs a self-supervised methodology to fine-tune the language model,
enabling it to acquire the ability to automatically invoke APIs. ART [82] leverages CoT [20] and
In-context Learning [76, 35] techniques to automatically generate multi-step reasoning processes
for new tasks, while also selecting and utilizing the most appropriate available tool at each step.
ASH [56] utilizes LLM for sequence hierarchical decision-making to achieve web navigation tasks.
WebGPT [60] and WebCPM [58] use network search to assist in implementing Question Answering
tasks. In addition, RCI [83] recursively criticizes and improves itself to execute computer tasks guided
by natural language according to the prompting scheme. To achieve the analysis and processing of
tables, TableGPT [65] employs a table encoder to transform tabular data into vector representations,
which are then fed into an LLM for inference in combination with user queries.
The usage of tools is contingent upon the accessibility of external tools. Recently, efforts have been
made to employ LLM as a tool creator in order to generate tools that can be utilized for diverse
requests [84–91]. This development has consequently raised the demands placed on LLM. And
these created tools are typically implemented as Python or SQL functions. LATM [84], for example,
leverages the prowess of GPT-4 to create tools, and the usage of more cost-effective models has
shown potential in exhibiting performance on par with larger models for these tool applications.
EVAPORATE [90] involves the synthesis of multiple functions, which are subsequently utilized at a
large scale to efficiently process documents and generate structured views.
5 Conclusion
In this paper, we have introduced a structured framework specially designed for LLM-based AI
Agents, with an emphasis on their abilities in task planning and tool usage. This framework, coupled
with our design of two distinct types of agents assigned for the inference process, allows for a
comprehensive evaluation of the capabilities of current open-source LLMs, thereby yielding critical
insights into their effectiveness. Furthermore, our research highlights the significant potential of
LLMs in managing complex tasks, revealing the exciting prospects they hold for future research
and development. As we continue to explore and improve upon these models, we move closer to
unlocking their full potential in a wide range of real-world applications.
Acknowledgements
16
the survey of API or GUI-based Tool Scheduling by LLMs. Bin Zhang summarized these papers and
synthesized an overarching summary.
As for the evaluation phase, Yihong Chen, Tianpeng Bao, Jingqing Ruan, Guoqing Du, Zhiwei Xu,
Shiwei Shi, and Bin Zhang performed the experiments and analyzed the data. Hangyu Mao assisted
in the analysis of the experimental phenomena and offered constructive suggestions for improvements.
Xingyu Zeng and Rui Zhao provided invaluable feedback, contributed to the direction of the research.
All authors participated in the discussion.
Regarding the manuscript phase, Hangyu Mao organized the overall chapters of the manuscript and
mainly wrote the methodology part, and provided assistance in other parts. Jingqing Ruan and Yihong
Chen wrote the evaluation section. Bin Zhang wrote the summary of the literature review. Each
author read and approved the final manuscript.
The authors would like to thank Feng Zhu, Ziyue Li, Kun Wang, Yuhang Ran, Mengying Xu, Pengfei
Jia, and Shaobo Lin for their valuable feedback, discussion, and participation in this project.
References
[1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong
et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
[2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural
information processing systems, vol. 33, pp. 1877–1901, 2020.
[3] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le,
“Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652, 2021.
[4] N. R. Jennings, K. Sycara, and M. Wooldridge, “A roadmap of agent research and development,”
Autonomous agents and multi-agent systems, vol. 1, pp. 7–38, 1998.
[5] N. R. Jennings and M. Wooldridge, “Applying agent technology,” Applied Artificial Intelligence
an International Journal, vol. 9, no. 4, pp. 357–369, 1995.
[6] S. Franklin and A. Graesser, “Is it an agent, or just a program?: A taxonomy for autonomous
agents,” in International workshop on agent theories, architectures, and languages. Springer,
1996, pp. 21–35.
[7] C. Castelfranchi, “Modelling social action for ai agents,” Artificial intelligence, vol. 103, no.
1-2, pp. 157–182, 1998.
[8] J. Ferber and G. Weiss, Multi-agent systems: an introduction to distributed artificial intelligence.
Addison-wesley Reading, 1999, vol. 1.
[9] L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous
agents and multi-agent systems, vol. 11, pp. 387–434, 2005.
[10] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, “On the tool manipulation capability of
open-source large language models,” arXiv preprint arXiv:2305.16504, 2023.
[11] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian et al.,
“Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint
arXiv:2307.16789, 2023.
[12] M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li, “Api-bank: A benchmark for
tool-augmented llms,” arXiv preprint arXiv:2304.08244, 2023.
[13] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected
with massive apis,” arXiv preprint arXiv:2305.15334, 2023.
[14] Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and L. Sun, “Toolalpaca: Generalized tool learning
for language models with 3000 simulated cases,” arXiv preprint arXiv:2306.05301, 2023.
17
[15] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,”
Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
[16] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho-
seini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint
arXiv:2212.08073, 2022.
[17] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al.,
“Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
[18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv
preprint arXiv:2302.13971, 2023.
[19] Y. Cui, Z. Yang, and X. Yao, “Efficient and effective text encoding for chinese llama and alpaca,”
arXiv preprint arXiv:2304.08177, 2023.
[20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou,
“Chain-of-thought prompting elicits reasoning in large language models,” Neural Information
Processing Systems, 2022.
[21] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein,
J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,”
arXiv preprint arXiv:2108.07258, 2021.
[22] M. Mosbach, T. Pimentel, S. Ravfogel, D. Klakow, and Y. Elazar, “Few-shot fine-tuning vs.
in-context learning: A fair comparison and evaluation,” arXiv preprint arXiv:2305.16938, 2023.
[23] J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, B. Yin, and X. Hu, “Harnessing the power
of llms in practice: A survey on chatgpt and beyond,” arXiv preprint arXiv:2304.13712, 2023.
[24] C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K. Dam, M. Zhang, J. U. Kim, S. T. Kim,
J. Choi et al., “One small step for generative ai, one giant leap for agi: A complete survey on
chatgpt in aigc era,” arXiv preprint arXiv:2304.06488, 2023.
[25] F. Yu, H. Zhang, and B. Wang, “Nature language reasoning, a survey,” arXiv preprint
arXiv:2303.14725, 2023.
[26] Z. Wang, G. Zhang, K. Yang, N. Shi, W. Zhou, S. Hao, G. Xiong, Y. Li, M. Y. Sim, X. Chen
et al., “Interactive natural language processing,” arXiv preprint arXiv:2305.13246, 2023.
[27] Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han et al.,
“Tool learning with foundation models,” arXiv preprint arXiv:2304.08354, 2023.
[28] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, and M. Jiang, “A survey of knowledge-enhanced
text generation,” ACM Computing Surveys, vol. 54, no. 11s, pp. 1–38, 2022.
[29] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora:
Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[30] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At-
tariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference
on Machine Learning. PMLR, 2019, pp. 2790–2799.
[31] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv
preprint arXiv:2101.00190, 2021.
[32] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt understands, too,” arXiv
preprint arXiv:2103.10385, 2021.
[33] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing
reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
18
[34] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, and A. Sabharwal,
“Decomposed prompting: A modular approach for solving complex tasks,” arXiv preprint
arXiv:2210.02406, 2022.
[35] G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozière, T. Schick,
J. Dwivedi-Yu, A. Celikyilmaz et al., “Augmented language models: a survey,” arXiv preprint
arXiv:2302.07842, 2023.
[36] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to
research tasks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6,
no. 2, pp. 230–244, 2022.
[37] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun,
J. Malik et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF
international conference on computer vision, 2019, pp. 9339–9347.
[38] S. Franklin, “Autonomous agents as embodied ai,” Cybernetics & Systems, vol. 28, no. 6, pp.
499–520, 1997.
[39] W. Zhang, Y. Guo, L. Niu, P. Li, C. Zhang, Z. Wan, J. Yan, F. U. D. Farrukh, and D. Zhang,
“Lp-slam: Language-perceptive rgb-d slam system based on large language model,” arXiv
preprint arXiv:2303.10089, 2023.
[40] D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained
models of language, vision, and action,” in Conference on Robot Learning. PMLR, 2023, pp.
492–504.
[41] A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang,
R. Julian et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in
Conference on Robot Learning. PMLR, 2023, pp. 287–318.
[42] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch,
Y. Chebotar et al., “Inner monologue: Embodied reasoning through planning with language
models,” arXiv preprint arXiv:2207.05608, 2022.
[43] B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler,
“Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 11 509–
11 522.
[44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson,
Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv preprint
arXiv:2303.03378, 2023.
[45] N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Chatgpt empow-
ered long-step robot control in various environments: A case application,” arXiv preprint
arXiv:2304.03893, 2023.
[46] K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Ground-
ing large language models using 3d scene graphs for scalable task planning,” arXiv preprint
arXiv:2307.06135, 2023.
[47] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “Llm-planner: Few-
shot grounded planning for embodied agents with large language models,” arXiv preprint
arXiv:2212.04088, 2022.
[48] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus-
man, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv
preprint arXiv:2212.06817, 2022.
[49] A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich,
F. Xia, C. Finn et al., “Open-world object manipulation using pre-trained vision-language
models,” arXiv preprint arXiv:2303.00905, 2023.
19
[50] S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron,
M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint
arXiv:2205.06175, 2022.
[51] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles
and model abilities,” Microsoft Auton. Syst. Robot. Res, vol. 2, p. 20, 2023.
[52] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code
as policies: Language model programs for embodied control,” in 2023 IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 9493–9500.
[53] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model
pre-training,” in International conference on machine learning. PMLR, 2020, pp. 3929–3938.
[54] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,”
Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
[55] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driess-
che, J.-B. Lespiau, B. Damoc, A. Clark et al., “Improving language models by retrieving
from trillions of tokens,” in International conference on machine learning. PMLR, 2022, pp.
2206–2240.
[56] A. Sridhar, R. Lo, F. F. Xu, H. Zhu, and S. Zhou, “Hierarchical prompting assists large language
model on web navigation,” arXiv preprint arXiv:2305.14257, 2023.
[57] H. Furuta, O. Nachum, K.-H. Lee, Y. Matsuo, S. S. Gu, and I. Gur, “Multimodal web navigation
with instruction-finetuned foundation models,” arXiv preprint arXiv:2305.11854, 2023.
[58] Y. Qin, Z. Cai, D. Jin, L. Yan, S. Liang, K. Zhu, Y. Lin, X. Han, N. Ding, H. Wang et al.,
“Webcpm: Interactive web search for chinese long-form question answering,” arXiv preprint
arXiv:2305.06849, 2023.
[59] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Webshop: Towards scalable real-world web in-
teraction with grounded language agents,” Advances in Neural Information Processing Systems,
vol. 35, pp. 20 744–20 757, 2022.
[60] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,
W. Saunders et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv
preprint arXiv:2112.09332, 2021.
[61] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning,
“Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” arXiv preprint
arXiv:1809.09600, 2018.
[62] B. Wang, G. Li, and Y. Li, “Enabling conversational interaction with mobile ui using large
language models,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing
Systems, 2023, pp. 1–17.
[63] D. Zhang, L. Chen, and K. Yu, “Mobile-env: A universal platform for training and evaluation of
mobile interaction,” arXiv preprint arXiv:2305.08144, 2023.
[64] H. Li, J. Su, Y. Chen, Q. Li, and Z. Zhang, “Sheetcopilot: Bringing software productivity to the
next level through large language models,” arXiv preprint arXiv:2305.19308, 2023.
[65] L. Zha, J. Zhou, L. Li, R. Wang, Q. Huang, S. Yang, J. Yuan, C. Su, X. Li, A. Su et al.,
“Tablegpt: Towards unifying tables, nature language and commands into one gpt,” arXiv preprint
arXiv:2307.08674, 2023.
[66] Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X. Zhao, and J.-R. Wen, “Chatcot: Tool-
augmented chain-of-thought reasoning on\\chat-based large language models,” arXiv preprint
arXiv:2305.14323, 2023.
[67] A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint
arXiv:2205.12255, 2022.
20
[68] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,
J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint
arXiv:2110.14168, 2021.
[69] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and
L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint
arXiv:2303.11381, 2023.
[70] Z. Liu, Y. He, W. Wang, W. Wang, Y. Wang, S. Chen, Q. Zhang, Y. Yang, Q. Li, J. Yu et al.,
“Internchat: Solving vision-centric tasks by interacting with chatbots beyond language,” arXiv
preprint arXiv:2305.05662, 2023.
[71] Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, “Openagi: When llm meets domain experts,”
arXiv preprint arXiv:2304.04370, 2023.
[72] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with
chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
[73] D. Surís, S. Menon, and C. Vondrick, “Vipergpt: Visual inference via python execution for
reasoning,” arXiv preprint arXiv:2303.08128, 2023.
[74] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and
editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
[75] T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without train-
ing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2023, pp. 14 953–14 962.
[76] L. Chen, B. Li, S. Shen, J. Yang, C. Li, K. Keutzer, T. Darrell, and Z. Liu, “Language models
are visual reasoning coordinators,” in ICLR 2023 Workshop on Mathematical and Empirical
Understanding of Foundation Models, 2023.
[77] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao,
“Chameleon: Plug-and-play compositional reasoning with large language models,” arXiv
preprint arXiv:2304.09842, 2023.
[78] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and
T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv preprint
arXiv:2302.04761, 2023.
[79] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, “Critic: Large language
models can self-correct with tool-interactive critiquing,” arXiv preprint arXiv:2305.11738,
2023.
[80] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou, S. Lu, L. Ji, S. Mao et al., “Taskmatrix.
ai: Completing tasks by connecting foundation models with millions of apis,” arXiv preprint
arXiv:2303.16434, 2023.
[81] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augmenting frozen language models with
massive tools via tool embeddings,” arXiv preprint arXiv:2305.11554, 2023.
[82] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro, “Art:
Automatic multi-step reasoning and tool-use for large language models,” arXiv preprint
arXiv:2303.09014, 2023.
[83] G. Kim, P. Baldi, and S. McAleer, “Language models can solve computer tasks,” arXiv preprint
arXiv:2303.17491, 2023.
[84] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou, “Large language models as tool makers,” arXiv
preprint arXiv:2305.17126, 2023.
[85] R. H. Lewis and J. Jiao, “Computegpt: A computational chat model for numerical problems,”
arXiv preprint arXiv:2305.06223, 2023.
21
[86] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-
aided language models,” in International Conference on Machine Learning. PMLR, 2023, pp.
10 764–10 799.
[87] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager:
An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291,
2023.
[88] C. Qian, C. Han, Y. R. Fung, Y. Qin, Z. Liu, and H. Ji, “Creator: Disentangling abstract
and concrete reasonings of large language models through tool creation,” arXiv preprint
arXiv:2305.14318, 2023.
[89] Y. Cai, S. Mao, W. Wu, Z. Wang, Y. Liang, T. Ge, C. Wu, W. You, T. Song, Y. Xia et al.,
“Low-code llm: Visual programming over llms,” arXiv preprint arXiv:2304.08103, 2023.
[90] S. Arora, B. Yang, S. Eyuboglu, A. Narayan, A. Hojel, I. Trummer, and C. Ré, “Language
models enable simple systems for generating structured views of heterogeneous data lakes,”
arXiv preprint arXiv:2304.09433, 2023.
[91] W. Zhang, Y. Shen, W. Lu, and Y. Zhuang, “Data-copilot: Bridging billions of data and humans
with autonomous workflow,” arXiv preprint arXiv:2306.07209, 2023.
22
A Detailed Dataset Description
Simple SQL queries: These queries typically involve basic operations such as SELECT, FROM,
WHERE, GROUP BY, etc. They are used to retrieve, filter, group, and sort data from a single table.
We give the Schema of two tables in the SQL database in Table 12 and 13 and list several examples
in Table 14.
Table 12: Schema of the Person table
Person
Table 13: Schema of the School table
Column Name Type
School
id TEXT
Column Name Type
name TEXT
age INTEGER id TEXT
sex TEXT name TEXT
school TEXT info_985 TEXT
phone TEXT info_211 TEXT
qualifications TEXT
ability TEXT
Complex nested SQL queries: These queries contain subqueries, which are SQL queries nested
inside a larger query. Nested queries can be used in various clauses such as SELECT, FROM,
WHERE, and HAVING. They provide a way to perform multiple operations or calculations across
multiple tables. We give the Schema of two tables in the SQL database in Table 15, 16, 17, and 18
and list several examples in Table 19.
Complex nested queries utilizing multiple tools: These are advanced queries that involve multiple
tools, such as SQL queries, python code generation, user-defined functions, etc. We give the Schema
23
Table 17: Schema of the Singers table
of two tables in the SQL database in Table 20, and 21 and list several examples in Table 22. For
verifying the planning ability of the LLM-based AI agents, we select this type of query.
Journal
Table 21: Schema of the CoverPersonality table
Column Name Type
Name TEXT CoverPersonality
First_Issue_Date TIME Column Name Type
Journal_ID INTEGER
Person_ID INTEGER
Category TEXT
Journal_ID INTEGER
Sponsor_Organization TEXT
Count INTEGER
Country TEXT
s Language TEXT
Publication_Count INTEGER
24
Table 22: Demostrations of complex nested queries utilizing multiple tools.
Table ID Question Answer Planning Tools SQL reference Code reference
Journal & Cover- Calculate the expo- [20.08, ["PythonREPL", select Name, Language from Jour- import math;
Personality nential of 3 and list "The Economist: Chinese, "SQL Generator"] nal where Journal_ID not in ( select return math.exp(3)
the names and lan- Reader’s Digest: English."] Journal_ID from CoverPersonality )
guages of journals
with no cover person-
ality.
CoverPersonality Compute 4’s factorial, [4, "The Economist: Chi- ["PythonREPL", select Name, Language from Jour- import math;
& Journal compare with GCD nese, Reader’s Digest: En- "SQL Generator"] nal where Journal_ID not in ( select math.gcd(math.factorial(4),
of 212 and list the glish."] Journal_ID from CoverPersonality ) 212)
names and languages
of journals with no
25
cover personality.
Journal Calculate the square [4.8989795, "English"] ["PythonREPL", select Language from Journal group import math;
root of 24, and query "SQL Generator"] by Language having avg ( Publica- math.sqrt(24)
for the language tion_Count ) > ( select avg ( Publi-
whose average num- cation_Count ) from Journal )
ber of published
issues exceeds the
overall average
CoverPersonality Compute the log base [0.69897, "Qing Hai, Xi- ["PythonREPL", select Person_ID from CoverPerson- import math;
10 of 5, then identify aoming Huang, Cristiano "SQL Generator"] ality where Count < ( select max ( math.log10(5)
cover figures appear- Ronaldo, Kobe Bryant"] Count ) from CoverPersonality )
ing less than the over-
all max frequency
across journals.
B Prompts Design
26
Figure 8: The evaluation prompt for tool order and subtask description planning.
27
Figure 9: The evaluation prompt for one-step tool-subtask pair planning.
28
Figure 10: The prompt added to Figure 9 for tool-subtask pair planning with other unrelated tools.
29
Figure 11: The prompt for the tool-subtask pair generation with TPTU-SA.
30
Figure 12: The evaluation prompt for simple SQL questions.
31
Figure 13: The evaluation prompt for complex nested SQL questions.
32
Figure 14: The evaluation CoT-based prompt for complex nested SQL questions.
33
Figure 15: The evaluation prompt for mathematical questions.
34
Figure 16: The system prompt for one-step agent.
35
Figure 17: The system prompt for the sequential agent.
36