Visual GPT
Visual GPT
Visual GPT
1
bility of each VFM and specifies the input-output formats; LLMs to process the visual information. To fuse the vi-
2) converts different visual information, for instance, png sion processing ability into such LLMs, several challenges
images, the depth images and mask matrix, to language for- are lying ahead since it is hard to train either large lan-
mat to help ChatGPT understand; 3) handles the histories, guage models or vision models, and the well-designed in-
priorities, and conflicts of different Visual Foundation Mod- structions [4, 55, 21] and cumbersome conversions [30, 52]
els. With the help of the Prompt Manager, ChatGPT can are required to connect different modalities. Although sev-
leverage these VFMs and receives their feedback in an it- eral works have explored leveraging the pre-trained LLMs
erative manner until it meets the requirements of users or to improve the performance on the vision-language (VL)
reaches the ending condition. tasks, those methods supported several specific VL tasks
As shown in Fig. 1, a user uploads an image of a yellow (from language to version or from version to language) and
flower and enters a complex language instruction “please required labeled data for training [38, 1, 22].
generate a red flower conditioned on the predicted depth of
this image and then make it like a cartoon, step by step”. 2.2. Pre-trained Models for VL tasks
With the help of Prompt Manager, Visual ChatGPT starts a
chain of execution of related Visual Foundation Models. In To better extract visual features, frozen pre-trained im-
this case, it first applies the depth estimation model to de- age encoders are adopted in the early works [9, 25, 54],
tect the depth information, then utilizes the depth-to-image and recent LiT [52] apply the CLIP pre-training [30] with
model to generate a figure of a red flower with the depth frozen ViT model [51]. From another perspective, exploit-
information, and finally leverages the style transfer VFM ing the knowledge from LLMs also counts. Following the
based on the Stable Diffusion model to change the style instruction of Transformer [39], pre-trained LLMs demon-
of this image into a cartoon. During the above pipeline, strate a powerful text understanding and generation capa-
Prompt Manager serves as a dispatcher for ChatGPT by pro- bility [31, 19, 37, 5], and such breakthroughs also benefit
viding the type of visual formats and recording the process the VL modelling [13, 14, 3, 49], where these works add
of information transformation. Finally, when Visual Chat- an extra adapter modules [17] in the pre-trained LLMs to
GPT obtains the hints of “cartoon” from Prompt Manager, align visual features to the text space. With the increased
it will end the execution pipeline and show the final result. number of model parameters, it is hard to train those pre-
In summary, our contributions are as follows: trained LLMs, thus more efforts have been paid to directly
leverage the off-the-shelf frozen pre-trained LLMs for VL
• We propose Visual ChatGPT, which opens the door tasks [12, 38, 8, 46, 50].
of combining ChatGPT and Visual Foundation Models
and enables ChatGPT to handle complex visual tasks; 2.3. Guidance of Pre-trained LLMs for VL tasks
• We design a Prompt Manager, in which we involve To deal with complex tasks, e.g., commonsense reason-
22 different VFMs and define the internal correlation ing [11], Chain-of-Thought (CoT) is proposed to elicit the
among them for better interaction and combination; multi-step reasoning abilities of LLMs [42]. More con-
cretely, CoT asks the LLMs to generate the intermediate
• Massive zero-shot experiments are conducted and answers for the final results. Existing study [57] have
abundant cases are shown to verify the understanding divided such a technique into two categories: Few-Shot-
and generation ability of Visual ChatGPT. CoT [56] and Zero-Shot-CoT [20]. For the few-shot set-
2. Related Works ting, the LLMs perform CoT reasoning with several demon-
strations [58, 41], and it turns out that the LLMs can ac-
2.1. Natural Language and Vision quire better abilities to solve complex problems. Fur-
Surrounded by various modalities (sound, vision, video, ther, recent studies [20, 47] have shown that LLMs can be
etc), language and vision are the two main mediums trans- self-improved by leveraging self-generated rationales under
mitting information in our life. There is a natural link be- the zero-shot setting. The above studies mainly focus on
tween the natural language and visions, and most questions a single modality, i.e., language. Recently, Multimodal-
require joint modeling of both two streams to produce the CoT [57] is proposed to incorporate language and vision
satisfied results [15, 26, 48], e.g., visual question answer- modalities into a two-stage framework that separates ra-
ing (VQA) [2] takes an image and one corresponding ques- tionale generation and answer inference. However, such a
tion as input and requires to generate an answer according method merely shows superiority under specific scenarios,
to the information in the given image. Owing to the success i.e., ScienceQA benchmark [28]. In a nutshell, our work
of large language models (LLMs) like InstructGPT [29], extends the potentiality of CoT to massive tasks, including
one can easily interact with the model or obtain feedback but not limited to text-to-image generation [27], image-to-
in the natural language format, but it is incapable for those image translation [18], image-to-text generation [40], etc.
: replace the sofa in this image with a desk and
Visual Foundation
then make it like a water-color painting
User Query
2db9a50a.png Models
System History of
Principles Dialogue ChatGPT
: Received.
Determine 1: Use VFM? Yes
: replace the sofa in this image Execute 1: Replace Something From The
Prompt Manager
with a desk and then make it like Photo → Inputs: (2db9a50a.png, sofa, desk)
Intermediate Answer
a water-color painting
483d_replace-something_2db9a50a_2db9a50a.png
ChatGPT
:483d_replace-
something_2db9a5 ChatGPT
0a_2db9a50a.png No Use Determine 2: Use VFM? Yes
Output Execute 2: Instruct Image Using Text→ Inputs:
VFM? (483d_replace-something_2db9a50a_2db9a50a.png,
f4b1_pix2pix_483d
make it like a water-color painting)
_2db9a50a.png Yes
Intermediate Answer :
VFMs Execute f4b1_pix2pix_483d_2db9a50a.png
Figure 2. Overview of Visual ChatGPT. The left side shows a three-round dialogue, The middle side shows the flowchart of how Visual
ChatGPT iteratively invokes Visual Foundation Models and provide answers. The right side shows the detailed process of the second QA.
Prompt Filename Visual ChatGPT is very strict to the file name and will
Manager Sensitivity never fabricate nonexistent files.
Thought: Do I need to use a tool? Yes
System Reasoning Action: the action to take, should be one of [{tool_names}]
Principles Format Action Input: the input to the action
ChatGPT
Manager
Name Instruct Image Using Text
Useful for when you want to the style of the image to
Usage be like the text. like make it look like a painting. or
make it like a robot.
Inputs/ The input to this tool should be a comma separated
Pix2Pix
Outputs string of two, representing the image_path and the text.
... Example image/5b227bce.png, make it look like an oil painting
(Optional)
image/{Name}_{Operation}_{Prev_Name}_{Org_Name}
Chained Name: a new uuid name for the generated image,
Prompt Operation: the operation
Filename Prev_Name: the input image
VFMs Manager
Execute Org_Name: the most original images' name
Figure 3. Overview of Prompt Manager. It coverts all non-language signals into language so that ChatGPT can understand.
3.1. Prompt Managing of System Principles M(P) Table 1. Foundation models supported by Visual ChatGPT.
Remove Objects from Image [10, 35]
Visual ChatGPT is a system that integrates different Replace Objects from Image [10, 35]
VFMs to understand visual information and generation cor- Change Image by the Text [35]
responding answers. To accomplish this, some system prin- Image Question Answering [23]
ciples need to be customized, which are then transferred Image-to-Text [23] Text-to-Image [35]
into prompts that ChatGPT can understand. These prompts Image-to-Edge [45] Edge-to-Image [53]
serve several purposes, including: Image-to-Line [16] Line-to-Image [53]
• Role of Visual ChatGPT Visual ChatGPT is designed Image-to-Hed [44] Hed-to-Image [53]
to assist with a range of text and visual-related tasks, Image-to-Seg [24] Seg-to-Image [53]
such as VQA, image generation, and editing. Image-to-Depth [34, 33] Depth-to-Image [53]
Image-to-NormalMap [34, 33] NormalMap-to-Image [53]
• VFMs Accessibility Visual ChatGPT has access to a
Image-to-Sketch [44] Sketch-to-Image [53]
list of VFMs to solve various VL tasks. The decision of
Image-to-Pose [6] Pose-to-Image [53]
which foundation model to use is entirely made by the
ChatGPT model itself, thus making it easy to support
new VFMs and VL tasks. 3.2. Prompt Managing of Foundation Models M(F)
• Filename Sensitivity Visual ChatGPT accesses image Visual ChatGPT is equipped with multiple VFMs to han-
files according to the filename, and it is crucial to use dle various VL tasks. Since these different VFMs may share
precise filenames to avoid ambiguity since one round some similarities, e.g., the replacement of objects in the im-
of conversation may contain multiple images and their age can be regarded as generating a new image, and both
different updated versions and the misuse of filenames Image-to-Text (I2T) task and Image Question Answering
will lead to the confusion about which image is cur- (VQA) task can be understood as giving the response ac-
rently being discussed. Therefore, Visual ChatGPT is cording to the provided image, it is critical to distinguish
designed to be strict about filename usage, ensuring among them. As shown in Fig. 3, the Prompt Manager
that it retrieves and manipulates the correct image files. specifically defines the following aspects to help Visual
• Chain-of-Thought As shown in Fig. 1, to cope with ChatGPT accurately understand and handle the VL tasks:
one seemingly simple command may require multi- • Name The name prompt provides an abstract of the
ple VFMs, e.g., the query of “generate a red flower overall function for each VFM, e.g., answer question
conditioned on the predicted depth of this image and about the image, and it not only helps Visual ChatGPT
then make it like a cartoon ” requires depth estimation, to understand the purpose of VFM in a concise manner
depth-to-image and the style transfer VFMs. To tackle but also severs as the entry to VFM.
more challenging queries by decomposing them into
subproblems, CoT is introduced in Visual ChatGPT to • Usage The usage prompt describes the specific sce-
help decide, leverage and dispatch multiple VFMs. nario where the VFM should be used. For example, the
Pix2Pix model [35] is suitable for changing the style
• Reasoning Format Strictness Visual ChatGPT must of an image. Providing this information helps Visual
follow strict reasoning formats. Thus we parse the in- ChatGPT make informed decisions about which VFM
termediate reasoning results with the elaborate regex to use for the particular task.
matching algorithms, and construct the rational input
format for ChatGPT model to help it determine the • Inputs/Outputs The inputs and the outputs prompt
next execution, e.g., triggering a new VFM or return- outlines the format of inputs and outputs required by
ing the final response. each VFM since the format can vary significantly and
it is crucial to provide clear guideline for Visual Chat-
• Reliability As a language model, Visual ChatGPT GPT to execute the VFMs correctly.
may fabricate fake image filenames or facts, which can
make the system unreliable. To handle such issues, • Example(Optional) The example prompt is optional,
we design prompts that require Visual ChatGPT to be but it can be helpful for Visual ChatGPT to better un-
loyal to the output of the vision foundation models derstand how to use particular VFM under the specific
and not fabricate image content or filenames. Besides, input template and deal with more complex queries.
the collaboration of multiple VFMs can increase sys-
3.3. Prompt Managing of User Querie M(Qi )
tem reliability, thus the prompt we construct will guide
ChatGPT to leverage VFMs preferentially instead of Visual ChatGPT supports a variety of user queries, in-
generating results based on conversation history. cluding languages or images, simple or complex ones, and
the reference of multiple images. Prompt Manager handles folder, which hints the following strings represent-
user queries in the following two aspects: ing an image name. Then, the image is named as
“{Name} {Operation} {Prev Name} {Org Name}”,
• Generate Unique Filename Visual ChatGPT can han- where {Name} is the UUID name mentioned
dle two types of image-related queries: those that in- above, with {Operation} as the operation name,
volve newly uploaded images and those that involve {Prev Name} as the input image unique identifier, and
reference to existing images. For newly uploaded im- {Org Name} as the original name of the image up-
ages, Visual ChatGPT generates a unique filename loaded by users or generated by VFMs. For instance,
with a universally unique identifier (UUID) and adds “image/ui3c edge-of o0ec nji9dcgf.png” is a canny
a prefix string ”image” representing the relative direc- edge image named “ui3c” of input “o0ec”, and the
tory, e.g., ”image/{uuid}.png”. Although the newly original name of this image is “nji9dcgf”. With such a
uploaded image will not be fed into ChatGPT, a fake naming rule, it can hint ChatGPT of the intermediate
dialogue history is generated with a question stating result attributes,i.e., image, and how it was generated
the image’s filename and an answer indicating that the from a series of operations.
image has been received. This fake dialogue history
assists in the following dialogues. For queries that in- • Call for More VFMs One core of Visual ChatGPT is
volve reference to existing images, Visual ChatGPT that it can automatically call for more VFMs to finish
ignores the filename check. This approach has been the user’s command. More concretely, we make the
proven beneficial since ChatGPT has the ability to un- ChatGPT keep asking itself whether it needs VFMs
derstand fuzzy matching of user queries if it does not to solve the current problem by extending one suffix
lead to ambiguity, e.g., UUID names. “Thought: ” at the end of each generation.
: Hi there! I'm Visual ChatGPT. I'm here to help you with : The image/6c04_remove-
a wide range of text and visual related tasks. How can I help something_7d56_dcd3af54.png now contains a
you? black table with a blue background and no cup.
: Sure, I can help you draw an apple. Here is the image I : Can you replace the background of
generated for you: image/6c04_remove-something_7d56_dcd3af54.png with
an old man sitting with a lot of vicissitudes in his face?
image/a3453226.png
: The new image is saved as
image/5cd1_replace-
: Draw an image: something_6c04_dcd3af54.png
: Received. What do you want to know / operate it? : The old man's clothes are blue.
You can use image/dcd3af54.png to represent this
image in our dialogue.
: Correct. I want you to detect the depth map of the
image/5cd1_replace-something_6c04_dcd3af54 first, and
: The image/dcd3af54.png is my sketch of an apple and a show the depth image, then generate a new image based on
drinking glass, can you please help me to improve it? this depth map, and finally describe the final image.
: I generated a new image based on your sketch. The new image
is saved as: : The image/38a5_depth2image_5cd1_dcd3af54.png, which
contains 1 chair, 1 person. The person is wearing a white shirt and has
image/569b_scribble2image_dcd3af54 a beard. The depth image of the of image/5cd1_replace-
_dcd3af54.png something_6c04_dcd3af54 is image/d149_depth_5cd1_dcd3af54.png.
image/38a5_dep
: Looks good. Can you make the image/d149_depth_ th2image_5cd1
image/569b_scribble2image_dcd3af54_dcd3af54.png 5cd1_dcd3af54.png _dcd3af54.png
into a watercolor painting?
: Ok, forget apples and old people. I want you to
: I have used the Instruct Image Using Text tool to help me generate a picture of a girl and a puppy
generate a watercolor painting based on your input. The new playing on the beach.
image is saved as:
: I have generated a picture of a girl and
image/0849_pix2pix_569b_dcd3af54.png
a puppy playing on the beach. The image is
image/ab0da279.png
: Wow this is very beautiful. Can you tell me what color
this background is?
: The background color of the : It looks very nice. Can you replace this
image/0849_pix2pix_569b_dcd3af54.png is blue. puppy with a kitten?
: Correct answer. Can you remove this apple in this picture? : I have replaced the puppy with a kitten in the
and then describe the image? image. The new image is
: The image/2ca6_remove-something_0849_dcd3af54.png image/d1e1_replace-
now contains a drinking glass with a blue background. something_ab0da279_ab0da279.png
image/2ca6_remove-
something_0849_dcd3af54.png : Well done. Can you turn this image/d1e1_replace-
something_ab0da279_ab0da279.png into a pencil drawing?
: There are still some shadows on the table in the : I have generated a pencil drawing of a girl and a kitten playing
image/2ca6_remove-something_0849_dcd3af54.png, on the beach. The image is
can you help me to replace the table with a black table? image/6c71_pix2pix_
d1e1_ab0da279.png
: I have replaced the table with a black table. The image is
saved as : Thanks very much. You did a good job.
Goodbye!
image/7d56_replace-
something_2ca6_dcd3af54.png : You're welcome! Have a great day!
Figure 4. Multiple rounds of dialogue between humans and Visual ChatGPT. In the dialogues, Visual ChatGPT can understand human
intents, support the language and image inputs, and accomplish complex visual tasks such as generation, question, and editing.
Filename Sensitivity Reasoning Format Strictness
: Can you generate a : Can you generate a
image/ image/ chinese water-ink chinese water-ink
555deb89.png 5180fac5.png painting of a landscape? painting of a landscape?
Inputs/Outputs Example
: generate a realistic : generate a realistic : :
photo of a lovely dog photo of a lovely dog image/ image/
: : I have : I have 477afd fceeb3
generated a generated a fb.png 35.png
lovely dog lovely dog : Received. : Received.
image for you, image for you,
the file name the file name : what is in this image : what is in this image
is image/ is image/ : contains a panda bear : contains a panda bear
6450a01f.png. 5e942285.png. sitting on a bamboo stick. sitting on a bamboo stick.
: detect the edges : detect the edges
: draw a cat to replace : draw a cat to replace
the dog in this image the dog in this image : :
: I have replaced the image/508d_e image/9b88_e
Output of LLM: dog in the image with a dge_477afdfb_ dge_fceeb335_
Action: Replace cat, the new image file 477afdfb.png. fceeb335.png.
Something From The name is
: Can you imagine : Can you imagine
Photo
the original image of the original image of
Action Input:
this edge map? this edge map?
image/6450a01f.png,cat
: :
image/65fc_replace- Image/95ad_can image/bd20_can
something_4ee1a8d2_4ee ny2image_508d ny2image_9b88
1a8d2.png. _477afdfb.png _fceeb335.png