LLMS' Classification Performance Is Overclaimed: Figure 1: Gpt-4O vs. Human When The Gold Label Is Present or Absent

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

LLMs’ Classification Performance is Overclaimed

Hanzi Xu* Renze lou§ Jiangshu Du† Vahid Mahzoon* Elmira Talebianaraki*
Zhuoan Zhou* Elizabeth Garrison* Slobodan Vucetic* Wenpeng Yin§
*
Temple University † University of Illinois at Chicago § Penn State University
{hanzi.xu, slobodan.vucetic}@temple.edu [email protected]

Abstract
One day, the mouse, Rudd, got a splinter in his paw. His turtle
In many classification tasks designed for AI friend Dig came up to Rudd after he heard him yelling, ”Ouch,
or human to solve, gold labels are typically I’ve got a splinter in my paw! Can someone help?”
arXiv:2406.16203v1 [cs.CL] 23 Jun 2024

included within the label space by default, What was the name of the mouse?
often posed as “which of the following is A. Mouse B. Dig C. Splinter D. Rudd
correct?” This standard setup has tradition-
ally highlighted the strong performance of ad- The name of
the mouse is
D. Rudd

vanced AI, particularly top-performing Large D. Rudd

Language Models (LLMs), in routine classi-


GPT-4o
fication tasks. However, when the gold label
is intentionally excluded from the label space, A. Mouse B. Dig C. Splinter

it becomes evident that LLMs still attempt to The name of


None of them
is correct
the mouse is
select from the available label candidates, even A. Mouse
when none are correct. This raises a pivotal
question: Do LLMs truly demonstrate their GPT-4o
intelligence in understanding the essence of
classification tasks? Figure 1: GPT-4o vs. Human when the gold label is present
In this study, we evaluate both closed-source or absent.
and open-source LLMs across representative
classification tasks, arguing that the perceived
performance of LLMs is overstated due to their In this simple scenario, GPT-4o, the latest large
inability to exhibit the expected comprehension language model (LLM), performs comparably to
of the task. This paper makes a threefold con- humans when the correct answer is included among
tribution: i) To our knowledge, this is the first the options. However, interestingly, even this ad-
work to identify the limitations of LLMs in clas-
vanced LLM does not show uncertainty by indi-
sification tasks when gold labels are absent. We
define this task as C LASSIFY- W / O -G OLD and
cating “no correct answer” or “all options seem
propose it as a new testbed for LLMs. ii) We incorrect”, a behavior consistently demonstrated
introduce a benchmark, K NOW-N O, compris- by humans when the correct answer is not provided.
ing two existing classification tasks and one Why should we be concerned about this partic-
new task, to evaluate C LASSIFY- W / O -G OLD. ular phenomenon, especially in the era of LLMs?
iii) This work defines and advocates for a new There are two primary reasons: i) Given the ver-
evaluation metric, O MNI ACCURACY, which
satility of LLMs, they can process inputs with any
assesses LLMs’ performance in classification
tasks both when gold labels are present and set of labels by following natural language instruc-
absent1 . tions, even when the correctness of the labels is un-
known. The expected behavior from LLMs should
1 Introduction mirror that of humans in the previous example:
identifying correct labels when present or indicat-
Let’s begin with the example in Figure 1, which ing the absence of correct ones without risking
illustrates the use of GPT-4o for a straightforward users accepting false responses. In contrast, tra-
classification problem (date: June 10, 2024). ditional classifiers, trained on fixed label sets, are
1
Our code, data, and raw results will be publicly available limited to predicting within those specific labels
at https://github.com/xhz0809/Know-No and lack the flexibility to handle open sets of labels.
ii) LLMs are predominantly designed as genera- to assess LLMs in classification tasks. This
tive models, prioritizing the enhancement of their metric combines performance metrics when
generative capabilities, often at the expense of dis- gold labels are both present and absent, of-
criminative capabilities (Sun et al., 2023b). Many fering a more comprehensive evaluation of
researchers argue that classification tasks are per- LLMs’ capabilities.
ceived as easy for LLMs, as evidenced by their
consistently high performance (Sun et al., 2023b,a; 2 Related Work
Zhang et al., 2024). However, the above example
LLMs’ high performance on classification tasks.
raises the question of whether the performance of
It has been a trend to use LLM generation to solve
LLMs in classification tasks has been overstated
classification problems, either as standalone clas-
due to current evaluation benchmarks and metrics
sification tasks (Sun et al., 2023b,a; Zhang et al.,
only capturing incomplete human behavior.
2024) or mixed with other NLP tasks in multi-task
To investigate this question, we present three
learning (Longpre et al., 2023; Wang et al., 2022;
standard classification tasks as benchmarks: BANK -
Mishra et al., 2022). Remarkable performance met-
77 (intent classification task, Casanueva et al.,
rics from GPT have been observed, including an av-
2020) , MC-T EST (multiple-choice question an-
erage accuracy above 90% on five well-known NLP
swering task, Richardson et al., 2013) , and E QU I N -
benchmark text classification datasets (SST-2, AG-
FER, a newly assembled task where the objective is
News, R8, R52, MR) reported in (Sun et al., 2023b)
to infer the correct equation from four candidates
with zero-shot prompting. Additionally, Sun et al.
given surrounding paragraphs in scientific papers.
(2023a) demonstrate 95-98% accuracy in sentiment
This benchmark, termed K NOW-N O, encompasses
analysis (SST-2/IMDB/Yelp), over 93% in seman-
classification tasks with inputs of varying lengths,
tic role labeling (CoNLL2009), and 92-98% in part-
label sizes, and label scopes, including instance-
of-speech identification (Penn, WSJTweets) with
level and task-level label spaces.
few-shot demonstrations. As the latest LLMs are
We define a novel evaluation metric, O MNI AC -
seen as reliable solutions for NLP classification,
CURACY, designed to accurately assess the human-
their true understanding of the essence of the clas-
level discrimination intelligence of LLMs in clas-
sification task has not been properly evaluated.
sification tasks. This metric integrates the perfor-
mance of LLMs across two dimensions within the LLMs’ Challenges in Task Comprehension It
K NOW-N O framework: i) ACCURACY- W /-G OLD: is essential to investigate the rationale behind the
representing the conventional accuracy when the model’s predictions and determine the extent to
correct label is provided. ii) ACCURACY- W / O - which we can trust its output (Gunning et al., 2019;
G OLD: indicating the accuracy when the correct Rudin, 2019). Recent studies have raised concerns
label is not provided. We argue that O MNI ACCU - about whether LLMs truly understand the tasks
RACY offers a more comprehensive reflection of they perform despite their good performance.
LLMs’ classification performance. Many studies have shown that LLMs can achieve
In summary, our contributions can be outlined promising performance when they are asked to
as follows: provide step-by-step reasoning in their answers
(Wei et al., 2022; Zelikman et al., 2022; Li et al.,
• To the best of our knowledge, this is the first 2022). However, LLM-generated reasoning has
study to uncover the limitations of LLMs in been found to be unfaithful despite its apparent
classification tasks when gold labels are ab- effectiveness (Turpin et al., 2023; Lanham et al.,
sent. We designate this task as C LASSIFY- 2023). The performance boost might be attributed
W / O -G OLD and propose it as a novel evalua-
to the extra computation provided by the explana-
tion framework for LLMs. tion tokens (Wei et al., 2022; Lanham et al., 2023).
• We introduce a benchmark, K NOW-N O, An inspection of the reasoning rationales generated
which encompasses two established classifi- by the model reveals that they often fail to make
cation tasks alongside a newly devised one, logical sense (Zelikman et al., 2022). It is com-
aimed at evaluating C LASSIFY- W / O -G OLD. mon to see rationales that simply repeat content
from the question without providing a reasonable
• This study introduces and advocates for a new explanation. Many of the rationales fail to effec-
evaluation metric, O MNI ACCURACY, tailored tively support the claims or address the reasoning
#input |input| label format #label label scope challenge
BANK -77 (Casanueva et al., 2020) 1,000 12 phrase 77 task-level moderate
MC-T EST (Richardson et al., 2013) 1,000 220 phrase&sent 4 instance-level low
E QU I NFER 1,049 1,925 latex equation 4 instance-level high
Table 1: Statistics of K NOW-N O. “Task-level label scope” refers to a dataset where all instances share the same set of labels. In
contrast, “instance-level label scope” entails instances having varying label sizes and options, as illustrated in the example in the
Introduction. The “challenge” categorization depends on whether domain expertise or common knowledge suffices for the task.

required, indicating that the model often does not due to budget limitations, we randomly selected
truly understand the content and reasoning behind 1,000 instances for this study, ensuring coverage of
the question, even if it arrives at the correct answer. all 77 labels. We chose this dataset because of its
Similarly, recent works have been evaluating moderate difficulty level and large label size.
the cognition-inspired intelligence of LLMs by
MC-T EST (Richardson et al., 2013). MC-
testing latest LLMs on NLP generation tasks to
T EST is a pioneering multiple-choice reading com-
evaluate their capabilities across multiple dimen-
prehension benchmark. It includes 660 elementary-
sions, including reading comprehension, common-
school-level stories, each accompanied by four
sense reasoning, discourse comprehension, and
multiple-choice questions with four unique answer
paragraph/document-level understanding (Wang
options for each question (“instance-level” label).
et al., 2024; Mahowald et al., 2023). When the
We randomly selected 250 stories (1000 questions)
problem-solving process is decomposed into three
as our test set. We chose MC-T EST because of its
sub-steps: knowledge recall, knowledge utilization,
simplicity (most questions can be answered by key-
and answer generation, the results reveal that, de-
word matching rather than deep reasoning), which
spite high scores in answer generation performance,
may help us uncover surprising behaviors of the
the score for knowledge utilization is significantly
latest LLMs.
lower, by up to 34%. Additionally, they point
out that LLMs’ proficiency in language process- E QU I NFER. We design this task to mimic the
ing does not necessarily translate to a similar level paper reviewing process, where reviewers must de-
of cognitive capability if looking at the correlation termine if an equation is valid based on its context.
between different capabilities, revealing the current This task requires intensive domain expertise.
shortcomings of LLMs in true understanding. i) Data Crawling. We crawled a total of 4,951
papers’ LaTeX source packages from ArXiv, fo-
3 Approach cusing on papers accepted by top-tier NLP confer-
ences.2 We excluded papers that were unsuitable
3.1 K NOW-N O Benchmark
for this task, including 1) papers without any La-
In this work, we collect representative classifica- TeX equations and 2) papers with overly compli-
tion datasets to cover (i) multiple task types and cated equations (e.g., equations with nested struc-
difficulty levels, and (ii) various label sizes and la- tures or custom commands). This filtering process
bel scopes (task-level label space or instance-level resulted in 1,449 papers. From each paper, we ran-
label space). Specifically, we build this benchmark domly sampled up to 3 equations, leading to a final
“K NOW-N O” with two existing datasets BANK - set of 3,877 equations.
77 (Casanueva et al., 2020), MC-T EST (Richard- ii) Task Formulation. We formulate this task
son et al., 2013), along with a new dataset pro- as a multiple-choice classification, where each in-
posed by us, named E QU I NFER. An overview of stance includes 1000-word context before the equa-
the K NOW-N O statistics is provided in Table 1. tion and 1000 words after, all in the original La-
We will first briefly introduce BANK -77 and MC- TeX format (details on scaling the optimal context
T EST, followed by a detailed explanation of the length can be found in Appendix A.1). The model
construction process for E QU I NFER. must select the correct LaTeX equation from one
positive option (the gold equation from the original
BANK -77 (Casanueva et al., 2020). Inputs are
paper) and three negative options.
simple sentences (customer service queries) in the
iii) Label Space Construction. To craft high-
banking and financial domain, all sharing the same
quality negative options, we mask out the target
label space of 77 intents (“task-level label”). The
2
original dataset comprises 13,083 inputs; however, ACL, EMNLP, NAACL, TACL, and EACL, etc.
Context before:
“[· · · · · · · · · ] Therefore, the positive sample is the corresponding augmented sentence, while the
negative samples are the augmented versions of other original source sentences from the same
mini-batch. exi and ez i are the average representations along the sequence dimension from the
encoder outputs. Apart from the contrastive loss, the standard cross-entropy loss is calculated as:”
Context after:
“We combine both losses as the final loss:

L = Lce + λLctr

where λ is an interpolation factor. We incorporate the augmented source inputs z to ensure that the
model can still generate correct translations with noisy input. [· · · · · · · · · ]”
Equation options:
A: Lce = − N
P PC
B: Lce = − N i i i i
P
i=1 c=1 yi,c log(pi,c ) i=1 (log Pθ (y |x ) + log Pθ (y |z ))

PN PC PN PC
C: Lce = − N1 i=1
i i
c=1 yc log(pc ) D: Lce = − i=1
i
k=1 yk log ŷki

Figure 2: An example of E QU I NFER, where the equation labeled with “B” is correct.

Instruction: For the following input and options, please return the correct option(s).
w/
Input: · · ·
G
Options: ID1 . [option1 ] ; ID2 . [option2 ]; · · · ; IDG . [optionG ]; · · ·
Instruction: For the following input and options, please return the correct option(s).
hint as
Input: · · ·
option.
Options: ID1 . [option1 ] ; ID2 . [option2 ]; · · · IDn . none-of-them
Instruction: For the following input and options, please return the correct option(s),
w/o hint in or return “none-of-them” if you believe none of the options is correct.
G instru. Input: · · ·
Options: ID1 . [option1 ] ; ID2 . [option2 ]; · · ·
Instruction: For the following input and options, please return the correct option(s).
no
Input: · · ·
hint
Options: ID1 . [option1 ] ; ID2 . [option2 ]; · · ·
Table 2: Prompting LLMs in K NOW-N O. G: gold label (i.e. “IDG . [optionG ]”, in red). Blue: none-of-them hint. In addition
to the content above, we append “Your answer:” as the suffix of the prompt to ensure that the model correctly understands the
task. Sample prompt for each dataset can be found in the Appendix A.2

gold equation in a paper and prompt GPT-4 to the original 3,877 instances, the instance will be
generate the masked equation based on the context abandoned if it cannot gather 3 qualified negative
before and after the equation (100 words on each equations. The above process results in a total of
side).3 1,449 instances. Since all the above filtering steps
iv) Quality Control (Automatic and Manual). are based on LLMs that may still leave some false
We filtered out negative equations if: a) they were negative equations, we asked humans to further fil-
identical to the gold equation; b) GPT-4 could eas- ter out classification instances with any suspicious
ily recognize the flaws. For the latter, we provided false negative equations (i.e., LLM-crafted negative
GPT-4-Turbo with the negative equation and asked equations that are logically correct). After human
whether the equation had significant flaws. The filtering, we have a total of 1,049 classification
remaining negative equations are thus “hard” op- instances.
tions that can easily deceive the LLMs.4 Among An example instance of E QU I NFER is shown in
3
Figure 2.
We also provide GPT-4 with the left part of the “=” sign
from the gold equation to make the LLM-crafted negative
equations more similar to the gold equation, thereby increasing three negative equations can fool GPT-4, we will keep all three
the challenge. negative equations in our dataset. This ensures there is at least
4
In practice, for each classification instance, if any of the one “hard” negative equation in each classification instance.
3.2 Prompting LLMs Therefore, for N O -H INT, we always report
Here we elaborate on our prompts for scenarios human performance by manually reviewing
where gold labels are present and where gold labels LLM responses.
are absent, respectively. We believe that any of the hint types mentioned
Prompt when the gold label is present (w/ G). above would work for human users. Using diverse
The prompt used can be seen in the first block of prompts, we aim to comprehensively evaluate the
Table 2. model’s performance without the gold option and
avoid behavior specific to a particular prompt.
Prompt when the gold label is absent (w/o G).
In this case, the gold option is deleted. The key 3.3 O MNI ACCURACY: A new evaluation
question is whether we should provide hints for the metric
LLMs on how to handle situations where all options Our goal with O MNI ACCURACY is to have it re-
appear incorrect, and how to implement these hints. flect model performance both when the gold label
In this work, based on real-world scenarios, we is present and absent. Therefore, we define O MNI -
design three types of hints (blocks 2-4 in Table 2): ACCURACY in the following straightforward form:
• H INT- AS -O PTION: Even though no gold la- 1
O MNI ACCURACY = · (Aw/ + E[Aw/o ]) (1)
bel is available, we provide “none-of-them” 2
as one of the options. This mirrors common
where Aw/ represents the accuracy when the gold
human behavior when no valid options are
label is present, and E[Aw/o ] indicates the expec-
found, leading to the selection of this choice.
tation of accuracy when the gold label is absent. In
An answer will only be counted as correct
practice, E[Aw/o ] can be achieved by combining
when LLMs choose “none-of-them”.
multiple prompting techniques, such as the three
• H INT- IN -I NSTRU: In contrast to the hints styles described in Section 3.2. We also en-
above hint type, here we do not include courage researchers to explore the most appropriate
“none-of-them” as an option. Instead, in the form for their particular research. In this work, we
instruction, we explicitly request the LLM to adopt the following form:
output “none-of-them” if no correct option Pn i
is found. An answer will only be counted as i=1 Aw/o
E[Aw/o ] = (2)
correct when LLMs return “none-of-them”. n
i.e., we take the average performance across
• N O -H INT: No hint at all. The instruction is
all conditions where no correct answer is pre-
the same as “w/ G” except for the absence of
sented, including “H INT- AS -O PTION”, “H INT- IN -
the gold label.
I NSTRU” and “N O -H INT”, as the comprehensive
The evaluation of N O -H INT is more compli- and robust assessment of LLMs when the gold label
cated: based on our observations, some LLMs, is missing.
especially top-performing ones, tend to gen-
erate a new option with an explanation if they 4 Experiments
believe no correct options are provided (this
LLM Models. We evaluate several popular open-
may also indicate data leakage of our datasets
source and closed-source LLMs in this study.
in LLM pretraining, which we will analyze
• Closed-source LLMs. GPT-4 (OpenAI, 2023)
in Q4 of Section 4.2). This type of LLM re-
and Claude3 (Anthropic, 2024).
sponse creates two challenges: i) it lacks a
• Open-source LLMs. Llama-3 (Meta, 2024),
fixed format, making automatic parsing for
Gemma (Mesnard et al., 2024), and Mistral (Jiang
system evaluation infeasible; ii) in reality, if
et al., 2023).
an LLM response contains a reasonable la-
bel for the input, it requires us to understand Experimental setting. We run each set of ex-
the task and conduct some reasoning, which periments 3 times with options shuffled by differ-
is beyond the scope of automatic processing. ent random seeds and report the averaged results.
Also, we want to avoid using LLM for evalua- For more details, including model versions, hyper-
tion, which might introduce even more errors. parameters, and costs, please see Appendix. A.3.
Closed-source Open-source
Human
GPT-4 Claude3 Llama3 Gemma Mistral
w/ G (i.e., Aw/ ) 98.67 98.26 94.23 39.0 87.53 100.00
Hint as option 80.17 49.83 43.10 4.33 39.97 96.00
Hint in instru. 80.40 62.17 3.83 15.26 30.27 97.00
MC-T EST w/o G
No hint 41.30 60.30 50.10 15.90 33.60 93.00
E[Aw/o ] 67.29 57.43 32.34 11.83 34.61 95.33
O MNI ACCURACY 82.98 77.85 63.29 25.41 61.07 97.67
w/ G (i.e., Aw/ ) 69.40 65.75 42.53 39.03 45.1 –
Hint as option 1.83 0.9 2.13 1.43 1.13 –
Hint in instru. 6.17 5.3 2.9 0.87 1.60 –
BANK -77 w/o G
no hint 1.60 2.00 8.50 2.20 9.30 –
E[Aw/o ] 3 2.73 4.51 1.5 4.01 –
O MNI ACCURACY 36.30 34.24 23.52 20.27 24.56 –
w/ G (i.e., Aw/ ) 44.71 55.39 30.31 20.40 29.90 –
Hint as option 1.91 8.67 38.90 9.06 5.79 –
Hint in instru. 2.86 9.06 0.67 0.19 0.29 –
EquationInf. w/o G
No hint 0.0 0.0 0.0 0.0 0.0 –
E[Aw/o ] 1.59 5.91 13.19 3.08 2.03 –
O MNI ACCURACY 23.15 30.65 21.75 11.74 15.96 –

Table 3: O MNI ACCURACY of closed-source, open-source LLMs and humans. We report human performance only on MC-
T EST due to its challenging nature and more manageable label size within a reasonable timeframe. Gemma is much lower than
other LLMs on MC-T EST dataset because it cannot follow the instruction and always fail to return the option.

4.1 Main results Observations about O MNI ACCURACY. In the


The main results are presented in Table 3. last column of Table 3, we report the human per-
formance on MC-T EST (more details in Section
G is present (w/ G). First, all LLMs, and espe- 4.2 Q2 ). Although both GPT-4 and Claude3 per-
cially closed-source LLMs, demonstrate exception- form on par with humans (98%+ vs. 100%), O M -
ally high accuracy, achieving around 98% on MC- NI ACCURACY clearly shows they are still behind
T EST and over 65% on BANK -77. Second, the per- humans–both overall (around 80% by LLMs vs.
formance of LLMs decreases progressively from 97.67% by humans) and in “E[Aw/o ]” category.
MC-T EST, BANK -77 to E QU I NFER, which aligns We can see that when using the standard evalu-
with our expectations due to the increasing level of ation approach (with the gold label as an option),
difficulty of these tasks. some LLMs appear to reach human-like perfor-
mance. However, our evaluation reveals that LLMs
G is absent (w/o G). For all three prompt styles in still lag considerably behind human performance
w/o G, LLM performance decreases notably as all because they cannot recognize the absence of the
models still tend to return one of the incorrect op- true answer as effectively as humans can. There-
tions offered. When comparing hint styles, “H INT- fore, O MNI ACCURACY offers a more comprehen-
AS -O PTION” and “H INT- IN -I NSTRU”, we find that
sive measure to evaluate LLMs’ understanding of
LLMs’ preference varies. Such variability indicates the classification task and their ability to perform
the difficulty of consistently evaluating Aw/o , as human-level discrimination.
it is somewhat dependent on prompt design. We
encourage researchers to design multiple prompts
4.2 Analysis
and use the expected value E[Aw/o ] rather than
any individual Aiw/o . Q1 : Most effective prompt when gold la-
Llama3 exhibits unexpected behavior on the bel is missing: H INT- AS -O PTION , H INT- IN -
E QU I NFER dataset, as “H INT- AS -O PTION” per- I NSTRU or N O -H INT? N O -H INT is clearly
forms even better than “w/ G”. We suspect this the worst among the three. Based on Table 3, for
may be due to data bias during model pretraining. strong closed-source LLMs, H INT- IN -I NSTRU con-
Please see Section 4.2 Q5 for more analysis. sistently results in higher performance than H INT-
100 Wrong Option Generated Right Generated Wrong

95
90
85
80
75
70
Human Min
65 Human Max
Human Mean
60 LLM Max
w/ Hint-as-Option Hint-in-Instru No-Hint
Figure 3: Humans vs. LLMs on MC-T EST.
Figure 4: LLMs’ output pattern distribution in N O -H INT on
AS -O PTION. This can be attributed to their superior MC-T EST.
ability to follow instructions. Additionally, consid-
ering the poorer performance of N O -H INT, it is
suggested that when working with widely recog- in 100 questions, each with four answer candidates.
nized top-performing LLMs and needing to try only We randomly divided the 100 questions into four
one type of hint (perhaps due to budget constraints), groups of 25 questions each, with each group cor-
H INT- IN -I NSTRU is the better option. responding to one of the four prompts (w/G, H INT-
Among open-source LLMs, H INT- AS - AS -O PTION, H INT- IN -I NSTRU, or N O -H INT). We
O PTION more frequently outperforms H INT- invited four human participants to work separately,
IN -I NSTRU , similar to our observations about with each person responsible for all four groups. To
Llama 3 on E QU I NFER. We suspect this is ensure unbiased results, we did not allow the same
because open-source LLMs are generally less human to annotate the same question with different
adept at following instructions, and they might be prompts, so their annotations of “w/o G” questions
relying on some specific classification patterns would not be influenced by “w/ G” questions.
seen during pretraining. Therefore, including Figure 3 depicts the statistics of human perfor-
none-of-them in the instruction is less clear for mance versus the maximum performance of LLMs
them compared to setting it as a separate option. on 4 prompts. We observe two key dimensions.
In addition, we notice that these open-source First, across prompts, human performance is barely
LLMs achieve the highest performance under N O - affected by the presence of the gold option. In
H INT among the three w/o G prompts on both MC- H INT- IN -I NSTRU and H INT- AS -O PTION, when
T EST and BANK -77. In fact, open-source LLMs the gold option is deleted and the “none-of-them”
tend to generate a new option whenever the gold op- option is provided either in the options or in the in-
tion is absent, regardless of whether there are hints struction, human performance differed by less than
about none-of-them. Therefore, even with H INT- 4% compared to when the gold option was present.
AS -O PTION and H INT- IN -I NSTRU , these models
Even in N O -H INT, without any gold option or
often ignore hints and propose self-generated an- “none-of-them” option hints, humans were only
swers without returning none-of-them. This leads slightly confused, with the difference being up to
to poorer performance H INT- AS -O PTION and 8%. This indicates that C LASSIFY- W / O -G OLD is
H INT- IN -I NSTRU. not a very challenging task for humans, whereas
We also notice interesting differences in the ac- it is for the models. Second, across humans, there
curacy ranking of LLMs between when the gold is a 4-8% difference between Human Min and Hu-
label is available (i.e., Aw/ ) and when it is deleted man Max when the gold option is absent. Despite
(i.e., E[Aw/o ]). More details and analysis will be this, it is evident that even the Human Min perfor-
illustrated in Appendix A.4. mance is significantly higher than the LLM Max
Q2 : Human performance analysis when the gold performance. Particularly under N O -H INT, hu-
label is absent In the last column of Table 3, we man performance did not show a significant decline
report the average human performance on MC- compared to other “w/ G” prompts, while all LLMs
T EST. We randomly selected 25 stories, resulting experienced dramatic drops.
Q3 : What different behaviors do LLMs exhibit GPT 4 Claude 3 Llama 3 Gemma Mistral
when the gold option is absent in N O -H INT? w/ G 98.67 98.26 94.23 39.00 87.53
MC
+ None 99.60 97.60 91.70 40.00 87.60
In N O -H INT, there are two types of patterns from w/ G 44.71 55.39 30.31 20.40 29.90
EQ
models’ responses. The first type is to declare + None 45.60 58.80 15.30 10.10 22.40
that none of the provided options is correct. The Table 4: Ablation study: w/ G vs w/ G + N one. MC and E Q
second type is to generate a new answer that the refer to MC-T EST and E QU I NFER, respectively.
model believes to be correct. Each model behaves
differently, and their response patterns vary. We Even though all other LLMs consistently follow
identified three behaviors: the new delimiter “-”, this can be attributed to their
strong instruction-following capabilities. There-
• GPT-4 usually combines both patterns, while fore, our proposed format-aware trick is unable to
other models tend to generate a new answer di- conclusively determine whether these models were
rectly. exposed to this dataset during pre-training.
• Figure 4 illustrates the distribution of LLM re- Q5 : Would the model be misled when we add
sponses in N O -H INT on MC-T EST. GPT-4 and none-of-them in w/ G? We have observed that
Mistral make the most mistakes on incorrect op- the model exhibits different behaviors when en-
tions, while Claude3 and Llama 3 are most likely countering none-of-them hints. We hypothesize
to generate correct new labels. Gemma performs that this behavior might stem from data bias in-
the worst, with mostly wrong labels generated. troduced during model pretraining or instruction
tuning. To investigate this, we conducted an abla-
• When generating new answers, a significant dif-
tion study by introducing none-of-them options
ference lies in the letter assigned by LLMs to this
in w/ G prompts on a subset (250 instances) on
new option. We observed that when only options
MC-T EST and E QU I NFER.
A-C exist in N O -H INT, GPT-4 consistently la-
bels its self-generated answer as “D” while other To ensure fairness, we randomly replaced one
LLMs tend to choose letters from A to C. incorrect answer with none-of-them, ensuring the
models always select from options A-D in both
Sample outputs and further analysis of behavior scenarios. The results, presented in Table 4, show
patterns are available in Appendix A.5. that for the simpler MC-T EST dataset, model per-
formance remains nearly identical between w/ G
Q4 : If LLMs can generate a correct label in
and w/ G + N one settings. For the more chal-
N O -H INT, is it because the LLMs have seen this
lenging E QU I NFER dataset, closed-source mod-
dataset during pretraining? When eyeballing
els maintains their robustness, while open-source
the answers suggested by LLMs in N O -H INT for
models, particularly Llama 3 and Gemma, experi-
MC-T EST and BANK -77, we noticed that the
enced significant performance declines. This de-
model sometimes generates a new option identi-
crease is attributed to the confusion caused by the
cal to the gold option. This is not surprising for
none-of-them option, which might also explain
MC-T EST, where options can be direct terms from
why Llama 3 performs exceptionally well in H INT-
the story. However, for BANK -77, with its fixed
AS -O PTION on E QU I NFER .
and specific label space, this raises questions about
whether LLMs have had this dataset in its parame- 5 Conclusion
terized knowledge base.
To answer this question, we employed a small Our study reveals critical insights into the lim-
trick with the BANK -77 dataset. Given that the itations of LLMs in classification tasks under
tokens in BANK -77’s options are delimited by “ C LASSIFY- W / O -G OLD where gold labels can be
_ ”, such as “lost_or_stolen_card”, we replaced “ present or absent. The K NOW-N O benchmark
_” with “-” and reran the experiments for the N O - and O MNI ACCURACY metrics provide a com-
H INT scenario. Among our five LLMs, Llama-3 prehensive evaluation by combining metrics for
was the only model that still generated “_” in the both the presence and absence of gold labels. This
output. Therefore, we highly suspect that Llama-3 work establishes a new testbed for assessing LLMs’
was exposed to BANK -77 during its pretraining. human-level discrimination intelligence, offering a
Consequently, its performance may be biased, es- framework for future research aimed at improving
pecially in the human evaluation metrics. the robustness and reliability of LLMs.
Limitation effective instruction tuning. In International Con-
ference on Machine Learning, ICML 2023, 23-29
There are several limitations in K NOW-N O as pre- July 2023, Honolulu, Hawaii, USA, volume 202 of
sented. First, we tested only three different prompts Proceedings of Machine Learning Research, pages
in w/o G. There are many other possible prompts 22631–22648. PMLR.
that could assess the model’s ability without the Kyle Mahowald, Anna A. Ivanova, Idan Asher Blank,
gold label, but it is impossible to address them all Nancy Kanwisher, Joshua B. Tenenbaum, and
in this paper. Second, the E QU I NFER dataset re- Evelina Fedorenko. 2023. Dissociating language
quires a large number of input tokens due to its and thought in large language models: a cognitive
perspective. CoRR, abs/2301.06627.
problem setting, making it expensive for propri-
etary models. However, researchers are welcome Thomas Mesnard, Cassidy Hardin, Robert Dadashi,
to test K NOW-N O on any open-source models or Surya Bhupatiraju, Shreya Pathak, Laurent Sifre,
Morgane Rivière, Mihir Sanjay Kale, Juliette Love,
adapt O MNI ACCURACY to any dataset, not limited
Pouya Tafti, Léonard Hussenot, Aakanksha Chowdh-
to the models or datasets used in this work. ery, Adam Roberts, Aditya Barua, Alex Botev, Alex
Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea
Tacchetti, Anna Bulanova, Antonia Paterson, Beth
References Tsai, Bobak Shahriari, Charline Le Lan, Christo-
pher A. Choquette-Choo, Clément Crepy, Daniel Cer,
Anthropic. 2024. Introducing the next generation of Daphne Ippolito, David Reid, Elena Buchatskaya,
claude. Eric Ni, Eric Noland, Geng Yan, George Tucker,
Iñigo Casanueva, Tadas Temcinas, Daniela Gerz, George-Cristian Muraru, Grigory Rozhdestvenskiy,
Matthew Henderson, and Ivan Vulic. 2020. Efficient Henryk Michalewski, Ian Tenney, Ivan Grishchenko,
intent detection with dual sentence encoders. CoRR, Jacob Austin, James Keeling, Jane Labanowski,
abs/2003.04807. Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan,
Jeremy Chen, Johan Ferret, Justin Chiu, and et al.
David Gunning, Mark Stefik, Jaesik Choi, Timothy 2024. Gemma: Open models based on gemini re-
Miller, Simone Stumpf, and Guang-Zhong Yang. search and technology. CoRR, abs/2403.08295.
2019. XAI - explainable artificial intelligence. Sci.
Robotics, 4(37). Meta. 2024. Introducing meta llama 3: The most capa-
ble openly available llm to date.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
de Las Casas, Florian Bressand, Gianna Lengyel, Hannaneh Hajishirzi. 2022. Cross-task generaliza-
Guillaume Lample, Lucile Saulnier, Lélio Re- tion via natural language crowdsourcing instructions.
nard Lavaud, Marie-Anne Lachaux, Pierre Stock, In Proceedings of the 60th Annual Meeting of the
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- Association for Computational Linguistics (Volume
thée Lacroix, and William El Sayed. 2023. Mistral 1: Long Papers), ACL 2022, Dublin, Ireland, May
7b. CoRR, abs/2310.06825. 22-27, 2022, pages 3470–3487. Association for Com-
putational Linguistics.
Tamera Lanham, Anna Chen, Ansh Radhakrishnan,
Benoit Steiner, Carson Denison, Danny Hernan- OpenAI. 2023. GPT-4 technical report. CoRR,
dez, Dustin Li, Esin Durmus, Evan Hubinger, Jack- abs/2303.08774.
son Kernion, Kamile Lukosiute, Karina Nguyen,
Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Matthew Richardson, Christopher J. C. Burges, and Erin
Oliver Rausch, Robin Larson, Sam McCandlish, Renshaw. 2013. Mctest: A challenge dataset for the
Sandipan Kundu, Saurav Kadavath, Shannon Yang, open-domain machine comprehension of text. In Pro-
Thomas Henighan, Timothy Maxwell, Timothy ceedings of the 2013 Conference on Empirical Meth-
Telleen-Lawton, Tristan Hume, Zac Hatfield-Dodds, ods in Natural Language Processing, EMNLP 2013,
Jared Kaplan, Jan Brauner, Samuel R. Bowman, and 18-21 October 2013, Grand Hyatt Seattle, Seattle,
Ethan Perez. 2023. Measuring faithfulness in chain- Washington, USA, A meeting of SIGDAT, a Special
of-thought reasoning. CoRR, abs/2307.13702. Interest Group of the ACL, pages 193–203. ACL.

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Cynthia Rudin. 2019. Stop explaining black box ma-
Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, chine learning models for high stakes decisions and
Baolin Peng, Yi Mao, Wenhu Chen, and Xifeng use interpretable models instead. Nat. Mach. Intell.,
Yan. 2022. Explanations from large language models 1(5):206–215.
make small reasoners better. CoRR, abs/2210.06726.
Xiaofei Sun, Linfeng Dong, Xiaoya Li, Zhen Wan,
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Shuhe Wang, Tianwei Zhang, Jiwei Li, Fei Cheng,
Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Lingjuan Lyu, Fei Wu, and Guoyin Wang. 2023a.
Barret Zoph, Jason Wei, and Adam Roberts. 2023. Pushing the limits of chatgpt on NLP tasks. CoRR,
The flan collection: Designing data and methods for abs/2306.09719.
41.56 41.37
Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei GPT-4-Turbo 40.89 41.27
Guo, Tianwei Zhang, and Guoyin Wang. 2023b. Text 40.0 39.37 39.08
classification via large language models. In Find-

Classification Accuracy
ings of the Association for Computational Linguis- 37.5
tics: EMNLP 2023, Singapore, December 6-10, 2023, 34.89
pages 8990–9005. Association for Computational 35.0
Linguistics. 32.5

Miles Turpin, Julian Michael, Ethan Perez, and 30.0


Samuel R. Bowman. 2023. Language models don’t
always say what they think: Unfaithful explanations 27.5
in chain-of-thought prompting. In Advances in Neu-
ral Information Processing Systems 36: Annual Con- 25.0 24.88
100 300 500 700 900 1100 1300 1500
ference on Neural Information Processing Systems Input Context Length (In Words)
2023, NeurIPS 2023, New Orleans, LA, USA, Decem-
ber 10 - 16, 2023.
Figure 5: Scaling the length of context around the equation
in E QU I NFER
Xiaoqiang Wang, Bang Liu, and Lingfei Wu. 2024.
Fac2 e: Better understanding large language model
capabilities by dissociating language and cognition.
CoRR, abs/2403.00126.
A Appendix
Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
jana Arunkumar, David Stap, Eshaan Pathak, Gi- A.1 Scaling E QU I NFER
annis Karamanolakis, Haizhi Gary Lai, Ishan Puro-
hit, Ishani Mondal, Jacob Anderson, Kirby Kuz-
nia, Krima Doshi, Kuntal Kumar Pal, Maitreya Pa- To keep an optimal context length for both sides of
tel, Mehrad Moradshahi, Mihir Parmar, Mirali Puro-
hit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit the equation, we tested ten different context lengths
Verma, Ravsehaj Singh Puri, Rushang Karia, Savan ranging from 100 to 1500 on GPT-4. The results
Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, are displayed in Figure 5. We found that starting
Sujan Reddy A, Sumanta Patro, Tanay Dixit, and from 1000 words, the model’s performance did
Xudong Shen. 2022. Super-naturalinstructions: Gen-
eralization via declarative instructions on 1600+ NLP not show significant improvement. Therefore, we
tasks. In Proceedings of the 2022 Conference on decided to retain 1000 words for either side. Ad-
Empirical Methods in Natural Language Processing, ditionally, when truncating the context, we ensure
EMNLP 2022, Abu Dhabi, United Arab Emirates, De- that complete sentences are presented. This means
cember 7-11, 2022, pages 5085–5109. Association
for Computational Linguistics.
if the 1000-word limit would cut off a sentence, we
extend the truncation to ensure sentence complete-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten ness.
Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le,
and Denny Zhou. 2022. Chain-of-thought prompting
elicits reasoning in large language models. In Ad-
vances in Neural Information Processing Systems 35:
Annual Conference on Neural Information Process- A.2 Prompt Illustrations
ing Systems 2022, NeurIPS 2022, New Orleans, LA,
USA, November 28 - December 9, 2022.
Below, we present sample inputs from the MC-
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. T EST, BANK -77, and E QU I NFER datasets. These
Goodman. 2022. Star: Bootstrapping reasoning with
reasoning. In Advances in Neural Information Pro- examples represent w/ G prompts, where the gold
cessing Systems 35: Annual Conference on Neural option is present (in blue). For w/o G prompts,
Information Processing Systems 2022, NeurIPS 2022, the gold option is removed, and the prompt is
New Orleans, LA, USA, November 28 - December 9, adjusted accordingly. Specifically, we add hints
2022.
(none-of-them) in options/instruction for H INT-
Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi AS -O PTION/H INT- IN -I NSTRU, or no hint at all for
Li, Prayag Tiwari, Benyou Wang, and Jing Qin. 2024. N O -H INT.
Pushing the limit of LLM capacity for text classifica-
tion. CoRR, abs/2402.07470. MC-T EST
Task: Task:
Given the story and an associated question, please You are given the latex source code of the context
return the correct option for the question without before and after an equation in an NLP paper and
explanation. multiple options for the equation. Only return
the correct option letter as the answer without
Story: explanation.
[· · · · · · · · · ]I’m going to play for the Yankees ma!"
Tom said. Tom’s mom was so excited that she Context before:
took Tom and the whole family out for dinner. “[· · · · · · · · · ]exi and ez i are the average represen-
Grandpa, Grandma, Mom and Dad were all there, tations along the sequence dimension from the
and bought Tom a big cake! [· · · · · · · · · ] encoder outputs. Apart from the contrastive loss,
the standard cross-entropy loss is calculated as:”
Question:
What did Tom’s family buy him to celebrate? Context after:
“We combine both losses as the final loss:
Options:
A: A cake L = Lce + λLctr
B: A car
C: New clothes where λ is an interpolation factor. [· · · · · · · · · ]”
D: A baseball
Equation options:
A: Lce = − N
P PC
Your answer: PNi=1 c=1 yi,c log(pi,c )
B: Lce = − i=1 (log Pθ (y i |xi ) + log Pθ (y i |z i ))
C: Lce = − N1 N
P PC
y i log(pic )
PC c=1i c
PN i=1
D: Lce = − i=1 k=1 yk log ŷki

Your answer:

BANK -77
Task: A.3 Experimental Setting Details
You are going to perform an intent classification
We adopt the most recently released instruction-
task. Given an utterance and multiple intent class
tuned versions of each LLM model as follows: gpt-
options, return the correct intent class name only.
4o-2024-05-13, claude-3-opus-20240229, Meta-
Llama-3-8B-Instruct, gemma-7b-it, Mistral-7B-
Utterance:
Instruct-v0.2. During inference, the temperature is
An unauthorized payment is in my app
always set to 0 for reproducibility. The top_p is set
to 0.9, and the max_length for generation is 1000.
Intent Class Options:
The cost of using the two closed-source LLMs
refund_not_showing_up, ap-
is detailed below. For BANK -77, MC-T EST, and
ple_pay_or_google_pay,
E QU I NFER, it takes 1 million, 0.4 million, and
pending_card_payment,
5 million input tokens, respectively, to run one
card_payment_not_recognised,
prompt on all instances. Given that one set of ex-
[· · · · · · · · · ]
periments includes 4 prompts, the costs for GPT-4
balance_not_updated_after_cheque_or_cash_deposit
(including both input and output tokens) are ap-
proximately $15, $8, and $100 for BANK -77, MC-
Your answer:
T EST, and E QU I NFER, respectively. For Claude 3,
the costs are roughly $60, $24, and $300 for BANK -
77, MC-T EST, and E QU I NFER, respectively.
We want to highlight that K NOW-N O and O MNI -
ACCURACY represent a novel evaluation approach
under C LASSIFY- W / O -G OLD. They can be applied
E QU I NFER to any model in any classification setting, extending
beyond the three datasets and five models reported labels them as option D.
in this study.
Claude 3 / Llama 3 / Mistral Claude 3, Llama
A.4 Ranking Differences of LLMs between 3, and Mistral exhibit consistent and standardized
Aw/ and Aw/o response patterns in the format “Letter. Option”,
without giving any additional discussion. The op-
Are there any differences in the accuracy ranking
tion might be one of the incorrect options provided
of LLMs between when the gold label is available
or a new answer generated by the model. Interest-
(i.e., Aw/ ) and when it is deleted (i.e., E[Aw/o ])?
ingly, unlike GPT-4, which always assigns its new
We notice that closed-source models consistently
answer as option D, Claude 3, Mistral, and Llama 3
achieve impressive accuracy when the gold option
tend to assign a letter from A to C, as per Response
is available. However, when the gold option is
type 4.
removed, they resist acknowledging the absence
of a correct label, especially in more challenging Gemma Gemma’s responses are the most chaotic
tasks. In MC-T EST, the simplest of the three tasks, and illogical. It often forces an explanation on the
GPT-4 and Claude 3 confidently suggest “none” or incorrect option or provides an incorrect option
generate a new answer, outperforming other models but includes the correct answer in the reasoning.
by a large margin. Conversely, for BANK -77 and This inconsistency aligns with its poor performance
E QU I NFER, they tend to select from the available across all datasets and most prompts. One example
incorrect options, performing as poorly as the open- of such an error is as follows:
source models. Question:What color was the animal’s stripe?
Gold option: green
A.5 Model Behaviour under N O -H INT Options: A. playground, B. sand, C. frosting
In this section, we provide a detailed analysis of Gemma output:
each model’s responses in N O -H INT. Since MC- “Answer: C
T EST is the simplest of the three tasks, it is the The animal had a green stripe across its back, there-
most likely scenario for the model to correctly iden- fore the answer is C.”
tify the N O -H INT context. In contrast, for E QU I N - As a result, it is very rare for Gemma to generate
FER and BANK -77, it is very rare for the model correct output, and none of the Response types fit
to identify the situation. Hence, we will focus on its behavior.
MC-T EST for our primary discussion.
A.6 Human Behaviour under N O -H INT
Even with MC-T EST, models fail to recognize
the absence of the gold label approximately half of Gold option: Rick
the time. When this occurs, the typical response Options: A. Bob, B. James, C. Stephanie
patterns are as follows: Response type 1: “None of the options is correct /
Gold option: Rick something is wrong with the problem setting”
Options: A. Bob, B. James, C. Stephanie Response type 2: “Rick”
Response type 1: “None of the above”
When faced with N O -H INT, humans respond
Response type 2: “None of the options provided
more straightforwardly and directly compared to
are correct. The correct answer is Rick.”
the models. Humans often either point out there
Response type 3: “D. Rick”
might be flaws in the question design or provide the
Response type 4: “C. Rick”
correct answer directly as the two response types
Each model displays these patterns differently, above. Very interestingly, we notice that humans
with unique characteristics specific to each model. would not assign a letter to a self-generated answer
and treat it as one of the provided options. This be-
GPT-4 Surprisingly, GPT-4 is among the least havior seems to be unique to models, likely because
likely models to acknowledge the absence of a gold they are trained to follow the provided pattern.
label. However, when it does recognize this situa-
tion and adopts an “out-of-the-box” approach, its A.7 ACL ethics code discussion
accuracy is notably the highest. The most common • Time/Memory Cost We only collect inferences
responses for GPT-4 are Response types 1, 2, and from LLMs. For closed-source models (GPT-4
3. For the new answers proposed, GPT-4 always and Claude 3) accessed through API, there’s no
significant memory usage. For open-source models
(Llama 3, Gemma and Mistral) accessed through
Huggingface, the memory usage is the same as the
model parameter size. It takes about an hour on
average to run all experiments on one dataset.
• Scientific artifacts usage The existing Scien-
tific artifacts included in this work are 5 models
(GPT-4, Claude 3, Llama 3, Gemma and Mistral.
please refer to Section 4. LLM Models) and 3 NLP
classification datasets (MC-T EST, BANK -77 and
E QU I NFER, please refer to Section 3.1). The mod-
els and datasets used in this work are publicly avail-
able for research purposes and do not contain any
sensitive information. Our use of existing Scien-
tific artifacts is consistent with their intended usage.
The dataset E QU I NFER proposed by us does not
contain any personal information.
The license, copyright information, the asset we
proposed, and terms of use information regarding
K NOW-N O, will be specified once the code is re-
leased.

You might also like