GEO: Generative Engine Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

GEO: Generative Engine Optimization

Pranjal Aggarwal*♢ Vishvak Murahari* ♠ Tanmay Rajpurohit† Ashwin Kalyan‡


Karthik R Narasimhan♠ Ameet Deshpande♠
♠ Princeton University † Georgia Tech ‡ The Allen Institute for AI ♢ IIT Delhi
[email protected] [email protected]

Abstract 1 Introduction
The advent of large language models (LLMs) The invention of traditional search engines three
arXiv:2311.09735v1 [cs.LG] 16 Nov 2023

has ushered in a new paradigm of search en- decades ago marked a shift in the way informa-
gines that use generative models to gather and tion was accessed and disseminated across the
summarize information to answer user queries. globe. While these search engines were power-
This emerging technology, which we formalize ful and ushered in a host of applications like aca-
under the unified framework of generative en-
demic research and e-commerce, they were lim-
gines (GEs), has the potential to generate accu-
rate and personalized responses, and is rapidly ited to providing a list of relevant websites to user
replacing traditional search engines like Google queries. The recent success of large language mod-
and Bing. Generative Engines typically satisfy els (LLMs) however has paved the way for better
queries by synthesizing information from mul- systems like BingChat, Google’s SGE, and per-
tiple sources and summarizing them with the plexity.ai that combine the strength of conventional
help of LLMs. While this shift significantly search engines with the flexibility of generative
improves user utility and generative search en-
models. We dub these new age systems generative
gine traffic, it results in a huge challenge for
the third stakeholder – website and content cre-
engines (GE) because they not only search for in-
ators. Given the black-box and fast-moving formation, but also generate multi-modal responses
nature of generative engines, content creators by synthesizing multiple sources. From a technical
have little to no control over when and how perspective, generative engines involve retrieving
their content is displayed. With generative en- relevant documents from a database (such as the
gines here to stay, the right tools should be internet) and using large neural models to gener-
provided to ensure that creator economy is not ate a response grounded on the sources, to ensure
severely disadvantaged. To address this, we
attribution and a way for the user to verify the in-
introduce G ENERATIVE E NGINE O PTIMIZA -
TION (GEO), a novel paradigm to aid content
formation.
creators in improving the visibility of their con- The usefulness of generative engines for both
tent in GE responses through a black-box opti- their developers and users is evident – users can ac-
mization framework for optimizing and defin- cess information faster and more accurately, while
ing visibility metrics. We facilitate system- developers can craft precise and personalized re-
atic evaluation in this new paradigm by intro- sponses, both to improve user satisfaction and rev-
ducing GEO- BENCH, a benchmark of diverse
enue. However, generative engines put the third
user queries across multiple domains, coupled
with sources required to answer these queries.
stakeholder – website and content creators – at a
Through rigorous evaluation, we demonstrate disadvantage. Generative Engines, in contrast to
that GEO can boost visibility by up to 40% traditional search engines, remove the need to nav-
in GE responses. Moreover, we show the effi- igate to websites by directly providing a precise
cacy of these strategies varies across domains, and comprehensive response, which can lead to
underscoring the need for domain-specific op- a drop in organic traffic to websites and severely
timization methods. Our work opens a new impact their visibility. With several millions of
frontier in the field of information discovery
small businesses and individuals relying on online
systems, with profound implications for both
developers of GEs and content creators.1 traffic and visibility for their livelihood, genera-
1
tive engines will significantly disrupt the creator
Code and data available at https://GEO-optim.github.
io/GEO/. * Equal Contribution
economy. Further, the black-box and proprietary discover a dependence of the effectiveness of G EN -
nature of generative engines makes it prohibitively ERATIVE E NGINE O PTIMIZATION methods on the
difficult for content creators to control and under- domain of the query.
stand how their content is ingested and portrayed In summary, our contributions are three-fold:
by generative engines. In this work, we take a first (1) We propose G ENERATIVE E NGINE O PTIMIZA -
step towards a general creator-centric framework TION, the first general framework for website own-
to optimize content for generative engines, which ers to optimize their websites for generative en-
we dub G ENERATIVE E NGINE O PTIMIZATION gines. (2) Our framework proposes a comprehen-
(GEO), to empower content creators to navigate sive set of visibility metrics designed for generative
this new search paradigm with greater confidence. engines and enables content creators to create their
GEO is a black-box optimization framework own customized visibility metrics. (3) To foster
for optimizing the visibility of web content for faithful evaluation of G ENERATIVE E NGINE O P -
proprietary and closed-source generative engines TIMIZATION methods in the age of Generative En-
(Figure 1). G ENERATIVE E NGINE O PTIMIZATION gines, we propose the first large-scale benchmark
ingests a source website and outputs an optimized consisting of diverse search queries from wide-
version of the website by tailoring and calibrating ranging domains and datasets, specially tailored
the presentation, text style, and content to increase for Generative Engines.
the likelihood of visibility in generative engines.
However, note that the notion of visibility in 2 Formulation & Methodology
generative engines is highly nuanced and multi- 2.1 Formulation of Generative Engines
faceted (Figure 3). While average ranking on the
Despite the deployment of a myriad of generative
search results page is a good measure of visibil-
engines to millions of users already, there is cur-
ity in traditional search engines which present a
rently no standard framework. We provide a for-
linear list of websites, this does not apply to gener-
mulation that can accommodate various modular
ative engines. Generative Engines provide rich and
components incorporated in their design.
highly structured responses and embed websites as
We describe a generative engine, which includes
inline citations in the response, often embedding
several backend generative models and a search en-
them with different lengths, at varying positions,
gine for source retrieval. A Generative Engine (GE)
and with diverse styles. This therefore necessitates
takes as input a user query qu and returns a natural
the need for visibility metrics tailor-made for gen-
language response r, where PU represents person-
erative engines, which measure the visibility of
alized user information, such as preferences and
attributed sources over multiple dimensions, such
history. The GE can be represented as a function:
as relevance and influence of citation to query, mea-
sured through both an objective and a subjective fGE := (qu , PU ) → r (1)
lens. Our GEO framework proposes a holistic set
of visibility metrics and enables content creators to While the response r can be multimodal, we sim-
create their own customized visibility metrics. plify it to a textual response in this section.
To facilitate faithful and extensive evaluation Generative Engines are comprised of two cru-
of GEO methods in this new paradigm, we pro- cial components: a.) A set of generative mod-
pose GEO- BENCH, a benchmark consisting of 10K els G = {G1 , G2 ...Gn }, each serving a specific
queries from a diverse set of domains and sources, purpose like query reformulation or summariza-
specially adapted for generative engines. Through tion, and b.) A search engine SE that returns a
systematic evaluation, we demonstrate that our set of sources S = {s1 , s2 ...sm } given a query
proposed G ENERATIVE E NGINE O PTIMIZATION q. We present a representative workflow in Fig-
methods can boost visibility by up to 40% on a ure 2, which at the time of writing, closely re-
diverse set of queries, providing beneficial strate- sembles the design of BingChat. This workflow
gies for content creators to improve their visibility breaks down the input query into a set of sim-
in the rapidly adapted generative engines. Among pler queries that are easier to consume for the
other things, we find that including citations, quo- search engine. Given a query, a query re-formulator
tations from relevant sources, and statistics can sig- generative model, G1 = Gqr , generates a set of
nificantly boost source visibility, with an increase queries Q1 = {q1 , q2 ...qn }, which are then passed
of over 40% across various queries. Further, we to the search engine SE to retrieve a multi-set
Figure 1: Our proposed G ENERATIVE E NGINE O PTIMIZATION (GEO) method optimizes websites to boost their
visibility in Generative Engine responses. GEO’s black-box optimization framework then enables the website
owner of the pizza website, which lacked visibility originally, to optimize their website to increase visibility under
Generative Engines. Further, GEO’s general framework allows content creators to define and optimize their custom
visibility metrics, giving them greater control in this new emerging paradigm.

The response r is typically a structured text re-


sponse along with citations embedded within the
text to support the information provided. Cita-
tions are especially important given the tendency
of LLMs to hallucinate information (Ji et al., 2023).
Specifically, consider a response r composed of
sentences {l1 , l2 ...lo }. Each sentence may be
Figure 2: Overview of Generative Engines. Generative backed by a set of citations that are a part of the
Engines primrarily consists of a set of generative models retrieved set of documents Ci ⊂ S. An ideal Gen-
and a search engine to retrieve relevant documents. Gen- erative Engine should ensure that all statements
erative Engines take user query as input and through a in the response are supported by relevant citations
series of steps generate a final response that is grounded (high citation recall), and all citations accurately
in the retrieved sources with inline attributions throught
support the statements they’re associated with (high
the response.
citation precision) (Liu et al., 2023a).

2.2 G ENERATIVE E NGINE O PTIMIZATION


of ranked sources S = {s1 , s2 , ..., sm }. The sets
of sources S are passed to a summarizing model The advent of search engines led to the develop-
G2 = Gsum , which generates a summary Sumj ment of search engine optimization (SEO), a pro-
for each source in S, resulting in the summary cess to help website creators optimize their content
set (Sum = {Sum1 , Sum2 , ..., Summ }). The to improve rankings in search engine results pages
summary set is passed to a response-generating (SERP). Higher rankings correlate with higher visi-
model G3 = Gresp , which generates a cumulative bility and increased website traffic. However, with
response r backed by sources S. We refer read- generative engines becoming front-and-center in
ers to Algorithm 1 for a representative pseudocode the information delivery paradigm and SEO not
describing the working of generative engine. In directly applicable to it, new techniques need to be
this work, we focus on single-turn Generative En- developed.
gines, but the formulation can be easily extended To this end, we propose G ENERATIVE E NGINE
to multi-turn Generative Engines and we provide O PTIMIZATION, a new paradigm where content
that formulation in Appendix A. creators aim to increase their visibility (or im-
Figure 3: Ranking and Visibility Metrics are straightforward in traditional search engines, which list website sources
in ranked order with verbatim content. However, Generative Engines generates rich and structured responses, and
often embed citations in a single block interleaved with each other. This makes the notion of ranking and visibility
highly nuanced and multi-faceted. Further, unlike search engines, where significant research has been conducted
on improving website visibility, it remains unclear how to optimize visibility in generative engine responses. To
address these challenges, our black-box optimization framework proposes a series of well-designed impression
metrics that creators can use to gauge and optimize their website’s performance and also allows the creator to define
their impression metrics.

pression) in the generated responses. We define Mathematically, this is defined as:


the visibility of a website/citation ci in a cited re- P
sponse r from a generative engine by the function s∈Sci |s|
Impwc (ci , r) = P (2)
Impwc (ci , r) and the website creator wants to max- s∈Sr |s|
imize this. Simultaneously, from the perspective
of the generative engine, the goal is to maximize Here Sci is the set of sentences citing ci , Sr is
the visibility of citations that are the set of sentences in the response, and |s| is the
Pmost relevant to number of words in sentence s. In cases where a
the user query, i.e., maximize i Impwc (ci , r) ·
Rel(ci , q, r), where Rel(ci , q, r) is a measure of sentence is cited by multiple sources, we simply
the relevance of citation ci to the query q in the share the word count with the citations. Intuitively,
context of response r. However, both the functions a higher word count correlates with the source play-
g and Rel are subjective and not well-defined yet ing a more important part in the answer, and thus
for generative engines, and we define them below. the user gets higher exposure to that source. How-
ever, since “Word Count” is not impacted by the
ranking of the citations (whether it appears first,
2.2.1 Impressions for Generative Engines
for example), we propose a position-adjusted count
In SEO, the impression (or visibility) of a web- that reduces the weight by an exponentially decay-
site is simply determined by the average ranking ing function of the rank of the citation:
of the website over a range of real queries. But
pos(s)

given that the nature of the output of generative
P
s∈Sci |s| · e |S|

engines is very different, impression metrics are Imp′wc (ci , r) = P (3)


s∈Sr |s|
not yet defined. Unlike search engines, Genera-
tive Engines combine information from multiple The above impression metrics are objective and
sources in a single response. Thus multiple factors well-grounded. However, they ignore the subjec-
such as length, uniqueness, and the presentation tive aspects of the impact of citations on the user’s
of the cited website determine the true visibility attention. To address this, we propose the "Subjec-
of a citation. In this section, we use website and tive Impression" metric, which incorporates multi-
citation interchangeably. ple facets such as 1.) relevance of the cited mate-
To address this, we propose several impression rial to the user query, 2.) influence of the citation,
metrics. The “Word Count” metric is the normal- which evaluates the degree to which the generated
ized word count of sentences related to a citation. response depends on the citation, 3.) uniqueness
of the material presented by a citation, 4.) subjec- new content to the website. Instead, these methods
tive position, which measures how prominently the primarily focus on enhancing the presentation of
source is positioned from the user’s perspective, the existing content in a way that increases its per-
5.) subjective count, which measures the amount suasiveness or makes it more appealing to the gen-
of content presented from the citation as perceived erative engine. These G ENERATIVE E NGINE O P -
by the user upon reading the citation, 6.) proba- TIMIZATION methods can be categorized into two
bility of clicking the citation, and 7.) diversity in broad types: Content Addition and Stylistic Opti-
the material presented. To measure each of these mization. In practice, these G ENERATIVE E NGINE
sub-metrics, we use G-Eval (Liu et al., 2023b), the O PTIMIZATION methods will be implemented by
current state-of-the-art for evaluation with LLMs website owners modifying their text in accordance
which has a high correlation with human judgment with these principles. However, for the purposes
for subjective tasks. We present general algorithm of our experiments, we implement G ENERATIVE
to measure impression metrics of a response in Al- E NGINE O PTIMIZATION methods by creating suit-
gorithm 2 and refer readers to Appendix B.3 for able prompts for the GPT-3.5 model to convert
more details. the source text into the modified text. The exact
prompts used are provided in Appendix B.5.
2.2.2 G ENERATIVE E NGINE O PTIMIZATION In order to analyze the performance gain of our
methods for website methods, for each input query, we randomly select
To improve the impression metrics, content cre- one source to be optimized using each of the GEO
ators need to make changes to their websites. To methods separately. Further, for every method, 5
this end, we present several generative engine- answers are generated per query to reduce statis-
agnostic strategies, referred to as G ENERATIVE tical noise in the results. We refer readers to Ap-
E NGINE O PTIMIZATION methods (GEO). Math- pendix B.4 for more details.
ematically, every GEO method is a function f :
W → Wi′ , where W is the initial web content, 3 Experimental Setup
and W ′ is the modified website content after apply-
ing GEOmethod. A well-designed GEO method Algorithm 1 Generative Engine
should increase the visibility of the website on
Require: userQuery
which it is applied. These methods are designed
1: newQuery ←
to implement textual modifications to W in a man-
ReformulatingModel(userQuery)
ner that is independent of the queries. The range
2: sources ← fetchFromSE(newQuery)
of these modifications spans from simple stylistic
3: summaries ← []
alterations to the incorporation of new content in a
4: for each source in sources do
structured format.
5: summary ← Summarizer(source))
We propose and evaluate a series of methods: 1: 6: summaries.append(summary)
Authoritative: Modifies text style of the source 7: end for
content to be more persuasive while making au- 8: Response ←
thoritative claims, 2. Keyword Stuffing: Modifies ResponseGenerator(userQuery, summaries)
content to include more keywords from the query, return Response
as would be expected in classical SEO optimiza-
tion, 3. Statistics Addition: Modifies content to
include quantitative statistics instead of qualitative Algorithm 2 Impression
discussion, wherever possible. 4. Cite Sources
& 5. Quotation Addition: Adds relevant cita- Require: sentences : List[Tuple[text, citation]]
1: ImpressionScores ← empty Dictionary
tions and quotations from credible sources, 6.) 6.
2: for each citation in [0, 1, 2, ....numSources]
Easy-to-Understand: Simplifies the language of
website, while 7. Fluency Optimization improves do
the fluency of website text. 8. Unique Words & 3: ImpressionScores[citation] ←
9. Technical Terms: involves adding unique and ImpressionFunction(sentences, citation)
4: end for
technical terms respectively wherever posssible.
5: return Normalize(ImpressionScores)
With the exception of methods 3, 4, and 5, the re-
maining methods do not necessitate the addition of
Algorithm 3 Generative Engine Optimization tion (eg: writing a short poem, python code.). 6.
Require: Source, M ethod Davinci-Debtate (Liu et al., 2023a) contains de-
1: return M ethod(Source) ▷ Note: GEO bate questions generated for testing Generative En-
does not require information about query and gines. 7. Perplexity.ai Discover: These queries
internals of GE are sourced from Perplexity.ai’s Discover section,
which is an updated list of trending queries on the
platform. 8. ELI-5: This dataset contains ques-
3.1 Evaluated Generative Engine tions from the ELI5 subreddit, where users ask
We use a 2-step setup for Generative Engine de- complex questions and expect answers in simple,
sign, in accordance with previous works (Liu et al., layman’s terms. 9. GPT-4 Generated Queries:
2023a) and general design adopted by GEs: the To supplement diversity in query distribution, we
first step involves fetching relevant sources for in- prompt GPT-4 to generate queries ranging from var-
put query, followed by a LLM generating response ious domains (eg: science, history) and based on
based on the fetched sources. In our setup, we query intent (eg: navigational, transactional) and
fetch the top 5 sources from the Google search based on difficulty and scope of generated response
engine for every query. The answer is generated (eg: open-ended, fact-based)
by gpt3.5-turbo model using the prompt same as Our benchmark contains 10K queries split into
prior work (Liu et al., 2023a). We refer readers to 8K,1K, and 1K train/val/test splits. Every query
Appendix B for more details. is tagged into multiple categories gauging vari-
ous dimensions such as intent, difficulty, domain
3.2 Benchmark
of query and format of answer type using GPT-
Since there is currently no publicly available 4. We maintain the real-world query distribution,
dataset containing Generative Engine related with our benchmark containing 80% informational
queries, we curate GEO- BENCH, a benchmark queries, and 10% transactional and 10% naviga-
consisting of 10K queries from multiple sources, tional queries. We augment every query with
repurposed for generative engines, along with syn- cleaned text content of top 5 search results from the
thetically generated queries. The benchmark in- Google search engine. We believe GEO- BENCH
cludes queries from nine different sources, each is a comprehensive benchmark for evaluating Gen-
further categorized based on their target domain, erative Engines and serves as a standard testbed
difficulty, query intent, and other dimensions. for evaluating Generative Engines for multiple pur-
The datasets used in constructing the benchmark poses in this and future works. More details can be
are as follows: found in Appendix B.2.
1. MS Macro, 2. ORCAS-1, and 3. Natural
Questions: (Kwiatkowski et al., 2019; Alexander
et al., 2022; Craswell et al., 2021) These datasets 3.3 Evaluation Metrics
contain real anonymized user queries from Bing
and Google Search Engines. These three collec- We evaluate all methods by calculating the Relative
tively represent the common set of datasets that Improvement in Impression. For an initial gener-
are used in search engine related research. How- ated Response r from sources Si ∈ {s1 , . . . , sm },
ever, Generative Engines will be posed with far and a modified response r′ , the relative improve-
more difficult and specific queries with the intent ment in impression of each source si is measured
of synthesizing answers from multiple sources in- as:
stead of searching for them. To this end, we re-
purpose several other publicly available datasets: Impsi (r′ ) − Impsi (r)
4. AllSouls: This dataset contains essay ques- Improvementsi = ∗ 100
Impsi (r)
tions from "All Souls College, Oxford University". (4)
The queries in this dataset require Generative En- ′
The modified response r is generated by applying
gines to perform appropriate reasoning to aggre- the G ENERATIVE E NGINE O PTIMIZATION method
gate information from multiple sources. 5. LIMA: to be evaluated on of the sources si . The source si
contains challenging questions requiring Genera- to be optimized is randomly selected but kept con-
tive Engines to not only aggregate information but stant for a particular query across all G ENERATIVE
also perform suitable reasoning to answer the ques- E NGINE O PTIMIZATION methods.
Position-Adjusted Word Count Subjective Impression
Method
Word Position Overall Rel. Infl. Unique Div. FollowUp Pos. Count Average
Performance without G ENERATIVE E NGINE O PTIMIZATION
No Optimization 19.5 19.3 19.3 19.3 19.3 19.3 19.3 19.3 19.3 19.3 19.3
Non-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Keyword Stuffing 17.8 17.7 17.7 19.8 19.1 20.5 20.4 20.3 20.5 20.4 20.2
Unique Words 20.7 20.5 20.5 20.5 20.1 19.9 20.4 20.2 20.7 20.2 20.4
High-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Easy-to-Understand 22.2 22.4 22.0 20.2 21.0 20.0 20.1 20.1 20.9 19.9 20.5
Authoritative 21.8 21.3 21.3 22.3 22.1 22.4 23.1 22.2 23.1 22.7 22.9
Technical Terms 23.1 22.7 22.7 20.9 21.7 20.5 21.2 20.8 21.9 20.8 21.4
Fluency Optimization 25.1 24.6 24.7 21.1 22.9 20.4 21.6 21.0 22.4 21.1 21.9
Cite Sources 24.9 24.5 24.6 21.4 22.5 21.0 21.6 21.2 22.2 20.7 21.9
Quotation Addition 27.8 27.3 27.2 23.8 25.4 23.9 24.4 22.9 24.9 23.2 24.7
Statistics Addition 25.9 25.4 25.2 22.5 24.5 23.0 23.3 21.6 24.2 23.0 23.7

Table 1: Performance improvement of GEO methods on GEO- BENCH. Performance Measured on Two metrics
and their sub-metrics. Compared to the baselines simple methods such as Keyword Stuffing traditionally used in
SEO do not perform very well. However, our proposed methods such as Statistics Addition and Quotation Addition
show strong performance improvements across all metrics considered. The best performing methods improve upon
baseline by 41% and 29% on Position-Adjusted Word Count and Subjective Impression respectively. For readability,
Subjective Impression scores are normalized with respect to Position-Adjusted Word Count resulting in baseline
scores being similar across the metrics

4 Results and 15-30% on the Subjective Impression metric


compared to the baseline.
We evaluate a variety of G ENERATIVE E NGINE O P - These methods, which involve adding relevant
TIMIZATION methods, each designed to optimize statistics (Statistics Addition), incorporating cred-
website content for better visibility in Generative ible quotes (Quotation Addition), and including
Engine responses. These methods are compared citations from reliable sources (Cite Sources) in
against a baseline scenario where no optimization the website content, require minimal changes to
was applied. Our evaluation was conducted on the actual content itself. Yet, they significantly
GEO- BENCH, a diverse benchmark encompassing improve the website’s visibility in Generative En-
a wide array of user queries from multiple domains gine responses, enhancing both the credibility and
and settings. The performance of these methods richness of the content.
was measured using two distinct metrics: Position- Interestingly, stylistic changes such as improv-
Adjusted Word Count and Subjective Impression. ing the fluency and readability of the source text,
The Position-Adjusted Word Count metric consid- i.e. methods Fluency Optimization and Easy-to-
ers both the word count and the position of the Understand also resulted in a significant boost of
citation in the GE’s response, while the Subjective 15-30% in visibility. This suggests that Genera-
Impression metric incorporates multiple subjective tive Engines not only value the content but also the
factors to compute an overall impression score. presentation of the information.
Our results, detailed in Table 1, reveal that our Further, given generative models used in Gener-
G ENERATIVE E NGINE O PTIMIZATION methods ative Engine often are designed to follow instruc-
consistently outperform the baseline across all met- tions, one would expect a more persuasive and au-
rics when evaluated on GEO- BENCH. This demon- thoritative tone in website content can boost visibil-
strates the robustness of these methods to vary- ity. However, to the contrary we find no significant
ing queries, as they were able to yield significant improvement, demonstrating that Generative En-
improvements despite the diversity of the queries. gines are already somewhat robust to such changes.
Specifically, our top-performing methods, namely This points towards the need for website owners to
Cite Sources, Quotation Addition, and Statistics focus more towards improving the presentation of
Addition, achieved a relative improvement of 30- content and making it more credible.
40% on the Position-Adjusted Word Count metric Finally, we also evaluate the idea of using key-
Top Performing Tags formance in the context of debate-style questions
Method
Rank-1 Rank-2 Rank-3 and queries related to the “historical” domain. This
Authoritative Debate History Science observation aligns with our intuition, as a more
Fluency Opt. Business Science Health
Cite Sources Statement Facts Law & Gov. persuasive form of writing is likely to hold more
Quotation Addition People & Society Explanation History value in debates like contexts.
Statistics Addition Law & Gov. Debate Opinion
Similarly, the addition of citations through Cite
Table 2: Top Performing categories for each of the G EN - Sources is particularly beneficial for factual ques-
ERATIVE E NGINE O PTIMIZATION methods. Website- tions. This is likely because citations provide
owners can choose relevant GEO strategy based on their a source of verification for the facts presented,
target domain. thereby enhancing the credibility of the response.
The effectiveness of different GEO methods varies
Relative Improvement (%) in Visibility
Method across different domains. For example, as shown
Rank-1 Rank-2 Rank-3 Rank-4 Rank-5
in row 5 of Table 2, domains such as ‘Law & Gov-
Authoritative -6.0 4.1 -0.6 12.6 6.1
Fluency Opt. -2.0 5.2 3.6 -4.4 2.2 ernment’ and question types like ‘Opinion’ benefit
Cite Sources -30.3 2.5 20.4 15.5 115.1 significantly from the addition of relevant statistics
Quotation Addition -22.9 -7.0 3.5 25.1 99.7
Statistics Addition -20.6 -3.9 8.1 10.0 97.9 in the website content, as implemented by Statistics
Addition. This suggests that the incorporation of
Table 3: Visibility changes through GEO methods for data-driven evidence can enhance the visibility of a
sources with different Rankings in Search Engine. GEO website in particular contexts especially these. The
methods are especially helpful for websites ranked lower
method Quotation Addition is most effective in the
in Search Engine rankings.
‘People & Society’, ‘Explanation’, and ‘History’
domains. This could be because these domains of-
word stuffing, i.e. adding more relevant keywords ten involve personal narratives or historical events,
to the website content. While this technique has where direct quotes can add authenticity and depth
been widely used for Search Engine Optimization, to the content.
we find such methods have little to no performance Overall, our analysis suggests that website own-
improvement on Generative Engine’s responses. ers should strive towards making domain-specific
This underscores the need for website owners to targeted adjustments to their websites for higher
rethink their optimization strategies for Generative visibility.
Engines, as techniques effective for traditional SEO
may not necessarily translate to success in the new 5.2 Simultaneous Optimization of Multiple
paradigm. Websites
In the evolving landscape of Generative Engines,
5 Analysis it is anticipated that GEO methods will be widely
adopted, leading to a scenario where all source
5.1 Domain-Specific G ENERATIVE E NGINE contents are optimized using GEO. To understand
O PTIMIZATIONs the implications of this scenario, we conducted an
In Section 4, we presented the improvements evaluation of G ENERATIVE E NGINE O PTIMIZA -
achieved by G ENERATIVE E NGINE O PTIMIZA - TION methods by optimizing all source contents
TION across the entirety of the GEO- BENCH bench- simultaneously. The results of this evaluation are
mark. However, it is important to note that in real- presented in Table 3. A key observation from our
world SEO scenarios, domain-specific optimiza- analysis is the differential impact of GEO on web-
tions are often applied to websites. With this in sites based on their ranking in the Search Engine
mind, and considering that we provide categories Results Pages (SERP). Interestingly, websites that
for every query in GEO- BENCH, we delve deeper are ranked lower in SERP, which typically struggle
into the performance of various GEO methods to gain visibility, benefit significantly more from
across these categories. GEO than those ranked higher. This is evident
Table 2 provides a detailed breakdown of the from the relative improvements in visibility shown
categories where our GEO methods have proven in Table 3. For instance, the Cite Sources method
to be most effective. A careful analysis of these led to a substantial 115.1% increase in visibility for
results reveals several intriguing observations. For websites ranked fifth in SERP, while on average
instance, Authoritative significantly improves per- the visibility of the top-ranked website decreased
by 30.3%. ity.ai are shown in Table 5. We find, similar to our
This finding underscores the potential of GEO generative engine Quotation Addition performs the
as a tool to democratize the digital space. Impor- best in Position-Adjusted Word Count with a rela-
tantly, many of these lower-ranked websites are tive improvement of 22% over the baseline. Further,
often created by small content creators or inde- methods that performed well in our generative en-
pendent businesses, who traditionally struggle to gine such as Cite Sources, Statistics Addition show
compete with larger corporations that dominate the high improvements of up to 9% and 37% on the
top rankings in search engine results. The advent two metrics. Further, our observations such as the
of Generative Engines may initially seem disad- ineffectiveness of traditional methods used in SEO
vantageous to these smaller entities. However, the such as Keyword Stuffing are further highlighted,
application of GEO methods presents an opportu- as it performs 10% worse than the baseline. The
nity for these small content creators to significantly result underscores the importance of developing
improve their visibility in Generative Engine re- different G ENERATIVE E NGINE O PTIMIZATION
sponses. By enhancing their content using GEO, methods to benefit the content-creators and further
they can reach a wider audience, thereby leveling highlights that our simple-to-implement proposed
the playing field and allowing them to compete methods can be used directly by content-creators,
more effectively with larger corporations in the thus having a high real-world impact.
digital space.
7 Related Work
5.3 Qualitative Analysis
Evidence-based Answer Generation Previous
We present a qualitative analysis of GEO methods works have used several techniques for generat-
in Table 4. The analysis contains representative ing answers backed by relevant sources. (Nakano
examples, where GEO methods boost source visi- et al., 2021) trained GPT-3 model to navigate a web-
bility while making minimal changes. For each of based environment through textual commands, to
the three methods, a source is optimized by making answer questions backed by sources. Similarly,
suitable additions and deletions in the text. In the other methods (Shuster et al., 2022; Thoppilan
first example, we see, that simply adding the source et al., 2022; Menick et al., 2022) fetch relevant
of a statement in text, can significantly boost visi- sources through search engines and use them to
bility in the final answer, requiring minimal effort generate answers. Our work tries to unify all these
on the content creator’s part. The second example methods and provide a common benchmark for
demonstrates that the addition of relevant statis- improving these systems in the future.
tics wherever possible, ensures source visibility
increasing in the final Generative Engine response. Retrieval-Augmented Language Models: Sev-
Finally, the third row suggests, that merely empha- eral, recent works have tackled the issues of lim-
sizing parts of the text and using a more persuasive ited memory of language models by fetching rel-
text style can also lead to decent improvements in evant sources from a knowledge base to complete
visibility. a task (Asai et al., 2021; Mialon et al., 2023; Guu
et al., 2020). However, Generative Engine needs
6 GEO in the Wild : Experiments with to not only generate an answer but also provide
Deployed Generative Engine attributions throughout the answer. Further, Gen-
erative Engine is not limited to a single modality
To further reinforce the efficacy of our proposed of text in terms of both input and output. Further,
G ENERATIVE E NGINE O PTIMIZATION methods, the framework of Generative Engine is not limited
we evaluate them on Perplexity.ai a deployed Gen- to fetching relevant sources, but instead comprises
erative Engine with a large user base. Since per- multiple tasks such as query reformulation, source
plexity.ai does not allow the user to specify source selection, and making decisions on how and when
URLs, we instead provide source text as file up- to perform them.
loads to perlexity.ai. We ensure all answers are
generated only using the file sources provided. We Search Engine Optimization: In nearly the past
evaluate all our methods on a subset of 200 samples 25 years, a tremendous amount of public and pri-
of our test set. Other experimental procedures are vate research has been done in optimizing web con-
the same as our main results. Results using Perplex- tent for search engines (Ankalkoti, 2017; Shahzad
Method GEO Optimization Relative Improvement
Query: What is the secret of Swiss chocolate
With per capita annual consumption averaging between 11 and 12 kilos, Swiss
Cite Sources 132.4%
people rank among the top chocolate lovers in the world
(According to a survey conducted by The International Chocolate
Consumption Research Group [1])

Query: Should robots replace humans in the workforce?


Source: Not here, and not now — until recently. The big difference is that the
Statistics Addition 65.5%
robots have come not to destroy our lives, but to disrupt our work,
with a staggering 70% increase in robotic involvement in the last decade.

Query: Did the jacksonville jaguars ever make it to the superbowl?


Source: It is important to note that The Jaguars have never appeared
Authoritative 89.1%
made an appearance in the Super Bowl. However, They have achieved
an impressive feat by securing 4 divisional titles to their name. , a
testament to their prowess and determination.

Table 4: Representative examples of GEO methods optimizing source website. Additions are marked in green
and Deletions in red. Without adding any substantial new information in the content, GEO methods are able to
significantly increase the visibility of the source content.

Position-Adjusted Word Count Subjective Impression


Method
Word Position Overall Rel. Infl. Unique Div. FollowUp Pos. Count Average
Performance without G ENERATIVE E NGINE O PTIMIZATION
No Optimization 24.0 24.4 24.1 24.7 24.7 24.7 24.7 24.7 24.7 24.7 24.7
Non-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Keyword Stuffing 21.9 21.4 21.9 26.3 27.2 27.2 30.2 27.9 28.2 26.9 28.1
Unique Words 24.0 23.7 23.6 24.9 25.1 24.7 24.4 23.0 23.6 23.9 24.1
High-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Authoritative 25.6 25.7 25.9 28.9 30.9 31.2 31.7 31.5 26.9 29.5 30.6
Fluency Optimization 25.8 26.2 26.0 28.9 29.4 29.8 30.6 30.1 29.6 29.6 30.0
Cite Sources 26.6 26.9 26.8 19.8 20.7 19.5 18.9 20.0 18.5 18.9 19.0
Quotation Addition 28.8 28.7 29.1 31.4 31.9 31.9 32.3 31.4 31.7 30.9 32.1
Statistics Addition 25.8 26.6 26.2 31.6 33.4 34.0 33.7 34.0 33.3 33.1 33.9

Table 5: Performance improvement of GEO methods on GEO- BENCH with Perplexity.ai as generative engine.
Compared to the baselines simple methods such as Keyword Stuffing traditionally used in SEO do not perform very
well with often negative performance. However, our proposed methods such as Statistics Addition and Quotation
Addition show strong performance improvements across all metrics considered. The best performing methods
improve upon baseline by 22% on Position-Adjusted Word Count and 37% on Subjective Impression. The scores
demonstrate the high impact of our proposed method directly on the already deployed generative engines.

et al., 2020; Kumar et al., 2019) These methods tional SEO-based strategies will not be applicable
are typically classified into On-Page SEO, which to Generative Engine settings highlighting the need
involves improving the actual content of the web- for GEO.
site and optimizing user experience and accessibil-
ity, and Off-Page SEO, which involves improving 8 Conclusion
the website’s authority and reputation through link
building and recognition. In contrast, GEO deals In this work, we formulate the new age search en-
with a more complex environment involving multi- gines that we dub generative engines and propose
modality, conversational settings. Further, since G ENERATIVE E NGINE O PTIMIZATION (GEO) to
GEO is optimized against a generative model that help put the power in the hands of content cre-
is not limited to simple keyword matching, tradi- ators to optimize their content. We define impres-
sion metrics for generative engines and propose
a benchmark encompassing diverse user queries
from multiple domains and settings, along with rel-
evant sources needed to answer those queries. We
propose several ways to optimize content for gener-
ative engines and demonstrate that these methods
are capable of boosting source visibility by up to
40% in generative engine responses. Among other
things, we find that including citations, quotations
from relevant sources, and statistics can signifi-
cantly boost source visibility. Further, we discover
a dependence of the effectiveness of G ENERATIVE
E NGINE O PTIMIZATION methods on the domain
of the query. Our work serves as a first step towards
understanding the impact of generative engines on
the digital space and the role of G ENERATIVE E N -
GINE O PTIMIZATION in this new age of search
engines.
Ethical Considerations and Reproducibility Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan
Statement Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea
Madotto, and Pascale Fung. 2023. Survey of halluci-
In our study, we focus on enhancing the visibility of nation in natural language generation. ACM Comput-
ing Surveys, 55(12):1–38.
websites in generative engines. We do not directly
interact with sensitive data or individuals. While R.Anil Kumar, Zaiduddin Shaik, and Mohammed
the sources we retrieve from search engines may Furqan. 2019. A survey on search engine optimiza-
tion techniques. International Journal of P2P Net-
contain biased or inappropriate content, these are work Trends and Technology.
already publicly accessible, and our study neither
amplifies nor endorses such content. We believe Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur P. Parikh, Chris Alberti,
that our work is ethically sound as it primarily deals Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
with publicly available information and aims to ton Lee, Kristina Toutanova, Llion Jones, Matthew
improve the user experience in generative engines. Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
Regarding reproducibility, we have made our Uszkoreit, Quoc V. Le, and Slav Petrov. 2019. Natu-
ral questions: A benchmark for question answering
code available to allow others to replicate our re- research. Transactions of the Association for Compu-
sults. Our main experiments have been conducted tational Linguistics, 7:453–466.
with five different seeds to minimize potential sta-
Nelson F. Liu, Tianyi Zhang, and Percy Liang. 2023a.
tistical deviations. Evaluating verifiability in generative search engines.
ArXiv, abs/2304.09848.
Yang Liu, Dan Iter, Yichong Xu, Shuo Wang, Ruochen
References Xu, and Chenguang Zhu. 2023b. G-eval: Nlg evalua-
Daria Alexander, Wojciech Kusa, and Arjen P. de Vries. tion using gpt-4 with better human alignment. ArXiv,
2022. Orcas-i: Queries annotated with intent using abs/2303.16634.
weak supervision. Proceedings of the 45th Inter-
Jacob Menick, Maja Trebacz, Vladimir Mikulik,
national ACM SIGIR Conference on Research and
John Aslanides, Francis Song, Martin Chadwick,
Development in Information Retrieval.
Mia Glaese, Susannah Young, Lucy Campbell-
Gillingham, Geoffrey Irving, and Nathan McAleese.
Prashant Ankalkoti. 2017. Survey on search engine 2022. Teaching language models to support answers
optimization tools & techniques. Imperial journal of with verified quotes. ArXiv, abs/2203.11147.
interdisciplinary research, 3.
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo-
Akari Asai, Xinyan Velocity Yu, Jungo Kasai, and Han- foros Nalmpantis, Ramakanth Pasunuru, Roberta
naneh Hajishirzi. 2021. One question answering Raileanu, Baptiste Rozière, Timo Schick, Jane
model for many languages with cross-lingual dense Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann
passage retrieval. In Neural Information Processing LeCun, and Thomas Scialom. 2023. Augmented
Systems. language models: a survey. ArXiv, abs/2302.07842.

Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu,
Callison-Burch, and Dan Roth. 2019. Seeing things Ouyang Long, Christina Kim, Christopher Hesse,
from a different angle:discovering diverse perspec- Shantanu Jain, Vineet Kosaraju, William Saunders,
tives about claims. In North American Chapter of Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen
the Association for Computational Linguistics. Krueger, Kevin Button, Matthew Knight, Benjamin
Chess, and John Schulman. 2021. Webgpt: Browser-
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, assisted question-answering with human feedback.
Daniel Fernando Campos, and Jimmy J. Lin. 2021. ArXiv, abs/2112.09332.
Ms marco: Benchmarking ranking models in the A. Shahzad, Deden Witarsyah Jacob, Nazri M. Nawi,
large-data regime. Proceedings of the 44th Inter- Hairulnizam Bin Mahdin, and Marheni Eka Saputri.
national ACM SIGIR Conference on Research and 2020. The new trend for search engine optimization,
Development in Information Retrieval. tools and techniques. Indonesian Journal of Electri-
cal Engineering and Computer Science, 18:1568.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Ming-Wei Chang. 2020. Realm: Retrieval- Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju,
augmented language model pre-training. ArXiv, Eric Michael Smith, Stephen Roller, Megan Ung,
abs/2002.08909. Moya Chen, Kushal Arora, Joshua Lane, Morteza
Behrooz, W.K.F. Ngan, Spencer Poff, Naman Goyal,
Bernard Jim Jansen, Danielle L. Booth, and Amanda Arthur Szlam, Y-Lan Boureau, Melanie Kambadur,
Spink. 2008. Determining the informational, naviga- and Jason Weston. 2022. Blenderbot 3: a deployed
tional, and transactional intent of web queries. Inf. conversational agent that continually learns to respon-
Process. Manag., 44:1251–1266. sibly engage. ArXiv, abs/2208.03188.
Romal Thoppilan, Daniel De Freitas, Jamie Hall,
Noam M. Shazeer, Apoorv Kulshreshtha, Heng-Tze
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,
Yaguang Li, Hongrae Lee, Huaixiu Steven Zheng,
Amin Ghafouri, Marcelo Menegali, Yanping Huang,
Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao
Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts,
Maarten Bosma, Yanqi Zhou, Chung-Ching Chang,
I. A. Krivokon, Willard James Rusch, Marc Pick-
ett, Kathleen S. Meier-Hellstern, Meredith Ringel
Morris, Tulsee Doshi, Renelito Delos Santos, Toju
Duke, Johnny Hartz Søraker, Ben Zevenbergen, Vin-
odkumar Prabhakaran, Mark Díaz, Ben Hutchinson,
Kristen Olson, Alejandra Molina, Erin Hoffman-
John, Josh Lee, Lora Aroyo, Ravindran Rajakumar,
Alena Butryna, Matthew Lamm, V. O. Kuzmina,
Joseph Fenton, Aaron Cohen, Rachel Bernstein, Ray
Kurzweil, Blaise Aguera-Arcas, Claire Cui, Mar-
ian Rogers Croak, Ed Huai hsin Chi, and Quoc Le.
2022. Lamda: Language models for dialog applica-
tions. ArXiv, abs/2201.08239.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao


Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
L. Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke
Zettlemoyer, and Omer Levy. 2023. Lima: Less is
more for alignment. ArXiv, abs/2305.11206.
A Conversational Generative Engine
Listing 1: Prompt used for Generative Engine. The
In Section 2.1, we discussed a single-turn Gener- GE takes the query and 5 sources as input and outputs
ative Enginethat outputs a single response given the response to query with response grounded in the
sources.
the user query. However, one of the strengths of
1 Write an accurate and concise answer for the
upcoming Generative Engines will be their ability given user question, using _only_ the
to engage in an active back-and-forth conversation provided summarized web search results.
with the user. The conversation allows users to The answer should be correct, high-
quality, and written by an expert using
provide clarifications to their queries or Generative an unbiased and journalistic tone. The
Engine response and ask follow-ups. Specifically, user’s language of choice such as
in equation 1, instead of the input being a single English, Francais, Espamol, Deutsch, or
should be used. The answer should be
query qu , it is modeled as a conversation history informative, interesting, and engaging.
H = (qut , rt ) pairs. The response rt+1 is then The answer’s logic and reasoning should
defined as: be rigorous and defensible. Every
sentence in the answer should be
_immediately followed_ by an in-line
GE := fLE (H, PU ) → rt+1 (5) citation to the search result(s). The
cited search result(s) should fully
support _all_ the information in the
where t is the turn number. sentence. Search results need to be
Further, to engage the user in a conversation, cited using [index]. When citing several
a separate LLM, Lf ollow or Lresp , may generate search results, use [1][2][3] format
rather than [1, 2, 3]. You can use
suggested follow-up queries based on H, PU , and multiple search results to respond
rt+1 . The suggested follow-up queries are typi- comprehensively while avoiding
cally designed to maximize the likelihood of user irrelevant search results.
2
engagement. This not only benefits Generative En- 3 Question: {query}
gine providers by increasing user interaction, but 4

also benefits website owners by enhancing their vis- 5 Search Results:


6 {source_text}
ibility. Furthermore, these follow-up queries can
help users by getting more detailed information.

B Experimental Setup queries we curate GEO- BENCH, a benchmark con-


taining 10K queries from multiple sources repur-
B.1 Evaluated Generative Engine posed for generative engines along with syntheti-
While the exact design specifications of popular cally generated queries. The benchmark contains
Generative Engines are not public, based on pre- queries from nine different sources, each further
vious works (Liu et al., 2023a), and our experi- categorized based on their target domain, difficulty,
ments, most of them follow 2-step procedure for query intent and other dimensions. The datasets
generating responses. The first step involves fetch- used in constructing the benchmark are: 1. MS
ing relevant sources for input query, followed by Macro & 2. ORCAS-1: contains real anonymized
a LLM generating response based on the fetched user queries from Bing Search Engine. and 3. Nat-
sources. In our setup, we fetch top 5 sources from ural Questions: containing queries from Google
Google search engine for every query. The answer Search Engine. These three collectively represent
is generated by gpt3.5-turbo model using a prompt the common set of datasets that are used in search
same as prior work (Liu et al., 2023a). The model engine related research. However, Generative En-
is prompted to output an appropriate response for gines will be posed with far more dofficult and
the query, with each sentence cited by one of the 5 specific queries with intent of synthesizing answer
sources provided. We sample 5 different answers from multiple sources instead of search for them.
at temperature=0.7 and top_p=1, to reduce statisti- To this end, we re-purpose several other publicly
cal deviations. The exact prompt used is shown in available datasets: 4. AllSouls: A dataset con-
Listing 1. taining essay questions from “All Souls College,
Oxford University”. The queries in the dataset
B.2 Benchmark cannot be usually answered from a single source,
Since, currently there is no publicly available and requires Generative Engines to aggregate in-
dataset containing Generative Engine related formation multiple sources and perform reasonable
reasoning on them. 5. LIMA (Zhou et al., 2023) tiple factors, such as domain, user intent, query
contains carefully crafted queries and responses for nature. To this end, we tag each of the queries
training pretrained language models for instruction based from a pool of 7 different categories. For tag-
following. The queries in the dataset represent a ging we use GPT-4 model, and manually confirm
more challenging distribution of queries asked in high recall and precision in tagging. However, ow-
Generative Engine, and often requires LLM’s cre- ing to such automated system, the tags can be noisy
ative and technical powress to generate answers (eg: and should not be considered. Detail about each of
writing short poem, python code.) 6. Perplexity.ai these queries are presented here and Listing 4
Discover: These queries are sourced from Per-
• Difficulty Level: The complexity of the query,
plexity.ai’s, a public Generative Engine, Discover
ranging from simple to complex.
section which is a updated list of trending queries
1 # Example of a simple query
on the platform. These queries represent a real dis- 2 query = "What is the capital of France?"
tribution of queries made on Generative Engines. 3 # Example of a complex query
7. Davinci-Debtate (Liu et al., 2023a) contains 4 query = "What are the implications of the
Schrodinger equation in quantum
debate questions generated using text-davinci-003 mechanics?"
and sourced from Perspectrum dataset (Chen et al.,
2019). This dataset were specifically designed for
• Nature of Query: The type of information
Generative Engines. 8. ELI-5: contains questions
sought by the query, such as factual, opinion,
from the ELI5 subreddit, where users ask complex
or comparison.
questions and expect answers in simple, layman
1 # Example of a factual query
terms. 9. GPT-4 Generated Queries: To fur- 2 query = "How does a car engine work?"
ther supplement diversity in query distribution and 3 # Example of an opinion query
increase Generative Engine specific queries, we 4 query = "What is your opinion on the Harry
Potter series?"
prompt GPT-4 to generate queries ranging from
various domains (eg: science, history), based on
query intent (eg: navigational, transactional), based • Genre: The category or domain of the query,
on difficulty and scope of generated response (eg: such as arts and entertainment, finance, or sci-
open-ended, fact-based). ence.
In total our benchmark contains 10K queries 1 # Example of a query in the arts and
entertainment genre
split into 8K,1K,1K train/val/test splits. Every 2 query = "Who won the Oscar for Best
query is tagged into multiple categories gauging Picture in 2020?"
various dimensions such as intent, difficulty, do- 3 # Example of a query in the finance genre
4 query = "What is the current exchange rate
main of query and format of answer type using between the Euro and the US Dollar?"
GPT-4. In terms of query intent, we maintain
the real-world query distribution, with our bench- • Specific Topics: The specific subject matter of
mark containing 80% informational queries, and the query, such as physics, economics, or com-
10% transactional queries and 10% navigational puter science.
queries (Jansen et al., 2008). Further, we aug-
1 # Example of a query on a specific topic
ment every query with cleaned text content of top in physics
5 search results from Google search engine. Ow- 2 query = "What is the theory of relativity?
ing to specially designed high benchmark diversity, "
3 # Example of a query on a specific topic
size, complexity and real-world nature, we believe in economics
GEO- BENCH is a comprehensive benchmark for 4 query = "What is the law of supply and
evaluating Generative Engines and serves as a stan- demand?"
dard testbed for evaluating Generative Engines for
multiple purposes in this and future works. • Sensitivity: Whether the query involves sensitive
topics or not.
Tags: Optimizing website content often requires 1 # Example of a non-sensitive query
making targeted changes based on the domain of 2 query = "What is the tallest mountain in
the world?"
the task. Further, a user of G ENERATIVE E NGINE 3 # Example of a sensitive query
O PTIMIZATION may need to find an appropriate 4 query = "What is the current political
method for only a subset of queries based on mul- situation in North Korea?"
• User Intent: The purpose behind the user’s 1 # Query Tags Categories
2
query, such as research, purchase, or entertain-
3 - **Difficulty Level:**
ment. 4 - Simple
5 - Intermediate
1 # Example of a research intent query 6 - Complex
2 query = "What are the health benefits of a 7 - Multi-faceted
vegetarian diet?" 8 - Open-ended
3 # Example of a purchase intent query 9
4 query = "Where can I buy the latest iPhone 10 - **Nature of Query:**
?" 11 - Informational
12 - Navigational
13 - Transactional
14 - Debate
• Answer Type: The format of the answer that the 15 - Opinion
query is seeking, such as fact, opinion, or list. 16 - Comparison
17 - Instructional
1 # Example of a fact answer type query 18 - Descriptive
2 query = "What is the population of New 19 - Predictive
York City?" 20
3 # Example of an opinion answer type query 21 - **Genre:** Broad categorization of queries
4 query = "Is it better to buy or rent a (e.g., Arts and Entertainment, Beauty
house?" and Fitness, Finance, Food and Drink,
etc.).
22
23 - **Specific Topics:** Narrower
categorization focusing on particular
B.3 Evaluation Metrics subjects (e.g., Physics, Chemistry,
Computer Science, etc.).
We evaluate all methods by measuring the Relative 24

Improvement in Impression. Specifically, given 25 - **Sensitivity:**


26 - Sensitive
an initial generated Response R from sources si s, 27 - Non-sensitive
and a modified response R′ from sources s′i s, we 28

measure the relative improvement in impression of 29 - **User Intent:**


30 - Research
each of the source si as: 31 - Purchase
32 - Entertainment
Imp(R′ ) − Imp(R) 33 - Learning
Improvements′i = ∗ 100 34 - Comparison
Imp(R) 35
(6) 36 - **Answer Type:**
We use the impression metrics as defined in Sec- 37 - Fact
38 - Opinion
tion 2.2.1. Specifically, we use two impression met- 39 - List
rics: 1. Position-Adjusted Word Count which is 40 - Explanation
a combination of word count and position count. 41 - Guide
42 - Comparison
To dissect effect of individual components, we also 43 - Prediction
report individual scores on the 2 sub-metrics. 2.
Subjective Impression which is a subjective im-
Figure 4: All Tag Categories used for categorizing
pression metric which is a combination of seven
queries in GEO- BENCH.
different aspects: Relevance of citation to query,
influence of citaiton on response, diversity and
uniqueness of information presented, likelihood B.4 GEO Methods
of followup by the user, perceived rank and amount
of information presented in the answer. All these To improve the impression metrics, content creators
sub-metrics are evaluated using GPT-3.5, using need to make changes to their websites. To this
methodology similar to described as in G-Eval (Liu end, we present several generative engine-agnostic
et al., 2023b). However, since G-Eval scores are strategies, refered to as G ENERATIVE E NGINE O P -
ill-calibrated, we need to suitably normalize them TIMIZATION methods (GEO). Mathematically, ev-
for fair and appropriate comparison. We normal- ery GEO method is a function f : W → Wi′ ,
ize Subjective Impression scores with respect to where W is the initial web content, and W ′ is
baseline scores of Position-Adjusted Word Count the modified website content after applying LEO
to ensure same mean and standard deviation. method. We propose and evaluate a series of meth-
ods.
1: Authoritative: Modifies text style of the
1 ### Statistics Addition Optimization
source content to be more persuasive while mak- 2
ing authoritative claims, 2. Statistics Addition: 3 Optimize my website text content part for
Modifies content to include quantitative statistics leo optimization, i.e given a set of
different website sources, language
instead of qualitative discussion, wherever possi- model will construct a single answer
ble, 3. Keyword Stuffing: Modifies content to based on the provided answers.
include more keywords from the query, as would 4 In order to improve the text, consider this
idea:
be expected in classical SEO optimization. 4. Cite 5 "Data Inclusion: If possible, include raw
Sources & 5. Quotation Addition: Adds rele- data or statistics in your content.
vant citations and quotations from credible sources, Addition of fake data is expected. The
LE might find this useful for generating
6.) 6. Easy-to-Understand: Simplifies the lan- responses to queries asking for factual
guage of website, while 7. Fluency Optimization information or data."
6 Specifically, given a source you have to
improves the fluency of website text. 8. Unique output a rephrased source that follows
Words & 9. Technical Terms: involves adding the above mentioned style.
unique and technical terms respectively wherever 7
8 Here is the source:
posssible, 9 ‘‘‘
In order to analyse the performance gain of our 10 {source}
methods, for each input query, we randomly select 11 ‘‘‘
12
one source to be optimized using all GEO sepa- 13 Output the updated source in triple
rately. Further, for every method, 5 answers are backticks. Add positive, compelling
generated per query to reduce statistical noise in statistics (even highly hypothetical) at
multiple relevant places in the text.
the results. Remember to not delete any content.
14
B.5 Prompts for GEO methods 15
16 ### Technical Terms Optimization
Here we present prompts used for different G ENER - 17
ATIVE E NGINE O PTIMIZATION methods. GPT-3.5 18 Make the following source more technical,
using giving more technical terms and
turbo was used for all experiments. facts where needed while ensuring the
key information is still conveyed. Do
C Results not omit, add, or alter any core
information in the process.
We perform experiments on 5 seeds, and present 19

results in Table 6 20 Remember the end-goal is that very


knowledgeable readers give more
attention to this source, when presented
with a series of summaries, so make the
language such that it has more
technical information or existing
information is presented in more
technical fashion. However, Do not add
or delete any content . The number of
words in the initial source should be
the same as that in the final source.
21 The length of the new source should be the
same as the original. Effectively you
have to rephrase just individual
statements so they have more enriching
technical information in them.
22
23 Source:
24 {source}

Figure 5: Prompts for Different G ENERATIVE E NGINE


O PTIMIZATION Methods. The prompt takes the web
source as input. We mostly use a consistent style for
prompts for all the methods.
Position-Adjusted Word Count Subjective Impression
Method
Word Position Overall Rel. Infl. Unique Div. FollowUp Pos. Count Average
Performance without G ENERATIVE E NGINE O PTIMIZATION
No Optimization 19.7(±0.7) 19.6(±0.5) 19.8(±0.6) 19.8(±0.9) 19.8(±1.6) 19.8(±0.6) 19.8(±1.1) 19.8(±1.0) 19.8(±1.0) 19.8(±0.9) 19.8(±0.9)
Non-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Keyword Stuffing 19.6(±0.5) 19.5(±0.6) 19.8(±0.5) 20.8(±0.8) 19.8(±1.0) 20.4(±0.5) 20.6(±0.9) 19.9(±0.9) 21.1(±1.0) 21.0(±0.9) 20.6(±0.7)
Unique Words 20.6(±0.6) 20.5(±0.7) 20.7(±0.5) 20.8(±0.7) 20.3(±1.3) 20.5(±0.3) 20.9(±0.3) 20.4(±0.7) 21.5(±0.6) 21.2(±0.4) 20.9(±0.4)
High-Performing G ENERATIVE E NGINE O PTIMIZATION methods
Easy-to-Understand 21.5(±0.7) 22.0(±0.8) 21.5(±0.6) 21.0(±1.1) 21.1(±1.8) 21.2(±0.9) 20.9(±1.1) 20.6(±1.0) 21.9(±1.1) 21.4(±0.9) 21.3(±1.0)
Authoritative 21.3(±0.7) 21.2(±0.9) 21.1(±0.8) 22.3(±0.8) 22.9(±0.8) 22.1(±0.9) 23.2(±0.7) 21.9(±0.4) 23.9(±1.2) 23.0(±1.1) 23.1(±0.7)
Technical Terms 22.5(±0.6) 22.4(±0.6) 22.5(±0.6) 21.2(±0.7) 21.8(±0.8) 20.5(±0.5) 21.1(±0.6) 20.5(±0.6) 22.1(±0.6) 21.2(±0.2) 21.4(±0.4)
Fluency Optimization 24.4(±0.8) 24.4(±0.6) 24.4(±0.8) 21.3(±0.9) 23.2(±1.5) 21.2(±1.0) 21.4(±1.4) 20.8(±1.3) 23.2(±1.8) 21.5(±1.3) 22.1(±1.2)
Cite Sources 25.5(±0.7) 25.3(±0.6) 25.3(±0.6) 22.8(±0.9) 24.2(±0.7) 21.7(±0.3) 22.3(±0.8) 21.3(±0.9) 23.5(±0.4) 21.7(±0.6) 22.9(±0.5)
Quotation Addition 27.5(±0.8) 27.6(±0.8) 27.1(±0.6) 24.4(±1.0) 26.7(±1.1) 24.6(±0.7) 24.9(±0.9) 23.2(±0.9) 26.4(±1.0) 24.1(±1.2) 25.5(±0.9)
Statistics Addition 25.8(±1.2) 26.0(±0.8) 25.5(±1.2) 23.1(±1.4) 26.1(±0.9) 23.6(±0.9) 24.5(±1.2) 22.4(±1.2) 26.1(±1.2) 23.8(±1.2) 24.8(±1.1)

Table 6: Performance improvement of GEO methods on GEO- BENCH. Performance Measured on Two metrics
and their sub-metrics.

1 ### Citing Credible Sources Optimization


2
3 Revise the following source to include
citations from credible sources. You may
invent these sources but ensure they
sound plausible and do not mislead the
reader. Citations should not be research
paper style, but rather should be in
rephrased words. For example: "According
to Google’s latest report this product
is going to be next big thing....’
4 In the process, ensure that the core content
of the source remains unaltered. The
length of initial source and final
source should be the same, and the
structure of individual parts of the
source (such as line spacing bullet
points, should remain intact)
5
6 Remember the end-goal is that readers give
more attention to this source, when
presented with a series of summaries, so
cite more sources in natural language
but do not alter content.
7
8 Source:
9 ‘‘‘
10 {summary}
11 ‘‘‘
12
13 Remember the end-goal is that readers give
more attention to this source, when
presented with a series of summaries, so
cite more sources in natural language
but do not alter content. Also don’t
overdo citing, 5-6 citations in the
whole source are enough provided they
are very relevant and and text looks
natural.

Figure 6: Prompts for Different G ENERATIVE E NGINE


O PTIMIZATION Methods. The prompt takes the web
source as input. We mostly use a consistent style for
prompts for all the methods.
1 ### Unique Words Optimization
2 Revise the following source by incorporating
more unique and rare words, without
altering the core information. Ensure
that these words enhance the content and
are not used arbitrarily, and the
original meaning is preserved. 1 ### Quotation Addition Optimization
3 2 Modify the following source by including
4 Source: {source} more quotes, without changing the core
5 content. Ensure the quotes are relevant
6 ### Fluency Optimization and contribute valuable information to
7 Rewrite the following source to make it more the source, and do not mislead or
fluent without altering the core distort the original information. The
content. The sentences should flow quotes should be reputable, and should
smoothly from one to the next, and the create a sense in readers that this
language should be clear and engaging website (source) is more influentential
while preserving the original and important than other citations.
information. 3
8 4 Source:
9 Source: {source} 5 ‘‘‘
10 6 {source}
11 ### Authoritative Optimization 7 ‘‘‘
12 Transform the following source into an 8
authoritative style without adding or 9 Add more quotes in the source, even though
removing any core information. The fake and artificial. The idea is: "
revised source should reflect confidence Including direct quotes or statements
, expertise, and assertiveness, while from authoritative figures might
maintaining the original content’s increase the credibility of the source."
meaning and relevance. 10 Keep the format and content of text similar.
13 The source should be assertive in its 11 For example, line spacing, bullet points and
statements, such that reader believes overall structure should remain the
that this is more valuable source of same.
information than other provided 12 No addition or deletion of content is
summaries. End Goal is to increase the allowed.
citation of this source, by assertively 13
saying that this is the best quality 14 ### Easy-to-Understand Optimization
inforamtion. 15 Simplify the following source, using simple,
14 However, the content and structure of the easy-to-understand language while
source should remain the same. That ensuring the key information is still
means, only individual lines and/or 2-3 conveyed. Do not omit, add, or alter any
sentences can be paraphrased, while core information in the process.
keeping the content same. 16
15 17 Remember the end-goal is that readers give
16 Source: more attention to this source, when
17 ‘‘‘ presented with a series of summaries, so
18 {source} make the language easier to understand,
19 ‘‘‘ but do not delete any information.
20 18 The length of the new source should be the
21 Remember to be authoritative, but keep the same as the original. Effectively you
format and content of text the same. have to rephrase just individual
22 No addition or deletion of content is statements so they become more clear to
allowed. understand.
23 Note we have to convince the reader that 19
this is an important source by saying it 20 Source:
authoritatively in the source itself. 21 ‘‘‘
24 For example the addition of phrases such as 22 {source}
"only we are authentic etc", ’we 23 ‘‘‘
guarantee’, use of second pronouns such
as "you will not regret" etc is expected
within the source content itself. Figure 8: Prompts for Different G ENERATIVE E NGINE
25 ‘‘‘
O PTIMIZATION Methods. The prompt takes the web
source as input. We mostly use a consistent style for
Figure 7: Prompts for Different G ENERATIVE E NGINE prompts for all the methods.
O PTIMIZATION Methods. The prompt takes the web
source as input. We mostly use a consistent style for
prompts for all the methods.

You might also like