0% found this document useful (0 votes)
24 views

The Infinite Index - Information Retrieval On Generative Text-To-Image Models

Uploaded by

Vishal Mathur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

The Infinite Index - Information Retrieval On Generative Text-To-Image Models

Uploaded by

Vishal Mathur
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

The Infinite Index:

Information Retrieval on Generative Text-To-Image Models


Niklas Deckers Maik Fröbe Johannes Kiesel
Leipzig University and ScaDS.AI Friedrich-Schiller-Universität Jena Bauhaus-Universität Weimar

Gianluca Pandolfo Christopher Schröder Benno Stein


Bauhaus-Universität Weimar Leipzig University Bauhaus-Universität Weimar

Martin Potthast
Leipzig University and ScaDS.AI

ABSTRACT 1 INTRODUCTION
Conditional generative models such as DALL-E and Stable Diffusion Conditional generative models allow the generation of a desired
generate images based on a user-defined text, the prompt. Finding output based on a user-specified condition. For generative text-
and refining prompts that produce a desired image has become to-image models such as DALL-E [68] or Stable Diffusion [72],
the art of prompt engineering. Generative models do not provide this means that the model generates images conditional on a text
a built-in retrieval model for a user’s information need expressed description known as a prompt. For a user, the prompt is the primary
through prompts. In light of an extensive literature review, we means of controlling the generated image. If an ad hoc prompt does
reframe prompt engineering for generative models as interactive not produce a satisfactory result, the user usually interacts with
text-based retrieval on a novel kind of “infinite index”. We apply the model by adjusting the prompt until they get one, or they give
these insights for the first time in a case study on image generation up after a few tries. Since such systematic refinement of prompts is
for game design with an expert. Finally, we envision how active often necessary to achieve a satisfactory result, writing prompts
learning may help to guide the retrieval of generated images. has evolved into the art of prompt engineering [53, 63, 71], for which
users exchange best practices in new communities. But even using
CCS CONCEPTS examples from others, it’s often not obvious how to change a prompt
• Information systems → Search engine indexing; Users and to steer image generation in a particular direction.
interactive retrieval; Image search; Novelty in information As a new perspective on the use of conditional generative models
retrieval; Search engine architectures and scalability; Users in general, we interpret them as a search engine index. Under this
and interactive retrieval; Image search. interpretation, the prompt is a request that represents a user’s need
for information. Prompt engineering can then be considered a form
KEYWORDS of interactive text-based retrieval, in which a user interacts with
case study, evaluation, generative models, image retrieval the model by modifying their prompt as if to refine their query to
find a result that meets their needs. This raises a number of new
ACM Reference Format:
challenges: When using a generative model, the initiative currently
Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christo-
lies solely with the user, without support from the model as a
pher Schröder, Benno Stein, and Martin Potthast. 2023. The Infinite Index:
Information Retrieval on Generative Text-To-Image Models. In ACM SI- “retrieval system”. There is no intermediary retrieval model to help
GIR Conference on Human Information Interaction and Retrieval (CHIIR ’23), users produce satisfactory images fast(er), if not ad hoc. The manual
March 19–23, 2023, Austin, TX, USA. ACM, New York, NY, USA, 15 pages. refinement of prompts is not supported by system-side log analysis
https://doi.org/10.1145/3576840.3578327 and query expansion. There is no operationalization of the concept
of image relevance, which is needed for ranking images, and thus
essential when many images are generated.
This work is licensed under a Creative Commons Attribution International A striking difference from traditional retrieval is that when gener-
4.0 License. ative models are used as an index, new results are generated rather
than existing ones retrieved.1 A non-empty result is returned for
CHIIR ’23, March 19–23, 2023, Austin, TX, USA
© 2023 Copyright held by the owner/author(s). every conceivable query prompt. This includes query prompts for
ACM ISBN 979-8-4007-0035-4/23/03. which a traditional retrieval system would return no results. Also,
https://doi.org/10.1145/3576840.3578327 the number of different results that can be generated per query
prompt is not conceptually limited, but only by the available com-
putational capacity for model inference. Thus, a generative model
is effectively an “infinite index”.
1 Generative model occasionally reproduce parts of their training data [46, 84].

172
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

Our contribution is to explore this perspective on generative Once a collection of images is represented and indexed, the repre-
models as indexes in four ways, focusing on text-to-image genera- sentation of the query image is used for similarity-based search and
tion: (1) Section 2 presents a literature survey on image generation, ranking. Text-based image retrieval has often focused on retrieval
text-based image retrieval, retrieval for creative tasks, and interac- based on image metadata and tags in the past, which is why it
tive retrieval. (2) Section 3 conceptualizes generative text-to-image is sometimes referred to as annotation-based, concept-based, or
models as an index integrated into a retrieval system: from the keyword-based image retrieval. Some approaches also generate
user perspective, the query language and interaction methods are textual representations for unannotated images, e.g., using opti-
presented, and from the system perspective, retrieval technologies cal character recognition [90], clustering images with and without
capable of supporting retrieval are examined. Requirements for the annotations [52], or using image captioning methods [33].
evaluation of retrieval systems based on generative models are also Some studies have examined users’ search interactions with a
presented. (3) Based on these findings, Section 4 presents a case text-based image retrieval system. Choi [16] analyzed the search
study of image generation. For creative tasks in game design, we logs of 29 students and found that participants changed their tex-
observe an expert and highlight several issues related to currently tual queries more frequently to refine their results. Hollink et al.
available technology. (4) Finally, based on the insights gained, Sec- [31] studied the image search behavior of news professionals and
tion 5 discusses an active learning approach to interactive retrieval showed that they often modified their queries by following seman-
to guide image generation using generative models. tic relationships of query terms, e.g., searching first for images
about a person and then for images about their spouse.
2 BACKGROUND AND SURVEY Cho et al. [15] took a closer look at why people search for images.
We review the relevant literature to place retrieval on generative In their study of 69 papers, they identified seven information need
models in the context of established concepts. categories (1) entertainment, (2) illustrations (explanation or clar-
ification of details, e.g., creating presentation slides or preparing
2.1 Image Generation study material), (3) images for aesthetic appreciation (e.g., for desk-
In image synthesis, Brock et al. [10] and Goodfellow et al. [28] top backgrounds), (4) knowledge construction (four sub-categories:
have achieved promising results with generative adversarial net- information processing, information dissemination, learning, and
works (GANs) that allow images to be generated from the distribu- ideation), (5) eye-catchers (e.g., to grab audiences’ attention), (6) in-
tion of given training images. Autoregressive transformer models spiring images, and (7) images for social interactions (e.g., images
as per Razavi et al. [69] and Ramesh et al. [68] have proven to be to trigger emotions). They also found seven categories of problems
effective for high-resolution image synthesis. Dhariwal and Nichol that could affect a user’s ability to find the images they were look-
[21] has recently shown that diffusion models [83] are capable of ing for: (a) semantic issues, i.e., related to employed terminology,
outperforming traditional models such as GANs in image synthesis. (b) content-based issues, i.e., related to describing content of images,
In addition, Rombach et al. [72] have shown how to condition the (c) technical limitations of retrieval systems, (d) lacking aboutness
generated images on text. This forms the basis for text-to-image or relevance of retrieved images, (e) lacking inclusivity with regard
models, which are often trained on datasets of text–image pairs [81]. to cultural or linguistic aspects of the user, (f) lacking skills in han-
Table 1 provides an overview of relevant text-to-image models, dling search technology, and (g) cognitive overload. As we discuss
starting with diffusion models such as DALL·E by Ramesh et al. in Section 3, most of these requirements and issues are also relevant
[68] and Imagen by Saharia et al. [74]. Most models are only ac- to retrieval from text-to-image models.
cessible via a web interface. Their code and model weights are not
publicly available. Stable Diffusion by Rombach et al. [72] achieved
2.3 User Feedback for Image Generation
great impact not only because of its impressive results, but also Based on GANs, Ukkonen et al. [89] have proposed and imple-
because the model itself was made publicly available. As a result, mented systems for relevance feedback and Liu et al. [55] for ex-
it was rapidly adapted and now serves as the basis for numerous ploratory search. This was to overcome the lack of prompts in GANs
new applications. More recent approaches pursue other research to condition image generation, leaving users with little control over
goals: eDiff-I by Balaji et al. [4] introduces an ensemble of export- the generated images. Similar techniques to incorporate relevance
denoising networks that allow different behavior at different noise feedback could be considered for text-to-image models.
levels. This increases the number of parameters, but also improves
the results. Muse by Chang et al. [13] uses a discrete token space 2.4 Retrieval for Creative Tasks
instead of a pixel space to increase efficiency. Text-to-image models are particularly suited to artistic and creative
applications, raising the question of whether there are parallels be-
2.2 Image Retrieval tween such applications and the literature on creative task search.
While text-to-image models are relatively new, image retrieval has Interestingly, text-to-image models have quickly led to the forma-
a long history of research. Two cases are distinguished in the litera- tion of communities dedicated not only to the use of these tools,
ture: In content-based image retrieval, the user enters an image as a but also to prompt engineering and the sharing of successful image
query, while in text-based image retrieval, the user makes a textual generation techniques.2 This development is consistent with the
query. Content-based image retrieval systems aim to bridge the formation of creative communities by artists in other art genres [29].
gap between the semantic meaning of images and their quantified On the other hand, such strong community building is somewhat
visual features through sophisticated image representations [50]. surprising, since artisans generally rely less on human sources [47].
2 The Midjourney Discord server has more than 8 million members (as of January 2023).

173
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

Table 1: Overview of the most relevant text-to-image models ( web link; ∗ replicated; † includes the text encoder).

Text-to-image model Training data Open Source Reference


Name Parameters Size Source Code Data Model Publication Link Month / Year
DALL·E 12 B n/a Custom web crawl ∗ – ∗ Ramesh et al. [68] 01 / 2021
DALL·E 2 3.5 B n/a Custom web crawl, licensed sources ∗ – ∗ Ramesh et al. [67] 04 / 2022
Imagen 4.6 B 860 M 400 M [81] from Common Crawl ∗ – ∗ Saharia et al. [74] 05 / 2022
Midjourney n/a n/a n/a – – – Salkowitz [77] 07 / 2022
Stable Diffusion 0.9 B 400 M Common Crawl; cf. Schuhmann et al. [81] Rombach et al. [72] 08 / 2022
eDiff-I 9.1 B† n/a n/a – – – Balaji et al. [4] 11 / 2022
Muse 3 B 460 M n/a; cf. Saharia et al. [74] – – – Chang et al. [13] 01 / 2023

Several studies have already specifically analyzed user behavior Query Log Analysis. Joachims and Radlinski [39] introduced
and goals in creative tasks. Chavula et al. [14] investigated the in- query log analysis for web search, which has since become a valu-
formation behavior of 15 graduate students in creative web search able tool, e.g., for improving retrieval effectiveness and study-
tasks using questionnaires and the think-aloud method. They iden- ing user behavior [11, 35–37]. Broder [11], for example, estab-
tified four creative thinking processes that participants switched lished a taxonomy for web search queries showing that web search
back and forth between: planning creative search tasks (i.e., decid- queries are divided into informational, navigational, and transac-
ing on a vague idea), searching for new ideas, synthesizing search tional queries, which is still the case today [1]. A further categoriza-
results, and organizing ideas. Palani et al. [64] use log analyses and tion derives from Jansen et al.’s [35] work on query reformulation:
self-reports in a study of 34 design students. They observed three queries are either generalizations (subset of words), specializations
main goals of the students: To get an overview of the information (superset of words), synonyms, or other topics. Today, query logs
space, to discover design patterns and criteria, and to get inspired are used for creating large training datasets for retrieval models
and develop ideas. In the study, special attention was paid to the fact based on transformers [60, 70] and remain an important asset.
that participants initially had difficulty finding appropriate terms
Query Reformulation. Query reformulation approaches aim to
to describe their information needs, but then arrived at appropriate
improve the effectiveness of retrieval by replacing the original
terms by quickly querying and reformulating queries. They also
query with substituted or extended reformulations [20]. Here, the
note that participants typically go through a divergent exploration
reformulation of a query can be either precision-oriented (when a
phase before a convergent synthesis phase. Based on a previous
term is replaced by a more specific one) or recall-oriented (when the
online survey and study [103, 104], Li et al. [51] examine the in-
query is expanded). Jansen et al. [37] shows that searchers do not
formation behavior in a diary study of 11 university students on
start with perfect queries but reformulate them instead: more than
self-selected creative tasks. They use Sawyer’s eight-step creativity
50% of searchers reformulate at least one query during a search.
framework [78] and focus specifically on the use of information
Approaches to automatic query expansion, such as RM3 [34], can
resources (search, images, Q&A, social sites, videos). They grouped
use (pseudo) relevance feedback to add new (weighted) terms to
them into five categories: Searching for specific information, sup-
the original query, thus solving the vocabulary mismatch problem
porting creative processes, learning definitional domains, learning
that occurs in text retrieval. However, it is not yet clear which refor-
procedural knowledge, and managing (organizing) found informa-
mulations are helpful in which situations when working creatively
tion. Especially with images, they distinguish specific uses (e.g., as
with generative text-to-image models (i.e., precision-oriented or
on Pinterest, Instagram, Tumblr, Flickr, and image search): Support
recall-oriented reformulations).
ideation and other creative processes, see finished examples, find
out what one likes or dislikes, and manage and overview found Query Suggestion. Search engines assist their users and offer a
information. They found that image search engines were primarily list of suggested queries for an input query [7], which is called query
used to search for a wide range of images, while image sites like auto-completion [12] if the query is incomplete. Query suggestions
Pinterest and Instagram were often used to search for high-quality are important; according to Feuer et al. [22], 30% of queries in a
images by specific artists or professionals. In summary, we identify commercial query log are suggested to users beforehand. Likewise,
three common topics when searching for creative tasks: Searching Cucerzan and Brill [19] notes that spelling corrections are required
to learn, to get inspired, and to get an overview. We also observe for 10-15% of queries with spelling errors. In addition, query sug-
these behaviors in our case study (Section 4). gestions often aim to assist users by displaying related terms [32],
where Jansen et al.’s [36] analysis shows that suggested related
2.5 Interactive Retrieval terms are also heavily used. However, it is important not to over-
Interactive retrieval explores users’ information behavior during whelm users and rather show fewer alternatives for suggestions
and beyond search, as well as the development of new interaction than many [96]. Overall, users value the interaction methods used
methods to assist them [73]. In relation to our work, we review in “traditional” search engines, and we believe that offering simi-
relevant research on query understanding based on query logs as a lar ones for retrieval interfaces built on generative text-to-image
source of user interaction data. models will provide benefits to users with creative tasks.

174
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

Preprocessing Lookup Lookup Indexing approach

Index Document term-to-identifier


Query Document(s)
term(s) Inverted index identifier(s) indexing

Document query-to-identifier
Query Document(s)
Neural index identifier(s) indexing

“Generative index”
Generated query-to-document
Query
Infinite index document(s) indexing
(generative text-to-document model)

Figure 1: Overview of indexing approaches in information retrieval. The top row shows the classic term-to-identifier indexing
approach, the middle rows the recent query-to-identifier indexing approach, and the bottom row the new query-to-document
indexing approach introduced in this paper.

3 TEXT-TO-IMAGE GENERATION AS SEARCH index that mimics the function of a classical index by mapping
Considering a text-to-image model as a virtually infinite index, queries directly to document identifiers. This mapping is trained
a prompt as a query, and prompt engineering as a form of user- based on a given document collection. Using an approach to predict
driven query refinement yields a rudimentary retrieval system queries that users might make to retrieve a given document, such as
(Section 3.1). In the following, the interaction methods (Section 3.2) Doc2Query [61], it is straightforward to generate training examples
that are (potentially) available to users and the retrieval technolo- consisting of a triple of query, document, and the document’s iden-
gies (Section 3.3) that are (potentially) applicable to such a retrieval tifier, or even just tuples of identifiers and synthetic queries [105].
system are examined in detail. Subsequently, requirements for the The goal of the model is to predict the identifiers of the relevant
evaluation of such a system are formulated (Section 3.4). documents given a query.
In this paper, we propose a different way of indexing by using
3.1 Classification of the “Infinite Index” in IR generative text-to-document models as indexes. Although we focus
on images as documents, this type of indexing is in principle applica-
Figure 1 shows how we place the concept of an infinite index in
ble to all types of documents. In this scenario, the “index” is trained
the context of known information retrieval concepts. The basic and
using documents and texts describing the document as training
most widely used concept of an (inverted) index was defined by
examples. Unlike the indexing approaches mentioned above, the
Anderson [2] as “a systematic guide designed to indicate topics or
resulting model does not necessarily retrieve the documents that
features of documents or parts of documents.” The topics or features
were part of the document collection used to train the generative
of documents are represented by (index) terms. In modern infor-
model, but rather generates new documents. Thus, this indexing
mation retrieval, these index terms correspond to the vocabulary
approach is different from the other two, while it can be considered
of an indexed document collection. Anderson [2] further explains
as a kind of independent neural indexing approach.
that “[t]he function of an index is to provide users with an effective
and systematic means for locating documentary units (complete Altogether, we classify the three indexing approaches as follows:
documents or parts of documents) that are relevant to information • Term-to-identifier indexing: building a lookup table that
needs or requests.” Specifically, the documents that can be looked maps index terms to document identifiers.
up in an index are stored elsewhere, with an index lookup providing
• Query-to-identifier indexing: training a model to predict
the necessary information that identifies the storage location of the
identifiers of relevant documents for a query.
matching documents within the filing system.
This concept of indexing, invented long before the days of com- • Query-to-document indexing: training a model to generate
puters, is still used today, in the form of data structures that fulfill relevant documents for a query.
the definition and function of an index in the above sense. Most To find a technical name for these indexes, the following alternatives
importantly, the inverted index data structure implements a map- are suitable: “generative index”, or “neural index”, or, “query-to-
ping of index terms to so-called postlists, where each postlist is identifier index” vs. “query-to-document index”, respectively.
a list of “postings” containing, among other things, a document
identifier for locating the document within a file system or doc- 3.2 Interaction Methods
ument store. Recently, index data structures have been revisited Although the characteristic way of interacting with generative
in the context of research on neural information retrieval [58, 87]: text-to-image models is the text prompt, other features have been
The neural index (the authors call it “transformer-based gener- rapidly added to the interfaces to support the process of image
ative indexing”) [6, 86, 95] has been proposed as a new type of

175
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

generation. To illustrate the possibilities, we give here a brief snap- image, changing elements of the original image while preserving
shot of interaction methods based on the most common models its original characteristics [40]. This interaction method, especially
(as of October 2022). Related generative text-to-video or text-to-3D in the case of (1), is similar to the “show similar results” button
models are not considered [30, 65]. in regular image retrieval. However, (2) and especially (3) allow a
clearer specification of the need.
Prompting. For generative text-to-image models, prompting the
model is the primary interaction method. This interaction method In- and Out-Painting. When generating an image, in- and out-
serves as the initial point of contact with the model during image painting allows to limit the generation of variations to user-defined
generation, much like a query in a standard web search. The in- areas of the image. This is useful when the user wants to change a
teraction method is identical for both: the user sends a short text certain area of the generated image (in-painting; cf. Figure 2c) or
and receives images in response. Some interfaces of generative expand an image (out-painting), where the model tries to fill the
models allow images to be included in the prompt to steer the gen- region to match both the prompt and the parts of the original image
erated images in a particular direction, much in the same way that at the edge of the region. This interaction method goes beyond
content-based image retrieval is used to find similar images. Unlike the capabilities of regular search interfaces, and in most cases one
content-based image retrieval, model interfaces typically require would expect finite indexes to contain no matching results. For an
that the prompt also contains text. Another aspect of prompting in infinite index, this interaction method can be extremely useful to
some interfaces is the specification of model parameters along with finding images that satisfy multiple requirements.
the prompt, e.g., the size of the image to be generated or whether
Quality Enhancements. If the user is satisfied with the composi-
to generate tiled images, which is similar to filters (e.g., by size)
tion of an image, quality enhancement allows improving the image
in regular web search. Moreover, the negation operator allows to
quality in one or more ways without changing the composition.
exclude certain terms from the generated image. The widely used
The most common way to improve quality is to upscale the image
Stable Diffusion model provides only a command line interface, but
to a higher resolution. There are often various upscaling methods
the community has implemented several graphical interfaces for it,
that create new versions from a source image that look sharp or
for example one maintained by AUTOMATIC11113 (cf. Figure 2a).
soft, realistic or artistic, without losing the original composition.
In addition, several services have emerged in the larger text-
Choosing a specific upscaling algorithm is useful to generate differ-
to-image model generation ecosystem to assist users with prompt
ent images that should look similar in terms of their composition.
engineering. Specialized search engines allow users to search for
Another type of enhancement is the use of image-to-image models
images created with generative text-to-image models. The search
trained specifically for correcting faces [94]. We anticipate that
engines then reveal the prompts used to generate the images they
other image-to-image models specializing in specific operations
find, allowing prompts to be reused. Images are indexed either by
will be integrated in the future. As with the variations tool, the
their prompt or by the image content (e.g., with CLIP [66]). Ex-
closest counterpart to this method in regular search is the “show
amples of such search engines include the “community feed” of
similar results” function, which can be quite effective for finding
the Midjourney web app or the independent search engine Lexica,
higher resolution images. However, quality enhancements allow a
which indexes images from the Stable Diffusion Discord server
much clearer specification of what is needed by comparison.
(cf. Figure 2b). According to the developer, 1.4 million queries were
made in a week, the index contained 12 million images in Septem- Image-to-Text. If the user wants to rephrase the prompt but also
ber 2022, and 5 million USD was earned, which clearly indicates the use parts of the generated images, image-to-text models can be
need for such systems. Other services enable (social) prompt engi- used to obtain a textual description of the image that reads like a
neering in a click interface4 or even to buy prompts that supposedly prompt. We are not yet aware of any regular image search engine
provide consistent results.5 Other projects carefully analyze how that integrates image-to-image models on the user page, although
the prompt affects the result and create extensive lists of examples.6 we believe that major image search engines such as Google Images
Although these services have a similar goal as query suggestions in will use them to index images.
web search, namely to help with prompt engineering, their interac-
tion pattern is different. We discuss the implications in Section 5. 3.3 Relevance in Text-To-Image Generation
Variations. When generating an image, the variations interac- As the above overview of interaction methods shows, the text-to-
tion method allows to change parts of the image composition. image generation community develops support for a variety of
This is useful when a generated image is broadly satisfactory but common search problems, but so far used information retrieval
needs improvement in certain aspects. We distinguish three ways concepts only as search facets supported by external tools. This
of generating variations: (1) the user does not change the prompt, section reviews relevance as a core information retrieval concept
which causes the composition to change only slightly and randomly that needs to be operationalized to steer the generation.
(cf. Figure 2d); (2) the user changes the prompt and gives the model As with regular image retrieval, also generative models the con-
a new target as it continues from a generation checkpoint of the cept of result relevance depends on the information needs of users,
original image; (3) the user specifies semantic processing of the of which seven different categories have been identified in the lit-
erature (cf. Section 2). Generating images rather than finding them
3 https://github.com/AUTOMATIC1111/stable-diffusion-webui
4 E.g., can, at least in theory, satisfy most of these needs, and is particu-
https://phraser.tech
5 https://promptbase.com larly useful for the needs of entertainment, illustration, aesthetic
6 E.g., https://github.com/willwulfken/MidJourney-Styles-and-Keywords-Reference appreciation, engaging others, inspiration, and social interaction.

176
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

(a)

(b) (c) (d)

Figure 2: Screenshots illustrating the interfaces and interaction methods discussed in Section 3.2: (a) prompting in a community-
maintained stable diffusion web interface; (b) Lexica search engine for generated images along their prompts; (c) in-painting in
DALL·E 2 on an image originally created for the “Wizard with staff” prompt: the staff was manually masked (shown in white)
to produce a modified prompt; (d) upscaling and variation generation in Midjourney.

177
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

In social interactions, for example, it is very useful for generative Impact of the Infinite Index on IR Evaluation Measures. Effec-
models to take into account the general moods mentioned in the tiveness measures can be divided into utility-oriented (based on a
prompt, providing a clear path to generating images that evoke spe- ranking only) and recall-oriented (normalized by a “best possible”
cific emotions. The one information need category for which image ranking) evaluation measures [56] so that an appropriate measure
generation is unsuitable is the need for knowledge construction, can be used depending on the nature of the information need. How-
since generated images are not tied to real world knowledge. ever, the virtually unending stream of alternative images that can
When generating images, a distinction must be made between be generated leads to problems with recall-oriented evaluation mea-
two different intentions. First, the user may already have a clear idea sures. An infinite number of images that can be generated allows
of the target image, for example, in an illustration. A user with this for the subset of highly relevant images is to also be infinite. For
intention iteratively refines their prompt until the system generates recall-oriented measures like nDCG [38], this means that their nor-
an image that approximates their ideas, which we call a descriptive malization term can default to a ranking that is completely filled
approach. Second, they may not have a clear vision or goal, just with highly relevant images. In practice, a human will still only
a set of constraints. With this intent, the user iteratively refines search a query up to a certain rank 𝑘, so an nDCG@𝑘 can still be
their prompt in a feedback-loop with random elements introduced computed in this way, since a specific retrieval model requesting
by the system, loosely steering the system toward an image that a text-to-image model may still deviate more or less from actu-
they like and that meets the constraints, which we call the creative ally providing only highly relevant images. Utility-based measures
approach. Although they are very different from the user’s point (such as Precision@k, MRR, RBP [59], etc.) are not affected by this
of view, both approaches are more or less indistinguishable for the problem because they measure the effectiveness of a ranking based
system in terms of query log analysis: a general prompt is extended only on the images available in the ranking.
with details to become more specific. Another problem is that an infinite number of near-duplicate
With respect to text-to-image model-based retrieval, the re- images of high relevance can be generated. Retrieval models could
search in interactive information retrieval is highly related (cf. Sec- therefore rank many/exclusively (near-)duplicate images highly. If
tion 2). Query log analysis will be important to identify keywords in evaluated in isolation, each one would be considered highly rel-
prompts that generally produce satisfactory results, to model user evant. Evaluation measures that operate on rankings with (near-
intent at a finer level, and to identify search queries and early aban- )duplicates overestimate their effectiveness [5, 24], and learning-to-
donments that may indicate problems in the model. We assume that rank approaches learn suboptimal ranking models as well when
query suggestion methods will be very helpful, especially to assist trained on redundant data [23]. Therefore, it is important to dedu-
inexperienced users. However, automatic query reformulation for plicate the rankings before evaluation. For the development of
prompts is more challenging because such changes have a generally retrieval models, this means that ensuring diversity of images in
more unpredictable impact on the generated images. In our case the top ranks can be instrumental for users.
study (Section 4), the creative professional therefore refrained from Overall, utility-based measures (such as RBP) on deduplicated
optimizing the prompt and instead tried completely new ones. We rankings with judgments for the top-𝑘 images allow theoretically
see here a clear lack of user support in terms of retrieval in the grounded evaluations when using text-to-image models as index.
current interfaces. External tools such as prompt search engines
Evaluations with Active Judgment Rounds. Experimental eval-
attempt to compensate for this shortcoming, but cannot match the
uation of retrieval systems usually follows the Cranfield para-
effectiveness of integrated solutions that are widely used in search
digm [17, 18], which assumes that all documents are judged for
engines today (see Section 5 for a discussion of possible remedies).
all information needs. The original Cranfield experiments [17, 18]
With these considerations in mind, the notion of relevance and
were conducted on a collection of 1,400 documents and complete
thus retrieval methods such as query suggestions can be transferred
relevance judgments for 225 topics. However, complete judgments
from information retrieval to text-to-image model generation, and
became impracticable almost immediately thereafter as the size
thus retrieval evaluation measures can be adopted.
of collections increased significantly. The current best practice for
shared tasks in IR is to create pools of the top-ranked documents
3.4 Evaluating Retrieval on Text-To-Image Models
from the submitted systems for each topic and then score each
Framing text-to-image generation as a retrieval problem implies topic’s pool [92], assuming that unjudged documents are not rele-
measuring the effectiveness of generated rankings of generated vant. However, the assumption that judgment pools are “essentially
images according to standard experimentation practices in infor- complete” is likely incorrect when text-to-image models are used
mation retrieval. However, we show that the infinite index in the as index, especially if query expansion approaches are involved.
form of a text-to-image model has far-reaching consequences for As a result, rigorous evaluations must include manual rounds of
the design and evaluation of experiments, since the set of relevant judgments of unjudged images to reestablish “completeness” (e.g.,
documents is not closed and can thus not form the basis to calcu- for the top-𝑘 results), at least for utility-oriented measures, which
late recall. We also discuss the challenges this poses for creating hinders fully automated evaluations.
reusable benchmark collections and speculate on approaches to
overcome these challenges. We focus on measuring ranking effec- Evaluations without Active Judgment Rounds. IR research has
tiveness because other aspects, such as user interface design and benefited largely from the availability of robust and reusable test
layout, are not considered in Cranfield-style evaluations. collections created during shared tasks [91]. However, these collec-
tions are robust only if most of the unjudged documents are irrele-
vant, which is not the case for text-to-image models. Consequently,

178
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

creating robust and reusable test collections is a major challenge 4 CASE STUDY: GAME ARTWORK SEARCH
that requires experience from several different shared tasks and To illustrate retrieval using a query-to-document index for images,
subsequent post hoc experimentation (e.g., some robustness checks we report on an observational case study in which a text-to-image
for traditional test collections are not performed until years after model is used for a creative task. First, we describe the study setup
their creation [93]). Therefore, any post-hoc experiments based and the exemplary creative task, generating graphics for an online
on an infinite index would need to include appropriate handling card game (Section 4.1). Subsequently, the main observations of the
of unjudged images. Traditionally, unjudged documents are either study are summarized (Section 4.2). A full report on the study is
simply removed (where a system’s result lists are condensed to the available as supplementary material.7
included judged documents in their relative order) [75], classified
as not relevant (default setting) or highly relevant (lower/upper 4.1 Setup of the Case Study
bound) [56], or their relevance labeling can be predicted [3]. While
For the case study, we recruited a creative professional through
these approaches are well studied for conventional retrieval exper-
personal contacts who allowed us to observe him as he explored the
iments (e.g., condensed lists often overestimate the effectiveness
use of generative text-to-image models in his creative process. The
values [76] and the gap between lower and upper bounds can be
professional described himself as a game designer and developer
very large [56]), it is not yet clear whether they are suitable for an
with the experience of five major game releases and as a lecturer
infinite index. As a result, it is not yet clear how to construct robust
in game development at a university. Prior to the case study, he
and reusable test collections, but we speculate that techniques from
described himself as very intrigued by generative text-to-image
machine translation (e.g., measuring the similarity of an unjudged
models he had come across in his Twitter feed, and had also seen
document via phash [100] to judged reference images) or relevance
some online videos on this technology (“2 minute papers”). More-
prediction may be appropriate.
over, he had already generated about 50 images in DALL·E 2, about
20 in Midjourney, and less than 10 with Stable Diffusion on his own
3.5 The First Index for The Library of Babel?
hardware, but none of them as part of a project. He anticipated,
At the beginning of the 20th century, Kurd Laßwitz, a German however, that generative text-to-image models will become very
writer, scientist, and philosopher who became the first German useful for the video game industry.8
science fiction author, introduced “The Universal Library” [45] as Based on his experience, the professional decided to investi-
part of a series of short stories published in a newspaper around gate the use of generative text-to-image models in the creation of
that time. The Universal Library contains every conceivable book graphics for an online card game for the study. Specifically, he was
with a length of 1 million characters. Assuming an alphabet of interested in developing a “deck-building online card game like
100 Latin letters, numerals, and punctuation marks, each combi- Magic the Gathering set in a fantasy universe.” In this game, each
nation of these characters in a book of 1 million characters yields playing card has its own artwork that visually links it to the fantasy
102,000,000 books, virtually everything that can been written in ev- universe. Moreover, the cards belong to different “factions” that
ery language (assuming an appropriate transliteration). The only must be visually distinguishable. The professional opted for a “con-
problem with such a library is that it is extremely unlikely to find a cept art-like style” from the outset. In the five hours we provided for
book by chance that contains a plausible sentence. This idea was the study, the professional expected to first create a “mood board”
taken up by Jorge Luis Borges, a well-known Argentine author, and of images to capture the artistic style of the desired artwork [48],
made widely known under the name “The Library of Babel” [9]. He and then create the artwork itself for some cards. Based on his own
imagines this library as a universe of its own and invents stories testing, he decided to use Midjourney for this task. This choice
about various tribes of humanity that might develop in such a place, reflects Midjourney’s concept, which emphasizes “painterly aes-
always looking for scraps of knowledge among the many books thetics” and aims to help creatives “converge on the idea they want
of incomprehensible gibberish. In an earlier work called “The To- much more quickly” [77], especially at the beginning of a project.
tal Library” [8], Borges traces the history of this concept back to The case study was conducted using the think-aloud method,
Laßwitz and even to Aristotle and Cicero, who formulated what is asking additional questions while the professional waited for the
now known as the “Infinite Monkey Theorem” [98], which states images to be generated. Since the study did not focus on search
that a monkey hitting a typewriter at random will eventually type interface design, one of the authors used Midjourney extensively
every text, including the complete works of William Shakespeare. to prepare for the study and provide technical support to the pro-
Given this fictional concept, generative text-to-document mod- fessional. To record observations, we took extensive notes as well
els can be understood as an index and a search engine for the as video and audio recordings and used the logging capabilities of
library of Babel: By entering a short phrase as a query, the model the Midjourney web app. Following Li et al. [51], we used forms
is prompted to search the library for a document that matches to structure our notes for various events, in our case for queries,
the query. This completely circumvents the problem outlined by problems, and shifts in design goals. A report on the study with all
Laßwitz and Borges, since a document returned by a generative generated images is available as supplementary material.
text-to-document model is very likely to be related to the query, and 7 Case study report: https://doi.org/10.5281/zenodo.7221434
as long as the query itself is not gibberish, the retrieved documents 8 Video games account for about 57% of digital media market revenue in 2022, or
will not be gibberish either. US$197 billion [85]. Meanwhile, other game developers have also published reports on
their experiments with text-to-image models, e.g., https://www.traffickinggame.com/ai-
assisted-graphics/

179
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

Initial prompt: an ancient golden dagger lying on moss, illuminated by godrays, close up, digital painting, matte
painting, midjourney, concept art, detailed art, scifiart cinematic painting, magic the Gathering, volumetric light,
masterpiece, volumetric realistic render, epic scene, 8k, post-production detailed art, scifiart cinematic painting --q 2
(1) (3) (4)

(2) (5) (6) (7)

Reformulated prompt: a medieval dagger lying on moss, lit by god rays, art by Adrian Smith + Paul
bonner, magic gathering style, warcraft, blizzard style, hearthstone, fantasy concept art, medieval,
+
masterpiece, mystical, witchcraft
(8) (9) (11) (12) (13)

(10) (14)

Figure 3: Exemplified search for a generated image from the case study, consisting of 14 steps in 22 minutes. Gray prompt text is
copied from the prompt of another image in the mood board. For the reformulated prompt—after the first series of images has
been abandoned as “leading nowhere”—, it is copied from the image that is part of the prompt. Interactions are text-to-image
generation of four images ( ), generating four variations of one image (same prompt, ), and upscaling one image
( ). The “beta” upscaling method is used in Step 12, the “light” method in Step 13, and the default method (“detailed”) in all
other cases. The professional kept the two images with the yellow border. Although the image generated in Step 9 did not show
a dagger as intended, he found it intriguing and said that it evoked a story, especially in combination with the kept image.

4.2 Main Observations from the Case Study from images in Midjourney’s community feed, however, the pro-
This section summarizes the insights from the case study into three fessional immediately began using the mood board as a source for
main observations. We found that the mood board is a key tool his prompts as well. When creating a new image, he selected from
for professionals and analyze its use based on the five reasons for the mood board the image that came closest to his ideas in terms
using information resources [51]. To analyze the mental state of of artistic style, and then copied the “style part” of that image’s
the professional, we use Kuhlthau’s [44] model of the information prompt for his own creation (cf. the gray text in Figure 3). Thus,
search process. And based on the professional’s comments during he additionally used the mood board to learn domain knowledge
the study, we identified the lack of control he mentioned as the (style names, rendering engines, etc.) and procedural knowledge
main problem that needs to be addressed by future tools. (parameters such as “--q 2” to increase image quality). Only once
did the professional search for the artists of the “Magic the Gath-
The mood board as prompt library. Lemarchand [48] defines a ering” cards using an external search engine and was pleased to
mood board as “a single page or screen of pictures arranged around find that they were already included in the prompts he copied.
a certain idea or theme” that serve two main purposes: first, to Learning happened only on a superficial level, copying entire style
inspire new ideas by juxtaposing images (supporting creative pro- sections of a prompt and using it like an atomic unit. This behavior
cesses), and second, to communicate a concept quickly and effec- is so widespread in the text-to-image generation community at the
tively (managing found information). After creating the mood board moment that commercial services have emerged for them.9
9 E.g., https://promptbase.com

180
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

Uncertainty never fully ceases. In Kuhlthau’s [44] model of the 5.1 Limitations of Text-To-Image Generation
information-seeking process, the seeker moves from uncertainty While the functionality of text-to-image models is already of suf-
to understanding as the search progresses. During the case study, ficient quality to be used in real-world applications [62, 77], we
we were able to identify clear parallels to this model and its phases, identified the following two main limitations related to the work-
particularly the selection, exploration, formulation, and collection flow or capabilities of the current methods—the same workflow
phases. In the selection phase, the professional uses the mood board that was used in the case study.
as inspiration to choose content and style for a new image. In the
exploration phase, he created and modified the prompt: he men- Prompt Engineering. Although prompt engineering has been
tioned that he was very unsure about the results he would get and successfully applied to other generative tasks such as co-writing
how he could modify the prompt to achieve what he envisioned. screenplays and theater scripts with a large language model [57],
Once he found something he thought was promising, he moved the need to engineer the prompt compromises the intuitiveness of
into the formulation phase, focusing on generating variations over the prompt interface. Users quickly realized that iteratively adding
and over again and figuring out certain aspects that the final image modifiers to the prompt (as in Section 4), causing the model to
should have. With a clear sense of direction, he would then upscale apply the desired result styles to the generated image, is the most
matching images in the acquisition phase and test the various up- effective way to control the image generation process [53, 63]. This
scaling algorithms as necessary. As accounted for in the model, the has given rise to a whole new subfield of text-to-image prompt
professional also regressed to earlier stages, especially when he engineering [53], where prompts are increasingly becoming long
saw an impasse (cf. Figure 3). Kuhlthau, however, mentions two strings of keywords instead of text descriptions. These manipulated
“types of uncertainty,” and although uncertainty about the concept prompts resemble highly optimized search engine queries where
(what he is looking for) decreases as described above, uncertainty users select and fill in keywords—so users have learned to adapt to
about the technical process (how to get there) remains high, with the algorithm rather than the other way around.
the AI remaining largely unpredictable to him. Influence of the Training Data. A fundamental limitation of cur-
Sense of direction, but lack of control. Although in some situa- rent models is that both text encoders and diffusion models generate
tions the professional noted that the unpredictability inherent in new data by merging concepts learned from large datasets and are
the process was appealing (“I also wanted to be surprised”), he thus limited to those concepts. Writing a prompt that contains a
also mentioned that the process was very exhausting, which we concept that does not appear in either the text corpora or the image
related to the fact that he often went back in the history of his gen- datasets is likely to result in sub-par generation of images. One
erated images to keep checking which interactions yielded good possible remedy is that unknown terms can be described as para-
results and which image he should continue with. An interface that phrases. If the training data does not contain images of a centaur, a
supports the user in organizing generated images therefore seems prompt such as “a mythical creature with the body of a horse
necessary. The professional noted that he was developing a sense and the torso of a human” might still produce the desired result.
of the direction the image variations would take, but also felt he
had no control. He decided whether to continue down one path or 5.2 Active Learning for Text-To-Image Generation
try another, but did not feel he could change direction. After the From the case study, it appears that targeted text-to-image genera-
case study, Midjourney introduced the ability to modify a prompt tion is already surprisingly effective. As described in Section 3.2, the
when generating variations, but the professional says this does not current way of working amounts to iterative prompt engineering,
solve the problem of choosing the right words. Uncertainty about which in turn is a fundamental limitation, as stated in Section 5.1.
how to change the prompt to achieve the desired results therefore We propose active learning as a solution to this problem and out-
has a major negative impact on the user’s sense of control. line how it can be integrated as a feedback mechanism in an image
generation workflow that uses text-to-image models.
Indeed, the case study showed clear parallels between text-to-image Active learning [49, 102] is an iterative approach to classification
generation and image search. In particular, we found that existing that involves a feedback loop involving a user and a (semi-)super-
theoretical models of the (creative) search process are broadly ap- vised machine learning model. It is intended for scenarios where
plicable. The main difference lies in the never-ending uncertainty training data is not available to minimize the effort required to
about how to get to a particular result—although the user must obtain a suitable labeled training dataset while maximizing model
assume, because of the index being virtually infinite, that there is a quality. According to Schohn and Cohn [79], an active learning
path that leads to the goal. Based on our observations, we believe setting consists of (1) a model that is trained for a specific task,
that tools that provide the user with more intuitive ways to control (2) a query strategy that selects data from an existing resource
the generation process are needed to bridge this gap. or generates new data to be labeled, and (3) a stopping criterion
that indicates at what point continuing the process is unlikely to
5 DISCUSSION sufficiently improve the result any further. At each iteration, the
Based on our conceptualization of text-to-image models as search query strategy selects the examples it deems most informative for
and the case study, we next explore the limitations of text-to-image the model, for example, based on the prediction uncertainty of
generation (Section 5.1). Then we discuss how active learning might the model [80]. These examples are then annotated by the user
help (Section 5.2), and address ethical concerns (Section 5.3). according to the task at hand. A new model is then trained on all
previously marked data, and the loop is repeated until an objective
stopping criterion is met or the user stops.

181
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

the full spectrum of natural language processing and information


...

Machine
Query Strategy retrieval can be applied to effectively process user feedback to
improve prompts during the reformulation step.
Text-to-Image Model When text-to-image generation is viewed as a retrieval problem
User (as in Section 3), the process of trying different prompts until a sat-
Prompt Reformulation isfactory image is generated is similar to traditional image retrieval,
(Changed) Prompt + Feedback
and thus the inclusion of active learning as a relevance feedback
mechanism is an obvious choice of a well-established method. We
Figure 4: A conceptual overview of the active learning loop anticipate that active prompt generation will be a strong interface
for the guided text-to-image generation use case. competitor for generative text-to-image models once image-to-text
models are sufficiently mature (apart from editing options such as
For text-to-image generation, the whole structure of active learn- in-painting or out-painting, which are orthogonal to this approach).
ing is shown in Figure 4. The process begins with the user and
an initial prompt. The active learning model learns to reformu-
5.3 Ethical Concerns
late prompts, which in turn are passed to the text-to-image model. A computational approach powerful enough to generate documents
The model is trained with user feedback as target values, so that such as images, text, and other media types at a quality difficult
the resulting images should become increasingly appealing to the to distinguish at times from human-made illustrations naturally
user. Subsequently, the query strategy decides which images are raises ethical concerns. We discuss the most important ones below.
displayed to the user. It strikes a balance between exploration and Will algorithms replace artists? We begin with the obvious ques-
exploitation, a well-known trade-off in information retrieval: ex- tion: will generative text-to-image models threaten artists’ jobs?
ploration selects images that are different from the current best First, based on our experience in the case study, it is currently diffi-
candidates, and exploitation selects images that are close to the cult to get text-to-image models to generate a desired result. The
current best solutions. Finally, the stopping criterion is the user decision whether the generated images represent the desired scene
who stops the process as soon as his information need is satisfied. with sufficient quality still has to be made by the user. Therefore,
In this setup, active learning uses relevance feedback [88, 99, 102]. we believe that these new models will be a powerful tool, but will
Information retrieval systems can let users explicitly specify not replace the human illustrator in the foreseeable future—even
relevant documents (explicit relevance feedback) or learn from if the image quality should eventually reach human levels. This
passive observations (implicit relevance feedback) [97], though this is corroborated by others such as Liu et al. [54], who developed
discussion focuses on explicit feedback to guide active learning and evaluated a system that assists users in generating images for
for image retrieval. There are different types of explicit relevance news articles, noting that artistic knowledge is still beneficial to
feedback for the user: (1) binary relevance feedback [27], where the the generated result, explicitly saying “generative AI deployment
user rates each image as “unappealing” or “appealing” with respect should [...] augment rather than [...] replace human creative exper-
to the target concept; (2) graded relevance feedback [27], in which tise”. We support this view: instead of an autonomous AI that acts
the user rates each image from “unappealing” to “appealing” on a on its own, we want to emphasize the benefits of a “supportive AI”
multilevel scale (e.g., from 0 to 5); (3) ranking, where the user rates that inquires about and incorporates the decisions of its users.
each image (possibly including images from previous iterations)
from unappealing to appealing. Users can provide feedback on the Who is the author of a generated image? And who owns the rights?
entire image or on individual parts (e.g., the background) or aspects This is currently an unresolved situation that leads to uncertainties
(e.g., the color scheme). Similar to query customization during a regarding the use of AI-generated images. For this reason, major
regular search, the user can change the prompt in each iteration. platforms such as the well-known image provider Getty Images
The main challenge for this feedback mechanism is to convert have recently banned all AI-generated content.10 Stakeholders may
the images into a textual representation that preserves the specifics include the user, the creators, and the artists who created the images
of each image, which can then be used to learn how to reformulate used for model training. Ultimately, this decision must be made
the prompt. For example, a prompt like “wizard with staff” could by policy makers and by the courts, where many legal precedents
generate images with different poses and backgrounds. To learn have been set in the past through copyright litigation.
reformulations from relevance feedback, it is necessary to obtain a Text-to-image models for generating misinformation? Generated
textual representation that includes these differences. One could, misinformation is already a pervasive problem and is widely dis-
of course, try to learn to reformulate based only on latent image cussed in the context of so-called “deep fakes” and AI-generated
representation and relevance feedback, but this would solve the text [43, 82, 101]. To mitigate this problem in text-to-image models
problem exclusively in the image space and largely ignore the such as Stable Diffusion, an image is watermarked to identify it
text embedding space. This could also be a useful approach, but is as artificially generated.11 Although watermarks are not easy to
outside the realm of natural language processing and information remove, this may not be enough if they are not checked on virtually
retrieval. Although the reverse step of image-to-image generation all devices. However, this requires that policymakers legally oblige
required for this has recently attracted increasing attention [25, 26], device manufacturers to detect fakes and warn users. In addition,
it remains a challenge, and moreover, multiple images are required 10 https://voicebot.ai/2022/09/23/getty-images-removes-and-bans-ai-generated-art/
to generate one text [25]. Once this reverse direction is improved, 11 https://github.com/CompVis/stable-diffusion

182
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

watermarking images itself raises privacy concerns. As for text-


to-text models, fully generated documents can be useful, provided
they are not used to generate factual knowledge, which is currently
woefully inadequate. Therefore, the use of such models as an in-
finite index must at least be subjected to post-processing in the
form of fact checking or the like. This is exactly what is happening
at present, after OpenAI recently introduced ChatGPT12 with lots
of publicity: The search engines You13 and Neeva14 have already
integrated facsimiles of ChatGPT into their search interfaces and
check the generated documents against traditional search results.
Whether this proves to be a good idea remains to be seen.
Do these models express or even amplify bias? Bias in training Figure 5: Results for the prompt “wizard with a staff” in
data is a known problem for both image data [41] and language Midjourney: (left) version 3, default at the time of our case
models [42]. Therefore, text-to-image models must also be system- study; (right) version 4, the default three months later.
atically screened for social and other types of bias. In information
retrieval, for example, fair ranking is now a widely studied problem. REFERENCES
A retrieval process built on generative models could be designed [1] Daria Alexander, Wojciech Kusa, and Arjen P. de Vries. 2022. ORCAS-I:
to mitigate their inherent biases. Image search engines based on Queries Annotated with Intent using Weak Supervision. In SIGIR ’22: The 45th
generative models must post-process and re-rank their results to International ACM SIGIR Conference on Research and Development in
Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo
compensate for bias, just like their traditional counterparts. How- Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai
ever, the technologies developed for traditional search engines can (Eds.). ACM, 3057–3066. https://doi.org/10.1145/3477495.3531737
also be applied to search engines based on generative models.15 [2] James D. Anderson. 1997. Guidelines for Indexesand
RelatedInformationRetrieval Devices.
[3] Javed A. Aslam and Emine Yilmaz. 2007. Inferring document relevance from
6 CONCLUSION incomplete information. In Proceedings of the Sixteenth ACM Conference on
Information and Knowledge Management, CIKM 2007, Lisbon, Portugal,
Supporting systems and services are needed for the use of genera- November 6-10, 2007, Mário J. Silva, Alberto H. F. Laender, Ricardo A.
tive text-to-image models. Their integration into existing systems Baeza-Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and
André O. Falcão (Eds.). ACM, 633–642.
is already in full swing, as has been seen for years in generative [4] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten
models for writing assistance and translation systems, but now Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras,
also in more creative areas. However, integration with end-user and Ming-Yu Liu. 2022. eDiff-I: Text-to-Image Diffusion Models with an
Ensemble of Expert Denoisers. CoRR abs/2211.01324 (2022).
software to create slide presentations or artwork will not meet https://doi.org/10.48550/arXiv.2211.01324 arXiv:2211.01324
all the needs of those looking for inspirational images. Given the [5] Yaniv Bernstein and Justin Zobel. 2005. Redundant documents and search
recent moves by You and Neeva, specialized search engines based effectiveness. In Proceedings of the 2005 ACM CIKM International Conference on
Information and Knowledge Management, Bremen, Germany, October 31 -
on generative text-to-image models as indexes, with user interface November 5, 2005, Otthein Herzog, Hans-Jörg Schek, Norbert Fuhr, Abdur
for formulating information needs and customized retrieval models, Chowdhury, and Wilfried Teiken (Eds.). ACM, 736–743.
https://doi.org/10.1145/1099554.1099733
are probably already being developed. However, the development of [6] Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Wen-tau Yih, Sebastian
a search engine is not trivial, and the information retrieval commu- Riedel, and Fabio Petroni. 2022. Autoregressive Search Engines: Generating
nity faces the renewed challenge of developing an understanding Substrings as Document Identifiers. CoRR abs/2204.10628 (2022).
https://doi.org/10.48550/arXiv.2204.10628 arXiv:2204.10628
and technological foundation for such search engines. This includes [7] Sumit Bhatia, Debapriyo Majumdar, and Prasenjit Mitra. 2011. Query
the development of new retrieval models and relevance scores as suggestions in the absence of query logs. In Proceeding of the 34th International
well as the adaptation of evaluation methods for benchmarking ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR 2011, Beijing, China, July 25-29, 2011, Wei-Ying Ma, Jian-Yun Nie, Ricardo
search engines based on generative models. Moreover, because re- Baeza-Yates, Tat-Seng Chua, and W. Bruce Croft (Eds.). ACM, 795–804.
sults can vary widely from one day to the next (cf. Figure 5), users https://doi.org/10.1145/2009916.2010023
[8] Jorge Luis Borges. 1939. La Bibliotheca Total (The Total Library). Buenos Aires.
cannot rely on things like remembering specific queries to search https://www.gwern.net/docs/borges/1939-borges-thetotallibrary.pdf
for known items. Therefore, to effectively use generative image [9] Jorge Luis Borges. 1941. La Bibliotheca de Babel (The Library of Babel).
models as a search index, it may be necessary to maintain a history https://maskofreason.files.wordpress.com/2011/02/the-library-of-babel-by-
jorge-luis-borges.pdf
of search results with appropriate model parameters. Finally, what [10] Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN
is true for generative text-to-image models is likely to be true other Training for High Fidelity Natural Image Synthesis. In 7th International
kinds of text-to-document models, opening up a whole new world Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May
6-9, 2019. OpenReview.net. https://openreview.net/forum?id=B1xsqj09Fm
of exciting new research directions and promising high impact. [11] Andrei Z. Broder. 2002. A taxonomy of web search. SIGIR Forum 36, 2 (2002),
12 https://openai.com/blog/chatgpt/ 3–10. https://doi.org/10.1145/792550.792552
13 https://blog.you.com/a9e05080c8ea
[12] Fei Cai and Maarten de Rijke. 2016. A Survey of Query Auto Completion in
Information Retrieval. Found. Trends Inf. Retr. 10, 4 (2016), 273–363.
14 https://neeva.com/blog/introducing-neevaai
https://doi.org/10.1561/1500000055
15 For example, compare the result from lexica.art (https://lexica.art/?q=nurse) with [13] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu
that from Google Images (https://www.google.com/search?tbm=isch&q=nurse). Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael
Rubinstein, et al. 2023. Muse: Text-To-Image Generation via Masked
Generative Transformers. arXiv preprint arXiv:2301.00704 (2023).

183
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

[14] Catherine Chavula, Yujin Choi, and Soo Young Rieh. 2022. Understanding [31] Vera Hollink, Theodora Tsikrika, and Arjen P. de Vries. 2011. Semantic search
Creative Thinking Processes in Searching for New Ideas. In ACM SIGIR log analysis: A method and a study on professional image search. J. Assoc. Inf.
Conference on Human Information Interaction and Retrieval (Regensburg, Sci. Technol. 62 (2011), 691–713.
Germany). ACM, New York, NY, USA, 321–326. [32] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. 2003. Relevant term
https://doi.org/10.1145/3498366.3505783 suggestion in interactive web search based on contextual information in query
[15] Hyerim Cho, Minh TN Pham, Katherine N. Leonard, and Alex C. Urban. 2021. session logs. J. Assoc. Inf. Sci. Technol. 54, 7 (2003), 638–649.
A systematic literature review on image information needs and behaviors. https://doi.org/10.1002/asi.10256
Journal of Documentation 78, 2 (2021), 207–227. [33] Sethurathienam Iyer, Shubham Chaturvedi, and Tirtharaj Dash. 2017. Image
[16] Youngok Choi. 2013. Analysis of image search queries on the web: Query Captioning-Based Image Search Engine: An Alternative to Retrieval by
modification patterns and semantic attributes. Journal of the American Society Metadata. In Soft Computing for Problem Solving (SocProS’17) (Advances in
for Information Science and Technology 64, 7 (2013), 1423–1441. Intelligent Systems and Computing, Vol. 817), Jagdish Chand Bansal, Kedar Nath
[17] Cyril W. Cleverdon. 1967. The Cranfield tests on index language devices. In Das, Atulya Nagar, Kusum Deep, and Akshay Kumar Ojha (Eds.). Springer,
Aslib proceedings. MCB UP Ltd. (Reprinted in Readings in Information Retrieval, 181–191. https://doi.org/10.1007/978-981-13-1595-4_14
Karen Sparck-Jones and Peter Willett, editors, Morgan Kaufmann, 1997), [34] Nasreen Abdul Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah S.
173–192. Larkey, Xiaoyan Li, Mark D. Smucker, and Courtney Wade. 2004. UMass at
[18] Cyril W. Cleverdon. 1991. The Significance of the Cranfield Tests on Index TREC 2004: Novelty and HARD. In Proceedings of the Thirteenth Text REtrieval
Languages. In Proceedings of the 14th Annual International ACM SIGIR Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004
Conference on Research and Development in Information Retrieval. Chicago, (NIST Special Publication, Vol. 500-261), Ellen M. Voorhees and Lori P. Buckland
Illinois, USA, October 13-16, 1991 (Special Issue of the SIGIR Forum), Abraham (Eds.). National Institute of Standards and Technology (NIST).
Bookstein, Yves Chiaramella, Gerard Salton, and Vijay V. Raghavan (Eds.). http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf
ACM, 3–12. [35] Bernard Jansen, D. Booth, and A. Spink. 2009. Patterns of Query Reformulation
[19] Silviu Cucerzan and Eric Brill. 2004. Spelling Correction as an Iterative Process During Web Searching. J. Assoc. Inf. Sci. Technol. 60, 7 (2009), 1358–1371.
that Exploits the CollectiveKnowledge of Web Users. In Proceedings of the 2004 https://doi.org/10.1002/asi.21071
Conference on Empirical Methods in Natural Language Processing , EMNLP 2004, [36] Bernard Jansen, Amanda Spink, and Sherry Koshman. 2007. Web searcher
A meeting of SIGDAT, a Special Interest Group of the ACL, held in conjunction interaction with the Dogpile.com metasearch engine. J. Assoc. Inf. Sci. Technol.
with ACL 2004, 25-26 July 2004, Barcelona, Spain. ACL, 293–300. 58, 5 (2007), 744–755. https://doi.org/10.1002/asi.20555
https://aclanthology.org/W04-3238/ [37] Bernard Jansen, Amanda Spink, and Jan Pedersen. 2005. A temporal
[20] Van Dang and W. Bruce Croft. 2010. Query reformulation using anchor text. In comparison of AltaVista Web searching. J. Assoc. Inf. Sci. Technol. 56, 6 (2005),
Proceedings of the Third International Conference on Web Search and Web Data 559–570. https://doi.org/10.1002/asi.20145
Mining, WSDM 2010, New York, NY, USA, February 4-6, 2010, Brian D. Davison, [38] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 41–50. of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
https://doi.org/10.1145/1718487.1718493 [39] Thorsten Joachims and Filip Radlinski. 2007. Search Engines that Learn from
[21] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on Implicit Feedback. Computer 40, 8 (2007), 34–40.
image synthesis. Advances in Neural Information Processing Systems 34 (2021), https://doi.org/10.1109/MC.2007.289
8780–8794. [40] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel,
[22] Alan Feuer, Stefan Savev, and Javed A. Aslam. 2007. Evaluation of phrasal Inbar Mosseri, and Michal Irani. 2012. Imagic: Text-Based Real Image Editing
query suggestions. In Proceedings of the Sixteenth ACM Conference on with Diffusion Models. CoRR abs/2210.09276 (2012). arXiv:2210.09276
Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, http://arxiv.org/abs/2210.09276
November 6-10, 2007, Mário J. Silva, Alberto H. F. Laender, Ricardo A. [41] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and
Baeza-Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and Antonio Torralba. 2012. Undoing the Damage of Dataset Bias. In Computer
André O. Falcão (Eds.). ACM, 841–848. https://doi.org/10.1145/1321440.1321556 Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence,
[23] Maik Fröbe, Janek Bevendorff, Jan Heinrich Reimer, Martin Potthast, and Italy, October 7-13, 2012, Proceedings, Part I (Lecture Notes in Computer Science,
Matthias Hagen. 2020. Sampling Bias Due to Near-Duplicates in Learning to Vol. 7572), Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato,
Rank. In 43rd International ACM Conference on Research and Development in and Cordelia Schmid (Eds.). Springer, 158–171.
Information Retrieval (SIGIR 2020). ACM, 1997–2000. https://doi.org/10.1007/978-3-642-33718-5_12
https://doi.org/10.1145/3397271.3401212 [42] Hannah Rose Kirk, Yennie Jun, Filippo Volpin, Haider Iqbal, Elias Benussi,
[24] Maik Fröbe, Jan Philipp Bittner, Martin Potthast, and Matthias Hagen. 2020. Frederic Dreyer, Aleksandar Shtedritski, and Yuki Asano. 2021. Bias
The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in
Engines. In Advances in Information Retrieval. 42nd European Conference on IR Popular Generative Language Models. In Advances in Neural Information
Research (ECIR 2020) (Lecture Notes in Computer Science, Vol. 12036), Joemon M. Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and
Jose, Emine Yilmaz, João Magalhães, Pablo Castells, Nicola Ferro, Mário J. Silva, J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 2611–2624.
and Flávio Martins (Eds.). Springer, Berlin Heidelberg New York, 12–19. https://proceedings.neurips.cc/paper/2021/file/
https://doi.org/10.1007/978-3-030-45442-5_2 1531beb762df4029513ebf9295e0d34f-Paper.pdf
[25] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal [43] Sarah Kreps, R. Miles McCain, and Miles Brundage. 2022. All the News That’s
Chechik, and Daniel Cohen-Or. 2022. An Image is Worth One Word: Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation. Journal
Personalizing Text-to-Image Generation using Textual Inversion. CoRR of Experimental Political Science 9, 1 (2022), 104–117.
abs/2208.01618 (2022). https://doi.org/10.48550/arXiv.2208.01618 https://doi.org/10.1017/XPS.2020.37
arXiv:2208.01618 [44] Carol Collier Kuhlthau. 1993. A Principle of Uncertainty for Information
[26] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal seeking. J. Documentation 49, 4 (1993), 339–355.
Chechik, and Daniel Cohen-Or. 2022. Imagen Video: High Definition Video https://doi.org/10.1108/eb026918
Generation with Diffusion Models. CoRR abs/2208.01618 (2022). [45] Kurd Laßwitz. 1897. Bis zum Nullpunkt des Seins und andere
https://doi.org/10.48550/arXiv.2208.01618 arXiv:2208.01618 Science-Fiction-Erzählungen (Kapitel 10: Die Universalbibliothek. Schlesische
[27] Gregory Gay, Sonia Haiduc, Andrian Marcus, and Tim Menzies. 2009. On the Zeitung; Neuauflage auf Projekt Gutenberg 2017.
use of relevance feedback in IR-based concept location. In 25th IEEE https://www.projekt-gutenberg.org/lasswitz/nullpunk/titlepage.html
International Conference on Software Maintenance (ICSM’09). IEEE Computer Erschienen zwischen 1871 und 1908.
Society, 351–360. https://doi.org/10.1109/ICSM.2009.5306315 [46] Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. 2022. Do Language
[28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Models Plagiarize? CoRR abs/2203.07618 (2022).
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. https://doi.org/10.48550/arXiv.2203.07618 arXiv:2203.07618
Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144. [47] Lo Lee, Melissa G. Ocepek, Stephann Makri, George Buchanan, and Dana
[29] William S. Hemmig. 2008. The information-seeking behavior of visual artists: a McKay. 2019. Getting creative in everyday life: Investigating arts and crafts
literature review. Journal of Documentation 64, 3 (2008), 343–362. hobbyists’ information behavior. Proceedings of the Association for Information
https://doi.org/10.1108/00220410810867579 Science and Technology 56, 1 (2019), 703–705. https://doi.org/10.1002/pra2.141
[30] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey A. arXiv:https://asistdl.onlinelibrary.wiley.com/doi/pdf/10.1002/pra2.141
Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, [48] Richard Lemarchand. 2021. A Playful Production Process: For Game Designers
and Tim Salimans. 2022. Imagen Video: High Definition Video Generation with (and Everyone). MIT Press, Cambridge, MA.
Diffusion Models. CoRR abs/2210.02303 (2022). [49] David D. Lewis and William A. Gale. 1994. A Sequential Algorithm for
https://doi.org/10.48550/arXiv.2210.02303 arXiv:2210.02303 Training Text Classifiers. In Proceedings of the 17th Annual International
ACM-SIGIR Conference on Research and Development in Information Retrieval,

184
CHIIR ’23, March 19–23, 2023, Austin, TX, USA Niklas Deckers, Maik Fröbe, Johannes Kiesel, Gianluca Pandolfo, Christopher Schröder, Benno Stein, and Martin Potthast

W. Bruce Croft and C. J. van Rijsbergen (Eds.). Springer, ACM/Springer, 3–12. Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and
https://doi.org/10.1007/978-1-4471-2099-5_1 Tong Zhang (Eds.). PMLR, 8821–8831.
[50] Xiaoqing Li, Jiansheng Yang, and Jinwen Ma. 2021. Recent developments of https://proceedings.mlr.press/v139/ramesh21a.html
content-based image retrieval (CBIR). Neurocomputing 452 (2021), 675–689. [69] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse
[51] Yuan Li, Yinglong Zhang, and Robert Capra. 2022. Analyzing Information high-fidelity images with vq-vae-2. Advances in neural information processing
Resources That Support the Creative Process. In Proceedings of the 2022 ACM systems 32 (2019).
SIGIR Conference on Human Information Interaction and Retrieval (CHIIR’22) [70] Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff.
(Regensburg, Germany). ACM, 180–190. 2021. TripClick: The Log Files of a Large Health Web Search Engine. In SIGIR
https://doi.org/10.1145/3498366.3505817 ’21: The 44th International ACM SIGIR Conference on Research and Development
[52] Wen-Cheng Lin, Yih-Chen Chang, and Hsin-Hsi Chen. 2004. From Text to in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, Fernando Diaz,
Image: Generating Visual Query for Image Retrieval. In Multilingual Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.).
Information Access for Text, Speech and Images, 5th Workshop of the ACM, 2507–2513. https://doi.org/10.1145/3404835.3463242
Cross-Language Evaluation Forum, CLEF 2004, Bath, UK, September 15-17, 2004, [71] Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large
Revised Selected Papers (Lecture Notes in Computer Science, Vol. 3491), Carol Language Models: Beyond the Few-Shot Paradigm. In CHI ’21: CHI Conference
Peters, Paul D. Clough, Julio Gonzalo, Gareth J. F. Jones, Michael Kluck, and on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May
Bernardo Magnini (Eds.). Springer, 664–675. 8-13, 2021, Extended Abstracts, Yoshifumi Kitamura, Aaron Quigley, Katherine
https://doi.org/10.1007/11519645_65 Isbister, and Takeo Igarashi (Eds.). ACM, 314:1–314:7.
[53] Vivian Liu and Lydia B. Chilton. 2022. Design Guidelines for Prompt https://doi.org/10.1145/3411763.3451760
Engineering Text-to-Image Generative Models. In Proceedings of the 2022 CHI [72] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn
Conference on Human Factors in Computing Systems (New Orleans, LA, USA) Ommer. 2022. High-resolution image synthesis with latent diffusion models. In
(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
384, 23 pages. https://doi.org/10.1145/3491102.3501825 Recognition. 10684–10695.
[54] Vivian Liu, Han Qiao, and Lydia Chilton. 2022. Opal: Multimodal Image [73] Ian Ruthven. 2008. Interactive information retrieval. Annu. Rev. Inf. Sci. Technol.
Generation for News Illustration. arXiv preprint arXiv:2204.09007 (2022). 42, 1 (2008), 43–91. https://doi.org/10.1002/aris.2008.1440420109
[55] Yang Liu, Alan Medlar, and Dorota Glowacka. 2022. ROGUE: A System for [74] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily
Exploratory Search of GANs. In SIGIR ’22: The 45th International ACM SIGIR Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara
Conference on Research and Development in Information Retrieval, Madrid, Spain, Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and
July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with
J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 3278–3282. Deep Language Understanding. arXiv preprint arXiv:2205.11487 (2022).
https://doi.org/10.1145/3477495.3531675 [75] Tetsuya Sakai. 2007. Alternatives to Bpref. In SIGIR 2007: Proceedings of the 30th
[56] Xiaolu Lu, Alistair Moffat, and J. Shane Culpepper. 2016. The effect of pooling Annual International ACM SIGIR Conference on Research and Development in
and evaluation depth on IR metrics. Inf. Retr. J. 19, 4 (2016), 416–445. Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, Wessel
[57] Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. 2022. Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando
Co-Writing Screenplays and Theatre Scripts with Language Models: An (Eds.). ACM, 71–78.
Evaluation by Industry Professionals. arXiv preprint arXiv:2209.14958 (2022). [76] Tetsuya Sakai. 2008. Comparing metrics across TREC and NTCIR: The
[58] Bhaskar Mitra and Nick Craswell. 2018. An Introduction to Neural Information robustness to system bias. In Proceedings of the 17th ACM Conference on
Retrieval. Found. Trends Inf. Retr. 13, 1 (2018), 1–126. Information and Knowledge Management, CIKM 2008, Napa Valley, California,
https://doi.org/10.1561/1500000061 USA, October 26-30, 2008, James G. Shanahan, Sihem Amer-Yahia, Ioana
[59] Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and
of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. Abdur Chowdhury (Eds.). ACM, 581–590.
[60] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan [77] Rob Salkowitz. 2022. Midjourney Founder David Holz On The Impact Of AI On
Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Art, Imagination And The Creative Economy. Forbes (Sept. 2022). https:
Reading COmprehension Dataset. In Proceedings of the Workshop on Cognitive //www.forbes.com/sites/robsalkowitz/2022/09/16/midjourney-founder-david-
Computation: Integrating neural and symbolic approaches 2016 co-located with holz-on-the-impact-of-ai-on-art-imagination-and-the-creative-economy/
the 30th Annual Conference on Neural Information Processing Systems (NIPS [78] R. Keith Sawyer. 2012. Explaining creativity: The science of human innovation.
2016), Barcelona, Spain, December 9, 2016 (CEUR Workshop Proceedings, Oxford University Press, New York, NY, US.
Vol. 1773), Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and [79] Greg Schohn and David Cohn. 2000. Less is More: Active Learning with
Greg Wayne (Eds.). CEUR-WS.org. Support Vector Machines. In Proceedings of the Seventeenth International
http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA,
[61] Rodrigo Frassetto Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. USA, June 29 - July 2, 2000, Pat Langley (Ed.). Morgan Kaufmann, 839–846.
Document Expansion by Query Prediction. CoRR abs/1904.08375 (2019). [80] Christopher Schröder, Andreas Niekler, and Martin Potthast. 2022. Revisiting
arXiv:1904.08375 http://arxiv.org/abs/1904.08375 Uncertainty-based Query Strategies for Active Learning with Transformers. In
[62] OpenAI. 2022. DALL·E: Creating Images from Text. Findings of the Association for Computational Linguistics: ACL 2022, Dublin,
https://openai.com/blog/dall-e/. Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline
[63] Jonas Oppenlaender. 2022. Prompt Engineering for Text-Based Generative Art. Villavicencio (Eds.). Association for Computational Linguistics, 2194–2203.
arXiv preprint arXiv:2204.13988 (2022). https://doi.org/10.18653/v1/2022.findings-acl.172
[64] Srishti Palani, Zijian Ding, Stephen MacNeil, and Steven P. Dow. 2021. The [81] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert
"Active Search" Hypothesis: How Search Strategies Relate to Creative Learning. Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and
In Proceedings of the 2021 Conference on Human Information Interaction and Aran Komatsuzaki. 2021. LAION-400M: Open Dataset of CLIP-Filtered 400
Retrieval (Canberra ACT, Australia). ACM, New York, NY, USA, 325–329. Million Image-Text Pairs. CoRR abs/2111.02114 (2021). arXiv:2111.02114
https://doi.org/10.1145/3406522.3446046 https://arxiv.org/abs/2111.02114
[65] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. [82] Tal Schuster, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020. The
DreamFusion: Text-to-3D using 2D Diffusion. CoRR abs/2209.14988 (2022). Limitations of Stylometry for Detecting Machine-Generated Fake News.
https://doi.org/10.48550/arXiv.2209.14988 arXiv:2209.14988 Comput. Linguist. 46, 2 (June 2020), 499–510.
[66] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, https://doi.org/10.1162/coli_a_00380
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, [83] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In
Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 2256–2265.
International Conference on Machine Learning, ICML 2021, 18-24 July 2021, [84] Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom
Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila Goldstein. 2022. Diffusion Art or Digital Forgery? Investigating Data
and Tong Zhang (Eds.). PMLR, 8748–8763. Replication in Diffusion Models. CoRR abs/2212.03860 (2022).
http://proceedings.mlr.press/v139/radford21a.html https://doi.org/10.48550/arXiv.2212.03860 arXiv:2212.03860
[67] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. [85] Statista Inc. 2022. Digital Media Report - Video Games.
2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. https://www.statista.com/study/39310/video-games/.
arXiv preprint arXiv:2204.06125 (2022). [86] Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta,
[68] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Zhen Qin, Kai Hui, Zhe Zhao, Jai Prakash Gupta, Tal Schuster, William W.
Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Cohen, and Donald Metzler. 2022. Transformer Memory as a Differentiable
Generation. In Proceedings of the 38th International Conference on Machine Search Index. CoRR abs/2202.06991 (2022). arXiv:2202.06991

185
The Infinite Index: Information Retrieval on Generative Text-To-Image Models CHIIR ’23, March 19–23, 2023, Austin, TX, USA

https://arxiv.org/abs/2202.06991 UK). ACM, New York, NY, USA, 153–162.


[87] Nicola Tonellotto. 2022. Lecture Notes on Neural Information Retrieval. CoRR https://doi.org/10.1145/3295750.3298936
abs/2207.13443 (2022). https://doi.org/10.48550/arXiv.2207.13443 [104] Yinglong Zhang, Rob Capra, and Yuan Li. 2020. An In-Situ Study of Information
arXiv:2207.13443 Needs in Design-Related Creative Projects. In Proceedings of the 2020 Conference
[88] Simon Tong and Edward Y. Chang. 2001. Support vector machine active on Human Information Interaction and Retrieval (Vancouver BC, Canada). ACM,
learning for image retrieval. In Proceedings of the 9th ACM International New York, NY, USA, 113–123. https://doi.org/10.1145/3343413.3377973
Conference on Multimedia 2001, Ottawa, Ontario, Canada, September 30 - [105] Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido
October 5, 2001, Nicolas D. Georganas and Radu Popescu-Zeletin (Eds.). ACM, Zuccon, and Daxin Jiang. 2022. Bridging the Gap Between Indexing and
107–118. https://doi.org/10.1145/500141.500159 Retrieval for Differentiable Search Index with Query Generation. CoRR
[89] Antti Ukkonen, Pyry Joona, and Tuukka Ruotsalo. 2020. Generating Images abs/2206.10128 (2022). https://doi.org/10.48550/arXiv.2206.10128
Instead of Retrieving Them: Relevance Feedback on Generative Adversarial arXiv:2206.10128
Networks. In Proceedings of the 43rd International ACM SIGIR conference on
research and development in Information Retrieval, SIGIR 2020, Virtual Event,
China, July 25-30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps,
Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 1329–1338.
https://doi.org/10.1145/3397271.3401129
[90] Salahuddin Unar, Xingyuan Wang, Chuan Zhang, and Chunpeng Wang. 2019.
Detected text-based image retrieval approach for textual images. IET Image
Process. 13, 3 (2019), 515–521. https://doi.org/10.1049/iet-ipr.2018.5277
[91] Ellen M. Voorhees. 2001. The Philosophy of Information Retrieval Evaluation.
In Evaluation of Cross-Language Information Retrieval Systems, Second
Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt,
Germany, September 3-4, 2001, Revised Papers (Lecture Notes in Computer Science,
Vol. 2406), Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck
(Eds.). Springer, 355–370.
[92] Ellen M. Voorhees. 2019. The Evolution of Cranfield. In Information Retrieval
Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, Nicola
Ferro and Carol Peters (Eds.). The Information Retrieval Series, Vol. 41.
Springer, 45–69.
[93] Ellen M. Voorhees, Ian Soboroff, and Jimmy Lin. 2022. Can Old TREC
Collections Reliably Evaluate Modern Neural Retrieval Models? CoRR
abs/2201.11086 (2022). arXiv:2201.11086
[94] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. 2021. Towards Real-World
Blind Face Restoration With Generative Facial Prior. In IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2021. Computer Vision
Foundation / IEEE, 9168–9178. https://doi.org/10.1109/CVPR46437.2021.00905
[95] Yujing Wang, Yingyan Hou, Haonan Wang, Ziming Miao, Shibin Wu, Hao Sun,
Qi Chen, Yuqing Xia, Chengmin Chi, Guoshuai Zhao, Zheng Liu, Xing Xie,
Hao Allen Sun, Weiwei Deng, Qi Zhang, and Mao Yang. 2022. A Neural Corpus
Indexer for Document Retrieval. CoRR abs/2206.02743 (2022).
https://doi.org/10.48550/arXiv.2206.02743 arXiv:2206.02743
[96] Ryen W. White, Mikhail Bilenko, and Silviu Cucerzan. 2007. Studying the use
of popular destinations to enhance web search interaction. In SIGIR 2007:
Proceedings of the 30th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, Amsterdam, The Netherlands, July
23-27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr,
and Noriko Kando (Eds.). ACM, 159–166.
https://doi.org/10.1145/1277741.1277771
[97] Ryen W. White, Ian Ruthven, and Joemon M. Jose. 2005. A study of factors
affecting the utility of implicit relevance feedback. In SIGIR 2005: Proceedings of
the 28th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005,
Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and
John Tait (Eds.). ACM, 35–42. https://doi.org/10.1145/1076034.1076044
[98] Wikipedia contributors. 2022. Infinite monkey theorem — Wikipedia, The Free
Encyclopedia. https://en.wikipedia.org/w/index.php?title=Infinite_monkey_
theorem&oldid=1122059899 [Online; accessed 10-January-2023].
[99] Zuobing Xu, Ram Akella, and Yi Zhang. 2007. Incorporating Diversity and
Density in Active Learning for Relevance Feedback. In Advances in Information
Retrieval, 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April
2-5, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4425),
Giambattista Amati, Claudio Carpineto, and Giovanni Romano (Eds.). Springer,
246–257. https://doi.org/10.1007/978-3-540-71496-5_24
[100] Christoph Zauner. 2010. Implementation and benchmarking of perceptual image
hash functions. Master’s thesis. Upper Austria University of Applied Sciences,
Hagenberg Campus.
[101] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi,
Franziska Roesner, and Yejin Choi. 2019. Defending Against Neural Fake News.
In Advances in Neural Information Processing Systems, H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.),
Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/
3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf
[102] Cha Zhang and Tsuhan Chen. 2002. An active learning framework for
content-based information retrieval. IEEE Transactions on Multimedia 4, 2
(2002), 260–268. https://doi.org/10.1109/TMM.2002.1017738
[103] Yinglong Zhang and Robert Capra. 2019. Understanding How People Use
Search to Support Their Everyday Creative Tasks. In Proceedings of the 2019
Conference on Human Information Interaction and Retrieval (Glasgow, Scotland

186

You might also like