1 Introduction

Recently impressive breakthroughs have been made in text-to-image generation as several large models (Yu et al., 2022; Ramesh et al., 2022; Rombach et al., 2022; Saharia et al., 2022) trained on large-scale datasets (Schuhmann et al., 2021; Ramesh et al., 2021; Changpinyo et al., 2021) have achieved incredible performance.

However, compositional generalization of (large) generative models, i.e., the ability to compose different concepts in generation, is far from a solved problem. It can be divided into two different categories: entity composition and attribute composition. Entity composition means that generative models integrate several entities into a complex scene. Attribute composition refers to combination of different attributes on individual entities. Popular attribute compositions are not difficult for current text-to-image generation models. However, generating an image conditional on the prompt with underrepresented attribute compositions remains a great challenge.

In this paper, we aim to improve attribute compositional generalization, for which there are two main challenges: underrepresented attribute compositions and overrepresented attribute compositions. For underrepresented attribute compositions, there are only few or no samples in the training data, while overrepresented attribute compositions have an excessive number of instances in the training data. It results in generative models that fail to capture the data distribution of underrepresented attribute compositions and over-memorize the features of popular compositions. For example, in the CelebA-HQ dataset (Lee et al., 2020), “she” and “wearing earrings” are a very frequent composition (see Fig. 1), while the composition of “he” and “wearing earrings” is underrepresented. The imbalanced distribution causes generative models to synthesize an image of a woman with earrings, instead of a man, when given the input “he is wearing earrings”. Previous works (Li et al., 2022; Nie et al., 2021) have attempted to solve this problem by manipulating attribute-directed codes in the latent space of a pre-trained generator. However, using latent codes that do not follow the learned distribution carries the risk of generating images with low quality. StyleT2I (Li et al., 2022) uses spatial constraints to disentangle attributes, but an external segmentation model is necessary, which makes it difficult to use and extend.

Fig. 1
figure 1

Our model can generate high-fidelity, text-matched images, even if the attribute compositions in the input text have not been seen in the training

A natural idea to handle underrepresented attribute compositions is to augment training samples. However, it is difficult to perform these image augmentations in practice. Fortunately, creating text with underrepresented attribute compositions is straightforward. Inspired by these observations, in this paper, we propose ACTIG, a novel attribute-centric compositional text-to-image generation framework. Specifically, we introduce an attribute-centric feature augmentation and a new image-free training paradigm to compensate for the data distribution. We compute the augmented text feature using CLIP text encoder. Via our text-to-image mapping network, we obtain the augmented image features. Our image-free training encourages the model to generate images with underrepresented attribute compositions. We further propose an attribute-centric contrastive loss to disentangle the feature distribution of attributes, which avoids overfitting to overrepresented attribute compositions. Our main contributions of this paper are summarized as follows:

  • We present an attribute-centric compositional text-to-image generation framework, named ACTIG, which excels in image quality and text-image consistency.

  • We alternate between a fully supervised paradigm and a novel image-free paradigm in training, so that the model learns the feature distributions of the real data and underrepresented attribute compositions from attribute-centric feature augmentation simultaneously.

  • We propose an attribute-centric contrastive loss to disentangle the data distribution of attributes to prevent the model from over-memorizing the overrepresented attribute compositions.

  • We conduct comprehensive experiments on CelebA-HQ dataset (Lee et al., 2020) and CUB dataset (Wah et al., 2011), and ACTIG achieves state-of-the-art results.

2 Related Work

Text-to-Image Generation. Significant progress has been made in text-to-image generation over the years and a variety of models have emerged (Lee et al., 2022; Wu et al., 2022b, a; Zhou et al., 2022). Recently, diffusion models (Saharia et al., 2022; Nichol et al., 2021; Gu et al., 2022) trained on large-scale datasets have demonstrated tremendous promise. DALLE2 (Ramesh et al., 2022) proposes a diffusion decoder that generates an image conditioned on the CLIP embedding (Radford et al., 2021). Technically, unCLIP (Ramesh et al., 2022) implements text-to-image mapping to estimate the latent image representation. In the contrast, we design the text-to-image mapping in image-free training to learn underrepresented attribute compositions by combining the mapping with attribute-centric feature augmentation. In our work, the text-to-image mapping only performs in the training stage while unCLIP uses the mapping during image generation/sampling. Stable Diffusion (Rombach et al., 2022) integrates cross-attention modules into the model structure to design powerful diffusion generators. Auto-regressive models (Ramesh et al., 2021; Gafni et al., 2022; Ding et al., 2021; Yu et al., 2022) are also showing their potential. In contrast, GAN-based methods (Zhang & Schomaker, 2022; Wang et al., 2021; Liu et al., 2021a; Crowson et al., 2022; Qiao et al., 2019; Yin et al., 2019; Li et al., 2019b; Cheng et al., 2020; Zhang et al., 2018; Zhu & Ngo, 2020) have motivated many advances in text-to-image generation. AttnGAN (Xu et al., 2018) introduces a attentional multimodal similarity model to calculate a fine-grained image-text correspondence loss, which is widely used in many GAN models (Tao et al., 2022; Liao et al., 2022; Zhu et al., 2019; Li et al., 2019a; Wu et al., 2022c). XMC-GAN (Zhang et al., 2021) improves text-image matching through cross-modal contrastive learning. DAE-GAN (Ruan et al., 2021) considers not only sentence-level information, but also the information of attributes extracted from the text. As a successful image generation framework, StyleGAN (Karras et al., 2019, 2020) has also been extended for text-to-image generation. TediGAN (Xia et al., 2021a, b) minimizes the embedding distances between the image and corresponding text in the latent space and employs a pre-trained StyleGAN generator to synthesize images. The above GANs strongly depend on the data distribution of the training set. To improve zero-shot text-to-image generation, Lafite (Zhou et al., 2021) proposes a language-free training framework by generating text features from image features. However, this method is still limited by the feature distribution of entire images. For rare attribute compositions in natural images, Lafite is unable to capture the features effectively. Our framework, ACTIG, goes in the opposite direction. We generate image features from attribute-centric augmented text to perform an image-free training.

Compositional Image Generation. A benchmark for compositional text-to-image generation (Park et al., 2021) is proposed, which is a study of previous text-to-image generation models for attribute compositions. LACE (Nie et al., 2021) proposes an energy-based model formulating attribute labels in the latent space of a pre-trained generator. However, since the formulated latent codes do not exactly follow the learned distribution of the pre-trained generator, there is a risk of reducing the image quality. Liu et al. (2022) interpret diffusion models as energy-based models composing several prompts into an image. A spatial constraint loss is introduced in StyleT2I (Li et al., 2022) to disentangle attribute features by limiting the spatial variation according to the input attributes. However, these works have more or less ignored the imbalanced distribution of attribute compositions in the dataset. Our framework, ACTIG, improves the attribute compositional generalization by focusing on underrepresented and overrepresented attribute compositions.

Multi-modal Representation Learning. High-quality text representations are essential for text-to-image generation. AttnGAN (Xu et al., 2018) introduce a fine-grained learning framework using attention mechanism to connect words and sub-regions, while XMC-GAN obtains text embeddings from a pre-trained BERT (Devlin et al., 2018). CLIP (Radford et al., 2021) is introduced to the task of text-to-image generation, and many previous works (Zhou et al., 2021; Li et al., 2022; Ramesh et al., 2022; Wang et al., 2022; Liu et al., 2021b) have demonstrated the strength of its language-and-vision feature space. In this paper, we are inspired to pre-train a specific CLIP model connecting attributes and images to guide our attribute-centric contrastive loss, which is used to capture the independent attribute distributions in the adversarial training.

3 Method

In this section, we present our framework ACTIG for attribute-centric compositional text-to-image generation. For underrepresented attribute compositions, we propose an attribute-centric feature augmentation and an image-free training paradigm to compensate for the data distribution. For overrepresented attribute compositions, we introduce an attribute-centric contrastive loss to capture the independent attribute distributions.

3.1 GAN Structure

Our generative model is built upon StyleGAN2 (Karras et al., 2020) with two modifications. We use a conditional generator instead. To facilitate the image-free training, we adopt the original StyleGAN2 discriminator to estimate whether the images are real or fake. Meanwhile, a matching discriminator using CLIP encodings is introduced to estimate the text-image consistency.

Generator. The original StyleGAN2 generator consists of a mapping network and a synthesis network. The non-linear mapping network projects the input latent code \(\varvec{z}\) into a latent space \(\mathcal {W}\), while the synthesis network generates images based on the output of the mapping network. To make the unconditional generator conditional, we normalize and concatenate the latent code \(\varvec{z}\) and text encoding \(\varvec{t}\) provided by the CLIP encoder as input to the mapping network, while the generator architecture remains unchanged. The architecture of our generator G is shown in Fig. 2. Therefore, an image \(\hat{I}\) generated by the generator G can be formulated as: \(\hat{I}=G(\varvec{z}, \varvec{t})\).

Discriminators. Different from previous works (Ruan et al., 2021; Zhou et al., 2021; Liao et al., 2022; Tao et al., 2022) that use a shared discriminator backbone to perform the tasks of estimating photo-fidelity and text-image consistency simultaneously, we adopt a fidelity discriminator \(D_f\) and a matching discriminator \(D_m\), which are independent of each other. This allows us to update them in a flexible way for image-free training. For the fidelity discriminator \(D_f\), we directly use the StyleGAN2 discriminator. For the matching discriminator \(D_m\), we utilize the CLIP text encoder \(E_{\text {txt}}\) and image encoder \(E_{\text {img}}\) to compute the encodings of the input text and image. Two fully-connected layers transform the text and image encodings respectively, while the cosine similarity between the transformed embeddings is calculated to represent the text-image consistency. The architecture of the matching discriminator \(D_m\) is demonstrated in Fig. 3.

3.2 Text-to-Image Mapping

We propose a text-to-image mapping network projecting a CLIP text encoding \(\varvec{t}\) into CLIP image space to obtain the approximate CLIP image encoding \(\tilde{\varvec{i}}\), which is later used for image-free training (see Fig. 4). The mapping network consists of multiple fully-connected layers with residual connection and batch normalization. The structure is shown in Fig. 5. The input and output dimensions are the same as the CLIP encoding dimension. The mapping network is pre-trained before optimizing GAN. We combine three loss functions to enforce \(\tilde{\varvec{i}}\) close to the real image encoding \(\varvec{i}\) from different views. Mean squared error and cosine similarity loss are adopted to align \(\tilde{\varvec{i}}\) and \(\varvec{i}\) in Euclidean space and cosine space, respectively. We also use a contrastive loss to make \(\tilde{\varvec{i}}\) most similar to the corresponding \(\varvec{i}\) in the batch. Given a batch of N text-image pairs, the complete objective function for the mapping network can be presented as,

$$\begin{aligned} \begin{aligned}&L_{mapping} = L_{MSE}+L_{similarity}+L_{contrast}, \\&L_{MSE}=\frac{1}{N} \sum _{k=1}^N (\tilde{\varvec{i}}_k-\varvec{i}_k)^2,\\&L_{similarity}=\frac{1}{N} \sum _{k=1}^N \left( 1-\frac{\tilde{\varvec{i}}_k\cdot \varvec{i}_k}{\left\| \tilde{\varvec{i}}_k\right\| \left\| \varvec{i}_k\right\| }\right) ,\\&L_{contrast}=-\frac{1}{N} \sum _{k=1}^N \log \frac{\exp (\text {sim}(\tilde{\varvec{i}}_k, \varvec{i}_k)}{\sum _{j=1}^N \exp (\text {sim}(\tilde{\varvec{i}}_k, \varvec{i}_j)}, \end{aligned} \end{aligned}$$
(1)

where \(\varvec{i}\) indicates the image encoding inferred from the real image and \(\tilde{\varvec{i}}\) is the mapping network output. \(\text {sim}(.,.)\) denotes the cosine similarity in our paper.

Fig. 2
figure 2

Architecture of our generator. \( \bigotimes \) denotes the concatenation operation

Fig. 3
figure 3

Architecture of our matching discriminator

Fig. 4
figure 4

We pre-train a text-to-image mapping network to convert the CLIP text encoding \(\varvec{t}\) into an approximate CLIP image encoding \(\tilde{\varvec{i}}\) in the CLIP shared latent space

Fig. 5
figure 5

Architecture of our text-to-image mapping network

Fig. 6
figure 6

Overview of our attribute-centric feature augmentation (for CUB (Wah et al., 2011)). We construct an attribute library and replace the attributes in the text with those randomly sampled from the library

Fig. 7
figure 7

Update paradigm for discriminators in (1) fully supervised training and (2) image-free training. The prompt with blue attributes is ground truth text in the training set, while the prompt with red attributes is produced by replacing the attributes in the left prompt with other random attributes. Map indicates our pre-trained text-to-image mapping network. We highlight the activated modules in green (Color figure online)

3.3 Attribute-Centric Feature Augmentation

For better quality and text correspondence of images generated based on the underrepresented attribute compositions, an intuitive idea is adding more training samples for these compositions. Although image augmentation is almost impossible, it is feasible to augment the text. We generate the text with underrepresented attribute compositions by randomly selecting attributes to form prompts (for CelebA-HQ (Lee et al., 2020)) or by replacing attributes in the training prompts with other randomly sampled attributes (for CUB (Wah et al., 2011)). The sampling weights are the same for all the attributes, e.g., the probability of woman/man wearing lipstick is 50% each in our sampling, which breaks the strong compositions in the real data. The generated prompts are encoded by CLIP text encoder, and our text-to-image mapping network transforms the text encodings into approximate image encodings. These augmented feature pairs are utilized in the image-free training. The attribute-centric feature augmentation pipeline for CUB is shown in Fig. 6.

3.4 Attribute-Centric Contrastive Loss

To prevent overfitting to overrepresented attribute compositions and disentangle their distributions, we propose an attribute-centric contrastive loss that leverages the generality and transferability of CLIP. We first finetune the pre-trained CLIP in a conventional manner, except that the training data are not text-image pairs. In each iteration of training, we randomly sample an attribute from the text and form an attribute-image pair with the corresponding image to replace the text-image pair. To distinguish the CLIP, which is fine-tuned with attribute-image pairs, from the text-image CLIP, we call it CLIP-A. Given a batch of N text-image pairs, we randomly extract an attribute A from each text T. The attribute-centric contrastive loss for k-th attribute-image pair can be formulated as,

$$\begin{aligned} \begin{aligned} L_{attr}=- \log \frac{\exp ( \text {sim}(E_{\text {CLIP-A}}^{\text {img}}(I_k), E_{\text {CLIP-A}}^{\text {txt}}(A_k))}{\sum _{j=1}^N \exp ( \text {sim}(E_{\text {CLIP-A}}^{\text {img}}(I_k), E_{\text {CLIP-A}}^{\text {txt}}(A_j)))} \\ - \log \frac{\exp ( \text {sim}(E_{\text {CLIP-A}}^{\text {img}}(I_k), E_{\text {CLIP-A}}^{\text {txt}}(A_k))}{\sum _{j=1}^N \exp ( \text {sim}(E_{\text {CLIP-A}}^{\text {img}}(I_j), E_{\text {CLIP-A}}^{\text {txt}}(A_k)))}, \end{aligned} \nonumber \\ \end{aligned}$$
(2)

where I indicates the image. \(E_{\text {CLIP-A}}^{\text {img}}\) and \(E_{\text {CLIP-A}}^{\text {txt}}\) are the image encoder and text encoder of CLIP-A. This loss function is used in the adversarial training to capture the independent attribute distributions and avoid the generative model over-memorizing the popular attribute compositions.

Fig. 8
figure 8

\(\ell ^2\) norms of text encoding and image encoding from the pre-trained CLIP are essentially identical. They increase at different rates during the finetuning. We introduce a norm penalty to keep the norms at the same level, which facilitates our text-to-image mapping

3.5 Training Schemes

Since it is extremely difficult to directly augment the text-image pairs of underrepresented attribute compositions required for the standard adversarial training, we train our generative model in two paradigms alternatively: (1) fully supervised training, which uses text-image pairs in the training set, and (2) image-free training, which uses the feature pairs resulting from the attribute-centric feature augmentation. The model learns the distributions of the real data and underrepresented attribute compositions. The training schedule is summarized in Algorithm 1.

For both training paradigms, the generator is updated in the same way. Given a batch of N text, the standard unconditional loss for the generator can be presented as,

$$\begin{aligned} \begin{aligned} L_{G}^f= \frac{1}{N} \sum _k^N \zeta (-D_f(\hat{I_k})), \end{aligned} \end{aligned}$$
(3)

where \(\zeta \) denotes the softplus function and \(\hat{I}\) denotes the generator output. To match the generated images to the input text, we introduce a matching loss using the matching discriminator \(D_m\),

$$\begin{aligned} \begin{aligned} L_{G}^m= \frac{1}{N} \sum _k^N(1 - D_m(\varvec{t}_k, \hat{\varvec{i}}_k)), \end{aligned} \end{aligned}$$
(4)

where \(\varvec{t}\) and \(\hat{\varvec{i}}\) are the text and generated image embeddings respectively provided by the CLIP text encoder \(E_{\text {txt}}\) and image encoder \(E_{\text {img}}\). Furthermore, we adopt the CLIP-guided contrastive loss from Li et al. (2022); Zhou et al. (2021),

$$\begin{aligned} \begin{aligned} L_{G}^{const}&=-\frac{1}{N} \sum _{k=1}^N \left( \log \frac{\exp ( \text {sim}(\hat{\varvec{i}}_k, \varvec{t}_k))}{\sum _{j=1}^N \exp ( \text {sim}(\hat{\varvec{i}}_k, \varvec{t}_j))}\right. \\&\quad \left. + \log \frac{\exp ( \text {sim}(\hat{\varvec{i}}_k, \varvec{t}_k))}{\sum _{j=1}^N \exp ( \text {sim}(\hat{\varvec{i}}_j, \varvec{t}_k))}\right) . \end{aligned} \end{aligned}$$
(5)

In order to align the generated image features with the attribute features in CLIP-A feature space, we integrate the attribute-centric contrastive loss \(L_{attr}\). The complete objective function for updating the generator is,

$$\begin{aligned} \begin{aligned} L_{G}= L_G^f+L_G^m+L_G^{const}+L_{attr}. \end{aligned} \end{aligned}$$
(6)

In fully supervised training and image-free training, the discriminators are updated in different ways (see Fig. 7). For fully supervised training, given a batch of N text-image pairs, the loss function for updating the fidelity discriminator is calculated as,

$$\begin{aligned} \begin{aligned} L_{D}^f= \frac{1}{N} \sum _{k=1}^N (\zeta (-D_f(I_k)) + \zeta (D_f(\hat{I_k}))), \end{aligned} \end{aligned}$$
(7)

while for the matching discriminator,

$$\begin{aligned} \begin{aligned} L_{D}^m= \frac{1}{N} \sum _{k=1}^N (1 - D_m(\varvec{t}_k, \varvec{i}_k) + D_m(\varvec{t}_{k*}, \hat{\varvec{i}}_{k})), \end{aligned} \end{aligned}$$
(8)

where \(\varvec{t}_{k}\) is the CLIP text encoding of the k-th prompt, while \(\varvec{t}_{k*}\) is the CLIP text encoding of a mis-matched prompt in the batch. Therefore, the complete objective function for updating the discriminators in fully supervised training is computed as,

$$\begin{aligned} \begin{aligned} L_{D}= L_{D}^f+L_{D}^m. \end{aligned} \end{aligned}$$
(9)

For image-free training, only attribute-centric augmented text is available. Therefore we skip the first term in Eq. 7 and only update the fidelity discriminator based on the generated images,

$$\begin{aligned} \begin{aligned} \tilde{L_{D}^f}= \frac{1}{N} \sum _{k=1}^N \zeta (D_f(\hat{I_k})). \end{aligned} \end{aligned}$$
(10)

For updating the matching discriminator, we use the pre-trained text-to-image mapping network to transform the text encodings \(\varvec{t}\) of augmented text to approximate image encodings \(\tilde{\varvec{i}}\). The matching discriminator loss is formulated as,

$$\begin{aligned} \begin{aligned} \tilde{L_{D}^m}= \frac{1}{N} \sum _{k=1}^N (1 - D_m(\varvec{t}_k, \tilde{\varvec{i}_k}) + D_m(\varvec{t}_{k*}, \hat{\varvec{i}}_{k})). \end{aligned} \end{aligned}$$
(11)

The complete objective function for updating the discriminators in image-free training is presented as,

$$\begin{aligned} \begin{aligned} \tilde{L_{D}}= \tilde{L_{D}^f}+\tilde{L_{D}^m}. \end{aligned} \end{aligned}$$
(12)
Algorithm 1
figure a

Training schedule of ACTIG

3.6 Finetuned CLIP

We finetune the pre-trained CLIP (Radford et al., 2021), respectively with text-image pairs and attribute-image pairs. CLIP trained with text-image pairs provides the text and image encodings for the generator, matching discriminator, text-image mapping network, and CLIP-guided contrastive loss. The model finetuned with attribute-image pairs, CLIP-A, is adopted for our attribute-centric contrastive loss. However, we observe that the \(\ell ^2\) norms of text encoding and image encoding increase at different rates during the finetuning stage as shown in Fig. 8. This obviously complicates the projection of text encodings into the image encoding space and makes it difficult for the text-to-image mapping network to infer approximate image encodings. Therefore, we add a norm penalty to the objective function to keep the norms of text encoding and image encoding at the same value level,

$$\begin{aligned} \begin{aligned} L_{\text {norm}}=\sigma (\Vert E_{\text {img}}(I)\Vert _2-\tau ) + \sigma (\Vert E_{\text {txt}}(T)\Vert _2-\tau ), \end{aligned} \end{aligned}$$
(13)

where \(\sigma \) denotes ReLU operation and \(\tau \) is a threshold hyperparameter for norm penalty.

4 Experiments

4.1 Datasets

We train and validate our model on two datasets, CelebA-HQ (Lee et al., 2020) and CUB (Wah et al., 2011). CelebA-HQ is a large-scale face dataset with facial attributes. We use the data split and text annotations proposed by Xia et al. (2021a) and Li et al. (2022), while there are 23.4k images for training and 1.9k images for testing. CUB is a dataset that includes 11.8k images in 200 bird species. For text annotations, we use the captions provided by Reed et al. Reed et al. (2016). Note that only captions with unseen attribute compositions are retained in the test set to evaluate the compositional ability. The settings from StyleT2I (Li et al., 2022) are adopted for a fair comparison.

4.2 Evaluation Metrics

FID. We adopt Fréchet Inception Distance (FID) (Heusel et al., 2017) to evaluate the quality of the generated images. FID computes the distance between the feature distributions of the generated images and real images. A lower number denotes that the synthetic images are more realistic.

R-Precision. We adopt R-precision (Xu et al., 2018) to evaluate the consistency of the input text and output images. R-Precision calculates the top 1 retrieval accuracy when using the generated image as a query to retrieve matching text from K candidate texts. If not otherwise specified, the default value of K is 100. We follow Park et al. (2021) to calculate R-Precision using the CLIP finetuned on the whole dataset, which has been demonstrated to be closer to human perception.

User study. Although the above model-based evaluation metrics can indicate the quality of the generated images and text-image consistency, they cannot be fully equated with human perception. Therefore, we invited 12 participants to perform the user study on both datasets to estimate image quality and text-image consistency as Li et al. (2022) did. Given an input text, participants were requested to rank the images synthesized by different models on image quality and image-text consistency respectively.

Fig. 9
figure 9

Visualization of the dependency parse for the text, “the long beaked bird has a white body with long brown wings”. The dependency matcher can localize the attributes in the text with the dependency matcher patterns

4.3 Implementation Details

GAN details. We adopt the generator and discriminator of StyleGAN2 (Karras et al., 2020). The input dimension of the generator is 1024, since we concatenate the CLIP text encodings to latent codes as input. The resolution of generated images is set to \(256\times 256\). We integrate the pre-trained CLIP encoders into the matching discriminator which are frozen while training the GAN. For both datasets, we train the generator and discriminators from scratch on 8 GPUs with Adam Kingma and Ba (2014) setting the batch size to 8 per GPU. We alternate between fully supervised training and image-free training. Three out of every four iterations use fully supervised training and one image-free training. The text-to-image mapping network, CLIP and CLIP-A are frozen at this stage.

Text-to-image mapping network details. The text-to-image mapping network consists of three linear transformation layers with GeLU activation. There are residual connections between the transformation layers. The input text encodings and target image encodings are provided by the finetuned CLIP. We train the text-to-image mapping network with Adam Kingma and Ba (2014) setting the batch size 128 (for CelebA-HQ) and 256 (for CUB).

Attribute-centric feature augmentation. To compensate for the feature distribution of underrepresented attribute compositions, the texts containing underrepresented attribute compositions are generated first, while the requirement for images is weakened by mapping CLIP text features to CLIP image space. For the CelebA-HQ dataset, the original captions are generated by probabilistic context-free grammar based on the known attribute labels. We first collect the keywords of these attribute labels and group them into two categories:

  • Gender: he, man, she, woman.

  • Appearance: arched eyebrows, bags under eyes, bangs, big lips, big nose, black hair, blond hair, brown hair, bushy eyebrows, double chin, goatee, gray hair, high cheekbones, mouth slightly open, mustache, narrow eyes, oval face, pale skin, pointy nose, receding hairline, rosy cheeks, sideburns, straight hair, wavy hair, bald, chubby, smiling, young, eyeglasses, heavy makeup, earrings, hat, lipstick.

We synthesize 10k augmented prompts which are further used in the image-free training. When generating a prompt, the gender is first randomly determined, and then we randomly sample two to six attributes of appearance. The sampled attributes are composed into a prompt according to the syntax rules, for example, “the man has gray hair and straight hair, and he wears lipstick”.

Different from CelebA-HQ, the captions in the CUB dataset are written artificially. In order to generate the captions with underrepresented attribute compositions, we first design an attribute parser based on the dependency matcher implemented in spaCy (see Fig. 9). We use the attribute parser to extract attributes from the prompts in the training set while keeping the noun in the attribute unchanged and replacing the adjective. We divide the adjectives appearing in the dataset into color and shape according to Park et al. (2021), and create the attribute library:

  • Color: brown, white, yellow, dark, gray, grey, black, red, rusty, beige, maroon, orange, green, iridescent, lime, pink, pale, purple, blue, taupe, gold, bronze, amber, magenta, silver, lightbrown, flittery, violet, teal, crimson, olive, creamy, metallic, azure, turquoise, indigo, chocolate, ruby, bluegreen, mauve, tawny, ivory, ash, khaki, scarlet, cyan, lemon, rosy, coppery, peachy, blond, earthtone, inky, opalescent, tan.

  • Shape: small, little, short, pointy, narrow, large, long, straight, medium, curved, pointed, thin, tiny, sharp, curving, skinny, stout, chubby, tall.

We also synthesize 10k augmented prompts for the CUB dataset. During the synthesis process, the colors in the text are replaced with other randomly sampled colors and the shapes are also replaced with new shapes. For example, based on the prompt in the training set, “the long beaked bird has a white body with long brown wings”, we replace the attributes and obtain the new prompt, “the tiny beaked bird has a blond body with tiny blue wings”.

Finetuned CLIP Details. To ensure the proper working of ACTIG, we finetune the pre-trained CLIP (ViT-B/32) with text-image pairs and attribute-image pairs in the training split of CelebA-HQ and CUB. The last few layers of CLIP are finetuned according to Li et al. (2022). The hyperparameter \(\tau \) in Eq. 13 is set to 10. Furthermore, we finetune the pre-trained CLIP with text-image pairs from the full dataset of CelebA-HQ and CUB, denoted as CLIP-Eval. Note that CLIP-Eval is not involved in any training and is only available for calculating R-Precision.

4.4 Quantitative Results and Comparison

Table 1 Results of text-to-image generation on CelebA-HQ and CUB datasets
Table 2 R-Precision of ACTIG on the CelebA-HQ and CUB datasets for different number of attributes in the input text

Table 1 shows the comparison between advanced text-to-image generation methods on the CelebA-HQ and CUB datasets. For both CelebA-HQ and CUB datasets, our framework, ACTIG, achieves state-of-the-art performance. In terms of R-Precision, ACTIG is 2.5 and 2.3 higher than another compositional text-to-image generation model StyleT2I (Li et al., 2022) on CelebA-HQ and CUB, respectively. In addition, our end-to-end trained ACTIG has a significant advantage in the image quality compared to StyleT2I, which performs on a pre-trained StyleGAN2. We also reproduce a language-free model Lafite (Zhou et al., 2021) on two datasets. On both datasets, the end-to-end trained Lafite has higher FID scores than StyleT2I, however, its generated images do not match the input text as well as StyleT2I without considering composition generalization. Our end-to-end ACTIG focuses on text-to-image consistency while keeping the output image to be high-fidelity. The FID score of ACTIG on the CelebA-HQ dataset is very close to that of TediGAN-B. Most of the other models also have FID between 15 and 18. This contradicts the results of human evaluation of image quality in user study. We conjecture that such contradiction may be caused by the small number of test images in the test split, whose feature distribution differs from the training split.

To demonstrate the attribute compositional generalization of ACTIG, we evaluate the text-image consistency given input prompts with different numbers of attributes (see Table 2). R-Precision is low when the attribute number in the input text is small. This is a result of the brief descriptions leading to the lack of obvious semantic features in the generated images. R-Precision increases as the number of constraints in the text increases. It shows that ACTIG can generate images that match the text including complex attribute compositions. When the attribute number is more than 6, there is a slight drop in R-Precision, possibly due to too much complex semantic information.

Our framework, ACTIG, synthesizes underrepresented attribute compositions through attribute augmentation, which involves randomly combining existing training attributes. A complete statistical analysis of attribute compositions is rather tedious, since the compositions are too flexible and diverse. The number of combinations of attributes can be from 2 to 10. For example, there are more than 10.5K attribute compositions in CelebA-HQ and we can not list all of them. In order to better demonstrate the effect of attribute augmentation, we visualize the distribution of 5 common attribute compositions and 5 underrepresented compositions in the training data before and after augmentation, as shown in Fig. 10.

Fig. 10
figure 10

Distribution of 10 attribute composition before and after augmentation for CelebA-HQ

4.5 Ablation Studies

Table 3 Ablation study for the components of \(L_{mapping}\) for the text-to-image mapping network on CUB
Table 4 Results for the ablation study of ACTIG on the CUB dataset

Text-to-Image Mapping Loss. The text-to-image mapping network plays a critical role in our framework as it forms the basis for attribute-centric feature augmentation and image-free training. We explore the impact of each component from the objective function \(L_{mapping}\) on image generation performance, which further demonstrates the effectiveness of our text-to-image mapping network. Our results, presented in Table 3, demonstrate the impact of each term when activated, with the best fidelity and image-text consistency (R-Precision) achieved when all terms are utilized. Notably, the mapping loss has a more significant impact on text-image consistency. In addition, we evaluate the performance when the mapping network is not optimized, and the result shows a rapid drop, providing further confirmation of the effectiveness of the mapping network.

Table 5 Performance of ACTIG using different CLIP models

Matching discriminator. Many previous works (Ruan et al., 2021; Zhou et al., 2021; Liao et al., 2022; Tao et al., 2022) use a shared discriminator backbone to estimate fidelity and text-image consistency simultaneously. The discriminator backbone provides the image feature to two sub-networks, one of which converts the features into a scalar representing the truthfulness, and the other combines the feature and text encoding to output the degree of text-image consistency. In contrast, we adopt an independent CLIP-based matching discriminator \(D_m\). We compare the model performance using the shared discriminator backbone (first row in Table 4) and independent \(D_m\) (second row in Table 4) on the CUB dataset. The results show that both image quality and text-image consistency are significantly improved with the independent \(D_m\).

Attribute-Centric Contrastive Loss. We further validate the effect of the attribute-centric contrastive loss \(L_{attr}\) (third row in Table 4). With the integration of the loss function, the quality of generated images improves and the FID decreases by 1.1. Meanwhile, R-precision increases from 21.8 to 25.6. It demonstrates that the attribute-centric contrastive loss helps the generator to capture the independent feature distribution for each attribute during training while having less impact on the feature distribution of the images.

Fig. 11
figure 11

Qualitative results of different components in the ablation studies

Fig. 12
figure 12

User study interface for image quality and text-image consistency evaluation: one text-image group for CelebA-HQ

Fig. 13
figure 13

Score distributions of the user study on CelebA-HQ and CUB

Image-free Training. We activate the image-free training which is supported by the attribute-centric feature augmentation. The fourth row in Table 4 denotes the performance of full ACTIG. The compositional generalization of ACTIG is further improved by introducing new attribute compositions and corresponding approximated image encodings into the image-free training. The FID score does not have a significant change since no real image is actually entered.

CLIP finetuning. To verify the impact of CLIP on text-to-image mapping in Sect. 3.2 and the overall framework, we test three different CLIPs, namely the pre-trained CLIP (ViT-B/32), the CLIP finetuned without norm penalty and the CLIP finetuned with norm penalty. The results are shown in Table 5. For both datasets, while ACTIG using the pre-trained CLIP has the lowest performance, ACTIG using the CLIP finetuned with norm penalty has the best R-Precision, since the text-to-image mapping is facilitated. Furthermore, whether or not finetune is performed on CLIP has little influence on FID.

Table 6 Average scores in terms of image quality (Quality) and text-image consistency (Consistency) on the CelebA-HQ and CUB datasets

Qualitative results of ablation studies. To demonstrate the effectiveness of different components in our framework, we provide the qualitative results of ablation studies in Fig. 11. Four samples are randomly selected, with two from the CUB dataset (displayed in the top two rows) and two from the CelebA-HQ dataset (displayed in the bottom two rows). We first ablate the attribute-centric contrastive loss from ACTIG (left). For example, the sample person in the last row is not young and there are no bags under the eyes. We also ablate the image-free training (in this case the attribute-centric feature augmentation is also deactivated). The sample person in the last row (middle) only presents the lipstick and bags under the eyes while the sample person in the last row from the full model (right) is of good quality and consistent with the given attribute composition.

Fig. 14
figure 14

The qualitative results on the CelebA-HQ dataset

Fig. 15
figure 15

The qualitative results on the CUB dataset

4.6 Qualitative Results

The qualitative results on the CelebA-HQ dataset and the CUB dataset are respectively shown in Figs. 14 and 15. Our method, ACTIG, outperforms other state-of-the-art models in terms of image quality and text-image consistency for the CelebA-HQ and CUB datasets. For ControlGAN (Li et al., 2019a) and DAE-GAN (Ruan et al., 2021), the output images do not have high fidelity when input prompts are complex. TediGAN (Xia et al., 2021a, b) can output high-fidelity images, but sometimes the generated images and the input text do not match at all. Lafite (Zhou et al., 2021) has better compositional generalization, while sometimes Lafite confuses the attributes, for example, reversing the color of the bird’s head and body. We conjecture that this is overfitting caused by the overrepresented attribute compositions. Compared to StyleT2I (Li et al., 2022) using a pre-trained generator, the images generated by ACTIG not only accurately match the input text, but also have higher fidelity.

We further make comparison with one popular diffusion-based model, Stable Diffusion 2.1 (Rombach et al., 2022). The comparison results are shown in Fig. 16. We observe that these diffusion models, despite their heavy consumption of resources, do not have strong advantages on specific datasets to accurately represent the underrepresented attribute compositions. In addition, our GAN-based model has a much faster inference speed than diffusion models. Note that the popular diffusion models are always trained on large-scale datasets, e.g. LAION (Schuhmann et al., 2021). Therefore, the underrepresented attribute compositions for our model may be not underrepresented for the diffusion models.

Fig. 16
figure 16

Qualitative comparison results of our model and Stable Diffusion 2.1

4.7 User Study

We obtain the text and images from Li et al. (2022) to be used in their user study and add the generated images by Lafite (Zhou et al., 2021) and ACTIG. 12 participants with different backgrounds are invited to estimate 40 groups of images (20 for CelebA-HQ and 20 for CUB), each containing seven images generated by seven models with the same text. One text-image group for CelebA-HQ is shown in Fig. 12. Note that these attribute combinations in the test set are unseen by the model during training. So even though human perception finds some of these combinations common, they are rare for our model (never seen before).

Each participant sees two types of questions:

  1. 1.

    Rank the alignment between the image and the given caption. When answering this type of question, the participant is asked to focus on the semantic similarity between the caption and image.

  2. 2.

    Rank the image quality (how close the generated image is to the real image). When answering this type of question, the participant is asked to focus on the quality of the image (e.g., fidelity, blur, artifacts) instead of the semantic similarity with the caption.

In both types of questions, 1 means the "worst", and 7 represents the "best". Note that one score can only be assigned to one image. The average scores of different models are shown in Table 6. The score distributions on the CelebA-HQ and CUB are shown in Fig. 13. ACTIG receives higher ranking scores for both image quality and image-text consistency.

5 Conclusion and Future Work

We propose a novel attribute-centric compositional text-to-image generation framework, ACTIG, which achieves compositional generalization for both underrepresented and overrepresented attribute compositions. To improve the generalization of underrepresented attribute compositions, we introduce attribute-centric text feature augmentation and image-free training. To overcome the bias of overrepresented attribute compositions, an attribute-centric contrastive loss is proposed to learn the independent attribute distributions through adversarial training. ACTIG achieves state-of-the-art results in terms of image fidelity and text-image consistency. Our framework can be extended similarly to improve compositional generalization of multiple foreground entities, which is our future direction to promote the robustness of generative models.

Our framework relies on the CLIP model to distinguish between various attributes, making its overall performance dependent on CLIP’s ability to accurately understand and encode these attributes. This reliance may limit the model’s effectiveness with complex or nuanced attributes that are not well-represented in CLIP’s training data. A possible solution is to apply advanced CLIP-like models being developed by the community, e.g. Florence Yuan et al. (2021) or Flamingo Alayrac et al. (2022), to mitigate this dependency.

Extending our approach to more diverse and complex large-scale datasets presents several challenges. Characterizing and improving the sparsity of attribute combinations in these large datasets is particularly difficult. The abundance of text-image pairs and rich attributes further complicates data analysis. Although our model effectively handles attribute-centric compositions, additional research is needed to address more nuanced textual descriptions and their representations in images.