Logo FocusDiff

Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

1Zhejiang University, 2Ant Group, 3Moontshot AI
*Equal Contribution

Introduction

Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose LogoFocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.

framework

arithmetic reasoning

LogoFocusDiff addresses the challenge of fine-grained semantic control in AR-based image generation. While AR models excel in capturing global semantics, they often struggle with subtle distinctions. And LogoFocusDiff enhances text-to-image alignment through two main innovations:

1. FocusDiff-Data: A curated dataset of paired prompts and images with subtle semantic variations.

arithmetic reasoning

2. Pair-GRPO: A novel RL algorithm extending Group Relative Policy Optimization to emphasize fine-grained semantic differences during training.

arithmetic reasoning

On this basis, we propose a new benchmark, i.e., PairComp. Each test case in Paircomp contains two similar prompts with subtle differences. By comparing the accuracy of the images generated by the model for each prompt, we evaluate whether the model has focused on the fine-grained semantic differences in the prompts to produce the corresponding correct images. The two prompts in a test case exhibit word-level differences that lead to noticeable distinctions in certain fine-grained semantic aspects. As shown in following figure, these differences can be categorized into six types: (1) Overall appearance difference; (2) Color difference; (3) Counting difference; (4) Position difference; (5) Style & Tone difference; (6) Text difference.


arithmetic reasoning

Main Results

Automated Metric Evaluation

We employ Janus-Pro as the backbone, developing LogoJanus-FocusDiff, excelling in text-to-image generation, with improved capabilities of vision-language alignment. The comparison results against both diffusion-based and MLLM-based methods on PairComp, GenEval, T2I-CompBench, and DPG-Bench are presented in the following two Tables.

arithmetic reasoning

arithmetic reasoning

Qualitative Examples

We also present a direct qualitative comparison between LogoJanus-FocusDiff-7B and Janus-Pro-7B on pairs of similar prompts with fine-grained semantic differences. For each prompt, we ask each model to generate two images. We can see that Janus-Pro-7B struggles to precisely control the fine-grained requirements of similar prompts. Moreover, even for the same prompt, the generated images are not consistently aligned with the target semantics. In contrast, our LogoJanus-FocusDiff-7B is capable of accurately capturing the fine-grained semantic differences between prompts to generate corresponding images, and stably produces high-quality images that meet the specified requirements.

arithmetic reasoning
arithmetic reasoning

LogoJanus-FocusDiff can further generate images that more accurately match counterfactual prompts which are rarely found in the real world. For instance, given the prompt ''square watermelon'', Janus-pro-7B still generates a round one. In contrast, our LogoJanus-FocusDiff-7B successfully generates a watermelon with this counterfactual shape. This indicates that we effectively mitigate the issue of hallucination generation, eliminating the erroneous bias towards the training distribution.

arithmetic reasoning

BibTeX