Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
FocusDiff addresses the challenge of fine-grained semantic control in AR-based image generation. While AR models excel in capturing global semantics, they often struggle with subtle distinctions. And
FocusDiff enhances text-to-image alignment through two main innovations:
2. Pair-GRPO: A novel RL algorithm extending Group Relative Policy Optimization to emphasize fine-grained semantic differences during training.
On this basis, we propose a new benchmark, i.e., PairComp. Each test case in Paircomp contains two similar prompts with subtle differences. By comparing the accuracy of the images generated by the model for each prompt, we evaluate whether the model has focused on the fine-grained semantic differences in the prompts to produce the corresponding correct images. The two prompts in a test case exhibit word-level differences that lead to noticeable distinctions in certain fine-grained semantic aspects. As shown in following figure, these differences can be categorized into six types: (1) Overall appearance difference; (2) Color difference; (3) Counting difference; (4) Position difference; (5) Style & Tone difference; (6) Text difference.
We employ Janus-Pro as the backbone, developing Janus-FocusDiff, excelling in text-to-image generation, with improved capabilities of vision-language alignment. The comparison
results against both diffusion-based and MLLM-based methods on PairComp, GenEval, T2I-CompBench, and DPG-Bench are presented in the following two Tables.
We also present a direct qualitative comparison between Janus-FocusDiff-7B and Janus-Pro-7B on pairs of similar prompts with fine-grained semantic differences. For each prompt, we ask each model to generate two images. We can see that Janus-Pro-7B struggles to precisely control the fine-grained requirements of similar prompts. Moreover, even for the same prompt, the generated images are not consistently aligned with the target semantics. In contrast, our
Janus-FocusDiff-7B is capable of accurately capturing the fine-grained semantic differences between prompts to generate corresponding images, and stably produces high-quality images that meet the specified requirements.
Janus-FocusDiff can further generate images that more accurately match counterfactual prompts which are rarely found in the real world. For instance, given the prompt ''square watermelon'', Janus-pro-7B still generates a round one.
In contrast, our
Janus-FocusDiff-7B successfully generates a watermelon with this counterfactual shape.
This indicates that we effectively mitigate the issue of hallucination generation, eliminating the erroneous bias towards the training distribution.