FocusDiff

Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Kaihang Pan¹^*, Wendong Bu¹^*, Yuruo Wu¹^*, Yang Wu², Kai Shen^{1^‡}, Yunfei Li²,
Hang Zhao², Juncheng Li^{1^†}, Siliang Tang¹, Yueting Zhuang¹

¹Zhejiang University, ²Ant Group,
^*Equal Contribution; ^‡Project Leader; ^†Corresponding Author

arXiv Code PairComp

🤗

Checkpoint

🤗

FocusDiff-Data

News

【2025.06.21】The PairComp Benchmark and the training/inference scripts for Janus-FocusDiff have been released on our GitHub repository. Additionally, the Janus-FocusDiff checkpoint is now available in this link. We also release a subset of FocusDiff-Data in this link. Try it out now!

【2025.06.05】 Our paper is now available on arXiv: FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL.

Introduction

Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose Logo FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.

Logo FocusDiff addresses the challenge of fine-grained semantic control in AR-based image generation. While AR models excel in capturing global semantics, they often struggle with subtle distinctions. And Logo FocusDiff enhances text-to-image alignment through two main innovations:

1. FocusDiff-Data: A curated dataset of paired prompts and images with subtle semantic variations.

2. Pair-GRPO: A novel RL algorithm extending Group Relative Policy Optimization to emphasize fine-grained semantic differences during training.

On this basis, we propose a new benchmark, i.e., PairComp. Each test case in Paircomp contains two similar prompts with subtle differences. By comparing the accuracy of the images generated by the model for each prompt, we evaluate whether the model has focused on the fine-grained semantic differences in the prompts to produce the corresponding correct images. The two prompts in a test case exhibit word-level differences that lead to noticeable distinctions in certain fine-grained semantic aspects. As shown in following figure, these differences can be categorized into six types: (1) Overall appearance difference; (2) Color difference; (3) Counting difference; (4) Position difference; (5) Style & Tone difference; (6) Text difference.

Automated Metric Evaluation

We employ Janus-Pro as the backbone, developing Logo Janus-FocusDiff, excelling in text-to-image generation, with improved capabilities of vision-language alignment. The comparison results against both diffusion-based and MLLM-based methods on PairComp, GenEval, T2I-CompBench, and DPG-Bench are presented in the following two Tables.

Qualitative Examples

We also present a direct qualitative comparison between Logo Janus-FocusDiff-7B and Janus-Pro-7B on pairs of similar prompts with fine-grained semantic differences. For each prompt, we ask each model to generate two images. We can see that Janus-Pro-7B struggles to precisely control the fine-grained requirements of similar prompts. Moreover, even for the same prompt, the generated images are not consistently aligned with the target semantics. In contrast, our Logo Janus-FocusDiff-7B is capable of accurately capturing the fine-grained semantic differences between prompts to generate corresponding images, and stably produces high-quality images that meet the specified requirements.

Logo Janus-FocusDiff can further generate images that more accurately match counterfactual prompts which are rarely found in the real world. For instance, given the prompt ''square watermelon'', Janus-pro-7B still generates a round one. In contrast, our Logo Janus-FocusDiff-7B successfully generates a watermelon with this counterfactual shape. This indicates that we effectively mitigate the issue of hallucination generation, eliminating the erroneous bias towards the training distribution.

BibTeX


      @article{pan2025focusdiff,
        title={FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL},
        author={Pan, Kaihang and Bu, Wendong and Wu, Yuruo and Wu, Yang and Shen, Kai and Li, Yunfei and Zhao, Hang and Li, Juncheng and Tang, Siliang and Zhuang, Yueting},
        journal={arXiv preprint arXiv:2506.05501},
        year={2025}
      }