In spite of recent advancements in text-to-image generation, it still has limitations when it comes to complex, imaginative text prompts. Due to the limited exposure to diverse and complex data in their training sets, text-to-image models often struggle to comprehend the semantics of these difficult prompts, leading to the generation of irrelevant images. This work explores how diffusion models can process and generate images based on prompts requiring artistic creativity or specialized knowledge. Recognizing the absence of a dedicated evaluation framework for such tasks, we introduce a new benchmark, the Realistic-Fantasy Benchmark (RFBench), which blends scenarios from both realistic and fantastical realms. Accordingly, for reality and fantasy scene generation, we propose an innovative training-free approach, Realistic-Fantasy Network (RFNet), that integrates diffusion models with LLMs. Through our proposed RFBench, extensive human evaluations coupled with GPT-based compositional assessments have demonstrated our approach's superiority over other state-of-the-art methods.
The Realistic-Fantasy Network (RFNet) contains two stages. In the first stage, we transform the initial input prompt into a refined version specifically tailored for image generation by LLMs. In the second stage, we utilize a diffusion model through a two-step process to generate outputs with extraordinary details.
As we proceed with generating images using the diffusion model using the details generated by the previous step, there is a critical challenge: the description lists generated by LLMs for one object usually overlook the relationships among them. For example, interpretations of “a lion” could range from being “unaware and asleep” to “frightened and trying to escape.” Although both depictions are valid, descriptions such as “unaware” and “trying to escape” can lead to conflicting interpretations, thus complicating the image generation process.
To overcome this challenge, we introduce the Semantic Alignment Assessment (SAA) module. This module calculates the relevance between different object vectors, thereby selecting the candidate description that best fits the current scenario. By conducting the cosine similarity among different descriptions, we can navigate the complexities introduced by the LLM's output, selecting the most compatible details for the diffusion model. This step is crucial for maintaining the coherence and accuracy of the generated images, highlighting our novel approach to mitigating the risk of conflicting descriptions. Through this module, we ensure textual precision and compatibility, and provide clear, consistent instructions for the subsequent diffusion model to generate visually coherent representations.
Model | GPT4-CLIP | GPT4Score | ||||
---|---|---|---|---|---|---|
R & A | C & I | Avg | R & A | C & I | Avg | |
Stable Diffusion | 0.573 | 0.552 | 0.561 | 0.667 | 0.440 | 0.541 |
MultiDiffusion | 0.510 | 0.510 | 0.510 | 0.517 | 0.493 | 0.504 |
Attend and Excite | 0.523 | 0.560 | 0.546 | 0.633 | 0.520 | 0.570 |
LLM-groundedDiffusion | 0.457 | 0.536 | 0.501 | 0.550 | 0.600 | 0.578 |
BoxDiff | 0.532 | 0.553 | 0.543 | 0.583 | 0.520 | 0.548 |
SDXL | 0.536 | 0.619 | 0.582 | 0.567 | 0.587 | 0.578 |
RFNet (ours) | 0.587 (2%↑) | 0.623 (13%↑) | 0.607 (8%↑) | 0.833 (25%↑) | 0.627 (43%↑) | 0.719 (33%↑) |
@article{yao2024fabricationrealityfantasyscene,
title = {The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation},
author = {Yi Yao and Chan-Feng Hsu and Jhe-Hao Lin and Hongxia Xie and Terence Lin and Yi-Ning Huang and Hong-Han Shuai and Wen-Huang Cheng},
year = {2024},
eprint = {2407.12579},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2407.12579},
}