The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation

(ECCV 2024)

*Equal contribution
1National Yang Ming Chiao Tung University, 2Jilin University, 3National Taiwan University
intro_result

Abstract

In spite of recent advancements in text-to-image generation, it still has limitations when it comes to complex, imaginative text prompts. Due to the limited exposure to diverse and complex data in their training sets, text-to-image models often struggle to comprehend the semantics of these difficult prompts, leading to the generation of irrelevant images. This work explores how diffusion models can process and generate images based on prompts requiring artistic creativity or specialized knowledge. Recognizing the absence of a dedicated evaluation framework for such tasks, we introduce a new benchmark, the Realistic-Fantasy Benchmark (RFBench), which blends scenarios from both realistic and fantastical realms. Accordingly, for reality and fantasy scene generation, we propose an innovative training-free approach, Realistic-Fantasy Network (RFNet), that integrates diffusion models with LLMs. Through our proposed RFBench, extensive human evaluations coupled with GPT-based compositional assessments have demonstrated our approach's superiority over other state-of-the-art methods.

Method

Overview

The Realistic-Fantasy Network (RFNet) contains two stages. In the first stage, we transform the initial input prompt into a refined version specifically tailored for image generation by LLMs. In the second stage, we utilize a diffusion model through a two-step process to generate outputs with extraordinary details.

method

Semantic Alignment Assessment (SAA) Module

As we proceed with generating images using the diffusion model using the details generated by the previous step, there is a critical challenge: the description lists generated by LLMs for one object usually overlook the relationships among them. For example, interpretations of “a lion” could range from being “unaware and asleep” to “frightened and trying to escape.” Although both depictions are valid, descriptions such as “unaware” and “trying to escape” can lead to conflicting interpretations, thus complicating the image generation process.

To overcome this challenge, we introduce the Semantic Alignment Assessment (SAA) module. This module calculates the relevance between different object vectors, thereby selecting the candidate description that best fits the current scenario. By conducting the cosine similarity among different descriptions, we can navigate the complexities introduced by the LLM's output, selecting the most compatible details for the diffusion model. This step is crucial for maintaining the coherence and accuracy of the generated images, highlighting our novel approach to mitigating the risk of conflicting descriptions. Through this module, we ensure textual precision and compatibility, and provide clear, consistent instructions for the subsequent diffusion model to generate visually coherent representations.

method

Qualitative Result

result
Qualitative comparison on RFBench. The compared models include (a) Stable Diffusion, (b) MultiDiffusion, (c) Attend and Excite, (d) LMD, (e) BoxDiff, (f) SDXL, (g) Ours
result
More results on Realistic and Analytical. The compared models include (a) Stable Diffusion, (b) MultiDiffusion, (c) Attend and Excite, (d) LMD, (e) BoxDiff, (f) SDXL, (g) Ours
result
More results on Creativity and Imagination. The compared models include (a) Stable Diffusion, (b) MultiDiffusion, (c) Attend and Excite, (d) LMD, (e) BoxDiff, (f) SDXL, (g) Ours

Quantitative Result

Model GPT4-CLIP GPT4Score
R & A C & I Avg R & A C & I Avg
Stable Diffusion 0.573 0.552 0.561 0.667 0.440 0.541
MultiDiffusion 0.510 0.510 0.510 0.517 0.493 0.504
Attend and Excite 0.523 0.560 0.546 0.633 0.520 0.570
LLM-groundedDiffusion 0.457 0.536 0.501 0.550 0.600 0.578
BoxDiff 0.532 0.553 0.543 0.583 0.520 0.548
SDXL 0.536 0.619 0.582 0.567 0.587 0.578
RFNet (ours) 0.587 (2%↑) 0.623 (13%↑) 0.607 (8%↑) 0.833 (25%↑) 0.627 (43%↑) 0.719 (33%↑)

BibTeX

@article{yao2024fabricationrealityfantasyscene,
    title          = {The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation}, 
    author         = {Yi Yao and Chan-Feng Hsu and Jhe-Hao Lin and Hongxia Xie and Terence Lin and Yi-Ning Huang and Hong-Han Shuai and Wen-Huang Cheng},
    year           = {2024},
    eprint         = {2407.12579},
    archivePrefix  = {arXiv},
    primaryClass   = {cs.CV},
    url            = {https://arxiv.org/abs/2407.12579}, 
  }