RelationBooth: Towards Relation-aware Customized Object Generation

1Peking University, 2UC, Merced, 3National University of Singapore, 4Shanghai AI Laboratory,
Teaser Image

Relation-aware image customization, the generated image should strictly keep the predicate relation and preserve each identity among those identities provided by the text prompt.

Abstract

Customized image generation is crucial for delivering personalized content based on user-provided image prompts, aligning large-scale text-to-image diffusion models with individual needs. However, existing models often overlook the relationships between customized objects in generated images. Instead, this work addresses that gap by focusing on relation-aware customized image generation, which aims to preserve the identities from image prompts while maintaining the predicate relations described in text prompts. Specifically, we introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges—generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on three benchmarks demonstrate the superiority of RelationBooth in generating precise relations while preserving object identities across a diverse set of objects and relations. The source code and trained models will be made available to the public.

Background

Customized image generation has made significant progress, driven by advancements in large-scale text-to-image diffusion models like Stable Diffusion and Imagen. These methods have enabled the generation of personalized content by preserving the identity of objects specified by user inputs, proving valuable in areas such as personalized artwork, branding, virtual fashion, social media content, and augmented reality. However, existing techniques primarily focus on individual object customization, often neglecting the importance of relationships between objects in the context of the provided text prompts. Addressing this gap, relation-aware customized image generation emphasizes not only preserving the identity of multiple objects but also accurately capturing the relationships described in the text prompts, presenting new challenges and opportunities in this field.

Approach

Our approach tackles the challenges in relation-aware customized image generation by addressing data limitations and model design. We propose using triplets of images—two image prompts and one target image—with the prompts depicting similar objects in distinct actions relative to the target image. Leveraging an advanced text-to-image generation model, we ensure object consistency across the triplets while varying the actions, enabling focused relationship learning during fine-tuning.

In terms of model design, we introduce RelationBooth, which utilizes the Low-Rank Adaptation (LoRA) strategy to adapt the text cross-attention layers of diffusion models. In RelationBooth, two key modules are introduced during training to enhance the customization of relationships and identity preservation. First, we introduce a keypoint matching loss (KML) as additional supervision to explicitly encourage the model to adjust object poses, since relationships between objects are closely tied to their poses. Importantly, the KML operates on the latent representation rather than the original image space, aligning with the default diffusion loss. Second, we inject local tokens for multiple objects to improve the distinctiveness of highly overlapping objects. Specifically, we employ the self-distillation method from CLIPSelf to enhance region-language alignment in CLIP's dense features. Through partitioning and pooling, the well-aligned local tokens help mitigate appearance confusion between objects.

Pipeline Data Engine

Results

To comprehensively evaluate relation-aware customized image generation, we developed RelationBench, building on two well-established benchmarks, DreamBench and CustomConcept101. Our methods demonstrated strong performance, with the generated images closely adhering to the relationships specified in the text prompts and maintaining the identities from the image prompts, achieving significant improvements in both visual quality and quantitative results.

Same object with different relations.
Results of RelationBooth
Same relation with different objects.
Results of RelationBooth
Qualititive results comparison.
Multi Comparison Image
Quantitative results on RelationBench.
Results of RelationBooth

Ablation

In our project, we explored several techniques to enhance relation-aware image generation. First, we experimented with fine-tuning text embeddings in using blank image prompts, but this underperformed due to entanglement of identity and relation information. Next, we observed that including Keypoint Matching Loss (KML) significantly improved image quality and accuracy. Lastly, local token injection was crucial for preventing object confusion. These combined approaches enable high-quality, relation-aware image customization.

Chair TP Steve Chair TP

For Relation Inversion Task

Notably, after fine-tuning, our RelationLoRA can also be directly integrated into SDXL, effectively enabling the model to address the relation inversion task.

Steve

Incorporate with CogVideoX-5b-I2V