DreamRelation: Bridging Customization and Relation Generation

Abstract

Customized image generation is essential for creating personalized content based on user prompts, allowing large-scale text-to-image diffusion models to more effectively meet individual needs. However, existing models often neglect the relationships between customized objects in generated images. In contrast, this work addresses this gap by focusing on relation-aware customized image generation, which seeks to preserve the identities from image prompts while maintaining the relationship specified in text prompts. Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges: generating accurate and natural relationships, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features of the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on our proposed benchmarks demonstrate the superiority of DreamRelation in generating precise relations while preserving object identities across a diverse set of objects and relationships.

Background

Customized image generation has made significant progress, driven by advancements in large-scale text-to-image diffusion models like Stable Diffusion and Imagen. These methods have enabled the generation of personalized content by preserving the identity of objects specified by user inputs, proving valuable in areas such as personalized artwork, branding, virtual fashion, social media content, and augmented reality. However, existing techniques primarily focus on individual object customization, often neglecting the importance of relationships between objects in the context of the provided text prompts. Addressing this gap, relation-aware customized image generation emphasizes not only preserving the identity of multiple objects but also accurately capturing the relationships described in the text prompts, presenting new challenges and opportunities in this field.

Approach

We attribute the failure to two key factors: a lack of relevant data and an ineffective model design. Unlike data augmentation techniques such as flipping or rotation, commonly used to create paired training data in object customization methods, our approach requires a triplet of images: two image prompts and one target image. The image prompts should contain similar objects but exhibit distinct poses compared to the target image. To collect these triplets, we propose a data engine to curate our finetuning set. We leverage an advanced text-to-image generation model to generate triplets where the same object pair is shared across the images. Through text prompt guidance, the image prompt provides strong identity information. This enables the decoupling of the relationship in the target image, which enhances the relation learning.

For the model design, we propose DreamRelation, which applies the Low-Rank Adaptation (LoRA) strategy to the text cross-attention layers of existing diffusion models to process user-provided text prompts. In DreamRelation, two key modules are introduced during training to enhance the relation generation in customized generation. First, we introduce a keypoint matching loss (KML) as additional supervision to explicitly encourage the model to manipulate object poses, since relationships between objects are closely tied to their poses. Importantly, the KML operates on the latent representation rather than the original image space, aligning with the default diffusion loss. Second, since relation-aware customization requires local features from image prompts, such as the “hands” features for generating “shaking hands”—which are not captured by CLIP's coarse image-level features, we introduce dense fea- tures from CLIP. Through partitioning and pooling, we obtain local tokens that contain detail and local information. To further enhance the compatibility between dense features and image-level features, we employ a self-distillation method to improve their alignment.

Results

To comprehensively evaluate relation-aware customized image generation, we developed RelationBench, building on two well-established benchmarks, DreamBench and CustomConcept101. Our methods demonstrated strong performance, with the generated images closely adhering to the relationships specified in the text prompts and maintaining the identities from the image prompts, achieving significant improvements in both visual quality and quantitative results.

Ablation

In our project, we explored several techniques to enhance relation-aware image generation. First, we experimented with fine-tuning text embeddings in using blank image prompts, but this underperformed due to entanglement of identity and relation information. Next, we observed that including Keypoint Matching Loss (KML) significantly improved image quality and accuracy. Lastly, local token injection was crucial for preventing object confusion. These combined approaches enable high-quality, relation-aware image customization.

For Relation Inversion Task

Notably, after fine-tuning, our RelationLoRA can also be directly integrated into SDXL, effectively enabling the model to address the relation inversion task.