Approach

We attribute the failure to two key factors: a lack of relevant data and an ineffective model design. Unlike data augmentation techniques such as flipping or rotation, commonly used to create paired training data in object customization methods, our approach requires a triplet of images: two image prompts and one target image. The image prompts should contain similar objects but exhibit distinct poses compared to the target image. To collect these triplets, we propose a data engine to curate our finetuning set. We leverage an advanced text-to-image generation model to generate triplets where the same object pair is shared across the images. Through text prompt guidance, the image prompt provides strong identity information. This enables the decoupling of the relationship in the target image, which enhances the relation learning.
For the model design, we propose DreamRelation, which applies the Low-Rank Adaptation (LoRA) strategy to the text cross-attention layers of existing diffusion models to process user-provided text prompts. In DreamRelation, two key modules are introduced during training to enhance the relation generation in customized generation. First, we introduce a keypoint matching loss (KML) as additional supervision to explicitly encourage the model to manipulate object poses, since relationships between objects are closely tied to their poses. Importantly, the KML operates on the latent representation rather than the original image space, aligning with the default diffusion loss. Second, since relation-aware customization requires local features from image prompts, such as the “hands” features for generating “shaking hands”—which are not captured by CLIP's coarse image-level features, we introduce dense fea- tures from CLIP. Through partitioning and pooling, we obtain local tokens that contain detail and local information. To further enhance the compatibility between dense features and image-level features, we employ a self-distillation method to improve their alignment.