PartComposer: Learning and Composing Part-Level Concepts from Single-Image Examples

Junyu Liu¹, R. Kenny Jones¹, Daniel Ritchie¹,

¹Brown University

Abstract

We present PartComposer: a framework for part-level concept learning from single-image examples that enables text-to-image diffusion models to compose novel objects from meaningful components. Existing methods either struggle with effectively learning fine-grained concepts or require a large dataset as input. We propose a dynamic data synthesis pipeline generating diverse part compositions to address one-shot data scarcity. Most importantly, we propose to maximize the mutual information between denoised latents and structured concept codes via a concept predictor, enabling direct regulation on concept disentanglement and re-composition supervision. Our method achieves strong disentanglement and controllable composition, outperforming subject and part-level baselines when mixing concepts from the same, or different, object categories.

Video

Part-level Concepts Compositions in Same-Category Objects

These are part-level concepts learning and composition results for same-category objects. Our methods works on a wide variety of objects and can learn very fine-grained concepts (more than 4 parts per objects). Feel free to swipe through them.

Part-level Concepts Compositions in Cross-Category Objects

These are part-level concepts learning and composition results for cross-category objects, resulting in some interesting imagination for virtual objects.

Problem Statement

Our goal is to learn part-level concepts from single-image examples (only one image for an object), and compose them into new objects. We focus on part-level concepts because they enable powerful visual imagination. We emphasize single-image inputs since most real or generated data only provide one image per object, without multi-view or 3D information.

Two major challenges:
- Concept Entanglement at Part-level: Firstly, the part-level concepts are more fine-grained than subject-level concepts, and require structural information to be composed into reasonable objects. This makes it more challenging to properly learn the identity of concepts and achieve clear disentanglement between different concepts.
- Data Scarcity: The single-image examples are very scarce, resulting in very poor data variety (i.e., very limited ground truth part composition), challenging the capability of generative models to properly compose unseen combinations.

Here we demonstrate the above challengs. When trying to compose these 4 parts across the two images, Break-a-Scene [1], a representive methods for learning subject-level concepts, tends to miss or mix the intended concepts. While our work aims to produce part learning and composition results like this, which clearly reflect the identity of the intended parts.

Our Overall Pipeline

To overcome the above challenges, we make two critical observations and propose corresponding modifications. Firstly, proper augmentation on the possible part combinations should be explored and the structural integrity of objects should be preserved in the generative model prior. Secondly, we observe that there is lack of regulation of the information encoded in different concepts, resulting in entangled and ambiguous concept learning and composition.

Dynamic Data Synthesis

To overcome the data scarcity, for example, just two images yield only two known part combinations, we propose Dynamic Data Synthesis. For each training batch, there is a synthetic image in which we randomly paste part combinations across input examples with overlapping. This image encourages exploring possible part combinations. Another image is an instance image, which we directly sample from the input examples, and randomly mask out some of the parts. This image encourages keeping the structural information of objects. These images are generated on-the-fly for throughout the training steps.

Maximize Mutual Information

To disentangle part-level concepts from generative latents, we adopt a mutual information maximization framework inspired by InfoGAN [5]. Thus, we aim to maximize mutual information between concept combinations and denoised latents to ensure decent disentanglement and composition capability. Specifically, we encourage the denoised latent \( \tilde{z} \) to retain semantic information about the concept codes \( \mathbf{c} \) by maximizing:

\[ I(\mathbf{c}; \tilde{z}) = H(\mathbf{c}) - H(\mathbf{c} \mid \tilde{z}) \]

Since direct computation is intractable, we approximate the true posterior with a neural network-based concept predictor \( Q(\mathbf{c} \mid \tilde{z}) \), leading to the variational lower bound:

\[ \mathcal{I}_{\text{lower}} = \mathbb{E}_{\mathbf{c}, \tilde{z}} [\log Q(\mathbf{c} \mid \tilde{z})] + H(\mathbf{c}) \]

We minimize the negative lower bound:

\[ \mathcal{L}_{\text{Info}} = - \mathbb{E}_{\mathbf{c}, \tilde{z}} [\log Q(\mathbf{c} \mid \tilde{z})] \]

Our concept predictor comprises two heads: a classification head penalized by \( \mathcal{L}_{\text{CLS}} \) for incorrect concept composition, and a segmentation head penalized by \( \mathcal{L}_{\text{SEG}} \) for localization errors. Both heads are jointly trained with the generative model, guiding it to synthesize disentangled, interpretable part compositions.

Comparison Results

We compare our methods with state-of-the-art methods in visual concept learning: Break-a-Scene[1], PartCraft[2], MuDI[3], and PiT[4].

Comparison of concept composition results for 2 chairs using our approach, Break-a-Scene (BaS) [1], PartCraft [2], and MuDI [3].

Comparison of concept composition results for 2 creatures and 2 chairs using our approach and PiT [4].

Flexibility of our method

We demonstrate the flexibility and scalability of out pipeline in 3 aspects: encoding background concepts together with part-level concepts, dealing with incomplete part combination prompts, and scaling to learn large number of concepts from more than 2 single-image examples.

Our pipeline can learn background concept together with part concepts, and naturally blend the composed objects into the background (upper row). Our pipeline can also handle incomplete combination of parts in a prompt and generate complete objects with variations in the unspecified parts (lower row).

Our method can directly learn large number of concepts (i.e., 16 concepts from 4 single-image examples). All previous works often struggle to learn over 5 concepts at the same time without using large datasets.

BibTeX

@article{liu2025partcomposer,
      title={PartComposer: Learning and Composing Part-Level Concepts from Single-Image Examples},
      author={Liu, Junyu and Jones, R Kenny and Ritchie, Daniel},
      journal={arXiv preprint arXiv:2506.03004},
      year={2025}
    }

References

[1] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. 2023. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers. 1–12.

[2] Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2024. PartCraft: Crafting Creative Objects by Parts. In European Conference on Computer Vision. Springer, 420–437.

[3] Jang, Sangwon, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. "Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models." In 38th Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems Foundation, 2024.

[4] Elad Richardson, Kfir Goldberg, Yuval Alaluf, and Daniel Cohen-Or. 2025. Piece it Together: Part-Based Concepting with IP-Priors. arXiv preprint arXiv:2503.10365 (2025).

[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems 29 (2016).