MMGen: Unified Multi-modal Image Generation and Understanding in One Go

1The University of Hong Kong, 2The University of Sydney, 3AIsphere, 4Tsinghua University, 5Hong Kong University of Science and Technology, 6Texas A&M University
* equal contribution, † corresponding author
teaser image

Abstract

A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

Method overview

(1) MM Encoding: Given paired multi-modal images, we first use a shared pretrained VAE encoder to encode each modality into latent patch codes. (2) MM Diffusion: Patch codes corresponding to the same image location are grouped to form the multi-modal patch input x0 , which is blended with random noise to create the diffusion input xt. Conditioned on timestep y, category label t and task embedding et, the MM Diffusion model iteratively predicts the velocity, resulting in denoised multi-modal patches x0_d. (3) MM Decoding: Finally, these patches are reprojected to the original image locations for each modality and decoded back into image pixels using a shared pretrained VAE decoder

teaser image

Results: Generation and Understanding

(a) Multi-modal category-conditioned generation

Given the category information, multi-modal images (i.e., rgb, depth, normal, semantic segmentation) are generated simultaneously in a single diffusion process.

(b) Multi-modal conditioned generation

Given a fine-grained condition input (e.g., depth or normal, highlighted by yellow rectangles), our model can accurately generate the corresponding rgb image and other aligned outputs in parallel.

(b.1) Depth-conditioned generation

(b.2) Normal-conditioned generation

(b.3) Segmentation-conditioned generation

(c) Multi-modal image visual understanding

Given a reference image (highlighted with yellow rectangles), our framework accurately estimates the associated depth, normal, and semantic segmentation results

Visual understanding on ImageNet-1k validation set.

teaser image

Zero-shot on ScanNet dataset.

teaser image

Applications

(a) Image-to-image translation

Given a reference image, MMGen can interpret it into three visual modalities simultaneously. Then, for each modality, we can feed into MMGen as conditions to generate a new image.

teaser image

(b) 3D reconstruction

Starting from a depth map, MMGen generates three visual modalities simultaneously, especially segmentation, which can be used for 3D recon struction of foreground objects without the need to run an individual segmentation model.

teaser image

(c) Adaptation to new modalities

To assess the feasibility of extending MMGen to new modalities, we conducted two experiments using a commonly used modality—Canny edge, including: (1) Finetune an existing modality to a new modality (seg →canny); (2) Add an additional modality to support generation of 5 modalities simultaneously.

teaser image

BibTeX

@article{jiepeng2025mmgen,
  author    = {Wang, Jiepeng and Wang, Zhaoqing and Pan, Hao and Liu, Yuan and Yu, Dongdong and Wang, Changhu and Wang, Wenping},
  title     = {MMGen: Unified Multi-modal Image Generation and Understanding in One Go},
  journal   = {arXiv preprint arXiv:2503.20644},
  year      = {2025},
}