Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.
In this paper, we propose a novel framework based on diffusion models for rendered-to-real fashion image translation. Our main idea consists of two aspects: Firstly, we propose to leverage abundant generative prior from pretrained Text-to-Image (T2I) diffusion models, and apply simple adaptation to realistic image generation under the guidance of distilled rendered prior. Secondly, we adopt a texture-preserving mechanism by extracting spatial image structure through attention from an inversion pipeline.
To achieve this, we design a diffusion-based method consisting of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). During DKI, we first finetune a pretrained T2I diffusion model on real fashion photos with derived captions from BLIP, to adapt its capability in generating high-quality images to our target domain. After this adaptation, we propose to guide the image generation towards the negative direction of rendered effect. Inspired by Textual Inversion, we distill a general rendered "concept" with thousands of rendered fashion images by training a negative domain embedding vector based on the adapted base model.
During RIG, we employ a DDIM inversion pipeline to first invert a rendered image into the latent noise map, and then generate its corresponding real image using the previous embedding as a negative guidance. Similar to recent training free controls in T2I generation method, we discover that the attention map in the shallow layers of the UNet contains rich spatial image structure and can be used for fine-grained texture-preserving during the generation. Specifically, we inject query and key of the self-attention from the rendered image inversion and generation pipeline to the rendered-to-real image generation pipeline. This largely improves the consistency of intricate clothing texture details.
@misc{hu2024fashionr2r,
title={FashionR2R: Texture-preserving Rendered-to-Real Image Translation with Diffusion Models},
author={Rui Hu and Qian He and Gaofeng He and Jiedong Zhuang and Huang Chen and Huafeng Liu and Huamin Wang},
year={2024},
eprint={2410.14429},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.14429},
}