WordRobe: Text-Guided Generation of Textured 3D Garments

Anonymous Authors

Method Overview

We propose 'WordRobe', a method to generate different types of 3D garments with openings (armholes, necklines etc.) and diverse textures via user-friendly text prompts. To achieve this, we incorporate three novel components in WordRobe — 3D garment latent space (Ω) which encodes unposed 3D garments as latent codes; Mapping Network (MLPmap) which predicts garment latent code from input text prompt ; and Text-guided texture synthesis to generate high-quality diverse texture maps for the 3D garments We provide an overview of the proposed method in the figure below at inference time, given an input text prompt, we first obtain its CLIP embedding ψ, which is subsequently passed to MLPmap to obtain the latent code ϕ ∈ Ω. We further perform two-step latent decoding of ϕ to generate the 3D garment as UDF, and extract the UV parametrized mesh representation for the same. Finally, we perform text-guided texture synthesis in a single feed-forward step by leveraging ControlNet to obtain the textured 3D garment mesh.

WordRobe

Text-Driven 3D Garment Generation & Editing

Composition of 3D Garments

Text-Driven Latent Editing

Sketch Guided Generation

Sketch + Text

3D Garment Extraction from Images

3D Garment Latent Space

WordRobe generates high-quality unposed 3D garment meshes with photorealistic textures from user-friendly text prompts. We achieve this by first learning a latent space of 3D garments using a novel two-stage encoder-decoder framework in a coarse-to-fine manner, representing the 3D garments as unsigned distance fields (UDFs). We also introduce an additional loss function to further disentangle the latent space, promoting better interpolation.

WordRobe

3D Garment Interpolation

Interpolate start reference image.

"skirt with wavy hemline"

Loading...
Interpolation end reference image.

"crop-top with droopy sleeves"


Mapping CLIP to Garment Latent Space

Once the garment latent space is learned, we train a mapping network to predict garment latent codes from CLIP embeddings. This allows CLIP-guided exploration of the latent space, enabling text-driven 3D garment generation and editing. For training the aforementioned mapping network, we develop a novel weakly-supervised training scheme that eliminates the need for explicit manual text annotations.

WordRobe

Results

Comparison with Text2Tex

Comparison with Text-to-3D Methods

Simulations

Hassle-free simulation of generated 3D textured garments in Blender.