Dec 10, 2024 23:00:00

'TRELLIS' is a 3D generation AI model that can automatically generate versatile, high-quality 3D assets from text and images.

A joint research team from Tsinghua University, University of Science and Technology of China, and Microsoft Research has announced a new 3D generation AI model called ' TRELLIS ' that can automatically generate versatile, high-quality 3D assets from text input. TRELLIS uses a new method called 'SLAT (Structured LATents)'.

TRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation

https://trellis3d.github.io/

[2412.01506] Structured 3D Latents for Scalable and Versatile 3D Generation
https://arxiv.org/abs/2412.01506

The innovation of SLAT is that it 'achieves efficient data processing by focusing only on the surfaces of 3D objects, and can generate multiple formats of 3D assets from a single representation,' making it easier to generate high-quality 3D assets.

SLAT uses a sparse 3D grid of 64 ^{^3} = approximately 260,000 voxels as its underlying structure, although only about 20,000 of these voxels are actually used, located on the surface of the 3D object.

Each voxel provides two pieces of information: location information, which indicates where the voxel is located in 3D space, and feature information, which describes the shape, color, texture, etc. of that location. SLAT uses an image recognition model called DINOv2 to extract features from images obtained by observing a 3D object from various angles, and collects and averages the features corresponding to the position of each voxel to obtain them. It then generates different types of 3D models from the obtained data, and further allows you to select the optimal format depending on the application. It is also easy to edit to change only specific parts.

The model developed to generate 3D assets using this SLAT representation is TRELLIS. TRELLIS has been developed in three model sizes: Basic (342 million parameters), Large (1.1 billion parameters), and X-Large (2 billion parameters), and has been trained using 64 A100 GPUs with 400,000 steps and a batch size of 256. According to the research team, the generation quality improves as the model size increases.

The text is converted into features through CLIP , and a 3D grid is generated using a custom-developed Rectified Flow Transformer. The research team says that this approach is more efficient than general diffusion models and is suitable for conditioning the generation of text and images.

The 3D model generated from the text actually generated by GPT-4 is as follows.

The TRELLIS logo created with TRELLIS looks like this.

Additionally, the demo published by Hugging Face makes it possible to generate 3D assets from images.

TRELLIS - a Hugging Face Space by JeffreyXiang

https://huggingface.co/spaces/JeffreyXiang/TRELLIS

In the 'Examples' section at the bottom of the demo page, there are a number of example images to input. This time, we chose an image of a house.