DICEPTION: A Generalist Diffusion Model for Vision Perception

DICEPTION: A Generalist Diffusion Model for Vision Perception

1Zhejiang University,  2Shanghai AI Laboratory
*Equal Contribution

 


Results of Our Generalist Model

One single model solves multiple perception tasks, on par with SOTA.

Monocular Depth Estimation Normal Estimation Entity Segmentation
Pose Estimation Semantic Segmentation Point-prompted Image Segmentation

 


Compare with SAM-vit-h

We achieve results on par with SAM-vit-h using only 0.06% of their data (600K vs. 1B pixel-level annotated images).

SAM-vit-h only shows a clear advantage on certain out-of-distribution datasets that are outside the scope of our model's training, such as WoodScape fisheye dataset.

 

1-point Comparison

Input-Ours-SAM

5-point Comparison

Input-Ours-SAM


DICEPTION preserves fine details in segmentation, such as hair.

 


Small Data But Good Performance on Multiple Tasks with One Single Model

DICEPTION achieves good performance without carefully cherry-picking extremely high-quality data or by using an exceptionally large amount of data.

 


Adaptation with Minimal Overhead

Even with minimal trainable parameters (~1%) and a very small dataset (50 data samples), the finetuned model produces high-quality results.

Image Highlighting

Medical Segmentation