When Preference Labels Fall Short:
Aligning Diffusion Models from Real Data

ICML 2026

Weiyan Chen¹, Weijian Deng², Yao Xiao¹, Weijie Tu³,

Ziyi Dong¹, Ibrahim Radwan⁴, Liang Lin^1,5, Pengxu Wei^1,5

¹Sun Yat-sen University ²Tsinghua Shenzhen International Graduate School, Tsinghua University

³Australian National University ⁴University of Canberra ⁵Peng Cheng Laboratory

arXiv OpenReview Code 🤗 Model (SD-1.5) 🤗 Model (SD-3.5-M) 🤗 Dataset

Abstract

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies.

When Do Preference Labels Fall Short?

Most existing preference alignment methods rely on pairwise comparisons between generated samples, where one image is labeled as preferred over another. While effective, this formulation carries two limitations that are easy to overlook.

Problem 1. Supervision is bounded by the generators.

The supervision signal is inherently constrained by the quality of the generators that produce the candidates. Even samples labeled as preferred may still contain artifacts, lack realism, or exhibit limited stylistic diversity. The model can therefore only learn to favor the less flawed of two imperfect options, rather than what a genuinely desirable output looks like.

preferred model-generated samples still contain artifacts — Figure 1. **Preference pairs from Pick-a-Pic v2.** The left group shows preferred images with local generation artifacts, while the right presents preferred images with unnatural global color. These cases highlight limitations of preference-based supervision in capturing holistic image quality.

Problem 2. Existing objectives trade realism against diversity.

Different alignment objectives improve different aspects, but rarely deliver balanced gains. Objectives that target specific visual properties (e.g., smoothness or texture consistency) do not consistently improve overall realism across diverse prompts, while reward-based approaches reach higher preference scores but tend to collapse toward more uniform stylistic patterns.

realism versus texture detail on SD-1.5 — **(a) Preference pairs**

human preference versus stylization on SD-3.5-M — **(a) Preference pairs**

Preference Signals from Real Data

Real-Data Curation for Preference Alignment

We present a data curation strategy that constructs structured supervision signals by contrasting real images with controlled variations, without using explicit preference labels. The idea is to first identify a set of images that represent desirable visual properties, and then introduce controlled degradations to create informative contrasts. This lets preference-related signals be derived directly from real data, while keeping the learning process grounded and interpretable.

examples of preference pairs derived from real images — Figure 3. **Examples of preference pairs derived from real images.** Red contours indicate salient regions where controlled inpainting introduces localized artifacts. The original images act as preferred references, while the degraded counterparts expose interpretable deviations in texture, structure, or semantics, providing effective supervision for preference alignment without labeling.

Preference Alignment with Real-Data-Based Signals

A practical consideration is that real images and their perturbed versions may differ from the model's initial generation distribution, which can make direct preference optimization less stable. To account for this, we adopt a two-stage alignment strategy that incorporates real-data-based signals gradually.

Stage 1 · Distribution Alignment with Real Images

The first stage moves the model closer to the distribution represented by the reference images. Using a Diffusion-DRO (inverse reinforcement learning) objective, the model is trained so that real reference images become more likely under its own distribution than under a frozen reference model, while generated samples are pushed in the opposite direction. This warms up the model toward the realistic, high-quality region described by the curated data before any explicit preference comparison is introduced.

Stage 2 · Preference Learning with Constructed Contrastive Samples

The second stage, warm-started from Stage 1, introduces the structured contrasts built during curation. With a Diffusion-DPO objective, each preferred real image is compared against its controlled degradation, teaching the model to favor the interpretable, higher-quality reference over its perturbed counterpart. Because both stages draw their signal entirely from real data and controlled perturbations, the whole pipeline aligns the model without any manually annotated preference labels.

Experiments

Effectiveness of Real-Data-Based Preference Signals

Real-data-based supervision aligns diffusion models effectively, reaching quality comparable to methods that rely on manually annotated preference pairs.

Table 1: method comparison — Table 1. **Method comparison.** Relative improvements are highlighted in (+gain). ImgRwd: ImageReward [ImageReward]; UniRwd: UnifiedReward [UnifiedReward]; Aes: LAION aesthetic classifier [Aesthetic]. † denotes our re-implementation, trained on Pick-a-Pic v2 [PickScore] using official code. Best and second-best results are in **bold** and underlined, respectively.

user study on SD-3.5-M — Figure 4. **User study on SD-3.5-M.** Following the protocol of Diffusion-DRO, we randomly sample 60 prompts from HPDv2 and ask users to compare our fine-tuned SD-3.5-M with baselines. Across both comparisons, users prefer our model.

Complementarity with Existing Preference Alignment Models

Real-data-based alignment is complementary to existing preference-based methods: combining the two yields further gains across benchmarks and backbones.

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data