When Preference Labels Fall Short:
Aligning Diffusion Models from Real Data

ICML 2026

Weiyan Chen1, Weijian Deng2, Yao Xiao1, Weijie Tu3,
Ziyi Dong1, Ibrahim Radwan4, Liang Lin1,5, Pengxu Wei1,5
1Sun Yat-sen University   2Tsinghua Shenzhen International Graduate School, Tsinghua University
3Australian National University   4University of Canberra   5Peng Cheng Laboratory

Abstract

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies.

When Do Preference Labels Fall Short?

Most existing preference alignment methods rely on pairwise comparisons between generated samples, where one image is labeled as preferred over another. While effective, this formulation carries two limitations that are easy to overlook.

Problem 1.  Supervision is bounded by the generators.

The supervision signal is inherently constrained by the quality of the generators that produce the candidates. Even samples labeled as preferred may still contain artifacts, lack realism, or exhibit limited stylistic diversity. The model can therefore only learn to favor the less flawed of two imperfect options, rather than what a genuinely desirable output looks like.

preferred model-generated samples still contain artifacts
Figure 1. Preference pairs from Pick-a-Pic v2. The left group shows preferred images with local generation artifacts, while the right presents preferred images with unnatural global color. These cases highlight limitations of preference-based supervision in capturing holistic image quality.
Problem 2.  Existing objectives trade realism against diversity.

Different alignment objectives improve different aspects, but rarely deliver balanced gains. Objectives that target specific visual properties (e.g., smoothness or texture consistency) do not consistently improve overall realism across diverse prompts, while reward-based approaches reach higher preference scores but tend to collapse toward more uniform stylistic patterns.

realism versus texture detail on SD-1.5
(a) Preference pairs
human preference versus stylization on SD-3.5-M
(b) Reward model
Figure 2. Analysis of preference- and reward-based alignment behaviors. (a) Comparison of realism and texture detail across different preference-based methods on SD-1.5. Methods optimized using pairwise preferences often improve specific visual aspects (e.g., smoothness or texture consistency) but do not consistently yield balanced gains in overall realism across diverse prompts. (b) Comparison of human preference and stylization on SD-3.5-M. Reward-based approaches achieve higher preference scores but tend to produce more uniform stylistic patterns.
Texture Detail: Laplacian variance of the generated images. Realism: SGP-PickScore, defined as the difference between PickScore evaluated on prompts prefixed with “Realistic photo” and “CG render”. Stylization: stylization score on OneIG-Bench. Human Preference Score: normalized average of HPSv3 and UnifiedReward evaluated on DrawBench.

Preference Signals from Real Data

Real-Data Curation for Preference Alignment

We present a data curation strategy that constructs structured supervision signals by contrasting real images with controlled variations, without using explicit preference labels. The idea is to first identify a set of images that represent desirable visual properties, and then introduce controlled degradations to create informative contrasts. This lets preference-related signals be derived directly from real data, while keeping the learning process grounded and interpretable.

examples of preference pairs derived from real images
Figure 3. Examples of preference pairs derived from real images. Red contours indicate salient regions where controlled inpainting introduces localized artifacts. The original images act as preferred references, while the degraded counterparts expose interpretable deviations in texture, structure, or semantics, providing effective supervision for preference alignment without labeling.

Preference Alignment with Real-Data-Based Signals

A practical consideration is that real images and their perturbed versions may differ from the model's initial generation distribution, which can make direct preference optimization less stable. To account for this, we adopt a two-stage alignment strategy that incorporates real-data-based signals gradually.

Stage 1 · Distribution Alignment with Real Images

The first stage moves the model closer to the distribution represented by the reference images. Using a Diffusion-DRO (inverse reinforcement learning) objective, the model is trained so that real reference images become more likely under its own distribution than under a frozen reference model, while generated samples are pushed in the opposite direction. This warms up the model toward the realistic, high-quality region described by the curated data before any explicit preference comparison is introduced.

Stage 2 · Preference Learning with Constructed Contrastive Samples

The second stage, warm-started from Stage 1, introduces the structured contrasts built during curation. With a Diffusion-DPO objective, each preferred real image is compared against its controlled degradation, teaching the model to favor the interpretable, higher-quality reference over its perturbed counterpart. Because both stages draw their signal entirely from real data and controlled perturbations, the whole pipeline aligns the model without any manually annotated preference labels.

Experiments

Effectiveness of Real-Data-Based Preference Signals

Real-data-based supervision aligns diffusion models effectively, reaching quality comparable to methods that rely on manually annotated preference pairs.

Table 1: method comparison
Table 1. Method comparison. Relative improvements are highlighted in (+gain). ImgRwd: ImageReward [ImageReward]; UniRwd: UnifiedReward [UnifiedReward]; Aes: LAION aesthetic classifier [Aesthetic]. † denotes our re-implementation, trained on Pick-a-Pic v2 [PickScore] using official code. Best and second-best results are in bold and underlined, respectively.
user study on SD-3.5-M
Figure 4. User study on SD-3.5-M. Following the protocol of Diffusion-DRO, we randomly sample 60 prompts from HPDv2 and ask users to compare our fine-tuned SD-3.5-M with baselines. Across both comparisons, users prefer our model.

Complementarity with Existing Preference Alignment Models

Real-data-based alignment is complementary to existing preference-based methods: combining the two yields further gains across benchmarks and backbones.

complementarity with existing preference alignment models
Figure 5. Complementarity with existing preference alignment models. Top row: quantitative results on Pick-a-Pic v2 using SD-1.5 as the base model; real-data-based supervision is integrated with Diffusion-DPO. Bottom row: quantitative results on DrawBench using SD-3.5-M as the base model; real-data-based supervision is used as a complementary post-training step on top of FlowGRPO.

Qualitative Comparison

qualitative comparison on SD-1.5
Figure 6. Qualitative comparison based on SD-1.5. Post-training with real data produces images with improved visual realism and richer texture details. When applied as an additional post-training stage on top of Diffusion-DPO, it also improves the visual realism of the resulting generations. Prompts from top to bottom: (1) a plant. (2) a woman sitting on a table drinking coffee, long shot, wide shot, highly detailed, intricate, professional photography, RAW color, night shot, bokeh, sharp focus, taken with EOS 5D, UHD 8K. (3) a corgi's head.
qualitative comparison on SD-3.5-M
Figure 7. Qualitative comparison based on SD-3.5-M. Compared to FlowGRPO, post-training with real data yields more realistic lighting and more natural color distributions. When applied as an additional post-training step on top of FlowGRPO, it further alleviates stylistic homogenization. Prompts from top to bottom: (1) a fire hydrant. (2) a person. (3) Anime illustration of Gundam mech suit on Pixiv.