MARBLE: Multi-Aspect Reward
BaLancE for Diffusion RL

A gradient-space framework that preserves reward-specific supervision throughout optimization, enabling simultaneous improvement across all reward dimensions in multi-reward diffusion RL.
Canyu Zhao1, Hao Chen1, Yunze Tong1, Yu Qiao2, Jiacheng Li2, Chunhua Shen1,3,✉
1Zhejiang University  ·  2Hithink  ·  3Zhejiang University of Technology
Corresponding author
Comparison of multi-reward training paradigms
Figure 1. Comparison of multi-reward training paradigms. Left: Training one model per reward requires maintaining multiple models. Middle: Sequential training produces a single model but demands handcrafted stage schedules. Right: MARBLE trains a single model on all rewards simultaneously with minimal manual effort.
TL;DR. MARBLE harmonizes reward-specific policy gradients into a single update direction, simultaneously improving all K rewards in one training run — no manual reward weighting, no multi-stage curriculum, and at near single-reward training cost.

Abstract

Reinforcement learning (RL) fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, image assessment is inherently multi-dimensional. Different evaluation criteria must be optimized jointly. Existing practice handles multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward R(x) = Σk wk Rk(x), or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitate heavy manually tuned sequential training. We identify weighted-sum reward aggregation as a key source of this failure. The underlying issue is a sample-level mismatch: most rollouts are specialist samples that are highly informative for certain reward dimensions while being inapplicable or uninformative for others, so weighted summation systematically dilutes their supervision. To preserve this signal, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space framework that maintains an independent advantage estimator for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manual reward weighting. We further propose an amortized formulation that exploits the affine structure of the DiffusionNFT loss to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward's gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97× the training speed of baseline training.

Key Ideas

Three techniques that make multi-reward diffusion RL practical.

Gradient Space

Gradient Harmonization

Per-reward policy gradients are normalized and harmonized via MGDA to find a common descent direction, preserving each reward's learning signal instead of collapsing them into a single scalar.
Efficiency

Amortized Formulation

Because the NFT loss depends on advantages only through an affine mapping, K per-reward backward passes collapse into a single pass with combined advantages — reducing cost to near-single-reward overhead.
Stability

Coefficient Smoothing

An EMA on the harmonization coefficients prevents a transient weak batch from suppressing a reward across an entire amortization window, keeping cached updates stable between full refreshes.

Contributions

What MARBLE brings to multi-reward diffusion RL.

01

The specialist sample problem

We identify that rollout samples are typically informative for only a subset of reward dimensions, making standard scalar reward aggregation intrinsically inefficient for multi-reward training.

02

Gradient-space reward balancing

MARBLE decomposes per-reward advantages, computes reward-specific gradients, and harmonizes them via MGDA with a normalize-and-rescale procedure that removes scale disparities while preserving natural gradient magnitude.

03

Amortized multi-reward training

By exploiting the affine structure of the DiffusionNFT loss, we prove that the convex combination of K per-reward gradients equals a single backward pass with combined advantages, enabling cached MGDA weights with near-zero overhead.

04

Simultaneous five-reward improvement

On SD3.5 Medium with PickScore, HPSv2, CLIPScore, OCR, and GenEval, MARBLE improves all reward dimensions simultaneously — weighted-sum baselines consistently fail on specialist rewards.

Why Multi-Reward Diffusion RL Is Hard

Two failure modes of weighted-sum reward aggregation that motivate MARBLE's gradient-space design.

1Specialist samples dilute reward signal

Most rollouts carry strong signal for only a subset of reward dimensions and are uninformative or inapplicable for the rest — an image of a cat carries no signal for OCR rewards, and a generation with strong text rendering may be only average aesthetically. Under R(x) = Σk wk Rk(x), the value of such a specialist sample is diluted by the unrelated dimensions, and the resulting advantage no longer reflects the dimension on which the sample is genuinely useful.

Per-sample per-reward advantage heatmap showing specialist-sample structure
Sample-level specialist structure. Each column denotes one sample, each row shows its per-reward $z$-score advantage. High advantages concentrate on source-specific rewards (OCR, GenEval). Few samples achieve positive advantages across all dimensions — confirming that scalar reward aggregation washes out specialist signal.

2Weighted-sum updates fight individual rewards most of the time

We empirically confirm the dilution at the gradient level: across multi-reward rollouts on SD3.5 Medium, the weighted-sum update direction is anti-aligned with at least one reward's gradient in 80% of mini-batches, meaning the shared update actively pushes against some reward most of the time. MARBLE's harmonized direction eliminates this conflict while keeping the average alignment essentially unchanged, and balances the rewards more evenly.

80% → 0%
Mini-batches with at least one reward gradient anti-aligned with the update direction
(weighted-sum → MARBLE)
−0.13 → +0.37
Worst-reward gradient cosine
mink cos(d, gk)
0.161 → 0.006
Across-reward variance vark cos(d, gk) — MARBLE balances rewards more evenly
Per-batch update-direction harmony comparison
Update-direction harmony. Per-batch mink, meank, and vark of {cos(d, gk)} for the weighted-sum direction (top) and the MARBLE harmonized direction (bottom). MARBLE keeps the worst-reward cosine consistently positive across all measured mini-batches.

Method Overview

From per-reward scoring to harmonized gradient updates.

MARBLE method pipeline
Overview of MARBLE. Given a prompt batch, the shared model generates images scored by K reward models independently (Step 1). Per-reward policy gradients are computed via separate backpropagation passes (Step 2). A gradient harmonization solver (MGDA) finds a common descent direction that balances all reward objectives (Step 3), and the shared model is updated accordingly.

Quantitative Results

SD3.5 Medium fine-tuned with five rewards: PickScore, CLIPScore, HPSv2.1, OCR, and GenEval. Green rows are RL fine-tunes; gray cells are in-domain rewards used during training.

Model Rule-Based Model-Based Composite ↑
GenEvalOCR PickScoreCLIPScoreHPSv2.1 AestheticImgRwdUniRwd
SD-XL0.550.1422.420.2870.2805.600.762.93−0.46
SD3.5-L0.710.6822.910.2890.2885.500.963.25+0.12
FLUX.1-Dev0.660.5922.840.2950.2745.710.963.27+0.10
SD3.5-M (w/o CFG)0.240.1220.510.2370.2045.13−0.582.02−2.32
SD3.5-M + CFG0.630.5922.340.2850.2795.360.853.03−0.26
+ FlowGRPO (GenEval)0.950.6622.510.2930.2745.321.063.18+0.12
+ FlowGRPO (OCR)0.660.9222.410.2900.2805.320.953.15+0.01
+ FlowGRPO (PickScore)0.540.6823.500.2800.3165.901.293.37+0.36
+ DiffusionNFT (sequential)0.940.9123.800.2930.3316.011.493.49+1.02
+ DiffusionNFT (weighted-sum)0.920.9121.530.2670.3006.151.163.04+0.18
+ MARBLE (ours)0.940.9622.830.2860.3556.591.533.52+1.12

Bold = best, underline = second best. Composite is the per-row mean of column-wise z-scores (each metric standardized to zero mean and unit variance across the rows of the table) and aggregates performance over all eight metrics; higher is better. Reading: MARBLE achieves the best score on OCR, HPSv2.1, Aesthetic, ImageReward, and UniReward — five out of eight metrics — in a single model, and the highest Composite score overall. The weighted-sum baseline collapses on PickScore and CLIPScore, while sequential training matches MARBLE only by training in stages with a hand-crafted curriculum. Sequential multi-stage training. Single-run weighted-sum aggregation.

Training Cost

Measured on 8×H200 with five rewards (K = 5). Speed and memory are normalized by the weighted-sum DiffusionNFT baseline. Amortization (N = 10) brings full per-reward harmonization from 0.56× back to 0.97× the baseline throughput, with only a 1.14× memory bump.

Method Relative speed ↑ GPU memory
Weighted Sum (DiffusionNFT, K = 5)1.00×59 GB (1.00×)
MARBLE w/o amortization (K = 5)0.56×67 GB (1.14×)
MARBLE w/ amortization (K = 5, N = 10)0.97×67 GB (1.14×)

Reading: per-reward gradient harmonization at every step costs $K{+}1$ backward passes, dropping throughput to $0.56\times$. Amortizing the full MGDA solve over $N{=}10$ steps reduces the average per-step cost to $(K{+}N)/N \approx 1.5\times$ backward passes — nearly recovering single-reward training speed at no quality cost (see Quantitative Results above and the EMA decay $\rho$ ablation below).

Qualitative Comparison

MARBLE produces images that satisfy multiple reward dimensions simultaneously, where weighted-sum baselines drop one or more aspects.

Qualitative comparison: MARBLE vs. baselines on prompts spanning text rendering, counting, attribute binding, and aesthetic quality
Qualitative comparison on prompts that span text rendering, object counting, attribute binding, and aesthetic quality. MARBLE renders correct text (OCR), follows compositional constraints (GenEval), and preserves visual quality in the same model; weighted-sum baselines visibly fail on at least one dimension per row.

More Comparisons

Side-by-side panels covering text rendering, attribute & spatial composition, and counting. Click any panel to enlarge.

Across all panels MARBLE produces sharper images with fewer blur and distortion artifacts than DiffusionNFT, and reliably satisfies text rendering, attribute binding, spatial layout, and counting requirements where weighted-sum baselines drop one or more aspects.

Ablation: Coefficient Smoothing (EMA Decay ρ)

Between full MGDA solves, MARBLE EMA-smooths the cached harmonization coefficients with decay ρ. Smaller ρ tracks the latest gradient geometry but inherits batch noise; larger ρ smooths more but adapts slowly. ρ = 0.7 is the sweet spot across all reward dimensions.

EMA decay ρ Rule-Based Model-Based
GenEvalOCR PickScoreCLIPScoreHPSv2.1 AestheticImgRwdUniRwd
ρ = 0.10.860.8021.520.2610.2925.841.222.98
ρ = 0.30.880.8421.760.2660.3126.031.273.04
ρ = 0.50.930.9522.020.2760.3406.141.483.43
ρ = 0.7 (default)0.940.9622.830.2860.3556.591.533.52
ρ = 0.90.900.8922.140.2720.3426.261.473.40

Bold = best per column. Reading: ρ too small (0.1, 0.3) lets batch noise in; ρ too large (0.9) is overly inertial; ρ = 0.7 dominates all eight metrics simultaneously.

Per-reward training curves under different EMA decay values
Training dynamics under different ρ. Per-reward curves over the course of training. ρ = 0.1, 0.3 are noisy and unstable; ρ = 0.9 adapts too slowly and stalls; ρ = 0.5, 0.7 show clean monotone progress, with ρ = 0.7 reaching the highest plateau across all five rewards simultaneously.

BibTeX

@article{zhao2026marblemultiaspectrewardbalance, title={MARBLE: Multi-Aspect Reward Balance for Diffusion RL}, author={Canyu Zhao and Hao Chen and Yunze Tong and Yu Qiao and Jiacheng Li and Chunhua Shen}, year={2026}, eprint={2605.06507}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.06507}, }