Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon---temporal oscillation---where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. When combined with the accuracy reward derived from ground-truth, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

Motivation

Diffusion large language models (dLLM) have recently emerged as a promising alternative to the auto-regressive (AR) large language models. However, current dLLMs typically adopt a decoding strategy that mirrors AR models: solely relying on the sequence predicted in the final denoising step as the final answer, and discarding all the intermediate predictions. This greedy decoding strategy ignores rich temporal dynamics that dLLMs possess in sampling process. In our analysis of two metrics for widely used mathematical benchmark datasets, a consistent and significant discrepancy emerges between two key metrics. the first metric, $\operatorname{Pass} @ 1 $, measures the accuracy of the final output, while the second, $ \operatorname{EverPass} @ 1 \mid t $, captures how often a correct answer appears at any intermediate decoding step. This discrepancy highlights that correct predictions made during early iterations are frequently overwritten of lost in later stages. This insight motivates our two complementary methods that leverage temporal dynamics. 1) Temporal Self-Consistency Voting, a test-time decoding strategy that aggregates predictions across sampling steps. 2) Temporal Consistency Reinforcement, a post-training approach designed to encourage stable and consistent generations.

**Illustration of temporal oscillation during sampling.** (a) Across four datasets, a significant gap is observed between the **final answer's pass rate** and the **ever-pass rate at any intermediate step**. This gap reveals the phenomenon we refer to as temporal oscillation, where correct intermediate answers are sometimes overwritten or lost as the generation proceeds. (b) Example of temporal oscillation: For a given math problem, the model initially gives the correct answer, 25, at an intermediate step (e.g., timestep 55), aligning with the ground truth.However, by the final timestep, this correct answer is replaced with an incorrect one: 2.

Methodology

Temporal Self-Consistency Voting

Relying solely on the final prediction thus risks discarding better intermediate outputs.To address this, we propose a method that aggregates predictions across timesteps through a weighted voting mechanism. Formally, given a diffusion sampling trajectory $\left\{x_0^t\right\}_{t=1}^{T}$, our method selects the final answer $a^*$ according to a weighted vote over all timesteps: $$ a^* = \arg \max _a \sum_{t=1}^T f(t) \cdot \mathbb{1} \left(\text { meaning }\left(x_0^t\right) = a\right). $$ Here, $\mathbb{1}(\cdot)$ is the indicator function that returns 1 if the decoded meaning of $x_0^t$ matches the candidate answer $a$, and $f(t)$ is a weighting function over timesteps. Since the accuracy of predictions generally tends to improve as the sampling step increases (or diffusion step decreases), we design $f(t)$ to be a monotonically decreasing function of diffusion step $t$. In experiments, we explore three types of weighting schemes: constant, linear decay, and exponential decay.

Temporal Consistency Reinforcement

We introduce a metric called Temporal Semantic Entropy, which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. During decoding, we obtain a sequence of $T$ intermediate answers, denoted by $\{x_0^t\}_{t=1}^{T}$. We group these answers into clusters based on their semantic meaning, forming a set of semantic clusters $\mathcal{C} = \{C_1, C_2, \ldots, C_K\}$, where each cluster $C_k = \{x_0^t : \text{meaning}(x_0^t) = k\}$ contains all answers with equivalent semantics (outcome equals $k$). We then define the temporal semantic entropy (TSE) of a trajectory as: $$ \operatorname{TSE}(\left\{x_0^t\right\}_{t=1}^{T})=-\sum_{C_k}\left(\left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right] \log \left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right]\right), $$ which quantifies the uncertainty in the semantic content of answers over the decoding steps. In our analysis we observe that high temporal semantic entropy may serve as a signal for model uncertainty. Thus we propose a post-training approach designed to encourage temporal consistency in model outputs. we define the reward $r_i$ as the negative Temporal Semantic Entropy of response $o_i$, i.e., $r_i = -\text{TSE}(o_i)$.This reward encourages outputs whose intermediate token representations maintain semantic stability throughout the generation process, thereby promoting temporal semantic consistency.

What's more, we combine TSE with the accuracy reward using scoring rule. Specifically, we consider the following four forms: $$ \begin{aligned} \textbf{Entropy scoring:} \quad & r^{\text{ent}}_{i} = \mathbb{1}_{o_i = o^*} \cdot c(o_i) \\ \textbf{Quadratic scoring:} \quad & r^{\text{quad}}_{i}= \mathbb{1}_{o_i = o^*} -(c(o_i) - \mathbb{1}_{o_i = o^*})^2 \\ \textbf{Logistic scoring:} \quad & r^{\text{log}}_{i} = \mathbb{1}_{o_i = o^*} + \mathbb{1}_{o_i = o^*} \log (c(o_i)) + (1 - \mathbb{1}_{o_i = o^*}) \log (1 - c(o_i)) \\ \textbf{Spherical scoring:} \quad & r^{\text{sphe}}_{i} = \mathbb{1}_{o_i = o^*} + \frac{c(o_i)}{\sqrt{(c(o_i))^2 + (1 - c(o_i))^2}} \end{aligned} $$ All four proposed reward functions aim to jointly encourage correctness and temporal self-consistency. Notably, the $r^{\text{ent}}_{i}$ function gives a reward of 0 for incorrect answers, while for correct answers, the reward is given by $c(o_i) = \frac{\mathcal{H}_{\max} - \operatorname{TSE}(o_i)}{\mathcal{H}_{\max}}$, where $\mathcal{H}_{\max} = \log T$. This reward reaches a maximum value of 1 when all sampling steps yield the correct answer. By default, we use the spherical scoring rule because it demonstrates more superior results compared to the alternatives.

Experimental Results

Temporal Self-Consistency Voting Results

Performance of temporal majority voting on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.

Method		GSM8K			MATH500			SVAMP			Countdown
Method		128	256	512	128	256	512	128	256	512	128	256	512
LLaDA-8B-Instruct	baseline	68.5	76.3	78.2	27.4	33.4	35.8	84.0	83.3	84.7	20.3	21.5	16.4
+ Temporal Voting	Fixed Weighting	68.0	73.4	78.3	26.6	30.8	34.2	87.0	84.3	84.3	22.7	18.8	11.3
	Linear Weighting	70.0	78.0	78.8	28.0	34.4	34.6	87.0	84.3	84.3	24.2	21.9	16.0
	Exp. Weighting	70.1	78.7	78.9	28.4	35.6	36.2	86.0	84.3	84.7	25.0	23.4	16.4
	Exp. Weighting	+1.6	+2.4	+0.7	+1.0	+2.2	+0.4	+2.0	+1.0	+0.0	+4.7	+1.9	+0.0
	$\operatorname{EverPass} @ 1 \mid t$	80.5	85.2	80.3	40.2	45.6	47.2	91.3	89.3	86.7	28.1	27.7	21.1
LLaDA-1.5	baseline	69.8	79.4	81.1	29.0	32.4	35.4	85.3	86.3	83.3	21.5	21.1	20.7
+ Temporal Voting	Fixed Weighting	68.8	75.7	80.3	27.3	30.8	34.6	87.3	85.3	84.0	23.4	22.3	18.8
	Linear Weighting	71.0	79.8	81.0	29.2	32.8	35.8	86.0	87.0	84.0	24.2	23.4	19.1
	Exp. Weighting	70.7	79.8	81.1	29.0	33.2	36.2	85.7	87.7	84.3	26.2	25.0	21.1
	Exp. Weighting	+0.9	+0.4	+0.0	+0.0	+0.8	+0.8	+0.4	+1.4	+1.0	+4.7	+3.9	+0.4
	$\operatorname{EverPass} @ 1 \mid t$	81.5	88.9	83.6	39.8	47.4	49.2	90.7	90.3	86.0	30.5	27.0	25.4

Temporal Consistency Reinforcement Results

Performance of reinforcement fine-tuning on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.

Method		GSM8K			MATH500			SVAMP			Countdown
Method		128	256	512	128	256	512	128	256	512	128	256	512
LLaDA-8B-Instruct	baseline	70.2	78.7	80.1	23.8	34.4	36.8	81.7	83.3	82.7	21.5	19.9	21.5
+ RFT	accuracy reward (d1)	71.7	78.3	82.3	31.0	36.0	40.4	85.7	88.0	88.7	34.8	35.5	37.9
	negative TSE reward (ours)	72.2	78.8	80.2	30.6	34.6	38.0	84.3	89.0	88.7	38.6	53.5	44.9
	negative TSE reward (ours)	+2.0	+0.1	+0.1	+6.8	+0.2	+1.2	+2.6	+5.7	+6.0	+17.1	+33.6	+23.4
	combining both (ours)	72.1	80.0	83.0	31.2	35.4	41.4	85.0	90.3	92.3	41.5	42.6	54.7
	combining both (ours)	+1.9	+1.3	+2.9	+7.4	+1.0	+4.6	+3.3	+7.0	+9.6	+20.0	+22.7	+33.2
LLaDA-1.5	baseline	69.8	79.4	81.1	29.0	32.4	35.4	85.3	86.3	83.3	21.5	21.1	20.7
+ RFT	accuracy reward (d1)	73.0	78.9	83.1	29.8	36.2	40.2	84.7	89.3	88.0	38.7	29.7	39.1
	negative TSE reward (ours)	72.0	80.8	82.6	30.2	35.0	40.0	86.3	88.3	87.3	50.0	55.9	53.1
	negative TSE reward (ours)	+2.2	+1.4	+1.5	+1.2	+2.6	+4.6	+1.0	+2.0	+4.0	+28.5	+34.8	+32.4
	combining both (ours)	73.2	80.5	84.0	29.6	35.4	41.4	86.3	90.3	89.0	44.5	46.9	63.3
	combining both (ours)	+3.4	+1.1	+2.9	+0.6	+3.0	+6.0	+1.0	+4.0	+5.7	+23.0	+25.8	+42.6

Analysis

(a) Ablations on $\alpha$ value selection in temporal voting with exponential weighting. (b) Negative temporal semantic entropy reward curve during reinforcement fine-tuning.

Key Insights

Temporal Dynamics

We conduct an in-depth analysis of diffusion-based large language models (dLLMs), uncovering the underexplored potential of temporal dynamics. Our findings highlight that the evolving intermediate states during the sampling process carry rich semantic signals that can be leveraged to enhance both reasoning performance and model robustness.

Temporal Self-Consistency Voting

A training-free test-time decoding strategy that aggregates predictions across denoising steps and selects the most consistent output, considerably improving accuracy with negligible computational overhead.

Temporal Semantic Entropy

We introduced a metric called Temporal Semantic Entropy which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. It serves as an indicator of the semantic stability of the model's outputs.

Temporal Consistency Reinforcement

A reinforcement learning-based post-training method that uses Temporal Semantic Entropy (TSE) as a reward signal to encourage stable and consistent generations. Importantly, leveraging negative TSE as the reward enables performance improvements without requiring ground-truth labels for reward computation.

Citation

@article{wang2025temporaldynamics, title={Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models}, author={Wen, Wang and Bozhen, Fang and Chenchen, Jing and Yongliang, Shen and Yangyi, Shen and Qiuyu, Wang and Hao, Ouyang and Hao, Chen and Chunhua, Shen}, journal={arXiv preprint arXiv:2508.09138}, year={2025} }

Acknowledgements

We would like to thank Muzhi Zhu, Canyu Zhao, and Linhao Zhong at Zhejiang University for their valuable discussions and insightful feedback. We also acknowledge the template from Ximing Xing that helped us build this project homepage.