Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

1Zhejiang University    2Ant Group    3Zhejiang University of Technology    4Stanford University   
*Equal contribution Corresponding author

Under Review



Motivation

Diffusion large language models (dLLM) have recently emerged as a promising alternative to the auto-regressive (AR) large language models. However. current dLLMs typically adopt a decoding strategy that mirrors AR models: solely relying on the sequence predicted in the final denoising step as the final answer, and discarding all the intermediate predictions. This greedy decoding strategy ignores rich temporal dynamics that dLLMs possess in sampling process. In our analysis of two metrics for widely used mathematical benchmark datasets, a consistent and significant discrepancy emerges between two key metrics. the first metric, final pass rate, measures the accuracy of the final output, while the second, ever-pass rate, captures how often a correct answer appears at any intermediate decoding step. This discrepancy highlights that correct predictions made during early iterations are frequently overwritten of lost in later stages. This insight motivates our two complementary methods that leverage temporal dynamics. 1) Temporal Majority Voting, a test-time decoding strategy that aggregates predictions across sampling steps. 2) Temporal Consistency Reinforcement, a post-training approach designed to encourage stable and consistent generations.

Illustration of temporal oscillation during sampling
Illustration of temporal oscillation during sampling. (a) Across four datasets, a significant gap is observed between the final answer's pass rate and the ever-pass rate at any intermediate step. This gap reveals the phenomenon we refer to as temporal oscillation, where correct intermediate answers are sometimes overwritten or lost as the generation proceeds. (b) Example of temporal oscillation: For a given math problem, the model initially gives the correct answer, 25, at an intermediate step (e.g., timestep 55), aligning with the ground truth.However, by the final timestep, this correct answer is replaced with an incorrect one: 2.

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon---temporal oscillation---where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Majority Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses negative Temporal Semantic Entropy (TSE) as a reward signal to encourage stable generations without ground-truth supervision. Our experiments across several benchmarks demonstrate that these approaches significantly improve accuracy and reliability. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.


Methodology


Temporal Self-Consistency Voting

Relying solely on the final prediction thus risks discarding better intermediate outputs.To address this, we propose a method that aggregates predictions across timesteps through a weighted voting mechanism. Formally, given a diffusion sampling trajectory \(\left\{x_0^t\right\}_{t=1}^{T}\), our method selects the final answer \(a^*\) according to a weighted vote over all timesteps: $$ a^* = \arg \max _a \sum_{t=1}^T f(t) \cdot \mathbb{1} \left(\text { meaning }\left(x_0^t\right) = a\right). $$ Here, \(\mathbb{1}(\cdot)\) is the indicator function that returns 1 if the decoded meaning of \(x_0^t\) matches the candidate answer \(a\), and \(f(t)\) is a weighting function over timesteps. Since the accuracy of predictions generally tends to improve as the sampling step increases (or diffusion step decreases), we design \(f(t)\) to be a monotonically decreasing function of diffusion step \(t\). In experiments, we explore three types of weighting schemes: constant, linear decay, and exponential decay.

Post-Training with Temporal Semantic Entropy

We introduce a metric called Temporal Semantic Entropy, which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. During decoding, we obtain a sequence of \(T\) intermediate answers, denoted by \(\{x_0^t\}_{t=1}^{T}\). We group these answers into clusters based on their semantic meaning, forming a set of semantic clusters \(\mathcal{C} = \{C_1, C_2, \ldots, C_K\}\), where each cluster \(C_k = \{x_0^t : \text{meaning}(x_0^t) = k\}\) contains all answers with equivalent semantics (outcome equals \(k\)). We then define the temporal semantic entropy (TSE) of a trajectory as: $$ \operatorname{TSE}(\left\{x_0^t\right\}_{t=1}^{T})=-\sum_{C_k}\left(\left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right] \log \left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right]\right), $$ which quantifies the uncertainty in the semantic content of answers over the decoding steps. In our analysis we observe that high temporal semantic entropy may serve as a signal for model uncertainty. Thus we propose a post-training approach designed to encourage temporal consistency in model outputs. we define the reward \(r_i\) as the negative Temporal Semantic Entropy of response \(o_i\), i.e., \(r_i = -\text{TSE}(o_i)\).This reward encourages outputs whose intermediate token representations maintain semantic stability throughout the generation process, thereby promoting temporal semantic consistency.


Experimental Results


Temporal Voting Results

Performance of temporal majority voting on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.

Method GSM8K MATH500 SVAMP Countdown
128 256 512 128 256 512 128 256 512 128 256 512
LLaDA-8B-Instruct baseline 68.5 76.3 78.2 27.4 33.4 35.8 84.0 83.3 84.7 20.3 21.5 16.4
+ Temporal Voting Fixed Weighting 68.0 73.4 78.3 26.6 30.8 34.2 87.0 84.3 84.3 22.7 18.8 11.3
Linear Weighting 70.0 78.0 78.8 28.0 34.4 34.6 87.0 84.3 84.3 24.2 21.9 16.0
Exp. Weighting 70.1 78.7 78.9 28.4 35.6 36.2 86.0 84.3 84.7 25.0 23.4 16.4
+1.6 +2.4 +0.7 +1.0 +2.2 +0.4 +2.0 +1.0 +0.0 +4.7 +1.9 +0.0
LLaDA-1.5 baseline 69.8 79.4 81.1 29.0 32.4 35.4 85.3 86.3 83.3 21.5 21.1 20.7
+ Temporal Voting Fixed Weighting 68.8 75.7 80.3 27.3 30.8 34.6 87.3 85.3 84.0 23.4 22.3 18.8
Linear Weighting 71.0 79.8 81.0 29.2 32.8 35.8 86.0 87.0 84.0 24.2 23.4 19.1
Exp. Weighting 70.7 79.8 81.1 29.0 33.2 36.2 85.7 87.7 84.3 26.2 25.0 21.1
+1.9 +0.4 +0.0 +0.0 +0.8 +0.8 +0.4 +1.4 +1.0 +4.7 +3.9 +0.4

Post-Training with Temporal Semantic Entropy Results

Performance of reinforcement fine-tuning on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.

Method GSM8K MATH500 SVAMP Countdown
128 256 512 128 256 512 128 256 512 128 256 512
LLaDA-8B-Instruct SFT on s1k 70.2 78.7 80.1 23.8 34.4 36.8 81.7 83.3 82.7 21.5 19.9 21.5
+ RFT d1 (w/GT) 71.7 78.3 82.3 31.0 36.0 40.4 85.7 88.0 88.7 34.8 35.5 37.9
ours (NTSE reward) 72.2 78.8 80.2 30.6 34.6 38.0 84.3 89.0 88.7 38.6 53.5 44.9
+2.0 +0.1 +0.1 +6.8 +0.2 +1.2 +2.6 +5.7 +6.0 +17.1 +33.6 +23.4
LLaDA-1.5 SFT on s1k 69.8 79.4 81.1 29.0 32.4 35.4 85.3 86.3 83.3 21.5 21.1 20.7
+ RFT d1 (w/GT) 70.6 78.5 81.4 29.6 34.2 39.2 85.0 88.0 88.3 32.8 25.8 39.5
ours (NTSE reward) 70.3 79.5 81.4 29.2 35.6 39.0 86.0 88.0 88.7 34.0 48.8 49.5
+0.5 +0.1 +0.3 +0.2 +3.2 +3.6 +0.7 +1.7 +5.4 +12.5 +27.7 +28.8

Analysis

Analysis
(a) Ablations on \(\alpha\) value selection in temporal voting with exponential weighting. (b) Negative temporal semantic entropy reward curve during reinforcement fine-tuning.

Key Insights


Temporal Dynamics

We conduct an in-depth analysis of diffusion-based large language models (dLLMs), uncovering the underexplored potential of temporal dynamics. Our findings highlight that the evolving intermediate states during the sampling process carry rich semantic signals that can be leveraged to enhance both reasoning performance and model robustness.

Temporal Majority Voting

A training-free test-time decoding strategy that aggregates predictions across denoising steps and selects the most consistent output, considerably improving accuracy with negligible computational overhead.

Temporal Semantic Entropy

We introduced a metric called Temporal Semantic Entropy which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. It serves as an indicator of the semantic stability of the model's outputs.

Temporal Consistency Reinforcement

A reinforcement learning-based post-training method that uses negative Temporal Semantic Entropy (TSE) as a reward signal to encourage stable and consistent generations. Importantly, leveraging negative TSE as the reward enables performance improvements without requiring ground-truth labels for reward computation.


Citation

todo


Acknowledgements

We would like to thank Muzhi Zhu, Canyu Zhao, and Linhao Zhong at Zhejiang University for their valuable discussions and insightful feedback. We also acknowledge the template from Ximing Xing that helped us build this project homepage.