Motivation
Diffusion large language models (dLLM) have recently emerged as a promising alternative to the auto-regressive (AR) large language models. However. current dLLMs typically adopt a decoding strategy that mirrors AR models: solely relying on the sequence predicted in the final denoising step as the final answer, and discarding all the intermediate predictions. This greedy decoding strategy ignores rich temporal dynamics that dLLMs possess in sampling process. In our analysis of two metrics for widely used mathematical benchmark datasets, a consistent and significant discrepancy emerges between two key metrics. the first metric, final pass rate, measures the accuracy of the final output, while the second, ever-pass rate, captures how often a correct answer appears at any intermediate decoding step. This discrepancy highlights that correct predictions made during early iterations are frequently overwritten of lost in later stages. This insight motivates our two complementary methods that leverage temporal dynamics. 1) Temporal Majority Voting, a test-time decoding strategy that aggregates predictions across sampling steps. 2) Temporal Consistency Reinforcement, a post-training approach designed to encourage stable and consistent generations.

Abstract
Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon---temporal oscillation---where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Majority Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses negative Temporal Semantic Entropy (TSE) as a reward signal to encourage stable generations without ground-truth supervision. Our experiments across several benchmarks demonstrate that these approaches significantly improve accuracy and reliability. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
Methodology
Temporal Self-Consistency Voting
Relying solely on the final prediction thus risks discarding better intermediate outputs.To address this, we propose a method that aggregates predictions across timesteps through a weighted voting mechanism. Formally, given a diffusion sampling trajectory \(\left\{x_0^t\right\}_{t=1}^{T}\), our method selects the final answer \(a^*\) according to a weighted vote over all timesteps: $$ a^* = \arg \max _a \sum_{t=1}^T f(t) \cdot \mathbb{1} \left(\text { meaning }\left(x_0^t\right) = a\right). $$ Here, \(\mathbb{1}(\cdot)\) is the indicator function that returns 1 if the decoded meaning of \(x_0^t\) matches the candidate answer \(a\), and \(f(t)\) is a weighting function over timesteps. Since the accuracy of predictions generally tends to improve as the sampling step increases (or diffusion step decreases), we design \(f(t)\) to be a monotonically decreasing function of diffusion step \(t\). In experiments, we explore three types of weighting schemes: constant, linear decay, and exponential decay.
Post-Training with Temporal Semantic Entropy
We introduce a metric called Temporal Semantic Entropy, which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. During decoding, we obtain a sequence of \(T\) intermediate answers, denoted by \(\{x_0^t\}_{t=1}^{T}\). We group these answers into clusters based on their semantic meaning, forming a set of semantic clusters \(\mathcal{C} = \{C_1, C_2, \ldots, C_K\}\), where each cluster \(C_k = \{x_0^t : \text{meaning}(x_0^t) = k\}\) contains all answers with equivalent semantics (outcome equals \(k\)). We then define the temporal semantic entropy (TSE) of a trajectory as: $$ \operatorname{TSE}(\left\{x_0^t\right\}_{t=1}^{T})=-\sum_{C_k}\left(\left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right] \log \left[\sum_{x_0^t \in C_k} p\left(x_0^t\right)\right]\right), $$ which quantifies the uncertainty in the semantic content of answers over the decoding steps. In our analysis we observe that high temporal semantic entropy may serve as a signal for model uncertainty. Thus we propose a post-training approach designed to encourage temporal consistency in model outputs. we define the reward \(r_i\) as the negative Temporal Semantic Entropy of response \(o_i\), i.e., \(r_i = -\text{TSE}(o_i)\).This reward encourages outputs whose intermediate token representations maintain semantic stability throughout the generation process, thereby promoting temporal semantic consistency.
Experimental Results
Temporal Voting Results
Performance of temporal majority voting on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.
Method | GSM8K | MATH500 | SVAMP | Countdown | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
128 | 256 | 512 | 128 | 256 | 512 | 128 | 256 | 512 | 128 | 256 | 512 | ||
LLaDA-8B-Instruct | baseline | 68.5 | 76.3 | 78.2 | 27.4 | 33.4 | 35.8 | 84.0 | 83.3 | 84.7 | 20.3 | 21.5 | 16.4 |
+ Temporal Voting | Fixed Weighting | 68.0 | 73.4 | 78.3 | 26.6 | 30.8 | 34.2 | 87.0 | 84.3 | 84.3 | 22.7 | 18.8 | 11.3 |
Linear Weighting | 70.0 | 78.0 | 78.8 | 28.0 | 34.4 | 34.6 | 87.0 | 84.3 | 84.3 | 24.2 | 21.9 | 16.0 | |
Exp. Weighting | 70.1 | 78.7 | 78.9 | 28.4 | 35.6 | 36.2 | 86.0 | 84.3 | 84.7 | 25.0 | 23.4 | 16.4 | |
+1.6 | +2.4 | +0.7 | +1.0 | +2.2 | +0.4 | +2.0 | +1.0 | +0.0 | +4.7 | +1.9 | +0.0 | ||
LLaDA-1.5 | baseline | 69.8 | 79.4 | 81.1 | 29.0 | 32.4 | 35.4 | 85.3 | 86.3 | 83.3 | 21.5 | 21.1 | 20.7 |
+ Temporal Voting | Fixed Weighting | 68.8 | 75.7 | 80.3 | 27.3 | 30.8 | 34.6 | 87.3 | 85.3 | 84.0 | 23.4 | 22.3 | 18.8 |
Linear Weighting | 71.0 | 79.8 | 81.0 | 29.2 | 32.8 | 35.8 | 86.0 | 87.0 | 84.0 | 24.2 | 23.4 | 19.1 | |
Exp. Weighting | 70.7 | 79.8 | 81.1 | 29.0 | 33.2 | 36.2 | 85.7 | 87.7 | 84.3 | 26.2 | 25.0 | 21.1 | |
+1.9 | +0.4 | +0.0 | +0.0 | +0.8 | +0.8 | +0.4 | +1.4 | +1.0 | +4.7 | +3.9 | +0.4 |
Post-Training with Temporal Semantic Entropy Results
Performance of reinforcement fine-tuning on mathematics benchmarks. Bold numbers indicate the highest performance within each group, while green values represent improvements over the baseline.
Method | GSM8K | MATH500 | SVAMP | Countdown | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
128 | 256 | 512 | 128 | 256 | 512 | 128 | 256 | 512 | 128 | 256 | 512 | ||
LLaDA-8B-Instruct | SFT on s1k | 70.2 | 78.7 | 80.1 | 23.8 | 34.4 | 36.8 | 81.7 | 83.3 | 82.7 | 21.5 | 19.9 | 21.5 |
+ RFT | d1 (w/GT) | 71.7 | 78.3 | 82.3 | 31.0 | 36.0 | 40.4 | 85.7 | 88.0 | 88.7 | 34.8 | 35.5 | 37.9 |
ours (NTSE reward) | 72.2 | 78.8 | 80.2 | 30.6 | 34.6 | 38.0 | 84.3 | 89.0 | 88.7 | 38.6 | 53.5 | 44.9 | |
+2.0 | +0.1 | +0.1 | +6.8 | +0.2 | +1.2 | +2.6 | +5.7 | +6.0 | +17.1 | +33.6 | +23.4 | ||
LLaDA-1.5 | SFT on s1k | 69.8 | 79.4 | 81.1 | 29.0 | 32.4 | 35.4 | 85.3 | 86.3 | 83.3 | 21.5 | 21.1 | 20.7 |
+ RFT | d1 (w/GT) | 70.6 | 78.5 | 81.4 | 29.6 | 34.2 | 39.2 | 85.0 | 88.0 | 88.3 | 32.8 | 25.8 | 39.5 |
ours (NTSE reward) | 70.3 | 79.5 | 81.4 | 29.2 | 35.6 | 39.0 | 86.0 | 88.0 | 88.7 | 34.0 | 48.8 | 49.5 | |
+0.5 | +0.1 | +0.3 | +0.2 | +3.2 | +3.6 | +0.7 | +1.7 | +5.4 | +12.5 | +27.7 | +28.8 |
Analysis

Key Insights
Temporal Dynamics
We conduct an in-depth analysis of diffusion-based large language models (dLLMs), uncovering the underexplored potential of temporal dynamics. Our findings highlight that the evolving intermediate states during the sampling process carry rich semantic signals that can be leveraged to enhance both reasoning performance and model robustness.
Temporal Majority Voting
A training-free test-time decoding strategy that aggregates predictions across denoising steps and selects the most consistent output, considerably improving accuracy with negligible computational overhead.
Temporal Semantic Entropy
We introduced a metric called Temporal Semantic Entropy which captures the distribution of semantic variations in answers generated at each step of the iterative denoising process. It serves as an indicator of the semantic stability of the model's outputs.
Temporal Consistency Reinforcement
A reinforcement learning-based post-training method that uses negative Temporal Semantic Entropy (TSE) as a reward signal to encourage stable and consistent generations. Importantly, leveraging negative TSE as the reward enables performance improvements without requiring ground-truth labels for reward computation.
Citation
todo
Acknowledgements
We would like to thank Muzhi Zhu, Canyu Zhao, and Linhao Zhong at Zhejiang University for their valuable discussions and insightful feedback. We also acknowledge the template from Ximing Xing that helped us build this project homepage.