RecTok: Reconstruction Distillation along Rectified Flow

Authors

Affiliations

Peking University

Nanyang Technological University

Tele AI

Resources

Paper

Code Repository

Hugging Face Models

Overview

Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models (VFMs) to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts.

In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distill the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss.

Combining the strengths of high-dimensional latent spaces and our proposed inoovations:

RecTok achieves state-of-the-art results on the gFID-50K ImageNet-1K 256x256.
RecTok achieves an effective balance among reconstruction, generation quality, and semantic representation.
We demonstrate that all three aspects mentioned above can be consistently improved by increasing the dimensionality of the latent space.

RecTok teaser — (a) presents the core insights of our approach. Unlike previous works, we enhance semantic information along the forward pass of the rectified flow via reconstruction distillation. (b) compares the gFID convergence across training epochs, where our method converges 7.75x faster than prior works and achieves a final gFID of 1.34 without classifier-free guidance, the state-of-the-art gFID performance to date.

Motivation

The Dimension Dilemma of Visual Tokenizers.

Increasing the latent dimension presents a challenge, as it often compromises the model's generation ability. So previous tokenizers such as SD-VAE and SD3-VAE are restricted to a low-dimensional space, which in turn limits both reconstruction fidelity and semantic expressiveness.

Dim dilemma — Increasing the latent space dimension leads to better reconstruction and semantics, but degrades generation performance (We select DiT-B as denoising network).

Semantic Enhancement in the Latent Space.

To address this limitation, previous methods distill semantic information from vision foundation models into the latent space, aiming to enrich the latent representation capacity and accelerate downstream generation training's convergence. These works effectively increase the dimension of latent space from 16 to 32.

Recent works improve latent space semantics and accelerate the convergence of downstream generation training. Specifically, increasing the latent dimension from 16 to 32 has been shown to yield superior generation performance. We report the number from the original papers.

However, their generation quality in high dimensions still lags behind their low-dimensional counterparts: continuous to expand the latent dimension still leads to the degradation of generation quality. RecTok overcome this bottleneck, we increase the dimension of latent space to 128 with consistent gain in reconstruction, generation, and linear probing.

Reconstruction examples produced by RAEs

Insight

Unlike previous works that directly inject semantics to the un-noised latent space, our core insight is the denoising network is trained on the forward flow instead of the latent space, as shown in figure (a). To understand the importance of semantic consistency along the flow, we first evaluate the discriminative capability on the flow. As shown in figure (b), the linear probing accuracy of the previous tokenizer drops remarkably as the latent is propagated along the forward flow, the representations that DiT receives during diffusion training.

Enhacing the Semantics along the Rectified Flow

We take a more training consistent perspective: since DiT is trained on the forward flow rather than on latent feature, we enhance the semantics of all flow states through two key innovations: Flow Semantic Distillation, FSD for short and Reconstruction and Alignment Distillation, RAD for short.

RecTok

Flow Semantic Distillation (FSD). Make every point $x_t$ along the forward flow discriminative and semantically rich.
Reconstruction and Alignment Distillation (RAD). Further enhance the semantics through reconstruction target.

We show our RecTok above. During the training of RecTok, we apply a random mask to the input image and encode the visible regions using the encoder to obtain $x_1$. We then sample a time step $t$ and use the forward flow to generate the corresponding $x_t$. Subsequently, $x_t$ is fed into two decoders: the Semantic Decoder reconstructs the features of VFMs, while the Pixel Decoder reconstructs the pixel space. After training, both the Semantic Decoder and VFMs are discarded, ensuring the efficiency of RecTok during inference. In the following sections, we further detail the FSD and RAD.

Flow Semantic Distillation

FSD — Illustration of Flow Semantic Distillation.

First we will introduce the Flow Semantic Distillation. We distill the semantics of the VFMs into the forward flow trajectory. Fortunately, the forward flow from data $x_0$ to noise $\epsilon$ is independent of the velocity network $v_\theta(x,t)$, allowing us to obtain $x_t=(1-t)x_0+t\epsilon, \ t\in [0, 1]$ easily through interpolation between the encoded $x_0 = E_{\theta}(I)$ and Gaussian noise $\epsilon$. Each $x_t$ is then decoded by a lightweight semantic decoder $D_{\text{sem}}$ to obtain semantic features, which are supervised by VFMs features.

Considering the redundancy in high-dimensional latent spaces, we apply a dimension-dependent shift to the distribution of $t$, following RAE, and sample it as follows:

$$ t = \frac{s t'}{1 + (s - 1)t'}, \quad t' \sim \mathcal{U}(0,1), \quad s = \sqrt{\frac{4096}{r^2 d}} $$

where $r, d$ is the resolution and dimension of the latent feature, respectively.

Reconstruction and Alignment Distillation

RAD — Illustratoin of Reconstruction and Alignment Distillation.

To further enhance the semantics along the forward flow, inspired by masked image modeling methods, which obtain semantically rich features through pixel or feature reconstruction, we introduce a reconstructive target during FSD. Specifically, we apply random masks to the input image and reconstruct the missing regions based on the visible noisy latent features. To ensure compatibility with the reconstruction task, we utilize a transformer-based semantic decoder $D_{sem}$.

Experiment

Tokenizer Comparison

We compare RecTok with other representative tokenizers in terms of parameter count, GFLOPs, reconstruction quality, and generation performance. RecTok achieves the best performance among ViT-based tokenizers.

FSD — Reconstruction results. We put the original images on the left and the reconstructed images on the right.

Generation Performance

As shown in the table below, RecTok with $\text{DiT}^{\text{DH}}\text{-XL}$ achieves the best gFID=1.34 without classifier-free guidance. When employing classifier-free guidance, we adopt the AutoGuidance strategy and achieve a gFID of 1.13, matching the gFID of RAE while showing a clear advantage in Inception Score (IS). Compared to other latent distillation methods like VA-VAE, which focus on enhancing latent feature semantics rather than forward flow, RecTok outperforms by a large margin. This indicates that enhancing forward flow semantics is more beneficial for diffusion generation quality.

FSD — Quantitative results on ImageNet-1K 256x256. RecTok reaches an FID of 1.34 and an IS of 254.6 without guidance, outperforming previous methods by a large margin. With AutoGuidance, it achieves an FID of 1.13 and an IS of 289.2 using only 600 epochs, representing the best overall performance.

FSD — Qualitative results on ImageNet-1K 256x256. We show selected examples of class-conditional generation using DiTDH-XL with AutoGuidance.

The Dimension of Latent Space

Using these two key innovations, we gradually increase the dimensionality of the latent space and observe consistent performance improvements in reconstruction, generation and linear probing. To the best of our knowledge, RecTok is the first work that obtain this phenomenon.

FSD — Results across different feature dimensions. L.P. Acc. (L) denotes linear probing accuracy on latent features, while L.P. Acc. (SL) refers to linear probing accuracy on second-last layer features. As the feature dimension increases, discriminative ability, reconstruction, and generation show consistent gains.

Ablation Studies

We present the overall ablation results in the table below, which clearly demonstrate the performance gains contributed by each proposed component. Although Reconstruction and Alignment Distillation (RAD) slightly degrades reconstruction quality, it leads to improved generation performance. More importantly, this reconstruction ability can be largely recovered by finetuning the pixel decoder after FSD and RAD.

Latent Feature Visulization

We present PCA projections, similarity heatmaps, and t-SNE embeddings to probe the latent feature of RecTok. The visualizations reveal that RecTok captures highly discriminative and semantically aligned features, effectively distinguishing object boundaries and categories. This state-of-the-art generation performance and semantic richness highlights the potential of RecTok as a foundation for future research into unified tokenizers.

Conclusion

In this work, we address the fundamental challenge posed by the latent dimensionality of visual tokenizers through RecTok. Building on our core insight---enhancing semantic consistency along the forward flow rather than only at the un-noised latents. We introduce two key innovations: Flow Semantic Distillation (FSD) and Reconstruction and Alignment Distillation (RAD). Together, FSD and RAD effectively enrich the semantics of RecTok's latent space, and we observe consistent improvements as the latent dimension increases. Experiments on ImageNet-1K demonstrate that RecTok achieves state-of-the-art generation performance while maintaining strong reconstruction quality and semantic representation. We hope this work inspires future research on high-dimensional visual tokenizers.