RecTok overcomes the limitations of latent space dimensionality, achieving state-of-the-art generation performance through a high-dimensional, semantically rich latent space.
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models (VFMs) to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts.
In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distill the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss.
Combining the strengths of high-dimensional latent spaces and our proposed inoovations:
Increasing the latent dimension presents a challenge, as it often compromises the model's generation ability. So previous tokenizers such as SD-VAE and SD3-VAE are restricted to a low-dimensional space, which in turn limits both reconstruction fidelity and semantic expressiveness.
To address this limitation, previous methods distill semantic information from vision foundation models into the latent space, aiming to enrich the latent representation capacity and accelerate downstream generation training's convergence. These works effectively increase the dimension of latent space from 16 to 32.
|
However, their generation quality in high dimensions still lags behind their low-dimensional counterparts: continuous to expand the latent dimension still leads to the degradation of generation quality. RecTok overcome this bottleneck, we increase the dimension of latent space to 128 with consistent gain in reconstruction, generation, and linear probing. |
|
Unlike previous works that directly inject semantics to the un-noised latent space, our core insight is the denoising network is trained on the forward flow instead of the latent space, as shown in figure (a). To understand the importance of semantic consistency along the flow, we first evaluate the discriminative capability on the flow. As shown in figure (b), the linear probing accuracy of the previous tokenizer drops remarkably as the latent is propagated along the forward flow, the representations that DiT receives during diffusion training.
|
We take a more training consistent perspective: since DiT is trained on the forward flow rather than on latent feature, we enhance the semantics of all flow states through two key innovations: Flow Semantic Distillation, FSD for short and Reconstruction and Alignment Distillation, RAD for short. |
|
We show our RecTok above. During the training of RecTok, we apply a random mask to the input image and encode the visible regions using the encoder to obtain $x_1$. We then sample a time step $t$ and use the forward flow to generate the corresponding $x_t$. Subsequently, $x_t$ is fed into two decoders: the Semantic Decoder reconstructs the features of VFMs, while the Pixel Decoder reconstructs the pixel space. After training, both the Semantic Decoder and VFMs are discarded, ensuring the efficiency of RecTok during inference. In the following sections, we further detail the FSD and RAD.
First we will introduce the Flow Semantic Distillation. We distill the semantics of the VFMs into the forward flow trajectory. Fortunately, the forward flow from data $x_0$ to noise $\epsilon$ is independent of the velocity network $v_\theta(x,t)$, allowing us to obtain $x_t=(1-t)x_0+t\epsilon, \ t\in [0, 1]$ easily through interpolation between the encoded $x_0 = E_{\theta}(I)$ and Gaussian noise $\epsilon$. Each $x_t$ is then decoded by a lightweight semantic decoder $D_{\text{sem}}$ to obtain semantic features, which are supervised by VFMs features.
Considering the redundancy in high-dimensional latent spaces, we apply a dimension-dependent shift to the distribution of $t$, following RAE, and sample it as follows:
$$ t = \frac{s t'}{1 + (s - 1)t'}, \quad t' \sim \mathcal{U}(0,1), \quad s = \sqrt{\frac{4096}{r^2 d}} $$
where $r, d$ is the resolution and dimension of the latent feature, respectively.
To further enhance the semantics along the forward flow, inspired by masked image modeling methods, which obtain semantically rich features through pixel or feature reconstruction, we introduce a reconstructive target during FSD. Specifically, we apply random masks to the input image and reconstruct the missing regions based on the visible noisy latent features. To ensure compatibility with the reconstruction task, we utilize a transformer-based semantic decoder $D_{sem}$.
|
We compare RecTok with other representative tokenizers in terms of parameter count, GFLOPs, reconstruction quality, and generation performance. RecTok achieves the best performance among ViT-based tokenizers. |
|
As shown in the table below, RecTok with $\text{DiT}^{\text{DH}}\text{-XL}$ achieves the best gFID=1.34 without classifier-free guidance. When employing classifier-free guidance, we adopt the AutoGuidance strategy and achieve a gFID of 1.13, matching the gFID of RAE while showing a clear advantage in Inception Score (IS). Compared to other latent distillation methods like VA-VAE, which focus on enhancing latent feature semantics rather than forward flow, RecTok outperforms by a large margin. This indicates that enhancing forward flow semantics is more beneficial for diffusion generation quality.
|
Using these two key innovations, we gradually increase the dimensionality of the latent space and observe consistent performance improvements in reconstruction, generation and linear probing. To the best of our knowledge, RecTok is the first work that obtain this phenomenon. |
|
We present the overall ablation results in the table below, which clearly demonstrate the performance gains contributed by each proposed component. Although Reconstruction and Alignment Distillation (RAD) slightly degrades reconstruction quality, it leads to improved generation performance. More importantly, this reconstruction ability can be largely recovered by finetuning the pixel decoder after FSD and RAD.
We present PCA projections, similarity heatmaps, and t-SNE embeddings to probe the latent feature of RecTok. The visualizations reveal that RecTok captures highly discriminative and semantically aligned features, effectively distinguishing object boundaries and categories. This state-of-the-art generation performance and semantic richness highlights the potential of RecTok as a foundation for future research into unified tokenizers.
In this work, we address the fundamental challenge posed by the latent dimensionality of visual tokenizers through RecTok. Building on our core insight---enhancing semantic consistency along the forward flow rather than only at the un-noised latents. We introduce two key innovations: Flow Semantic Distillation (FSD) and Reconstruction and Alignment Distillation (RAD). Together, FSD and RAD effectively enrich the semantics of RecTok's latent space, and we observe consistent improvements as the latent dimension increases. Experiments on ImageNet-1K demonstrate that RecTok achieves state-of-the-art generation performance while maintaining strong reconstruction quality and semantic representation. We hope this work inspires future research on high-dimensional visual tokenizers.