POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

Abstract

Real-time decoding generates high-quality assets for rendering photorealistic Codec Avatars for immersive social telepresence with AR/VR. However, high-quality avatar decoding incurs expensive computation and memory consumption, which necessitates the design of a decoder compression algorithm (e.g., quantization). Although quantization has been widely studied, the quantization of avatar decoders is an urgent yet under-explored need. Furthermore, the requirement of fast "User-Avatar" deployment prioritizes the post-training quantization (PTQ) over the time-consuming quantization-aware training (QAT). As the first work in this area, we reveal the sensitivity of the avatar decoding quality under low precision. In particular, the state-of-the-art (SoTA) QAT and PTQ algorithms introduce massive amounts of temporal noise to the rendered avatars, even with the well-established 8-bit precision. To resolve these issues, a novel PTQ algorithm is proposed for quantizing the avatar decoder with low-precision weights and activation (8-bit and 6-bit), without introducing temporal noise to the rendered avatar. Furthermore, the proposed method only needs 10% of the activations of each layer to calibrate quantization parameters without any distribution manipulations or extensive boundary search. The proposed method is evaluated on various face avatars with different facial characteristics. The proposed method compresses the decoder model by 5.3x while recovering the quality on par with the full precision baseline. In addition to the avatar rendering tasks, POCA is also applicable to image resolution enhancement tasks, achieving new SoTA image quality.

Method

Overall PTQ process of POCA based on layer-wise calibration (Layer Calib). POCA fully quantizes all the layers in the VAE decoder (Deep Appearance Model (ACM Trans. Graph, 2018)), where the proposed view-dependent + SWD filter is applied to the decoder with transpose convolution layers. In other words, the quantizers are quantized based on the most relevant visual information, which further minimizes the impact of the long-tailed distribution of human face.

Dynamically Long-tailed Human Face

In practice, the texture and geometry of human face exhibits different statistical characteristics across different regions and expressions. As shown in the figure below, different facial expressions changes both regional and global distribution, especially the tailedness. Globally optimizing the quantizers with respect to the long-tailed distribution leads to the rounding instability of activation, which further causes the unstable geometry and pixel color representation. (see Section 4.3 of the paper).

@inproceedings{poca2024meng, author = {Jian Meng and Yuecheng Li and Leo Chenghui Li and Syed Shakib Sarwar and Dilin Wang and Jae-sun Seo}, title = {POCA: Post-training Quantization with Temporal Alignment for Codec Avatars}, booktitle = {ECCV}, year = {2024}, }

POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

Clean Avatar with the Quantized Decoder Model

Left: Ground truth avatar captured by the Mugsy system (MultiFace, CVPR'2023-3DMV).

Middle: Noisy and jittering avatar with the SoTA post training quantization algorithm (INT8).

Right: Proposed POCA algorithm with clean avatar rendering (INT8).

Abstract

Method

Dynamically Long-tailed Human Face

Slice the activation of the decoder model into different row-channel planes: Different expressions shows different tailediness (standard deviation) in different regions.

BibTeX

POCA: Post-training Quantization with Temporal Alignment for Codec Avatars

Clean Avatar with the Quantized Decoder Model

Left: Ground truth avatar captured by the Mugsy system (MultiFace, CVPR'2023-3DMV). Middle: Noisy and jittering avatar with the SoTA post training quantization algorithm (INT8). Right: Proposed POCA algorithm with clean avatar rendering (INT8).

Abstract

Method

Dynamically Long-tailed Human Face

Slice the activation of the decoder model into different row-channel planes: Different expressions shows different tailediness (standard deviation) in different regions.

BibTeX

Left: Ground truth avatar captured by the Mugsy system (MultiFace, CVPR'2023-3DMV).

Middle: Noisy and jittering avatar with the SoTA post training quantization algorithm (INT8).

Right: Proposed POCA algorithm with clean avatar rendering (INT8).