On the Content Bias in Fréchet Video Distance

Songwei Ge¹ Aniruddha Mahapatra^{2, 3} Gaurav Parmar² Jun-Yan Zhu² Jia-Bin Huang¹

¹University of Maryland, College Park ²Carnegie Mellon University ³Adobe Research

CVPR 2024

Reference Videos

Medium Spatial Corruption
No Temporal Corruption
FVD=317.10

Small Spatial Corruption
Severe Temporal Corruption
FVD=310.52

FVD is biased towards per-frame quality than temporal consistency. FVD is the primary metric for evaluating video generation models. Ideally, such a metric should capture both spatial and temporal aspects. However, our experiments reveal a strong bias toward individual frame quality.

As in this simple test, we compare a reference set of videos (left) to two different sets of corruptions. The first corruption set introduces mild spatial corruption (middle), which results in an FVD score of 317.10. In contrast, the second set induces slightly less spatial corruption and additional temporal inconsistency (right), yet results in a lower (better) FVD score of 310.52. This discrepancy highlights the metric's bias towards spatial quality. In the following, we show experiments to quantify such content bias and understand the origin, progressively adapting from synthetic to real-world settings.

Corruptions

Original Video

Spatial only

Spatiotemporal

Original Video

Spatial only

Spatiotemporal

Quantify the temporal sensitivity. We first develop ways to distort videos so that the frame quality deteriorates the same while the temporal quality is either intact or significantly decreased. By comparing the FVD induced by the spatiotemporal corruption against the spatial corruption, we can analyze FVD's reletive sensitivity to the temporal aspect.

We first verify our claim that the two distorted video sets share similar frame quality with the minimal FID difference, shown in the table, between the spatial and spatiotemporal distortion across different datasets. We then find that FVD sometimes fails to detect the temporal quality decrease induced by spatiotemporal corruption. For example, the temporal inconsistency in the FaceForensics dataset only raises FVD by 3.6%. To further grasp the significance of the FVD increase due to temporal inconsistency, we compare it with the FVD computed using different models below.

Origin of FVD's content bias. FVD employs an Inflated 3D ConvNet (I3D) model, initially trained for action recognition task on the Kinetics-400 dataset, which is known to be biased to the static features in the content instead of motions. We thus conjecture that the FVD bias can be attributed to the features extracted from such supervised video classifier.

To unravel the confounding factors including the model architecture, training objectives, model capacity, and the dataset, we further delve into a comparative study with different trained models, where the differences of these models are summarized in the table. As shown in the Figure on the left, we find that using features from self-supervised models boosts FVD's temporal sensitivity. Training on the content-debiased data further helps mitigate the bias.

Without improving the temporal quality of the generated videos, the FVD scores can still be decreased. We now move from synthetic corruptions to generated videos. We follow Kynkäänniemi et al. to probe the FVD perceptual null space, namely the space where the temporal quality of generated videos remains unchanged while the FVD score can be effectively adjusted. To do so, we first generate a larger candidate set of videos without any motions. We then meticulously sample from this set to induce a decrease in FVD (denoted as FVD*).

As in the table above, across different video generators and training datasets, despite the absence of motion in the generated videos, one can still reduce FVD by up to half through selectively choosing from the candidate videos. Conversely, when computing the features for FVD using the VideoMAE-v2 model, which is sensitive to temporal quality, the observed gaps significantly diminish, and the FVD scores can hardly be decreased through resampling.

We now have concluded that FVD is highly insensitive to the temporal quality and consistency of the generated videos, and verified the hypothesis that the bias originates from the content-biased video features and show that self-supervised features can mitigate the issues. Next, we extend our study to real-world examples.

Default StyleGAN-v.

StyleGAN-v with LSTM motion codes.

Case study I. The default StyleGAN-v model generates natural motions while its variant with LSTM motion codes generates repeated temporal patterns. However, previous study found that the FVD metric fails to capture the variant's worse quality. We observe the same trend as shown in the table on the right. Upon computing FVD using VideoMAE-v2 features, we have them to be in accordance with human judgment.

Frames 0 - 16.

Frames 128 - 144.

Case study II. The initial 16 frames generated by the DIGAN model exhibit natural motions, while the extrapolated frames contain periodic spatiotemporal artifacts. Similarly, previous paper noticed that the FVD fails to distinguish between the two as confirmed by the table. Instaed, FVD computed using VideoMAE features better follows human judgment.

FVD tookit. We develop code and provide pre-computed features for computing FVD with different feature extractors. The toolkit is available at Github repo.

Acknowledgment

We thank Angjoo Kanazawa, Aleksander Holynski, Devi Parikh, and Yogesh Balaji for the early feedback and discussion. We thank Or Patashnik, Richard Zhang, and Hadi Alzayer for their helpful comments and paper proofreading. We thank Ivan Skorokhodov for his help with reproducing the StyleGAN-v ablation experiments. This work is partly supported by NSF grant No. IIS-239076, the Packard Fellowship, as well as NSF grants No. IIS-1910132 and IIS-2213335.

BibTeX

@inproceedings{ge2024content,
      title={On the Content Bias in Fréchet Video Distance},
      author={Ge, Songwei and Mahapatra, Aniruddha and Parmar, Gaurav and Zhu, Jun-Yan and Huang, Jia-Bin},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
}