CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Zhu, Hao; Wu, Wayne; Zhu, Wentao; Jiang, Liming; Tang, Siwei; Zhang, Li; Liu, Ziwei; Loy, Chen Change

doi:10.1007/978-3-031-20071-7_38

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13667))

Included in the following conference series:

European Conference on Computer Vision

3387 Accesses
37 Citations

Abstract

Large-scale datasets have played indispensable roles in the recent success of face generation/editing and significantly facilitated the advances of emerging research fields. However, the academic community still lacks a video dataset with diverse facial attribute annotations, which is crucial for the research on face-related videos. In this work, we propose a large-scale, high-quality, and diverse video dataset with rich facial attribute annotations, named the High-Quality Celebrity Video Dataset (CelebV-HQ). CelebV-HQ contains 35, 666 video clips with the resolution of \(512\times 512\) at least, involving 15, 653 identities. All clips are labeled manually with 83 facial attributes, covering appearance, action, and emotion. We conduct a comprehensive analysis in terms of age, ethnicity, brightness stability, motion smoothness, head pose diversity, and data quality to demonstrate the diversity and temporal coherence of CelebV-HQ. Besides, its versatility and potential are validated on two representative tasks, i.e., unconditional video generation and video facial attribute editing. We finally envision the future potential of CelebV-HQ, as well as the new opportunities and challenges it would bring to related research directions. Data, code, and models are publicly available (Project page: https://celebv-hq.github.io/ Code and models: https://github.com/CelebV-HQ/CelebV-HQ).

H. Zhu and W. Wu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 93.08; Price includes VAT (France)

Softcover Book: EUR 116.04; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Demographic attribute estimation in face videos combining local information and quality assessment

Article 03 February 2022

The Sabancı University Dynamic Face Database (SUDFace): Development and validation of an audiovisual stimulus set of recited and free speeches with neutral facial expressions

Article 26 August 2022

Vienna Talking Faces (ViTaFa): A multimodal person database with synchronized videos, images, and voices

Article Open access 10 November 2023

References

Bezryadin, S., Bourov, P., Ilinih, D.: Brightness calculation in digital image processing. In: TDPF (2007)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2018)
Google Scholar
Chan, E.R., et al.: Efficient geometry-aware 3D generative adversarial networks. In: CVPR (2022)
Google Scholar
Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., Wetzstein, G.: pi-GAN: periodic implicit generative adversarial networks for 3D-aware image synthesis. In: CVPR (2021)
Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: CVPR (2019)
Google Scholar
Chen, Y., Wu, Q., Zheng, C., Cham, T.J., Cai, J.: Sem2NeRF: converting single-view semantic masks to neural radiance fields. In: ECCV (2022)
Google Scholar
Cheng, W., et al.: Generalizable neural performer: Learning robust radiance fields for human novel view synthesis. arXiv preprint arxiv:2204.11798 (2022)
Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: diverse image synthesis for multiple domains. In: CVPR (2020)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)
Google Scholar
Da Xu, L., He, W., Li, S.: Internet of things in industries: a survey. IEEE TII 10, 2233–2243 (2014)
Google Scholar
Ding, H., Zhou, H., Zhou, S., Chellappa, R.: A deep cascade network for unaligned face attribute classification. In: AAAI (2018)
Google Scholar
Dzedzickis, A., Kaklauskas, A., Bucinskas, V.: Human emotion recognition: review of sensors and methods. Sensors 20, 592 (2020)
Article Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM TOG 37, 1–11 (2018)
Article Google Scholar
Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: CVPR (2021)
Google Scholar
Gao, G., Huang, H., Fu, C., Li, Z., He, R.: Information bottleneck disentanglement for identity swapping. In: CVPR (2021)
Google Scholar
Gao, R., Grauman, K.: VisualVoice: audio-visual speech separation with cross-modal consistency. In: CVPR (2021)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Gu, J., Liu, L., Wang, P., Theobalt, C.: StyleNeRF: a style-based 3D aware generator for high-resolution image synthesis. In: ICLR (2021)
Google Scholar
Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H., Zhang, J.: AD-NeRF: audio driven neural radiance fields for talking head synthesis. In: ICCV (2021)
Google Scholar
Haliassos, A., Vougioukas, K., Petridis, S., Pantic, M.: Lips don’t lie: a generalisable and robust approach to face forgery detection. In: CVPR (2021)
Google Scholar
Han, H., Jain, A.K., Wang, F., Shan, S., Chen, X.: Heterogeneous face attribute estimation: a deep multi-task learning approach. IEEE TPAMI 40, 2597–2609 (2017)
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: CogVideo: large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)
Hong, Y., Peng, B., Xiao, H., Liu, L., Zhang, J.: HeadNeRF: a real-time nerf-based parametric head model. In: CVPR (2022)
Google Scholar
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: ECCV Workshop (2008)
Google Scholar
Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
Google Scholar
Hui, T.W., Loy, C.C.: LiteFlowNet3: resolving correspondence ambiguity for more accurate optical flow estimation. In: ECCV (2020)
Google Scholar
Inc., S.: Snapchat. In: https://www.snapchat.com/ (2022)
Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: Vision-based human action recognition: an overview and real world challenges. Forensic Sci. Int.: Digit. Invest. 32, 200901 (2020)
Google Scholar
Ji, X., et al.: Audio-driven emotional video portraits. In: CVPR (2021)
Google Scholar
Jiang, L., Dai, B., Wu, W., Loy, C.C.: Deceive D: adaptive pseudo augmentation for GAN training with limited data. In: NeurIPS (2021)
Google Scholar
Jiang, Y., Huang, Z., Pan, X., Loy, C.C., Liu, Z.: Talk-to-edit: fine-grained facial editing via dialog. In: ICCV (2021)
Google Scholar
Karkkainen, K., Joo, J.: FairFace: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In: WACV (2021)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Google Scholar
Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of styleGAN. In: CVPR (2020)
Google Scholar
Lee, C.H., Liu, Z., Wu, L., Luo, P.: MaskGAN: towards diverse and interactive facial image manipulation. In: CVPR (2020)
Google Scholar
Lee, J., Kim, S., Kim, S., Park, J., Sohn, K.: Context-aware emotion recognition networks. In: ICCV (2019)
Google Scholar
Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM MM (2019)
Google Scholar
Li, L., Bao, J., Yang, H., Chen, D., Wen, F.: FaceShifter: towards high fidelity and occlusion aware face swapping. arXiv preprint arxiv:1912.13457 (2019)
Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., Guo, B.: Face X-Ray for more general face forgery detection. In: CVPR (2020)
Google Scholar
Li, Y., Min, M., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: AAAI (2018)
Google Scholar
Liang, B., et al.: Expressive talking head generation with granular audio-visual control. In: CVPR (2022)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
Google Scholar
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (Ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english. PLoS ONE 13, e0196391 (2018)
Article Google Scholar
Ltd., F.T.: Faceapp. In: https://www.faceapp.com/ (2022)
Ltd., T.P.: Tiktok. In: https://www.tiktok.com (2022)
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE TIP 21, 4695–4708 (2012)
MathSciNet MATH Google Scholar
Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR (2020)
Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2017)
Google Scholar
Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J.J., Kemelmacher-Shlizerman, I.: StyleSDF: high-resolution 3D-consistent image and geometry generation. In: CVPR (2022)
Google Scholar
Peng, S., et al.: Animatable neural radiance fields for modeling dynamic human bodies. In: ICCV (2021)
Google Scholar
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: CVPR (2021)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arxiv:1511.06434 (2015)
Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Nießner, M.: FaceForensics: a large-scale video dataset for forgery detection in human faces. arXiv preprint arxiv:1803.09179 (2018)
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: ICCV (2017)
Google Scholar
Schmidt, L., Santurkar, S., Tsipras, D., Talwar, K., Madry, A.: Adversarially robust generalization requires more data. In: NeurIPS (2018)
Google Scholar
Serengil, S.I., Ozpinar, A.: Hyperextended lightface: a facial attribute analysis framework. In: ICEET (2021)
Google Scholar
Shen, W., Liu, R.: Learning residual images for face attribute manipulation. In: CVPR (2017)
Google Scholar
Shen, Y., Yang, C., Tang, X., Zhou, B.: InterfaceGAN: interpreting the disentangled face representation learned by GANs. IEEE TPAMI 44(4), 2004–2018 (2022)
Article Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NeurIPS (2019)
Google Scholar
Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: StyleGAN-v: a continuous video generator with the price, image quality and perks of styleGAN2. In: CVPR (2022)
Google Scholar
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identification-verification. In: NeurIPS (2014)
Google Scholar
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. In: ICLR (2020)
Google Scholar
Tov, O., Alaluf, Y., Nitzan, Y., Patashnik, O., Cohen-Or, D.: Designing an encoder for styleGAN image manipulation. ACM TOG 40(4), 1–14 (2021)
Article Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)
Google Scholar
Tzaban, R., Mokady, R., Gal, R., Bermano, A.H., Cohen-Or, D.: Stitch it in time: GAN-based facial editing of real videos. arXiv preprint arxiv:2201.08361 (2022)
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: a new metric & challenges. arXiv preprint arxiv:1812.01717 (2018)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)
Google Scholar
Wang, K., et al.: Mead: a large-scale audio-visual dataset for emotional talking-face generation. In: ECCV (2020)
Google Scholar
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: CVPR (2021)
Google Scholar
Wu, C., et al.: N\(\backslash \)" uwa: visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417 (2021)
Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: CVPR (2018)
Google Scholar
Wu, W., Zhang, Y., Li, C., Qian, C., Loy, C.C.: ReenactGAN: learning to reenact faces via boundary transfer. In: ECCV (2018)
Google Scholar
Xu, Y., et al.: Transeditor: transformer-based dual-space GAN for highly controllable facial editing. In: CVPR (2022)
Google Scholar
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arxiv:2104.10157 (2021)
Yao, X., Newson, A., Gousseau, Y., Hellier, P.: A latent transformer for disentangled face editing in images and videos. In: ICCV (2021)
Google Scholar
Yao, X., Newson, A., Gousseau, Y., Hellier, P.: A latent transformer for disentangled face editing in images and videos. In: ICCV (2021)
Google Scholar
Yu, S., et al.: Generating videos with dynamics-aware implicit generative adversarial networks. In: ICLR (2021)
Google Scholar
Zakharov, E., Ivakhnenko, A., Shysheya, A., Lempitsky, V.: Fast Bi-layer neural synthesis of one-shot realistic head avatars. In: ECCV (2020)
Google Scholar
Zhang, J., Yin, Z., Chen, P., Nichele, S.: Emotion recognition using multi-modal data and machine learning techniques: a tutorial and review. Inf. Fusion 59, 103–126 (2020)
Article Google Scholar
Zhong, Y., Sullivan, J., Li, H.: Face attribute prediction using off-the-shelf CNN features. In: ICB (2016)
Google Scholar
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019)
Google Scholar
Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., Liu, Z.: Pose-controllable talking face generation by implicitly modularized audio-visual representation. In: CVPR (2021)
Google Scholar
Zhu, H., Fu, C., Wu, Q., Wu, W., Qian, C., He, R.: AOT: appearance optimal transport based identity swapping for forgery detection. In: NeurIPS (2020)
Google Scholar
Zhu, H., Huang, H., Li, Y., Zheng, A., He, R.: Arbitrary talking face generation via attentional audio-visual coherence learning. In: IJCAI (2021)
Google Scholar
Zhu, H., Luo, M.D., Wang, R., Zheng, A.H., He, R.: Deep audio-visual learning: a survey. IJAC 18, 351–376 (2021)
Google Scholar
Zhu, X., Wang, H., Fei, H., Lei, Z., Li, S.Z.: Face forgery detection by 3D decomposition. In: CVPR (2021)
Google Scholar

Download references

Acknowledgement

This work is supported by Shanghai AI Laboratory and SenseTime Research. It is also supported by NTU NAP, MOE AcRF Tier 1 (2021-T1-001-088), and under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

SenseTime Research, Hong Kong, China
Hao Zhu, Wayne Wu, Siwei Tang & Li Zhang
Peking University, Beijing, China
Wentao Zhu
S-Lab, Nanyang Technological University, Singapore, Singapore
Liming Jiang, Ziwei Liu & Chen Change Loy

Authors

Hao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wayne Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Liming Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Siwei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chen Change Loy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wayne Wu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12009 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, H. et al. (2022). CelebV-HQ: A Large-Scale Video Facial Attributes Dataset. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13667. Springer, Cham. https://doi.org/10.1007/978-3-031-20071-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-20071-7_38
Published: 13 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20070-0
Online ISBN: 978-3-031-20071-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Demographic attribute estimation in face videos combining local information and quality assessment

The Sabancı University Dynamic Face Database (SUDFace): Development and validation of an audiovisual stimulus set of recited and free speeches with neutral facial expressions

Vienna Talking Faces (ViTaFa): A multimodal person database with synchronized videos, images, and voices

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12009 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

CelebV-HQ: A Large-Scale Video Facial Attributes Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Demographic attribute estimation in face videos combining local information and quality assessment

The Sabancı University Dynamic Face Database (SUDFace): Development and validation of an audiovisual stimulus set of recited and free speeches with neutral facial expressions

Vienna Talking Faces (ViTaFa): A multimodal person database with synchronized videos, images, and voices

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 12009 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation