Abstract
The massive increase in classroom video data enables the possibility of utilizing artificial intelligence technology to automatically recognize, detect and caption students’ behaviors. This is beneficial for related research, e.g., pedagogy and educational psychology. However, the lack of a dataset specifically designed for students’ classroom behaviors may block these potential studies. This paper presents a comprehensive dataset that can be employed for recognizing, detecting, and captioning students’ behaviors in a classroom. We collected videos of 128 classes in different disciplines and in 11 classrooms. Specifically, the constructed dataset consists of a detection part, recognition part, and captioning part. The detection part includes a temporal detection data module with 4542 samples and an action detection data module with 3343 samples, whereas the recognition part contains 4276 samples and the captioning part contains 4296 samples. Moreover, the students’ behaviors are spontaneous in real classes, rendering the dataset representative and realistic. We analyze the special characteristics of the classroom scene and the technical difficulties for each module (task), which are verified by experiments. Due to the particularity of classrooms, our datasets proposes increasing the requirements of existing methods. Moreover, we provide a baseline for each task module in the dataset and make a comparison with the current mainstream datasets. The results show that our dataset is viable and reliable. Additionally, we present a thorough performance analysis of each baseline model to provide a comprehensive comparison for models using our presented dataset. The dataset and code are available to download online: https://github.com/BNU-Wu/Student-Class-Behavior-Dataset/tree/master.
















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1529–1538
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7363–7372
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Shaoqing Ren, Kaiming He,Ross Girshick,Jian Sun (2016) Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE transactions on pattern analysis and machine intelligence. arXiv:1506.01497
Berclaz J, Fleuret F, Fua P (2006) Robust people tracking with global trajectory optimization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 744–750
Breitenstein MD, Reichlin F, Leibe B, Koller-Meier E, Gool L (2009) Robust tracking-by-detection using a detector confidence particle filter. In: Proceedings of the IEEE international conference on computer vision, pp 1515–1522
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems, pp 3844–3852
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631
Wang J, Liu W, Kumar S, Chang S (2016) Learning to hash for indexing big data: a survey. Proc IEEE 104(1):34–57
Liu W, Zhang T (2016) Multimedia hashing and networking. IEEE Multimed 23:75–79
Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75:175–187
Wang J, Zhang T, Song J, Sebe N, Shen H (2018) A Survey on Learning to Hash. IEEE Trans Pattern Anal Mach Intell 40(4):769–790
Haijun Z, Yuzhu J, Wang H, Linlin L (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x
Ma L, Lu Z, Li H (2016) Learning to answer questions from image using convolutional neural network. Assoc Adv Artif Intel 3:16
Soomro K, Zamir AR, Shah MJCe (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne, H, Jhuang, H, Garrote, E, Poggio, T.A., Serre, T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Gu C, Sun C, Ross DA, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6047–6056
Wang X, Wu J, Chen J, Li L, Wang YF, Wang WY (2019) VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE international conference on computer vision, pp 4580–4590
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: International conference on pattern recognition, pp 32–36
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intel 29(12):2247–2253
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2929–2936
Over P, Fiscus J, Sanders G, Joy D, Quénot G (2013) TRECVID 2013: an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID 2013 workshop participants notebook papers. http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf. Accessed 29 Dec 2020
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675
Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision, pp 5843–5851
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Zhao H, Torralba A, Torresani L, Yan Z (2017) SLAC: a sparsely labeled dataset for action classification and localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:1712.09374
Monfort M, Andonian A, Zhou B, Ramakrishnan K, Bargal SA, Yan T, Brown L, Fan Q, Gutfruend D, Vondrick C (2018) Moments in time dataset: one million videos for event understanding. In: IEEE transactions on pattern analysis and machine intelligence. arXiv:1801.03150
Kay W, Carreira J, Simonyan K, Zhang B, Zisserman A (2017) The kinetics human action video dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. arXiv:170506950
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970
Idrees H, Zamir AR, Jiang Y-G, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The THUMOS challenge on action recognition for videos “in the wild”. Comput Vis Image Underst 155:1–23
Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2018) Every moment counts: dense detailed labeling of actions in complex videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–389
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta (2016) A hollywood in homes: crowd sourcing data collection for activity understanding. In: European conference on computer vision. Springer, pp 510–526
Ke Y, Sukthankar R, Hebert M (2005) Efficient visual event detection using volumetric features. In: Proceedings of the IEEE international conference on computer vision, pp 166–173
Yuan J, Liu Z, Wu Y (2009) Discriminative subvolume search for efficient action detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2442–2449
Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3192–3199
Weinzaepfel P, Martin X, Schmid C (2016) Towards weakly-supervised action localization. arXiv:1065.05197
Mettes P, Van Gemert JC, Snoek CG (2016) Spot on: action localization from pointly-supervised proposals. In: European conference on computer vision. Springer, pp 437–453
Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1194–1201
Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2634–2641
Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, Springer, pp 144–157
Rohrbach A, Rohrbach M, Qiu W, Friedrich A, Pinkal M, Schiele B (2014) Coherent multi-sentence video description with variable level of detail. In: German conference on pattern recognition. Springer, pp 184–195
Zhou L, Xu C, Corso J (2018) Towards automatic learning of procedures from web instructional videos. In: Association for the advancement of artificial intelligence, pp 7590–7598
Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3202–3212
Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision, pp 706–715
Zhou L, Kalantidis Y, Chen X, Corso JJ, Rohrbach M (2019) Grounded video description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6578–6587
Gella S, Lewis M, Rohrbach M (2018) A dataset for telling the stories of social media videos. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 968–974
Aafaq N, Mian A, Liu W, Gilani SZ, Shah M (2019) Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput Surv 52(6):1–37
Zeng K-H, Chen T-H, Niebles JC, Sun M (2016) Generation for user generated videos. In: European conference on computer vision. Springer, pp 609–625
Wei Q, Sun B, He J, Yu LJ (2017) BNU-LSVED 2.0: Spontaneous multimodal student affect database with multi-dimensional labels. Sig Process Image Commun 59:168–181
Wang Z, Pan X, Miller KF, Cortina KSJC, Education (2014) Automatic classification of activities in classroom discourse. Comput Educ 78:115–123
Sun B, Wei Q, He J, Yu L, Zhu X (2016) BNU-LSVED: a multimodal spontaneous expression database in educational environment. In: Optics and photonics for information processing X, international society for optics and photonics, p 997016
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL(2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Simonyan K, Zisserman (2014) A Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Ulutan O, Rallapalli S, Srivatsa M, Torres C, Manjunath B (2020) Actor conditioned attention maps for video action detection. In: The IEEE winter conference on applications of computer vision, pp 527–536
Li Y, Wang Z, Wang L, Wu G (2020) Actions as moving points. In: Proceedings of the European conference on computer vision. arXiv:2001.04608
Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2403–2412
Gkioxari G, Malik J (2015) Finding action tubes. In: Proceedings of the IEEE international conference on computer vision, pp 759–768
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European conference on computer vision, pp 3–19
Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. In: Association for the advancement of artificial intelligence. https://doi.org/10.22648/ETRI.2020.J.350303
Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7622–7631
Wang X, Wang YF, Wang WY (2018) Watch, listen, and describe: globally and locally aligned cross-modal attentions for video captioning. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, pp 795–801
Denkowski M, Lavie (2014) A Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared that they have no conflicts of interest with regard to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, B., Wu, Y., Zhao, K. et al. Student Class Behavior Dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput & Applic 33, 8335–8354 (2021). https://doi.org/10.1007/s00521-020-05587-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05587-y