Abstract
Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the student’s accuracy close to the teacher’s accuracy. However, this strong assumption is not met in many real-world applications where the distribution mismatch happens between teacher’s training data and student’s training data. As a result, existing KD methods often fail in this case. To overcome this problem, we propose a novel method for KD process, which is still effective when the distribution mismatch happens. We first learn a distribution based on student’s training data, from which we can sample images well-classified by the teacher. By doing this, we can discover the data space where the teacher has good knowledge to transfer to the student. We then propose a new loss function to train the student network, which achieves better accuracy than the standard KD loss function. We conduct extensive experiments to demonstrate that our method works well for KD tasks with or without distribution mismatch. To the best of our knowledge, our method is the first method addressing the challenge of distribution mismatch when performing KD process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This is possible because we use benchmark datasets, and the training and test splits are fixed.
- 2.
References
Adriana, R., Nicolas, B., Ebrahimi, S., Antoine, C., Carlo, G., Yoshua, B.: FitNets: hints for thin deep nets. In: ICLR (2015)
Ahn, S., Hu, X., Damianou, A., Lawrence, N., Dai, Z.: Variational information distillation for knowledge transfer. In: CVPR, pp. 9163–9171 (2019)
Chawla, A., Yin, H., Molchanov, P., Alvarez, J.: Data-free knowledge distillation for object detection. In: CVPR, pp. 3289–3298 (2021)
Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NIPS, pp. 742–751 (2017)
Chen, H., et al.: Data-free learning of student networks. In: ICCV, pp. 3514–3522 (2019)
Eriksson, D., Pearce, M., Gardner, J., Turner, R., Poloczek, M.: Scalable global optimization via local bayesian optimization. In: NIPS, pp. 5496–5507 (2019)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. arXiv preprint arXiv:2006.05525 (2020)
Guo, G., Zhang, N.: A survey on deep learning based face recognition. Comput. Vis. Image Underst. 189, 102805 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Kim, J., Park, S., Kwak, N.: Paraphrasing complex network: network compression via factor transfer. In: NIPS, pp. 2760–2769 (2018)
Lee, S., Song, B.C.: Graph-based knowledge distillation by multi-head attention network. arXiv preprint arXiv:1907.02226 (2019)
Meng, Z., Li, J., Zhao, Y., Gong, Y.: Conditional teacher-student learning. In: ICASSP, pp. 6445–6449. IEEE (2019)
Nayak, G.K., Mopuri, K.R., Chakraborty, A.: Effectiveness of arbitrary transfer sets for data-free knowledge distillation. In: CVPR, pp. 1430–1438 (2021)
Nguyen, D., Gupta, S., Rana, S., Shilton, A., Venkatesh, S.: Bayesian optimization for categorical and category-specific continuous inputs. In: AAAI, pp. 5256–5263 (2020)
Passalis, N., Tzelepi, M., Tefas, A.: Heterogeneous knowledge distillation using information flow modeling. In: CVPR, pp. 2339–2348 (2020)
Pouyanfar, S., et al.: A survey on deep learning: algorithms, techniques, and applications. ACM Comput. Surv. 51(5), 1–36 (2018)
Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., Madry, A.: Do adversarially robust ImageNet models transfer better? In: NIPS, pp. 3533–3545 (2020)
Shen, L., Margolies, L., Rothstein, J., Fluder, E., McBride, R., Sieh, W.: Deep learning to improve breast cancer detection on screening mammography. Sci. Rep. 9(1), 1–12 (2019)
Snoek, J., Larochelle, H., Adams, R.: Practical Bayesian optimization of machine learning algorithms. In: NIPS, pp. 2951–2959 (2012)
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: NIPS, pp. 3483–3491 (2015)
Sreenu, G., Durai, S.: Intelligent video surveillance: a review through deep learning techniques for crowd analysis. J. Big Data 6(1), 1–27 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)
Wang, D., Li, Y., Wang, L., Gong, B.: Neural networks are more productive teachers than human raters: active mixup for data-efficient knowledge distillation from a blackbox model. In: CVPR, pp. 1498–1507 (2020)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: CVPR, pp. 4133–4141 (2017)
Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. 52(1), 1–38 (2019)
Acknowledgment
This research was a collaboration between the Commonwealth Australia (represented by Department of Defence) and Deakin University, through a Defence Science Partnerships agreement.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Nguyen, D. et al. (2021). Knowledge Distillation with Distribution Mismatch. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12976. Springer, Cham. https://doi.org/10.1007/978-3-030-86520-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-86520-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86519-1
Online ISBN: 978-3-030-86520-7
eBook Packages: Computer ScienceComputer Science (R0)