Skip to main content

Diffusion-Based Document Layout Generation

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14187))

Included in the following conference series:

Abstract

We develop a diffusion-based approach for various document layout sequence generation. Layout sequences specify the contents of a document design in an explicit format. Our novel diffusion-based approach works in the sequence domain rather than the image domain in order to permit more complex and realistic layouts. We also introduce a new metric, Document Earth Mover’s Distance (Doc-EMD). By considering similarity between heterogeneous categories document designs, we handle the shortcomings of prior document metrics that only evaluate the same category of layouts. Our empirical analysis shows that our diffusion-based approach is comparable to or outperforming other previous methods for layout generation across various document datasets. Moreover, our metric is capable of differentiating documents better than previous metrics for specific cases.

L. He—Work done while a research intern at Microsoft Cloud and AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

References

  1. Agostinelli, A., et al.: Musiclm: generating music from text. arXiv preprint arXiv:2301.11325 (2023)

  2. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)

    Google Scholar 

  3. Arroyo, D.M., Postels, J., Tombari, F.: Variational transformer networks for layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13642–13652 (2021)

    Google Scholar 

  4. Benes, B., Zhou, X., Chang, P., Cani, M.P.R.: Urban brush: intuitive and controllable urban layout editing. In: The 34th Annual ACM Symposium on User Interface Software and Technology, pp. 796–814 (2021)

    Google Scholar 

  5. Bhatt, M., et al.: Design and deployment of photo2building: a cloud-based procedural modeling tool as a service. In: Practice and Experience in Advanced Research Computing, pp. 132–138 (2020)

    Google Scholar 

  6. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  7. Bui, Q.A., Mollard, D., Tabbone, S.: Automatic synthetic document image generation using generative adversarial networks: application in mobile-captured document analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 393–400. IEEE (2019)

    Google Scholar 

  8. Che, T., et al.: Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983 (2017)

  9. Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  10. Deshpande, I., et al.: Max-sliced Wasserstein distance and its use for GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648–10656 (2019)

    Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Ding, Y., Huang, Y., He, L.: Pavement crack detection using directional curvature. Technical report (2017)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  14. Flamary, R., et al.: Pot: Python optimal transport. J. Mach. Learn. Res. 22(78), 1–8 (2021). http://jmlr.org/papers/v22/20-451.html

  15. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  16. Grauman, K., Darrell, T.: Fast contour matching using approximate earth mover’s distance. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, vol. 1, p. I. IEEE (2004)

    Google Scholar 

  17. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)

    Google Scholar 

  18. Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long text generation via adversarial training with leaked information. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  19. Gupta, K., Lazarow, J., Achille, A., Davis, L.S., Mahadevan, V., Shrivastava, A.: Layouttransformer: layout generation and completion with self-attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1004–1014 (2021)

    Google Scholar 

  20. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  21. He, L., Shan, J., Aliaga, D.: Generative building feature estimation from satellite images. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)

    Article  Google Scholar 

  22. He, L., Yang, H., Huang, Y.: Automatic pole-like object modeling via 3D part-based analysis of point cloud. In: Remote Sensing Technologies and Applications in Urban Environments, vol. 10008, pp. 233–248. SPIE (2016)

    Google Scholar 

  23. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)

  24. Huang, Y., Ma, P., Ji, Z., He, L.: Part-based modeling of pole-like objects using divergence-incorporated 3-d clustering of mobile laser scanning point clouds. IEEE Trans. Geosci. Remote Sens. 59(3), 2611–2626 (2020)

    Article  Google Scholar 

  25. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

    Google Scholar 

  26. Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: stochastic scene layout generation from a label set. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9895–9904 (2019)

    Google Scholar 

  27. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  28. Kieu, V., Journet, N., Visani, M., Mullot, R., Domenger, J.P.: Semi-synthetic document image generation using texture mapping on scanned 3d document shapes. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 489–493. IEEE (2013)

    Google Scholar 

  29. Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 88–96 (2021)

    Google Scholar 

  30. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  31. Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)

  32. Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43

    Chapter  Google Scholar 

  33. Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: generating graphic layouts with wireframe discriminators. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=HJxB5sRcFQ

  34. Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(7), 2388–2399 (2020)

    Article  Google Scholar 

  35. Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)

  36. Li, X.L., Thickstun, J., Gulrajani, I., Liang, P., Hashimoto, T.B.: Diffusion-LM improves controllable text generation (2022). https://doi.org/10.48550/ARXIV.2205.14217, https://arxiv.org/abs/2205.14217

  37. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  38. Lin, Z., Winata, G.I., Xu, P., Liu, Z., Fung, P.: Variational transformers for diverse response generation. arXiv preprint arXiv:2003.12738 (2020)

  39. Liu, Y., Huang, Y., Qiu, X., He, L.: Automatic guardrail inventory using mobile laser scanning (MLS). Technical report (2017)

    Google Scholar 

  40. Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021)

    Google Scholar 

  41. Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)

  42. Nazeri, K., Ng, E., Ebrahimi, M.: Image colorization using generative adversarial networks. In: Perales, F.J., Kittler, J. (eds.) AMDO 2018. LNCS, vol. 10945, pp. 85–94. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94544-6_9

    Chapter  Google Scholar 

  43. Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)

    Google Scholar 

  44. Patel, P., Kalyanam, R., He, L., Aliaga, D., Niyogi, D.: Deep learning-based urban morphology for city-scale environmental modeling. PNAS Nexus 2(3), pgad027 (2023)

    Google Scholar 

  45. Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: Read: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)

    Google Scholar 

  46. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)

  47. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

  48. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  49. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)

  50. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  51. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  52. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)

    Article  MATH  Google Scholar 

  53. Sheng, Y., et al.: Controllable shadow generation using pixel height maps. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 240–256. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_15

    Chapter  Google Scholar 

  54. Sheng, Y., Zhang, J., Benes, B.: SSN: soft shadow network for image compositing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4380–4390 (2021)

    Google Scholar 

  55. Sheng, Y., et al.: Pixht-lab: pixel height based light effect generation for image compositing. arXiv preprint arXiv:2303.00137 (2023)

  56. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2020). https://doi.org/10.48550/ARXIV.2010.02502, https://arxiv.org/abs/2010.02502

  57. Song, Y., et al.: Objectstitch: generative object compositing. arXiv preprint arXiv:2212.00932 (2022)

  58. Tabata, S., Yoshihara, H., Maeda, H., Yokoyama, K.: Automatic layout generation for graphical design magazines. In: ACM SIGGRAPH 2019 Posters, pp. 1–2 (2019)

    Google Scholar 

  59. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  60. Wang, L., Huang, Y., Shan, J., He, L.: Msnet: multi-scale convolutional network for point cloud classification. Remote Sens. 10(4), 612 (2018)

    Article  Google Scholar 

  61. Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)

    Google Scholar 

  62. Zhang, X., Ma, W., Varinlioglu, G., Rauh, N., He, L., Aliaga, D.: Guided pluralistic building contour completion. Vis. Comput. 38(9–10), 3205–3216 (2022)

    Article  Google Scholar 

  63. Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)

    Article  Google Scholar 

  64. Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Liu He , Yijuan Lu or John Corring .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, L., Lu, Y., Corring, J., Florencio, D., Zhang, C. (2023). Diffusion-Based Document Layout Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41676-7_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41675-0

  • Online ISBN: 978-3-031-41676-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics