Diffusion-Based Document Layout Generation

He, Liu; Lu, Yijuan; Corring, John; Florencio, Dinei; Zhang, Cha

doi:10.1007/978-3-031-41676-7_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14187))

Included in the following conference series:

International Conference on Document Analysis and Recognition

Abstract

We develop a diffusion-based approach for various document layout sequence generation. Layout sequences specify the contents of a document design in an explicit format. Our novel diffusion-based approach works in the sequence domain rather than the image domain in order to permit more complex and realistic layouts. We also introduce a new metric, Document Earth Mover’s Distance (Doc-EMD). By considering similarity between heterogeneous categories document designs, we handle the shortcomings of prior document metrics that only evaluate the same category of layouts. Our empirical analysis shows that our diffusion-based approach is comparable to or outperforming other previous methods for layout generation across various document datasets. Moreover, our metric is capable of differentiating documents better than previous metrics for specific cases.

L. He—Work done while a research intern at Microsoft Cloud and AI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

Agostinelli, A., et al.: Musiclm: generating music from text. arXiv preprint arXiv:2301.11325 (2023)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
Google Scholar
Arroyo, D.M., Postels, J., Tombari, F.: Variational transformer networks for layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13642–13652 (2021)
Google Scholar
Benes, B., Zhou, X., Chang, P., Cani, M.P.R.: Urban brush: intuitive and controllable urban layout editing. In: The 34th Annual ACM Symposium on User Interface Software and Technology, pp. 796–814 (2021)
Google Scholar
Bhatt, M., et al.: Design and deployment of photo2building: a cloud-based procedural modeling tool as a service. In: Practice and Experience in Advanced Research Computing, pp. 132–138 (2020)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Bui, Q.A., Mollard, D., Tabbone, S.: Automatic synthetic document image generation using generative adversarial networks: application in mobile-captured document analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 393–400. IEEE (2019)
Google Scholar
Che, T., et al.: Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983 (2017)
Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Deshpande, I., et al.: Max-sliced Wasserstein distance and its use for GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648–10656 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, Y., Huang, Y., He, L.: Pavement crack detection using directional curvature. Technical report (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Flamary, R., et al.: Pot: Python optimal transport. J. Mach. Learn. Res. 22(78), 1–8 (2021). http://jmlr.org/papers/v22/20-451.html
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Grauman, K., Darrell, T.: Fast contour matching using approximate earth mover’s distance. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, vol. 1, p. I. IEEE (2004)
Google Scholar
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Google Scholar
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long text generation via adversarial training with leaked information. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Gupta, K., Lazarow, J., Achille, A., Davis, L.S., Mahadevan, V., Shrivastava, A.: Layouttransformer: layout generation and completion with self-attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1004–1014 (2021)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
He, L., Shan, J., Aliaga, D.: Generative building feature estimation from satellite images. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)
Article Google Scholar
He, L., Yang, H., Huang, Y.: Automatic pole-like object modeling via 3D part-based analysis of point cloud. In: Remote Sensing Technologies and Applications in Urban Environments, vol. 10008, pp. 233–248. SPIE (2016)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)
Huang, Y., Ma, P., Ji, Z., He, L.: Part-based modeling of pole-like objects using divergence-incorporated 3-d clustering of mobile laser scanning point clouds. IEEE Trans. Geosci. Remote Sens. 59(3), 2611–2626 (2020)
Article Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: stochastic scene layout generation from a label set. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9895–9904 (2019)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Google Scholar
Kieu, V., Journet, N., Visani, M., Mullot, R., Domenger, J.P.: Semi-synthetic document image generation using texture mapping on scanned 3d document shapes. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 489–493. IEEE (2013)
Google Scholar
Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 88–96 (2021)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Chapter Google Scholar
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: generating graphic layouts with wireframe discriminators. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=HJxB5sRcFQ
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(7), 2388–2399 (2020)
Article Google Scholar
Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)
Li, X.L., Thickstun, J., Gulrajani, I., Liang, P., Hashimoto, T.B.: Diffusion-LM improves controllable text generation (2022). https://doi.org/10.48550/ARXIV.2205.14217, https://arxiv.org/abs/2205.14217
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Z., Winata, G.I., Xu, P., Liu, Z., Fung, P.: Variational transformers for diverse response generation. arXiv preprint arXiv:2003.12738 (2020)
Liu, Y., Huang, Y., Qiu, X., He, L.: Automatic guardrail inventory using mobile laser scanning (MLS). Technical report (2017)
Google Scholar
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021)
Google Scholar
Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
Nazeri, K., Ng, E., Ebrahimi, M.: Image colorization using generative adversarial networks. In: Perales, F.J., Kittler, J. (eds.) AMDO 2018. LNCS, vol. 10945, pp. 85–94. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94544-6_9
Chapter Google Scholar
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Google Scholar
Patel, P., Kalyanam, R., He, L., Aliaga, D., Niyogi, D.: Deep learning-based urban morphology for city-scale environmental modeling. PNAS Nexus 2(3), pgad027 (2023)
Google Scholar
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: Read: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)
Google Scholar
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)
Article MATH Google Scholar
Sheng, Y., et al.: Controllable shadow generation using pixel height maps. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 240–256. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_15
Chapter Google Scholar
Sheng, Y., Zhang, J., Benes, B.: SSN: soft shadow network for image compositing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4380–4390 (2021)
Google Scholar
Sheng, Y., et al.: Pixht-lab: pixel height based light effect generation for image compositing. arXiv preprint arXiv:2303.00137 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2020). https://doi.org/10.48550/ARXIV.2010.02502, https://arxiv.org/abs/2010.02502
Song, Y., et al.: Objectstitch: generative object compositing. arXiv preprint arXiv:2212.00932 (2022)
Tabata, S., Yoshihara, H., Maeda, H., Yokoyama, K.: Automatic layout generation for graphical design magazines. In: ACM SIGGRAPH 2019 Posters, pp. 1–2 (2019)
Google Scholar
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, L., Huang, Y., Shan, J., He, L.: Msnet: multi-scale convolutional network for point cloud classification. Remote Sens. 10(4), 612 (2018)
Article Google Scholar
Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Zhang, X., Ma, W., Varinlioglu, G., Rauh, N., He, L., Aliaga, D.: Guided pluralistic building contour completion. Vis. Comput. 38(9–10), 3205–3216 (2022)
Article Google Scholar
Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)
Article Google Scholar
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Purdue University, West Lafayette, IN, 47906, USA
Liu He
Microsoft Cloud and AI, Bellevue, WA, 98004, USA
Yijuan Lu, John Corring, Dinei Florencio & Cha Zhang

Authors

Liu He
View author publications
You can also search for this author in PubMed Google Scholar
Yijuan Lu
View author publications
You can also search for this author in PubMed Google Scholar
John Corring
View author publications
You can also search for this author in PubMed Google Scholar
Dinei Florencio
View author publications
You can also search for this author in PubMed Google Scholar
Cha Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Liu He , Yijuan Lu or John Corring .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, L., Lu, Y., Corring, J., Florencio, D., Zhang, C. (2023). Diffusion-Based Document Layout Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-41676-7_21
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41675-0
Online ISBN: 978-3-031-41676-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Diffusion-Based Document Layout Generation

Abstract

Access this chapter

Similar content being viewed by others

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Diffusion-Based Document Layout Generation

Abstract

Access this chapter

Similar content being viewed by others

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation

LayoutFlow: Flow Matching for Layout Generation

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation