Abstract
We develop a diffusion-based approach for various document layout sequence generation. Layout sequences specify the contents of a document design in an explicit format. Our novel diffusion-based approach works in the sequence domain rather than the image domain in order to permit more complex and realistic layouts. We also introduce a new metric, Document Earth Mover’s Distance (Doc-EMD). By considering similarity between heterogeneous categories document designs, we handle the shortcomings of prior document metrics that only evaluate the same category of layouts. Our empirical analysis shows that our diffusion-based approach is comparable to or outperforming other previous methods for layout generation across various document datasets. Moreover, our metric is capable of differentiating documents better than previous metrics for specific cases.
L. He—Work done while a research intern at Microsoft Cloud and AI.
Similar content being viewed by others
References
Agostinelli, A., et al.: Musiclm: generating music from text. arXiv preprint arXiv:2301.11325 (2023)
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223. PMLR (2017)
Arroyo, D.M., Postels, J., Tombari, F.: Variational transformer networks for layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13642–13652 (2021)
Benes, B., Zhou, X., Chang, P., Cani, M.P.R.: Urban brush: intuitive and controllable urban layout editing. In: The 34th Annual ACM Symposium on User Interface Software and Technology, pp. 796–814 (2021)
Bhatt, M., et al.: Design and deployment of photo2building: a cloud-based procedural modeling tool as a service. In: Practice and Experience in Advanced Research Computing, pp. 132–138 (2020)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Bui, Q.A., Mollard, D., Tabbone, S.: Automatic synthetic document image generation using generative adversarial networks: application in mobile-captured document analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 393–400. IEEE (2019)
Che, T., et al.: Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983 (2017)
Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Deshpande, I., et al.: Max-sliced Wasserstein distance and its use for GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10648–10656 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ding, Y., Huang, Y., He, L.: Pavement crack detection using directional curvature. Technical report (2017)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Flamary, R., et al.: Pot: Python optimal transport. J. Mach. Learn. Res. 22(78), 1–8 (2021). http://jmlr.org/papers/v22/20-451.html
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Grauman, K., Darrell, T.: Fast contour matching using approximate earth mover’s distance. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, vol. 1, p. I. IEEE (2004)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706 (2022)
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J.: Long text generation via adversarial training with leaked information. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Gupta, K., Lazarow, J., Achille, A., Davis, L.S., Mahadevan, V., Shrivastava, A.: Layouttransformer: layout generation and completion with self-attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1004–1014 (2021)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
He, L., Shan, J., Aliaga, D.: Generative building feature estimation from satellite images. IEEE Trans. Geosci. Remote Sens. 61, 1–13 (2023)
He, L., Yang, H., Huang, Y.: Automatic pole-like object modeling via 3D part-based analysis of point cloud. In: Remote Sensing Technologies and Applications in Urban Environments, vol. 10008, pp. 233–248. SPIE (2016)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)
Huang, Y., Ma, P., Ji, Z., He, L.: Part-based modeling of pole-like objects using divergence-incorporated 3-d clustering of mobile laser scanning point clouds. IEEE Trans. Geosci. Remote Sens. 59(3), 2611–2626 (2020)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Jyothi, A.A., Durand, T., He, J., Sigal, L., Mori, G.: Layoutvae: stochastic scene layout generation from a label set. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9895–9904 (2019)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Kieu, V., Journet, N., Visani, M., Mullot, R., Domenger, J.P.: Semi-synthetic document image generation using texture mapping on scanned 3d document shapes. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 489–493. IEEE (2013)
Kikuchi, K., Simo-Serra, E., Otani, M., Yamaguchi, K.: Constrained graphic layout generation via latent optimization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 88–96 (2021)
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kong, Z., Ping, W., Huang, J., Zhao, K., Catanzaro, B.: Diffwave: a versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761 (2020)
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: generating graphic layouts with wireframe discriminators. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019). https://openreview.net/forum?id=HJxB5sRcFQ
Li, J., Yang, J., Hertzmann, A., Zhang, J., Xu, T.: LayoutGAN: synthesizing graphic layouts with vector-wireframe adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 43(7), 2388–2399 (2020)
Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. arXiv preprint arXiv:2006.01038 (2020)
Li, X.L., Thickstun, J., Gulrajani, I., Liang, P., Hashimoto, T.B.: Diffusion-LM improves controllable text generation (2022). https://doi.org/10.48550/ARXIV.2205.14217, https://arxiv.org/abs/2205.14217
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, Z., Winata, G.I., Xu, P., Liu, Z., Fung, P.: Variational transformers for diverse response generation. arXiv preprint arXiv:2003.12738 (2020)
Liu, Y., Huang, Y., Qiu, X., He, L.: Automatic guardrail inventory using mobile laser scanning (MLS). Technical report (2017)
Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2837–2845 (2021)
Molad, E., et al.: Dreamix: video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
Nazeri, K., Ng, E., Ebrahimi, M.: Image colorization using generative adversarial networks. In: Perales, F.J., Kittler, J. (eds.) AMDO 2018. LNCS, vol. 10945, pp. 85–94. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-94544-6_9
Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning, pp. 8162–8171. PMLR (2021)
Patel, P., Kalyanam, R., He, L., Aliaga, D., Niyogi, D.: Deep learning-based urban morphology for city-scale environmental modeling. PNAS Nexus 2(3), pgad027 (2023)
Patil, A.G., Ben-Eliezer, O., Perel, O., Averbuch-Elor, H.: Read: recursive autoencoders for document layout generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 544–545 (2020)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: text-to-3D using 2D diffusion. arXiv preprint arXiv:2209.14988 (2022)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000)
Sheng, Y., et al.: Controllable shadow generation using pixel height maps. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 240–256. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_15
Sheng, Y., Zhang, J., Benes, B.: SSN: soft shadow network for image compositing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4380–4390 (2021)
Sheng, Y., et al.: Pixht-lab: pixel height based light effect generation for image compositing. arXiv preprint arXiv:2303.00137 (2023)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models (2020). https://doi.org/10.48550/ARXIV.2010.02502, https://arxiv.org/abs/2010.02502
Song, Y., et al.: Objectstitch: generative object compositing. arXiv preprint arXiv:2212.00932 (2022)
Tabata, S., Yoshihara, H., Maeda, H., Yokoyama, K.: Automatic layout generation for graphical design magazines. In: ACM SIGGRAPH 2019 Posters, pp. 1–2 (2019)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, L., Huang, Y., Shan, J., He, L.: Msnet: multi-scale convolutional network for point cloud classification. Remote Sens. 10(4), 612 (2018)
Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Zhang, X., Ma, W., Varinlioglu, G., Rauh, N., He, L., Aliaga, D.: Guided pluralistic building contour completion. Vis. Comput. 38(9–10), 3205–3216 (2022)
Zheng, X., Qiao, X., Cao, Y., Lau, R.W.: Content-aware generative modeling of graphic design layouts. ACM Trans. Graph. (TOG) 38(4), 1–15 (2019)
Zhong, X., Tang, J., Yepes, A.J.: Publaynet: largest dataset ever for document layout analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. IEEE (2019)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
He, L., Lu, Y., Corring, J., Florencio, D., Zhang, C. (2023). Diffusion-Based Document Layout Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14187. Springer, Cham. https://doi.org/10.1007/978-3-031-41676-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-41676-7_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41675-0
Online ISBN: 978-3-031-41676-7
eBook Packages: Computer ScienceComputer Science (R0)