Return Format: Output - Body (I) (J) (K) (L)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Return Format

Some structure will be maintained. Text will be returned in a nested list, with paragraphs
always at depth 4 (i.e.,  output.body[i][j][k][l]  will be a paragraph).

If your docx has no tables, output.body will appear as one a table with all content in one cell:

[ # document

[ # table

[ # row

[ # cell

"Paragraph 1",

"Paragraph 2",

"-- bulleted list",

"-- continuing bulleted list",

"1) numbered list",

"2) continuing numbered list"

" a) sublist",

" i) sublist of sublist",

"3) keeps track of indention levels",

" a) resets sublist counters"

]
]

Table cells will appear as table cells. Text outside tables will appear as table cells.

A docx document can be tables within tables within tables. Docx2Python flattens most of this
to more easily navigate within the content.

Working with output

This package provides several documented helper functions


in the  docx2python.iterators  module. Here are a few recipes possible with these
functions:

from docx2python.iterators import enum_cells

def remove_empty_paragraphs(tables):

for (i, j, k), cell in enum_cells(tables):

tables[i][j][k] = [x for x in cell if x]

>>> tables = [[[['a', 'b'], ['a', '', 'd', '']]]]

>>> remove_empty_paragraphs(tables)

[[[['a', 'b'], ['a', 'd']]]]

from docx2python.iterators import enum_at_depth

def html_map(tables) -> str:


"""Create an HTML map of document contents.

Render this in a browser to visually search for data.

:tables: value could come from, e.g.,

* docx_to_text_output.document

* docx_to_text_output.body

"""

# prepend index tuple to each paragraph

for (i, j, k, l), paragraph in enum_at_depth(tables, 4):

tables[i][j][k][l] = " ".join([str((i, j, k, l)), paragraph])

# wrap each paragraph in <pre> tags

for (i, j, k), cell in enum_at_depth(tables, 3):

tables[i][j][k] = "".join(["<pre>{x}</pre>".format(x) for x in


cell])

# wrap each cell in <td> tags

for (i, j), row in enum_at_depth(tables, 2):

You might also like