hyperlinked text contents are ignored #406

sandhan26 · 2017-06-19T10:57:25Z

I wrote a small script to fetch the contents of a table in a docx file. Some of the cells in the table contained hyperlinked texts.

while iterating over the table's row's cells i was able to retrieve the text contents by using .text attribute.
But if the cell contains some texts with hyperlinks, they were simply ignored, while the other texts were successfully retrieved.

The text was updated successfully, but these errors were encountered:

q210 · 2017-09-22T13:18:22Z

I too stumbled into this problem.
Sadly it's not that easy to fix ( reasons are explained here - #377 (comment) ).

AFAIK there is some progress on this in the issue #85 and official patch will be released eventually, but if you need this working right now and content only getting text from hyperlinks (without address part), you can install package with this quick and dirty hack q210@336ed9f from my own fork (based on patch proposed by Brad-Python in #85 )

pip command is:
pip install git+git://github.com/q210/python-docx.git@336ed9fed27ff0460b674d91ba1646ded9cecb59

paaguti · 2018-11-08T12:53:42Z

Running against the same wall. Would appreciate an official patch for this. Tested git above and that solves my problem

Thanks, /PA

paaguti · 2018-11-09T09:07:29Z

I have written a brute force PoC just using LXML, which seems to work. I hope it helps

/PA

docx2txt.zip

kart8172 · 2020-01-08T05:44:02Z

I downloaded the below mentioned file and it is not working for me.

pip install git+git://github.com/q210/pythondocx.git@336ed9f

q210 · 2020-01-08T12:21:29Z

@kart8172 if you post some minimal example of your data here (perhaps .docx file and the script where you trying to use my fork) and describe what are your expectations in more details, I can try and debug the problem.

AAAves · 2021-04-20T11:07:03Z

Hi I also meet this problem, and I try to use xml to fix it.
It works for me, you can try~

from docx.oxml.ns import qn

def find_hyperlink_indoc(doc):
    '''
    :param doc: doc file get by doc = Document('./xxxx.docx')
    :return: a list of all hyperlink item in doc.
    '''
    xml_e = doc.element
    hyperlink_list = xml_e.findall('.//' + qn("w:hyperlink"))
    return hyperlink_list


def get_hyperlink_text(hyperlink_item):
    text = hyperlink_item.findall('.//' + qn("w:t"))[0].text
    return text


def set_hyperlink_text(hyperlink_item, text):
    hyperlink_item.findall('.//' + qn("w:t"))[0].text = text

doc = Document('./test.docx')
hl_list = find_hyperlink_indoc(doc)
for item in hl_list:
    print(get_hyperlink_text(item))
    set_hyperlink_text(item, 'testtest')
doc.save('./test_out.docx')

scanny · 2023-09-28T00:29:53Z

Paragraph.text will include the text of hyperlinks in the forthcoming v1.0 due out in a week or so.

realchrisolin mentioned this issue Sep 25, 2020

<w:t> tags not detected as text in certain scenario #870

Closed

scanny closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hyperlinked text contents are ignored #406

hyperlinked text contents are ignored #406

sandhan26 commented Jun 19, 2017 •

edited

Loading

q210 commented Sep 22, 2017 •

edited

Loading

paaguti commented Nov 8, 2018 •

edited

Loading

paaguti commented Nov 9, 2018

kart8172 commented Jan 8, 2020

q210 commented Jan 8, 2020 •

edited

Loading

AAAves commented Apr 20, 2021 •

edited

Loading

scanny commented Sep 28, 2023

hyperlinked text contents are ignored #406

hyperlinked text contents are ignored #406

Comments

sandhan26 commented Jun 19, 2017 • edited Loading

q210 commented Sep 22, 2017 • edited Loading

paaguti commented Nov 8, 2018 • edited Loading

paaguti commented Nov 9, 2018

kart8172 commented Jan 8, 2020

q210 commented Jan 8, 2020 • edited Loading

AAAves commented Apr 20, 2021 • edited Loading

scanny commented Sep 28, 2023

sandhan26 commented Jun 19, 2017 •

edited

Loading

q210 commented Sep 22, 2017 •

edited

Loading

paaguti commented Nov 8, 2018 •

edited

Loading

q210 commented Jan 8, 2020 •

edited

Loading

AAAves commented Apr 20, 2021 •

edited

Loading