Skip to content

hyperlinked text contents are ignored #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sandhan26 opened this issue Jun 19, 2017 · 7 comments
Closed

hyperlinked text contents are ignored #406

sandhan26 opened this issue Jun 19, 2017 · 7 comments

Comments

@sandhan26
Copy link

sandhan26 commented Jun 19, 2017

I wrote a small script to fetch the contents of a table in a docx file. Some of the cells in the table contained hyperlinked texts.

while iterating over the table's row's cells i was able to retrieve the text contents by using .text attribute.
But if the cell contains some texts with hyperlinks, they were simply ignored, while the other texts were successfully retrieved.

@q210
Copy link

q210 commented Sep 22, 2017

I too stumbled into this problem.
Sadly it's not that easy to fix ( reasons are explained here - #377 (comment) ).

AFAIK there is some progress on this in the issue #85 and official patch will be released eventually, but if you need this working right now and content only getting text from hyperlinks (without address part), you can install package with this quick and dirty hack q210@336ed9f from my own fork (based on patch proposed by Brad-Python in #85 )

pip command is:
pip install git+git://github.com/q210/python-docx.git@336ed9fed27ff0460b674d91ba1646ded9cecb59

@paaguti
Copy link

paaguti commented Nov 8, 2018

Running against the same wall. Would appreciate an official patch for this. Tested git above and that solves my problem

Thanks, /PA

@paaguti
Copy link

paaguti commented Nov 9, 2018

I have written a brute force PoC just using LXML, which seems to work. I hope it helps

/PA

docx2txt.zip

@kart8172
Copy link

kart8172 commented Jan 8, 2020

I downloaded the below mentioned file and it is not working for me.

pip install git+git://github.com/q210/pythondocx.git@336ed9f

@q210
Copy link

q210 commented Jan 8, 2020

@kart8172 if you post some minimal example of your data here (perhaps .docx file and the script where you trying to use my fork) and describe what are your expectations in more details, I can try and debug the problem.

@AAAves
Copy link

AAAves commented Apr 20, 2021

Hi I also meet this problem, and I try to use xml to fix it.
It works for me, you can try~

from docx.oxml.ns import qn

def find_hyperlink_indoc(doc):
    '''
    :param doc: doc file get by doc = Document('./xxxx.docx')
    :return: a list of all hyperlink item in doc.
    '''
    xml_e = doc.element
    hyperlink_list = xml_e.findall('.//' + qn("w:hyperlink"))
    return hyperlink_list


def get_hyperlink_text(hyperlink_item):
    text = hyperlink_item.findall('.//' + qn("w:t"))[0].text
    return text


def set_hyperlink_text(hyperlink_item, text):
    hyperlink_item.findall('.//' + qn("w:t"))[0].text = text

doc = Document('./test.docx')
hl_list = find_hyperlink_indoc(doc)
for item in hl_list:
    print(get_hyperlink_text(item))
    set_hyperlink_text(item, 'testtest')
doc.save('./test_out.docx')

@scanny
Copy link
Contributor

scanny commented Sep 28, 2023

Paragraph.text will include the text of hyperlinks in the forthcoming v1.0 due out in a week or so.

@scanny scanny closed this as completed Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants