Skip to content

Feature: Paragraph.add_hyperlink() (wip) #278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tanyunshi
Copy link

The PR answers to the feature #74

  • Protocol/schema analysis
  • Implementations and tests for paragraph.add_hyperlink
    -> implementation wip
  • Implementations and tests for paragraph.iter_run_level_items()

@tanyunshi
Copy link
Author

Hi again @scanny Could you please me some feekback on this PR ? I wait for your GO to attack the implementation part. :)

@tanyunshi
Copy link
Author

Hello @scanny ,

Thank you very much for the explanation in detail, it really helps :). I understand more now.

I am totally agree with you that we should escape special chars in Hyperlink.address. ^^

For <w:history>, I think you are right. In this document of Microsoft, it says

The history attribute value of true specifies that the target of the current hyperlink must be added to a list of visited hyperlinks when invoked within the document. end example]
The possible values for this attribute are defined by the ST_OnOff simple type (§22.9.2.7).

So the value can be 0/1/True/False.
But I don't think this <w:history> is important as <w:history> has not been set by google docx nor by open office. I suggest we don't implement <w:history> at the moment.

As you have mentioned, <w:anchor> is a bookmark name with '#' prepended, either for the document itself (no <r:id>) or for an external uri (with <r:id>). If neither <r:id> and the bookmark is present, the <w:anchor> is just a harmless bookmark name, e.g., as a link '#anchor_value'.
PS. document of Microsoft suggests that if the hyperlink is an external uri, the anchor attribute should be ignored. But in practice, Word put the anchor to the end of the uri with '#' prepended. I find this behavior is more reasonable as it corresponds to the behavior of the html anchors.
Therefore, we might want to keep the <w:anchor> of the hyperlink, even Hyperlink.address is None and the bookmark cannot find in the document itself. How do you think?

@tanyunshi
Copy link
Author

@scanny
My commits are actually work in progress. I will leave your a message when the commit is ready for the review. So you won't lose too much time on it. :)

Thank you again.

@scanny
Copy link
Contributor

scanny commented Mar 30, 2016

@tanyunshi Thanks for letting me know about the commits, no worries, just let me know when you're ready :)

On the w:anchor bit, the main question in my mind is whether Word complains about loading a document that contains a w:anchor element pointing to a bookmark that is not present in the document. Have you tried making and loading such a document and seeing how Word responds?

If Word doesn't mind I expect we wouldn't either :) But if it fails to load such a document we'll need to do due diligence to avoid allowing the user to create such a document. Does it make sense?

@ryan-rushton
Copy link

@scanny @tanyunshi I was wondering if this is getting merged with the main branch and if so when would we be looking at seeing it?

Thanks for all the awesome work!

@tanyunshi
Copy link
Author

Hey @rushton3179 I am always on the feature. I am a bit busy untils mid-may. I will catch up the dev after mi-may. :")

@ryan-rushton
Copy link

Hey @tanyunshi, no worries if you would ever like a hand implementing it let me know. I wrote a function that inserts hyperlinks and put it in the original feature request thread if its of any help too :)

@scanny scanny force-pushed the feature/hyperlink branch from 6086f5c to 436e8a2 Compare May 26, 2016 00:57
@spohner
Copy link

spohner commented Jul 26, 2016

@tanyunshi Great work! Will you be able to finalize this PR? Where should we start if you want help?

@scanny
Copy link
Contributor

scanny commented Jul 26, 2016

@spohner - I think you can pick this up more or less from the beginning of implementation. The analysis document (in the first commit above) is fairly well developed, so should provide the background needed. @tanyunshi also provided a "spiked" implementation, so I'm sure that's useful background as well, but to proceed with a merge we'll need tests, and it will need to be in test-driven commits.

The first step would be developing the acceptance test(s) in (features/{something}.feature). Then next is the unit-test/code pair that addresses the first error in the acceptance test, and so on in rinse-repeat style until the acceptance test passes. You can look at a couple of the recent pull requests for some idea how it all goes:

These two are pretty good examples:
#261
#291

Merges maintain a strictly test-driven discipline; it's the only way to keep the library robust while keeping maintenance/committer effort within doable levels :) The basic idea is:

  • No code gets merged without its corresponding unit test.
  • No unit test gets written without a failing acceptance test.
  • Each commit is one atomic piece of functionality, generally a single property or method.

Let me know with any questions once you've had a chance to look through :)

@ryan-rushton
Copy link

ryan-rushton commented Apr 23, 2017

@scanny I am currently up to implementing a hyperlink style and adding bookmarks in for anchors. I am having a little trouble as when I zip and unzip word files on my mac, word does not like them and says they are unreadable.

In styles.xml I am attempting to add

<w:lsdException w:name="Hyperlink" w:semiHidden="1" w:unhideWhenUsed="1"/>
<w:lsdException w:name="FollowedHyperlink" w:semiHidden="1" w:unhideWhenUsed="1"/>
   
...
  # In latent styles add this
  <w:style w:type="character" w:styleId="Hyperlink">
    <w:name w:val="Hyperlink"/>
    <w:basedOn w:val="DefaultParagraphFont"/>
    <w:uiPriority w:val="99"/>
    <w:unhideWhenUsed/>
    <w:rsid w:val="00E77728"/>
    <w:rPr>
      <w:color w:val="0563C1" w:themeColor="hyperlink"/>
      <w:u w:val="single"/>
    </w:rPr>
  </w:style>
  <w:style w:type="character" w:styleId="FollowedHyperlink">
    <w:name w:val="FollowedHyperlink"/>
    <w:basedOn w:val="DefaultParagraphFont"/>
    <w:uiPriority w:val="99"/>
    <w:semiHidden/>
    <w:unhideWhenUsed/>
    <w:rsid w:val="00E77728"/>
    <w:rPr>
      <w:color w:val="954F72" w:themeColor="followedHyperlink"/>
      <w:u w:val="single"/>
    </w:rPr>
  </w:style>

Do you have any tips for ensuring the modified default document is correct?
Currently I'm unzipping it, modifying the styles.xml and using zip -r default.zip * and then changing the extension.

I believe this should work as when using the hack I initially wrote when I manually set the rStyle value to 'Hyperlink' it would work as intended when modifying a document that didn't come from the default template.

Also I cannot seem to find any current implementation of inserting bookmarks into runs or paragraphs. The method of generating functions automatically has been a little confusing though so I may be missing something and thought I'd ask.

Other than that the basic feature tests I have written are working and I am just trying to fine tune it. Once I have those done I plan to do some more extensive testing.

Edit: I have gotten the bookmarks working properly thanks to @rev112, combing over his code made me realise I had a small mistake. Thanks!

@scanny
Copy link
Contributor

scanny commented Apr 23, 2017

I think I'll need a little more to go on to fully understand what you're trying to do, but in general you shouldn't have to work with zip files at all. That's all abstracted out by the docx.opc sub-package.

A Word file as a whole is known as a package. It can be saved as a zip archive, which is what a .docx file is, but can also be stored in other ways, such as in a database and so on. Each individual "file" within the zip archive is known as a part. This is the language you'll see reflected in the opc subpackage.

The opc subpackage takes care of all the details of opening (unzipping) and saving (zipping) .docx files and getting the contents into and out of the various Part subtypes. There's already a StylesPart object: https://github.com/python-openxml/python-docx/blob/master/docx/parts/styles.py

The styles part is a subclass of XmlPart, which has the root XML object of that part on .element as a pre-parsed lxml object. Actually lxml is loosely wrapped with xmlchemy, which is the foundation for the docx.oxml layer, but the point is all lxml._Element calls work on every element. Best not to go below that interface for modifying XML.

Pretty much every kind of operation on the XML has been done somewhere in the library already, and can serve as an example. If you explain more specifically what you're trying to accomplish I can probably guide you to those examples.

If all you want to do is add a few styles to the document, there's already API for that. Check out this page in the documentation: http://python-docx.readthedocs.io/en/latest/api/style.html#styles-objects

@ryan-rushton
Copy link

ryan-rushton commented Apr 23, 2017

@scanny Awesome, that should get me through what I need and I will hopefully have a PR ready in the next couple of days. When it comes to testing, so far I have come up with most scenarios I can think of in the acceptance testing and I have noticed you comment before that unit testing is used in python-docx when acceptance test cases fail. If this is the case is there anything I should write unit test for just to ensure everything is as needed? Cheers!

@scanny
Copy link
Contributor

scanny commented Apr 24, 2017

The general flow is you write an acceptance test before any code. It should fail, of course, because the code that implements it isn't there. You mark the acceptance test with the @wip decorator so the test suite can skip these and therefore still pass. This acceptance test gets committed as a separate commit with message starting with 'acpt:...'.

Where the acceptance test fails determines what the first unit test and code is written. Each commit is atomic, generally meaning one unit test and just the code it takes to make it pass. After a few commits, the acceptance test passes, the @wip decorator is removed from it, and that's one cycle.

This pattern is clearly visible in the commit history, which you should study a bit to get the flow. The one for python-pptx has exactly the same pattern and is perhaps more active; just look for a commit starting with 'acpt:' and read up from there.
https://github.com/scanny/python-pptx/commits/master

If you don't get this right you end up rewriting stuff or putting me to unnecessary effort (time) when it comes to merging.

@ryan-rushton
Copy link

Gotchya, while I have actually done this all in one commit on my branch (but using a similar workflow) I can set up a new branch and step through it like this so you get something that is easy to review and has all the necessary tests. It might be a week or so until I get it done but it shouldn't be too far off as it is all currently working.

@scanny
Copy link
Contributor

scanny commented Oct 2, 2023

Hi @tanyunshi, just wanted to let you know that your analysis doc made it into release v1.0.0 going out in the next few days.

Right now that's in this commit: 4935c30

But the SHA may change across rebases and force-pushes as it wends its way to release probably later this week.

In any case, wanted to thank you for that work and let you know it had been incorporated :)

There's no Paragraph.add_hyperlink() included yet, my main interest at the moment is improving granular extraction of text from documents, but many of the supporting objects are there and if you still want to take a crack at it these several years later just let me know :)

@scanny scanny deleted the branch python-openxml:feature/hyperlink October 2, 2023 04:07
@scanny scanny closed this Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants