html5lib is a pure-python library for parsing HTML. It is designed to conform to the HTML specification, which has formalized the error handling algorithms of legacy web browsers, and is now implemented by all major web browsers.
Python 2.6 and above (including 3) are supported. Implementations known to work are CPython (as the reference implementation) and PyPy. Jython is known not to work due to various bugs in its implementation of the language. Others such as IronPython may or may not work; if you wish to try, you are strongly recommended to run the testsuite and report back!
The only required library dependency is six
, this can be found
packaged in PyPi.
Optionally:
datrie
can be used to improve parsing performance (though in almost all cases the improvement is trivial);lxml
is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);genshi
has a treewalker (but not builder); andchardet
(note currently this is only packaged on PyPi for Python 2, though several package managers include unofficial ports to Python 3) can be used as a fallback when character encoding cannot be determined.
html5lib is packaged with distutils. To install it use:
$ python setup.py install
Simple usage follows this pattern:
import html5lib with open("mydocument.html", "r") as fp: document = html5lib.parse(f)
or:
import html5lib document = html5lib.parse("<p>Hello World!")
More documentation is available in the docstrings.
Please report any bugs on the issue tracker.
These are nowadays contained in the html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized (for release tarballs this is unneeded):
$ git submodule init $ git submodule update
And then they can be run once nose
has been installed with
nosetests
. All should pass.
Pull requests are more than welcome — both to the library and to the documentation. Some useful information:
- We aim to follow PEP 8 in the library, but ignoring the 79-character-per-line limit, instead following a soft limit of 99, but allowing lines over this where it is the readable thing to do.
- We keep pyflakes reporting no errors or warnings at all times.
- We keep the master branch passing all tests at all times on all supported versions.
Travis CI is run against all pull requests and should enforce all of the above.
We also use an external code-review tool, which uses your GitHub login to authenticate. You'll get emails for changes on the review.
There's a mailing list available for support on Google Groups, html5lib-discuss, though you may have more success (and get a far quicker response) asking on IRC in #whatwg on irc.freenode.net.