html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the HTML specification, which has formalized the error handling algorithms of legacy web browsers, and is now implemented by all major web browsers.

Requirements

Python 2.6 and above (including 3) are supported. Implementations known to work are CPython (as the reference implementation) and PyPy. Jython is known not to work due to various bugs in its implementation of the language. Others such as IronPython may or may not work; if you wish to try, you are strongly recommended to run the testsuite and report back!

The only required library dependency is six, this can be found packaged in PyPi.

Optionally:

datrie can be used to improve parsing performance (though in almost all cases the improvement is trivial);

lxml is supported as a tree format (for both building and walking) under CPython (but not PyPy where it is known to cause segfaults);

genshi has a treewalker (but not builder); and

chardet (note currently this is only packaged on PyPi for Python 2, though several package managers include unofficial ports to Python 3) can be used as a fallback when character encoding cannot be determined.

Installation

html5lib is packaged with distutils. To install it use:

$ python setup.py install

Usage

Simple usage follows this pattern:

import html5lib
with open("mydocument.html", "r") as fp:
    document = html5lib.parse(f)

or:

import html5lib
document = html5lib.parse("<p>Hello World!")

More documentation is available in the docstrings.

Bugs

Please report any bugs on the issue tracker.

Tests

These are nowadays contained in the html5lib-tests repository and included as a submodule, thus for git checkouts they must be initialized (for release tarballs this is unneeded):

$ git submodule init
$ git submodule update

And then they can be run once nose has been installed with nosetests. All should pass.

Contributing

Pull requests are more than welcome — both to the library and to the documentation. Some useful information:

We aim to follow PEP 8 in the library, but ignoring the 79-character-per-line limit, instead following a soft limit of 99, but allowing lines over this where it is the readable thing to do.

We keep pyflakes reporting no errors or warnings at all times.

We keep the master branch passing all tests at all times on all supported versions.

Travis CI is run against all pull requests and should enforce all of the above.

We also use an external code-review tool, which uses your GitHub login to authenticate. You'll get emails for changes on the review.

Questions?

There's a mailing list available for support on Google Groups, html5lib-discuss, though you may have more success (and get a far quicker response) asking on IRC in #whatwg on irc.freenode.net.

Name		Name	Last commit message	Last commit date
Latest commit History 1,122 Commits
html5lib		html5lib
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
debug-info.py		debug-info.py
parse.py		parse.py
requirements-optional-2.txt		requirements-optional-2.txt
requirements-optional-3.txt		requirements-optional-3.txt
requirements-optional-cpython.txt		requirements-optional-cpython.txt
requirements-optional.txt		requirements-optional.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

html5lib

Requirements

Installation

Usage

Bugs

Tests

Contributing

Questions?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 58

Languages

License

html5lib/html5lib-python

Folders and files

Latest commit

History

Repository files navigation

html5lib

Requirements

Installation

Usage

Bugs

Tests

Contributing

Questions?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 58

Languages

Packages