1
1
html5lib
2
2
========
3
3
4
+ .. image :: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
5
+ :target: https://travis-ci.org/html5lib/html5lib-python
6
+
4
7
html5lib is a pure-python library for parsing HTML. It is designed to
5
8
conform to the WHATWG HTML specification, as is implemented by all major
6
9
web browsers.
7
10
8
11
9
- Requirements
10
- ------------
12
+ Usage
13
+ -----
11
14
12
- Python 2.6 and above as well as Python 3.0 and above are
13
- supported. Implementations known to work are CPython (as the reference
14
- implementation) and PyPy. Jython is known *not * to work due to various
15
- bugs in its implementation of the language. Others such as IronPython
16
- may or may not work; if you wish to try, you are strongly encouraged
17
- to run the testsuite and report back!
15
+ Simple usage follows this pattern:
18
16
19
- The only required library dependency is ``six ``, this can be found
20
- packaged in PyPI.
17
+ .. code-block :: python
21
18
22
- Optionally:
19
+ import html5lib
20
+ with open (" mydocument.html" , " rb" ) as f:
21
+ document = html5lib.parse(f)
23
22
24
- - ``datrie `` can be used to improve parsing performance (though in
25
- almost all cases the improvement is marginal);
23
+ or:
26
24
27
- - ``lxml `` is supported as a tree format (for both building and
28
- walking) under CPython (but *not * PyPy where it is known to cause
29
- segfaults);
25
+ .. code-block :: python
30
26
31
- - ``genshi `` has a treewalker (but not builder); and
27
+ import html5lib
28
+ document = html5lib.parse(" <p>Hello World!" )
32
29
33
- - ``charade `` can be used as a fallback when character encoding cannot
34
- be determined; ``chardet ``, from which it was forked, can also be used
35
- on Python 2.
30
+ By default, the ``document `` will be an ``xml.etree `` element instance.
31
+ Whenever possible, html5lib chooses the accelerated ``ElementTree ``
32
+ implementation (i.e. ``xml.etree.cElementTree `` on Python 2.x).
33
+
34
+ Two other tree types are supported: ``xml.dom.minidom `` and
35
+ ``lxml.etree ``. To use an alternative format, specify the name of
36
+ a treebuilder:
37
+
38
+ .. code-block :: python
39
+
40
+ import html5lib
41
+ with open (" mydocument.html" , " rb" ) as f:
42
+ lxml_etree_document = html5lib.parse(f, treebuilder = " lxml" )
43
+
44
+ To have more control over the parser, create a parser object explicitly.
45
+ For instance, to make the parser raise exceptions on parse errors, use:
46
+
47
+ .. code-block :: python
48
+
49
+ import html5lib
50
+ with open (" mydocument.html" , " rb" ) as f:
51
+ parser = html5lib.HTMLParser(strict = True )
52
+ document = parser.parse(f)
53
+
54
+ When you're instantiating parser objects explicitly, pass a treebuilder
55
+ class as the ``tree `` keyword argument to use an alternative document
56
+ format:
57
+
58
+ .. code-block :: python
59
+
60
+ import html5lib
61
+ parser = html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
62
+ minidom_document = parser.parse(" <p>Hello World!" )
63
+
64
+ More documentation is available at http://html5lib.readthedocs.org/.
36
65
37
66
38
67
Installation
39
68
------------
40
69
41
- html5lib is packaged with distutils. To install it use::
70
+ html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
71
+ use:
42
72
43
- $ python setup.py install
73
+ .. code-block :: bash
44
74
75
+ $ pip install html5lib
45
76
46
- Usage
47
- -----
48
77
49
- Simple usage follows this pattern::
78
+ Optional Dependencies
79
+ ---------------------
50
80
51
- import html5lib
52
- with open("mydocument.html", "r") as fp:
53
- document = html5lib.parse(f)
81
+ The following third-party libraries may be used for additional
82
+ functionality:
54
83
55
- or::
84
+ - ``datrie `` can be used to improve parsing performance (though in
85
+ almost all cases the improvement is marginal);
56
86
57
- import html5lib
58
- document = html5lib.parse("<p>Hello World!")
87
+ - ``lxml `` is supported as a tree format (for both building and
88
+ walking) under CPython (but *not * PyPy where it is known to cause
89
+ segfaults);
59
90
60
- More documentation is available in the docstrings.
91
+ - ``genshi `` has a treewalker (but not builder); and
92
+
93
+ - ``charade `` can be used as a fallback when character encoding cannot
94
+ be determined; ``chardet ``, from which it was forked, can also be used
95
+ on Python 2.
61
96
62
97
63
98
Bugs
@@ -70,28 +105,21 @@ Please report any bugs on the `issue tracker
70
105
Tests
71
106
-----
72
107
73
- These are contained in the html5lib-tests repository and included as a
74
- submodule, thus for git checkouts they must be initialized (for
75
- release tarballs this is unneeded)::
108
+ Unit tests require the ``nose `` library and can be run using the
109
+ ``nosetests `` command in the root directory. All should pass.
110
+
111
+ Test data are contained in a separate `html5lib-tests
112
+ <https://github.com/html5lib/html5lib-tests> `_ repository and included
113
+ as a submodule, thus for git checkouts they must be initialized::
76
114
77
115
$ git submodule init
78
116
$ git submodule update
79
117
80
- And then they can be run, with ``nose `` installed, using the
81
- ``nosetests `` command in the root directory. All should pass.
118
+ This is unneeded for release tarballs.
82
119
83
120
If you have all compatible Python implementations available on your
84
- system, you can run tests on all of them by using tox::
85
-
86
- $ pip install tox
87
- $ tox
88
- ...
89
- _______________________ summary ______________________
90
- py26: commands succeeded
91
- py27: commands succeeded
92
- py32: commands succeeded
93
- py33: commands succeeded
94
- congratulations :)
121
+ system, you can run tests on all of them using the ``tox `` utility,
122
+ which can be found on PyPI.
95
123
96
124
97
125
Contributing
@@ -121,5 +149,5 @@ Questions?
121
149
122
150
There's a mailing list available for support on Google Groups,
123
151
`html5lib-discuss <http://groups.google.com/group/html5lib-discuss >`_,
124
- though you may have more success (and get a far quicker response)
125
- asking on IRC in #whatwg on irc.freenode.net.
152
+ though you may get a quicker response asking on IRC in #whatwg on
153
+ irc.freenode.net.
0 commit comments