@@ -4,22 +4,25 @@ The moving parts
4
4
html5lib consists of a number of components, which are responsible for
5
5
handling its features.
6
6
7
+ Parsing uses a *tree builder * to generate a *tree *, the in-memory representation of the document.
8
+ Several tree representations are supported, as are translations to other formats via *tree adapters *.
9
+ The tree may be translated to a token stream with a *tree walker *, from which :class: `~html5lib.serializer.HTMLSerializer ` produces a stream of bytes.
10
+ The token stream may also be transformed by use of *filters * to accomplish tasks like sanitization.
7
11
8
12
Tree builders
9
13
-------------
10
14
11
15
The parser reads HTML by tokenizing the content and building a tree that
12
- the user can later access. There are three main types of trees that
13
- html5lib can build:
16
+ the user can later access. html5lib can build three types of trees:
14
17
15
- * ``etree `` - this is the default; builds a tree based on `` xml.etree ` `,
18
+ * ``etree `` - this is the default; builds a tree based on :mod: ` xml.etree `,
16
19
which can be found in the standard library. Whenever possible, the
17
20
accelerated ``ElementTree `` implementation (i.e.
18
21
``xml.etree.cElementTree `` on Python 2.x) is used.
19
22
20
- * ``dom `` - builds a tree based on `` xml.dom.minidom ` `.
23
+ * ``dom `` - builds a tree based on :mod: ` xml.dom.minidom `.
21
24
22
- * ``lxml.etree `` - uses lxml's implementation of the ``ElementTree ``
25
+ * ``lxml `` - uses the :mod: ` lxml.etree ` implementation of the ``ElementTree ``
23
26
API. The performance gains are relatively small compared to using the
24
27
accelerated ``ElementTree `` module.
25
28
@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
31
34
with open (" mydocument.html" , " rb" ) as f:
32
35
lxml_etree_document = html5lib.parse(f, treebuilder = " lxml" )
33
36
34
- When instantiating a parser object, you have to pass a tree builder
35
- class in the ``tree `` keyword attribute:
37
+ To get a builder class by name, use the :func: `~html5lib.treebuilders.getTreeBuilder ` function.
36
38
37
- .. code-block :: python
38
-
39
- import html5lib
40
- parser = html5lib.HTMLParser(tree = SomeTreeBuilder)
41
- document = parser.parse(" <p>Hello World!" )
42
-
43
- To get a builder class by name, use the ``getTreeBuilder `` function:
39
+ When instantiating a :class: `~html5lib.html5parser.HTMLParser ` object, you must pass a tree builder class via the ``tree `` keyword attribute:
44
40
45
41
.. code-block :: python
46
42
47
43
import html5lib
48
- parser = html5lib.HTMLParser(tree = html5lib.getTreeBuilder(" dom" ))
44
+ TreeBuilder = html5lib.getTreeBuilder(" dom" )
45
+ parser = html5lib.HTMLParser(tree = TreeBuilder)
49
46
minidom_document = parser.parse(" <p>Hello World!" )
50
47
51
48
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/
55
52
Tree walkers
56
53
------------
57
54
58
- Once a tree is ready, you can work on it either manually, or using
59
- a tree walker, which provides a streaming view of the tree. html5lib
60
- provides walkers for all three supported types of trees (``etree ``,
61
- ``dom `` and ``lxml ``).
55
+ In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56
+ html5lib provides walkers for ``etree ``, ``dom ``, and ``lxml `` trees, as well as ``genshi `` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
62
57
63
58
The implementation of walkers can be found in `html5lib/treewalkers/
64
59
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers> `_.
65
60
66
- Walkers make consuming HTML easier. html5lib uses them to provide you
67
- with has a couple of handy tools.
61
+ html5lib provides a few tools for consuming token streams:
68
62
63
+ * :class: `~html5lib.serializer.HTMLSerializer `, to generate a stream of bytes; and
64
+ * filters, to manipulate the token stream.
69
65
70
66
HTMLSerializer
71
67
~~~~~~~~~~~~~~
@@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes.
90
86
'>'
91
87
'Witam wszystkich'
92
88
93
- You can customize the serializer behaviour in a variety of ways, consult
94
- the :class: `~html5lib.serializer.htmlserializer.HTMLSerializer `
95
- documentation.
89
+ You can customize the serializer behaviour in a variety of ways. Consult
90
+ the :class: `~html5lib.serializer.HTMLSerializer ` documentation.
96
91
97
92
98
93
Filters
99
94
~~~~~~~
100
95
101
- You can alter the stream content with filters provided by html5lib:
96
+ html5lib provides several filters
102
97
103
98
* :class: `alphabeticalattributes.Filter
104
99
<html5lib.filters.alphabeticalattributes.Filter> ` sorts attributes on
@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
110
105
the document
111
106
112
107
* :class: `lint.Filter <html5lib.filters.lint.Filter> ` raises
113
- `` LintError ` ` exceptions on invalid tag and attribute names, invalid
108
+ :exc: ` AssertionError ` exceptions on invalid tag and attribute names, invalid
114
109
PCDATA, etc.
115
110
116
111
* :class: `optionaltags.Filter <html5lib.filters.optionaltags.Filter> `
117
- removes tags from the stream which are not necessary to produce valid
112
+ removes tags from the token stream which are not necessary to produce valid
118
113
HTML
119
114
120
115
* :class: `sanitizer.Filter <html5lib.filters.sanitizer.Filter> ` removes
@@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib:
125
120
126
121
* :class: `whitespace.Filter <html5lib.filters.whitespace.Filter> `
127
122
collapses all whitespace characters to single spaces unless they're in
128
- ``<pre/> `` or ``textarea `` tags.
123
+ ``<pre/> `` or ``< textarea/> `` tags.
129
124
130
- To use a filter, simply wrap it around a stream:
125
+ To use a filter, simply wrap it around a token stream:
131
126
132
127
.. code-block :: python
133
128
@@ -142,9 +137,11 @@ To use a filter, simply wrap it around a stream:
142
137
Tree adapters
143
138
-------------
144
139
145
- Used to translate one type of tree to another. More documentation
146
- pending, sorry.
140
+ Tree adapters can be used to translate between tree formats.
141
+ Two adapters are provided by html5lib:
147
142
143
+ * :func: `html5lib.treeadapters.genshi.to_genshi() ` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html >`_.
144
+ * :func: `html5lib.treeadapters.sax.to_sax() ` calls a SAX handler based on the tree.
148
145
149
146
Encoding discovery
150
147
------------------
@@ -156,14 +153,14 @@ the following way:
156
153
* The encoding may be explicitly specified by passing the name of the
157
154
encoding as the encoding parameter to the
158
155
:meth: `~html5lib.html5parser.HTMLParser.parse ` method on
159
- `` HTMLParser ` ` objects.
156
+ :class: ` ~html5lib.html5parser. HTMLParser ` objects.
160
157
161
158
* If no encoding is specified, the parser will attempt to detect the
162
159
encoding from a ``<meta> `` element in the first 512 bytes of the
163
160
document (this is only a partial implementation of the current HTML
164
- 5 specification).
161
+ specification).
165
162
166
- * If no encoding can be found and the chardet library is available, an
163
+ * If no encoding can be found and the :mod: ` chardet ` library is available, an
167
164
attempt will be made to sniff the encoding from the byte pattern.
168
165
169
166
* If all else fails, the default encoding will be used. This is usually
0 commit comments