Skip to content

Commit 637826f

Browse files
committed
Update and expand "moving parts" doc
1 parent c8fca0e commit 637826f

File tree

1 file changed

+31
-34
lines changed

1 file changed

+31
-34
lines changed

doc/movingparts.rst

Lines changed: 31 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,25 @@ The moving parts
44
html5lib consists of a number of components, which are responsible for
55
handling its features.
66

7+
Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
8+
Several tree representations are supported, as are translations to other formats via *tree adapters*.
9+
The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
10+
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
711

812
Tree builders
913
-------------
1014

1115
The parser reads HTML by tokenizing the content and building a tree that
12-
the user can later access. There are three main types of trees that
13-
html5lib can build:
16+
the user can later access. html5lib can build three types of trees:
1417

15-
* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
18+
* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
1619
which can be found in the standard library. Whenever possible, the
1720
accelerated ``ElementTree`` implementation (i.e.
1821
``xml.etree.cElementTree`` on Python 2.x) is used.
1922

20-
* ``dom`` - builds a tree based on ``xml.dom.minidom``.
23+
* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.
2124

22-
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
25+
* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
2326
API. The performance gains are relatively small compared to using the
2427
accelerated ``ElementTree`` module.
2528

@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134
with open("mydocument.html", "rb") as f:
3235
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
3336
34-
When instantiating a parser object, you have to pass a tree builder
35-
class in the ``tree`` keyword attribute:
37+
To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.
3638

37-
.. code-block:: python
38-
39-
import html5lib
40-
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
41-
document = parser.parse("<p>Hello World!")
42-
43-
To get a builder class by name, use the ``getTreeBuilder`` function:
39+
When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
4440

4541
.. code-block:: python
4642
4743
import html5lib
48-
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
44+
TreeBuilder = html5lib.getTreeBuilder("dom")
45+
parser = html5lib.HTMLParser(tree=TreeBuilder)
4946
minidom_document = parser.parse("<p>Hello World!")
5047
5148
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,16 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552
Tree walkers
5653
------------
5754

58-
Once a tree is ready, you can work on it either manually, or using
59-
a tree walker, which provides a streaming view of the tree. html5lib
60-
provides walkers for all three supported types of trees (``etree``,
61-
``dom`` and ``lxml``).
55+
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
6257

6358
The implementation of walkers can be found in `html5lib/treewalkers/
6459
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
6560

66-
Walkers make consuming HTML easier. html5lib uses them to provide you
67-
with has a couple of handy tools.
61+
html5lib provides a few tools for consuming token streams:
6862

63+
* :class:`~html5lib.serializer.HTMLSerializer`, to generate a stream of bytes; and
64+
* filters, to manipulate the token stream.
6965

7066
HTMLSerializer
7167
~~~~~~~~~~~~~~
@@ -90,15 +86,14 @@ The serializer lets you write HTML back as a stream of bytes.
9086
'>'
9187
'Witam wszystkich'
9288
93-
You can customize the serializer behaviour in a variety of ways, consult
94-
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
95-
documentation.
89+
You can customize the serializer behaviour in a variety of ways. Consult
90+
the :class:`~html5lib.serializer.HTMLSerializer` documentation.
9691

9792

9893
Filters
9994
~~~~~~~
10095

101-
You can alter the stream content with filters provided by html5lib:
96+
html5lib provides several filters
10297

10398
* :class:`alphabeticalattributes.Filter
10499
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
@@ -110,11 +105,11 @@ You can alter the stream content with filters provided by html5lib:
110105
the document
111106

112107
* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
113-
``LintError`` exceptions on invalid tag and attribute names, invalid
108+
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
114109
PCDATA, etc.
115110

116111
* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
117-
removes tags from the stream which are not necessary to produce valid
112+
removes tags from the token stream which are not necessary to produce valid
118113
HTML
119114

120115
* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
@@ -125,9 +120,9 @@ You can alter the stream content with filters provided by html5lib:
125120

126121
* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
127122
collapses all whitespace characters to single spaces unless they're in
128-
``<pre/>`` or ``textarea`` tags.
123+
``<pre/>`` or ``<textarea/>`` tags.
129124

130-
To use a filter, simply wrap it around a stream:
125+
To use a filter, simply wrap it around a token stream:
131126

132127
.. code-block:: python
133128
@@ -142,9 +137,11 @@ To use a filter, simply wrap it around a stream:
142137
Tree adapters
143138
-------------
144139

145-
Used to translate one type of tree to another. More documentation
146-
pending, sorry.
140+
Tree adapters can be used to translate between tree formats.
141+
Two adapters are provided by html5lib:
147142

143+
* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
144+
* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
148145

149146
Encoding discovery
150147
------------------
@@ -156,14 +153,14 @@ the following way:
156153
* The encoding may be explicitly specified by passing the name of the
157154
encoding as the encoding parameter to the
158155
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
159-
``HTMLParser`` objects.
156+
:class:`~html5lib.html5parser.HTMLParser` objects.
160157

161158
* If no encoding is specified, the parser will attempt to detect the
162159
encoding from a ``<meta>`` element in the first 512 bytes of the
163160
document (this is only a partial implementation of the current HTML
164-
5 specification).
161+
specification).
165162

166-
* If no encoding can be found and the chardet library is available, an
163+
* If no encoding can be found and the :mod:`chardet` library is available, an
167164
attempt will be made to sniff the encoding from the byte pattern.
168165

169166
* If all else fails, the default encoding will be used. This is usually

0 commit comments

Comments
 (0)