Skip to content

Commit 6e822c2

Browse files
committed
updated README, pulled all 0.95 - 1.0 changes from git logs
1 parent ae6520f commit 6e822c2

File tree

2 files changed

+127
-52
lines changed

2 files changed

+127
-52
lines changed

CHANGES.rst

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,56 @@ Change Log
66

77
Released on XXX, 2013
88

9+
* Implementation updated to implement the `HTML specification
10+
<http://www.whatwg.org/specs/web-apps/current-work/>`_ as of 5th May
11+
2013 (`SVN <http://svn.whatwg.org/webapps/>`_ revision r7867).
12+
13+
* Python 3.2+ supported in a single codebase using the ``six`` library.
14+
15+
* Removed support for Python 2.5 and older.
16+
17+
* Removed the deprecated Beautiful Soup 3 treebuilder.
18+
``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that
19+
since it doesn't support namespaces, foreign content like SVG and
20+
MathML is parsed incorrectly.
21+
922
* Removed ``simpletree`` from the package. The default tree builder is
10-
now ``etree`` (using the ``xml.etree.ElementTree/cElementTree``
11-
implementation).
23+
now ``etree`` (using the ``xml.etree.cElementTree`` implementation if
24+
available, and ``xml.etree.ElementTree`` otherwise).
25+
26+
* Removed the ``XHTMLSerializer`` as it never actually guaranteed its
27+
output was well-formed XML, and hence provided little of use.
28+
29+
* Optional heuristic character encoding detection now based on
30+
``charade`` for Python 2.6 - 3.3 compatibility.
31+
32+
* Optional ``Genshi`` treewalker support fixed.
33+
34+
* Many bugfixes, including:
35+
36+
* #33: null in attribute value breaks XML AttValue;
37+
38+
* #4: nested, indirect descendant, <button> causes infinite loop;
39+
40+
* `Google Code 215
41+
<http://code.google.com/p/html5lib/issues/detail?id=215>`_: Properly
42+
detect seekable streams;
43+
44+
* `Google Code 206
45+
<http://code.google.com/p/html5lib/issues/detail?id=206>`_: add
46+
support for <video preload=...>, <audio preload=...>;
47+
48+
* `Google Code 205
49+
<http://code.google.com/p/html5lib/issues/detail?id=205>`_: add
50+
support for <video poster=...>;
51+
52+
* `Google Code 202
53+
<http://code.google.com/p/html5lib/issues/detail?id=202>`_: Unicode
54+
file breaks InputStream.
55+
56+
* Source code is now mostly PEP 8 compliant.
57+
58+
* Test harness has been improved and now depends on ``nose``.
1259

1360

1461
0.95

README.rst

Lines changed: 78 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,63 +1,98 @@
11
html5lib
22
========
33

4+
.. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master
5+
:target: https://travis-ci.org/html5lib/html5lib-python
6+
47
html5lib is a pure-python library for parsing HTML. It is designed to
58
conform to the WHATWG HTML specification, as is implemented by all major
69
web browsers.
710

811

9-
Requirements
10-
------------
12+
Usage
13+
-----
1114

12-
Python 2.6 and above as well as Python 3.0 and above are
13-
supported. Implementations known to work are CPython (as the reference
14-
implementation) and PyPy. Jython is known *not* to work due to various
15-
bugs in its implementation of the language. Others such as IronPython
16-
may or may not work; if you wish to try, you are strongly encouraged
17-
to run the testsuite and report back!
15+
Simple usage follows this pattern:
1816

19-
The only required library dependency is ``six``, this can be found
20-
packaged in PyPI.
17+
.. code-block:: python
2118
22-
Optionally:
19+
import html5lib
20+
with open("mydocument.html", "rb") as f:
21+
document = html5lib.parse(f)
2322
24-
- ``datrie`` can be used to improve parsing performance (though in
25-
almost all cases the improvement is marginal);
23+
or:
2624

27-
- ``lxml`` is supported as a tree format (for both building and
28-
walking) under CPython (but *not* PyPy where it is known to cause
29-
segfaults);
25+
.. code-block:: python
3026
31-
- ``genshi`` has a treewalker (but not builder); and
27+
import html5lib
28+
document = html5lib.parse("<p>Hello World!")
3229
33-
- ``charade`` can be used as a fallback when character encoding cannot
34-
be determined; ``chardet``, from which it was forked, can also be used
35-
on Python 2.
30+
By default, the ``document`` will be an ``xml.etree`` element instance.
31+
Whenever possible, html5lib chooses the accelerated ``ElementTree``
32+
implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x).
33+
34+
Two other tree types are supported: ``xml.dom.minidom`` and
35+
``lxml.etree``. To use an alternative format, specify the name of
36+
a treebuilder:
37+
38+
.. code-block:: python
39+
40+
import html5lib
41+
with open("mydocument.html", "rb") as f:
42+
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
43+
44+
To have more control over the parser, create a parser object explicitly.
45+
For instance, to make the parser raise exceptions on parse errors, use:
46+
47+
.. code-block:: python
48+
49+
import html5lib
50+
with open("mydocument.html", "rb") as f:
51+
parser = html5lib.HTMLParser(strict=True)
52+
document = parser.parse(f)
53+
54+
When you're instantiating parser objects explicitly, pass a treebuilder
55+
class as the ``tree`` keyword argument to use an alternative document
56+
format:
57+
58+
.. code-block:: python
59+
60+
import html5lib
61+
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
62+
minidom_document = parser.parse("<p>Hello World!")
63+
64+
More documentation is available at http://html5lib.readthedocs.org/.
3665

3766

3867
Installation
3968
------------
4069

41-
html5lib is packaged with distutils. To install it use::
70+
html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,
71+
use:
4272

43-
$ python setup.py install
73+
.. code-block:: bash
4474
75+
$ pip install html5lib
4576
46-
Usage
47-
-----
4877
49-
Simple usage follows this pattern::
78+
Optional Dependencies
79+
---------------------
5080

51-
import html5lib
52-
with open("mydocument.html", "r") as fp:
53-
document = html5lib.parse(f)
81+
The following third-party libraries may be used for additional
82+
functionality:
5483

55-
or::
84+
- ``datrie`` can be used to improve parsing performance (though in
85+
almost all cases the improvement is marginal);
5686

57-
import html5lib
58-
document = html5lib.parse("<p>Hello World!")
87+
- ``lxml`` is supported as a tree format (for both building and
88+
walking) under CPython (but *not* PyPy where it is known to cause
89+
segfaults);
5990

60-
More documentation is available in the docstrings.
91+
- ``genshi`` has a treewalker (but not builder); and
92+
93+
- ``charade`` can be used as a fallback when character encoding cannot
94+
be determined; ``chardet``, from which it was forked, can also be used
95+
on Python 2.
6196

6297

6398
Bugs
@@ -70,28 +105,21 @@ Please report any bugs on the `issue tracker
70105
Tests
71106
-----
72107

73-
These are contained in the html5lib-tests repository and included as a
74-
submodule, thus for git checkouts they must be initialized (for
75-
release tarballs this is unneeded)::
108+
Unit tests require the ``nose`` library and can be run using the
109+
``nosetests`` command in the root directory. All should pass.
110+
111+
Test data are contained in a separate `html5lib-tests
112+
<https://github.com/html5lib/html5lib-tests>`_ repository and included
113+
as a submodule, thus for git checkouts they must be initialized::
76114

77115
$ git submodule init
78116
$ git submodule update
79117

80-
And then they can be run, with ``nose`` installed, using the
81-
``nosetests`` command in the root directory. All should pass.
118+
This is unneeded for release tarballs.
82119

83120
If you have all compatible Python implementations available on your
84-
system, you can run tests on all of them by using tox::
85-
86-
$ pip install tox
87-
$ tox
88-
...
89-
_______________________ summary ______________________
90-
py26: commands succeeded
91-
py27: commands succeeded
92-
py32: commands succeeded
93-
py33: commands succeeded
94-
congratulations :)
121+
system, you can run tests on all of them using the ``tox`` utility,
122+
which can be found on PyPI.
95123

96124

97125
Contributing
@@ -121,5 +149,5 @@ Questions?
121149

122150
There's a mailing list available for support on Google Groups,
123151
`html5lib-discuss <http://groups.google.com/group/html5lib-discuss>`_,
124-
though you may have more success (and get a far quicker response)
125-
asking on IRC in #whatwg on irc.freenode.net.
152+
though you may get a quicker response asking on IRC in #whatwg on
153+
irc.freenode.net.

0 commit comments

Comments
 (0)