From 510bd5b0be16360d5362eb48e241fc43b4d8eb7f Mon Sep 17 00:00:00 2001 From: Google Code Exporter Date: Sun, 22 Mar 2015 09:18:55 -0400 Subject: [PATCH] Migrating wiki contents from Google Code --- Ports.md | 19 ++++ ProjectHome.md | 1 + UserDocumentation.md | 226 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 246 insertions(+) create mode 100644 Ports.md create mode 100644 ProjectHome.md create mode 100644 UserDocumentation.md diff --git a/Ports.md b/Ports.md new file mode 100644 index 0000000..cd6afc3 --- /dev/null +++ b/Ports.md @@ -0,0 +1,19 @@ +html5lib ports (providing a similar public API): + + * Python + * [html5lib-python](https://github.com/html5lib/html5lib-python): the original html5lib implementation + * PHP + * [html5lib-php](https://github.com/html5lib/html5lib-php): dated, unmaintained, port + * Ruby + * [html5lib-ruby](https://github.com/html5lib/html5lib-ruby): dated, unmaintained, port + * Dart + * [Dart html5lib](https://github.com/dart-lang/html5lib): a third-party port to Dart. + +Other HTML parsers: + + * JavaScript + * [HTML5 parser for node.js](https://github.com/aredridel/html5) by Aria Stewart + * [dom.js implementation of DOM4](https://github.com/andreasgal/dom.js) by Andreas Gal and David Flanagan + * [Live DOM Viewer](http://livedom.validator.nu/) compiled via GWT by Henri Sivonen + * Java + * [Validator.nu HTML parser](http://about.validator.nu/htmlparser/) by Henri Sivonen \ No newline at end of file diff --git a/ProjectHome.md b/ProjectHome.md new file mode 100644 index 0000000..3ce7ef3 --- /dev/null +++ b/ProjectHome.md @@ -0,0 +1 @@ +# NOTE: html5lib is now hosted at github: https://github.com/html5lib # \ No newline at end of file diff --git a/UserDocumentation.md b/UserDocumentation.md new file mode 100644 index 0000000..905d37d --- /dev/null +++ b/UserDocumentation.md @@ -0,0 +1,226 @@ +# Using html5lib # + +## Installation ## + +Releases can be installed using `pip` in the usual way: +``` + $ pip install html5lib +``` + +The development version can be installed by cloning the source repository using mercurial and running: +``` + $ python setup.py develop +``` +in the `python` directory. + +## Tests ## + +The development version of html5lib comes with an extensive testsuite. All the tests can be run by invoking +runtests.py in the tests/ directory or by running +``` +$ python setup.py nosetests +``` + +## Parsing HTML ## + +Simple usage follows this pattern: +``` +import html5lib +f = open("mydocument.html") +doc = html5lib.parse(f) +``` +This will return a tree in a custom "simpletree" format. More interesting is the ability to use a variety of standard tree formats; currently minidom, ElementTree, lxml and BeafutifulSoup (deprecated) formats are supported by default. To do this you pass a string indicating the name of the tree format to use as the "treebuilder" argument to the parse method: +``` +import html5lib +f = open("mydocument.html") +doc = html5lib.parse(f, treebuilder="lxml") +``` + +It is also possible to explicitly create a parser object: +``` +import html5lib +f = open("mydocument.html") +parser = html5lib.HTMLParser() +doc = parser.parse(f) +``` +To output non-simpletree tree formats when explicitly creating a parser, you need to pass a TreeBuilder class as the "tree" argument to the HTMLParser. For +the built-in treebuilders this can be conveniently obtained from the treebuilders.getTreeBuilder function e.g. for minidom: +``` +import html5lib +from html5lib import treebuilders + +f = open("mydocument.html") +parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom")) +minidom_document = parser.parse(f) +``` + +For a BeautifulSoup tree replace the string "dom" with "beautifulsoup". For +ElementTree the procedure is slightly more involved as there are many libraries +that support the ElementTree API. Therefore getTreeBuilder accepts a second +argument which is the ElementTree implementation that is desired (in the future +this may be extended, for example to allow multiple DOM libraries to be used): + +``` +import html5lib +from html5lib import treebuilders +from xml.etree import cElementTree + +f = open("mydocument.html") +parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("etree", cElementTree)) +etree_document = parser.parse(f) +``` + +If you are using the excellent lxml library, using the generic etree treebuilder described above with fail. Instead you must use the lxml builder: + +``` +import html5lib +from html5lib import treebuilders +from lxml import etree + +f = open("mydocument.html") +parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("lxml")) +etree_document = parser.parse(f) +``` + +### SAX Events ### +The WHATWG spec is not very streaming-friendly as it requires rearrangement of +subtrees in some situations. However html5lib allows SAX events to be created +from a DOM tree using html5lib.treebuilders.dom.dom2sax + +### Character encoding ### + +Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way: + + * The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse + + * If no encoding is specified, the parser will attempt to detect the encoding from a + +<meta> + + element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification) + + * If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern + + * If all else fails, the default encoding (usually Windows-1252) will be used + +#### Examples #### + +Explicit encoding specification: +``` +import html5lib +import urllib2 +p = html5lib.HTMLParser() +p.parse(urllib2.urlopen("http://yahoo.co.jp", encoding="euc-jp").read()) +``` + +Automatic detection from a meta element: +``` +import html5lib +import urllib2 +p = html5lib.HTMLParser() +p.parse(urllib2.urlopen("http://www.mozilla-japan.org/").read()) +``` + +## Sanitizing Tokenizer ## + +When building web applications it is often necessary to remove unsafe markup and +CSS from user entered content. html5lib provides a custom tokenizer for this +purpose. It only allows known safe element tokens through and converts others +to text. Similarly, a variety of unsafe CSS constructs are removed from the +stream. For more details on the default configuration of the sanitizer, see +http://wiki.whatwg.org/wiki/Sanitization_rules The sanitizer can be used by +passing it as the tokenizer argument to the parser: +``` +import html5lib +from html5lib import sanitizer + +p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer) +p.parse("") +``` + +## Treewalkers ## + +Treewalkers provide a streaming view of a tree. They are useful for filtering +and serializing the stream. html5lib provides a variety of treewalkers for +working with different tree types. For example, to stream a dom tree: +``` +from html5lib import treewalkers +walker = treewalkers.getTreeWalker("dom") + +stream = walker(dom_tree) #stream is an iterable representing each token in the + #tree +``` + +Treewalkers are avaliable for all the tree types supported by the HTMLParser plus +xml.dom.pulldom ("pulldom"), genshi streams ("genshi") and a lxml-optimized +elementtree ("lxml"). As for the treebulders, treewalkers.getTreeWalker takes a +second argument implementation containing a object implementing the ElementTree +API. + +### Sanitization using treewalkers ### +You may wish to sanitize content from an which has been parsed into a tree by some other code. This may be done using the sanitizer filter: + +``` +from html5lib import treewalkers, filters +from html5lib.filters import sanitizer + +walker = treewalkers.getTreeWalker("dom") + +stream = walker(dom_tree) +clean_stream = sanitizer.Filter(stream) +``` + +### Serialization of Streams ### + +html5lib provides HTML and XHML serializers which work on streams produced by the treewalkers. These are implemented as generators with each item in the generator representing a single tag. A full example of parsing and serializing content looks like: + +``` +import html5lib +from html5lib import treebuilders, treewalkers, serializer +from html5lib.filters import sanitizer + +p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom")) + +dom_tree = p.parse("

Hello World

") + +walker = treewalkers.getTreeWalker("dom") + +stream = walker(dom_tree) + +s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False) +output_generator = s.serialize(stream) + +for item in output_generator: + print item + + + + + +

+ +Hello + + +World +

+ + + +``` + + +# Bugs # + +Please report any bugs on the issue tracker: +http://code.google.com/p/html5lib/issues/list + +# Ports # + +There is a listing of [html5lib ports](Ports.md) to JavaScript, Ruby and more. + +# Get Involved # + +Contributions to code or documenation are actively encouraged. Submit +patches to the issue tracker or discuss changes on irc in the #whatwg +channel on freenode.net \ No newline at end of file