Skip to content

Commit 0ab2c61

Browse files
author
Steve Canny
committed
first version
1 parent bc0290a commit 0ab2c61

28 files changed

+2925
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1+
.coverage
12
/_scratch/
3+
/.tox/

HISTORY.rst

Whitespace-only changes.

LICENSE

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
The MIT License (MIT)
2+
Copyright (c) 2013 Steve Canny, https://github.com/scanny
3+
4+
Permission is hereby granted, free of charge, to any person obtaining a copy
5+
of this software and associated documentation files (the "Software"), to deal
6+
in the Software without restriction, including without limitation the rights
7+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8+
copies of the Software, and to permit persons to whom the Software is
9+
furnished to do so, subject to the following conditions:
10+
11+
The above copyright notice and this permission notice shall be included in
12+
all copies or substantial portions of the Software.
13+
14+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
20+
THE SOFTWARE.

MANIFEST.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include HISTORY.rst LICENSE README.rst tox.ini
2+
recursive-include tests *.py
3+
recursive-include tests *.txt

Makefile

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
MAKE = make
2+
PYTHON = python
3+
SETUP = $(PYTHON) ./setup.py
4+
5+
.PHONY: clean
6+
7+
help:
8+
@echo "Please use \`make <target>' where <target> is one or more of"
9+
@echo " clean delete intermediate work product and start fresh"
10+
11+
clean:
12+
find . -type f -name \*.pyc -exec rm {} \;
13+
rm -rf dist *.egg-info .coverage .DS_Store
14+
15+
coverage:
16+
py.test --cov-report term-missing --cov=cxml tests/
17+
18+
sdist:
19+
$(SETUP) sdist
20+
21+
test: clean
22+
flake8
23+
py.test -x

README.rst

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
2+
cxml - Compact XML translator
3+
=============================
4+
5+
.. highlight:: python
6+
7+
`cxml` translates a Compact XML (CXML) expression into the corresponding
8+
pretty-printed XML snippet. For example::
9+
10+
from cxml import xml
11+
12+
xml('w:p/(w:pPr/w:jc{w:val=right},w:r/w:t"Right-aligned")'),
13+
14+
.. highlight:: xml
15+
16+
becomes::
17+
18+
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
19+
<w:pPr>
20+
<w:jc w:val="right"/>
21+
</w:pPr>
22+
<w:r>
23+
<w:t>Right-aligned</w:t>
24+
</w:r>
25+
</w:p>
26+
27+
28+
Who cares?
29+
----------
30+
31+
The motivation for a compact XML expression language arose out of the testing
32+
requirements of the `python-docx` and `python-pptx` libraries. The
33+
*WordprocessingML* and *PresentationML* file formats are XML-based and many
34+
operations in those libraries involve the recognition or modification of XML.
35+
The tests then require a great many XML snippets to test all the possible
36+
combinations the code must recognize or produce.
37+
38+
Including full-sized XML snippets in the test code is both distracting and
39+
tedious. By compressing the specification of a snippet to fit on a single
40+
line (in most cases), the test code is much more compact and expressive.
41+
42+
43+
Syntax
44+
------
45+
46+
CXML syntax borrows from that of XPath.
47+
48+
.. highlight:: python
49+
50+
An element is specified by its name::
51+
52+
>>> xml('foobar')
53+
<foobar/>
54+
55+
A child is specified by name following a slash::
56+
57+
>>> xml('foo/bar')
58+
<foo>
59+
<bar/>
60+
</foo>
61+
62+
XML output is pretty-printed with 2-space indentation.
63+
64+
Multiple child elements are specified by separating them with a comma and
65+
enclosing them in parentheses::
66+
67+
>>> xml('foo/(bar,baz)')
68+
<foo>
69+
<bar/>
70+
<baz/>
71+
</foo>
72+
73+
Element attributes are specified in braces after the element name::
74+
75+
>>> xml('foo{a=b}')
76+
<foo a="b"/>
77+
78+
Multiple attributes are separated by commas::
79+
80+
>>> xml('foo{a=b,b=c}')
81+
<foo a="b" b="c"/>
82+
83+
Whitespace is permitted (and ignored) between tokens in most places, however
84+
after using CXML quite a bit I don't find it useful::
85+
86+
>>> xml(' foo {a=b, b=c}')
87+
<foo a="b" b="c"/>
88+
89+
Attribute text may be surrounded by double-quotes, which is handy when the
90+
text contains a comma or a closing brace::
91+
92+
>>> xml('foo{a=b,b="c,}g")}')
93+
<foo a="b" b="c,}g"/>
94+
95+
Text immediately following the attributes' closing brace is interpreted as
96+
the text of the element. Whitespace within the text is preserved.::
97+
98+
>>> xml('foo{a=b,b=c} bar ')
99+
<foo a="b" b="c"> bar </foo>
100+
101+
Element text may also be enclosed in quotes, which allows it to contain
102+
a comma or slash that would otherwise be interpreted as the next token.::
103+
104+
>>> xml('foo{a=b}"bar/baz, barfoo"')
105+
<foo a="b">bar/baz, barfoo</foo>
106+
107+
An element having a namespace prefix appears with the corresponding namespace
108+
declaration::
109+
110+
>>> xml('a:foo)')
111+
<a:foo xmlns:a="http://foo/a"/>
112+
113+
A different namespace prefix in a descendant element causes the corresponding
114+
namespace declaration to be added to the root element, in the order
115+
encountered::
116+
117+
>>> xml('a:foo/(b:bar,c:baz)')
118+
<a:foo xmlns:a="http://foo/a" xmlns:b="http://foo/b" xmlns:c="http://foo/c">
119+
<b:bar/>
120+
<c:baz/>
121+
</a:foo>
122+
123+
A namespace can be explicitly declared as an attribute of an element, in
124+
which case it will appear whether a child element in that namespace is
125+
present or not::
126+
127+
>>> xml('a:foo{b:}')
128+
<a:foo xmlns:a="http://foo/a" xmlns:b="http://foo/b"/>
129+
130+
An explicit namespace appears immediately after the root element namespace
131+
(if it has one) when placed on the root element. This allows namespace
132+
declarations to appear in a different order than the order encountered. This
133+
is occasionally handy when matching XML by its string value.
134+
135+
An explicit namespace may also be placed on a child element, in which case
136+
the corresponding namespace declaration appears on that child rather than the
137+
root element::
138+
139+
>>> xml('a:foo/b:bar{b:,c:}')
140+
<a:foo xmlns:a="http://foo/a">
141+
<b:bar xmlns:b="http://foo/b" xmlns:c="http://foo/c"/>
142+
</a:foo>
143+
144+
Putting all these together, a reasonably complex XML snippet can be condensed
145+
quite a bit::
146+
147+
>>> xml('w:p/(w:pPr/w:jc{w:val=right},w:r/w:t"Right-aligned")'),
148+
<w:p xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
149+
<w:pPr>
150+
<w:jc w:val="right"/>
151+
</w:pPr>
152+
<w:r>
153+
<w:t>Right-aligned</w:t>
154+
</w:r>
155+
</w:p>

cxml/__init__.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# encoding: utf-8
2+
3+
"""
4+
API for CXML translator.
5+
"""
6+
7+
from __future__ import (
8+
absolute_import, division, print_function, unicode_literals
9+
)
10+
11+
12+
__version__ = '0.9.6'
13+
14+
15+
from .lexer import CxmlLexer
16+
from .parser import CxmlParser
17+
from .symbols import root
18+
from .translator import CxmlTranslator
19+
20+
21+
def xml(cxml):
22+
"""
23+
Return the XML generated from *cxml*.
24+
"""
25+
lexer = CxmlLexer(cxml)
26+
parser = CxmlParser(lexer)
27+
root_ast = parser.parse(root)
28+
root_element = CxmlTranslator.translate(root_ast)
29+
return root_element.xml

cxml/lexer.py

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# encoding: utf-8
2+
3+
"""
4+
Lexical analyzer, (a.k.a lexer, tokenizer) for CXML language.
5+
"""
6+
7+
from __future__ import (
8+
absolute_import, division, print_function, unicode_literals
9+
)
10+
11+
from .lib.lexer import Lexer
12+
13+
from .symbols import (
14+
COLON, COMMA, EQUAL, LBRACE, LPAREN, NAME, RBRACE, RPAREN, SLASH, SNTL,
15+
TEXT
16+
)
17+
18+
19+
alphas = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
20+
nums = '0123456789'
21+
22+
name_start_chars = alphas + '_'
23+
name_chars = alphas + nums + '_-.'
24+
25+
punctuation = ':,=/{}()'
26+
27+
28+
class CxmlLexer(Lexer):
29+
"""
30+
Lexer object for CXML.
31+
"""
32+
def _lex_start(self):
33+
"""
34+
The starting and fallback state of the lexer, where it is in-between
35+
tokens.
36+
"""
37+
# should only be entering this state in-between tokens
38+
assert self._start == self._pos
39+
40+
peek = self._peek
41+
42+
# test EOF first to avoid __contains__ errors
43+
if peek is None:
44+
return self._lex_eof
45+
46+
# ignore whitespace as a priority
47+
elif peek == ' ':
48+
return self._lex_whitespace
49+
50+
elif peek in name_start_chars:
51+
return self._lex_name
52+
53+
elif peek in punctuation:
54+
return self._lex_punctuation
55+
56+
elif peek == '"':
57+
return self._lex_quoted_string
58+
59+
else:
60+
raise SyntaxError(
61+
"at character '%s' in '%s'" % (peek, self._input)
62+
)
63+
64+
def _lex_eof(self):
65+
"""
66+
Emit `SNTL` token and end parsing by returning |None|.
67+
"""
68+
assert self._start == self._pos == self._len
69+
self._emit(SNTL)
70+
return None
71+
72+
def _lex_name(self):
73+
"""
74+
Emit maximal sequence of name characters.
75+
"""
76+
self._accept_run(name_chars)
77+
self._emit(NAME)
78+
return self._lex_start
79+
80+
def _lex_punctuation(self):
81+
"""
82+
Emit the appropriate single-character punctuation token, such as
83+
COLON.
84+
"""
85+
symbol = self._next()
86+
87+
token_type = {
88+
':': COLON, ',': COMMA, '{': LBRACE, '}': RBRACE,
89+
'=': EQUAL, '/': SLASH, '(': LPAREN, ')': RPAREN,
90+
}[symbol]
91+
92+
self._emit(token_type)
93+
return self._lex_text if symbol in '=}' else self._lex_start
94+
95+
def _lex_quoted_string(self):
96+
"""
97+
Emit the text of a quoted string as a TEXT token, discarding the
98+
enclosing quote characters.
99+
"""
100+
# skip over opening quote
101+
self._skip()
102+
103+
# accept any character until another double-quote or EOF
104+
self._accept_until('"')
105+
self._emit(TEXT)
106+
107+
# raise unterminated if next character not closing quote
108+
if self._peek != '"':
109+
raise SyntaxError("unterminated quote")
110+
self._skip()
111+
112+
return self._lex_start
113+
114+
def _lex_text(self):
115+
"""
116+
Parse a string value, either a quoted string or a raw string, which
117+
is terminated by a comma, closing brace, slash, or right paren.
118+
"""
119+
peek = self._peek
120+
121+
if peek is None:
122+
return self._lex_eof
123+
124+
if peek == '"':
125+
return self._lex_quoted_string
126+
127+
if peek not in ',}/)':
128+
self._accept_until(',}/)')
129+
self._emit(TEXT)
130+
131+
return self._lex_start
132+
133+
def _lex_whitespace(self):
134+
"""
135+
Consume all whitespace at current position and ignore it.
136+
"""
137+
self._accept_run(' ')
138+
self._ignore()
139+
return self._lex_start

cxml/lib/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)