Skip to content

Commit ca4a3e6

Browse files
committed
Merge branch 'PHP-5.3' into PHP-5.4
* PHP-5.3: merged PCRE 8.32
2 parents a3f020a + 357ab3c commit ca4a3e6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+11916
-6984
lines changed

ext/pcre/config.w32

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@ AC_DEFINE('HAVE_BUNDLED_PCRE', 1, 'Using bundled PCRE library');
1010
AC_DEFINE('HAVE_PCRE', 1, 'Have PCRE library');
1111
PHP_PCRE="yes";
1212
PHP_INSTALL_HEADERS("ext/pcre", "php_pcre.h pcrelib/");
13+
ADD_FLAG("CFLAGS_PCRE", " /D HAVE_CONFIG_H");

ext/pcre/config0.m4

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,8 @@ PHP_ARG_WITH(pcre-regex,,
5959
pcrelib/pcre_ord2utf8.c pcrelib/pcre_refcount.c pcrelib/pcre_study.c \
6060
pcrelib/pcre_tables.c pcrelib/pcre_valid_utf8.c \
6161
pcrelib/pcre_version.c pcrelib/pcre_xclass.c"
62-
PHP_NEW_EXTENSION(pcre, $pcrelib_sources php_pcre.c, no,,-I@ext_srcdir@/pcrelib)
62+
PHP_PCRE_CFLAGS="-DHAVE_CONFIG_H -I@ext_srcdir@/pcrelib"
63+
PHP_NEW_EXTENSION(pcre, $pcrelib_sources php_pcre.c, no,,$PHP_PCRE_CFLAGS)
6364
PHP_ADD_BUILD_DIR($ext_builddir/pcrelib)
6465
PHP_INSTALL_HEADERS([ext/pcre], [php_pcre.h pcrelib/])
6566
AC_DEFINE(HAVE_BUNDLED_PCRE, 1, [ ])

ext/pcre/pcrelib/ChangeLog

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,170 @@
11
ChangeLog for PCRE
22
------------------
33

4+
Version 8.32 30-November-2012
5+
-----------------------------
6+
7+
1. Improved JIT compiler optimizations for first character search and single
8+
character iterators.
9+
10+
2. Supporting IBM XL C compilers for PPC architectures in the JIT compiler.
11+
Patch by Daniel Richard G.
12+
13+
3. Single character iterator optimizations in the JIT compiler.
14+
15+
4. Improved JIT compiler optimizations for character ranges.
16+
17+
5. Rename the "leave" variable names to "quit" to improve WinCE compatibility.
18+
Reported by Giuseppe D'Angelo.
19+
20+
6. The PCRE_STARTLINE bit, indicating that a match can occur only at the start
21+
of a line, was being set incorrectly in cases where .* appeared inside
22+
atomic brackets at the start of a pattern, or where there was a subsequent
23+
*PRUNE or *SKIP.
24+
25+
7. Improved instruction cache flush for POWER/PowerPC.
26+
Patch by Daniel Richard G.
27+
28+
8. Fixed a number of issues in pcregrep, making it more compatible with GNU
29+
grep:
30+
31+
(a) There is now no limit to the number of patterns to be matched.
32+
33+
(b) An error is given if a pattern is too long.
34+
35+
(c) Multiple uses of --exclude, --exclude-dir, --include, and --include-dir
36+
are now supported.
37+
38+
(d) --exclude-from and --include-from (multiple use) have been added.
39+
40+
(e) Exclusions and inclusions now apply to all files and directories, not
41+
just to those obtained from scanning a directory recursively.
42+
43+
(f) Multiple uses of -f and --file-list are now supported.
44+
45+
(g) In a Windows environment, the default for -d has been changed from
46+
"read" (the GNU grep default) to "skip", because otherwise the presence
47+
of a directory in the file list provokes an error.
48+
49+
(h) The documentation has been revised and clarified in places.
50+
51+
9. Improve the matching speed of capturing brackets.
52+
53+
10. Changed the meaning of \X so that it now matches a Unicode extended
54+
grapheme cluster.
55+
56+
11. Patch by Daniel Richard G to the autoconf files to add a macro for sorting
57+
out POSIX threads when JIT support is configured.
58+
59+
12. Added support for PCRE_STUDY_EXTRA_NEEDED.
60+
61+
13. In the POSIX wrapper regcomp() function, setting re_nsub field in the preg
62+
structure could go wrong in environments where size_t is not the same size
63+
as int.
64+
65+
14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be
66+
set.
67+
68+
15. The EBCDIC support had decayed; later updates to the code had included
69+
explicit references to (e.g.) \x0a instead of CHAR_LF. There has been a
70+
general tidy up of EBCDIC-related issues, and the documentation was also
71+
not quite right. There is now a test that can be run on ASCII systems to
72+
check some of the EBCDIC-related things (but is it not a full test).
73+
74+
16. The new PCRE_STUDY_EXTRA_NEEDED option is now used by pcregrep, resulting
75+
in a small tidy to the code.
76+
77+
17. Fix JIT tests when UTF is disabled and both 8 and 16 bit mode are enabled.
78+
79+
18. If the --only-matching (-o) option in pcregrep is specified multiple
80+
times, each one causes appropriate output. For example, -o1 -o2 outputs the
81+
substrings matched by the 1st and 2nd capturing parentheses. A separating
82+
string can be specified by --om-separator (default empty).
83+
84+
19. Improving the first n character searches.
85+
86+
20. Turn case lists for horizontal and vertical white space into macros so that
87+
they are defined only once.
88+
89+
21. This set of changes together give more compatible Unicode case-folding
90+
behaviour for characters that have more than one other case when UCP
91+
support is available.
92+
93+
(a) The Unicode property table now has offsets into a new table of sets of
94+
three or more characters that are case-equivalent. The MultiStage2.py
95+
script that generates these tables (the pcre_ucd.c file) now scans
96+
CaseFolding.txt instead of UnicodeData.txt for character case
97+
information.
98+
99+
(b) The code for adding characters or ranges of characters to a character
100+
class has been abstracted into a generalized function that also handles
101+
case-independence. In UTF-mode with UCP support, this uses the new data
102+
to handle characters with more than one other case.
103+
104+
(c) A bug that is fixed as a result of (b) is that codepoints less than 256
105+
whose other case is greater than 256 are now correctly matched
106+
caselessly. Previously, the high codepoint matched the low one, but not
107+
vice versa.
108+
109+
(d) The processing of \h, \H, \v, and \ in character classes now makes use
110+
of the new class addition function, using character lists defined as
111+
macros alongside the case definitions of 20 above.
112+
113+
(e) Caseless back references now work with characters that have more than
114+
one other case.
115+
116+
(f) General caseless matching of characters with more than one other case
117+
is supported.
118+
119+
22. Unicode character properties were updated from Unicode 6.2.0
120+
121+
23. Improved CMake support under Windows. Patch by Daniel Richard G.
122+
123+
24. Add support for 32-bit character strings, and UTF-32
124+
125+
25. Major JIT compiler update (code refactoring and bugfixing).
126+
Experimental Sparc 32 support is added.
127+
128+
26. Applied a modified version of Daniel Richard G's patch to create
129+
pcre.h.generic and config.h.generic by "make" instead of in the
130+
PrepareRelease script.
131+
132+
27. Added a definition for CHAR_NULL (helpful for the z/OS port), and use it in
133+
pcre_compile.c when checking for a zero character.
134+
135+
28. Introducing a native interface for JIT. Through this interface, the compiled
136+
machine code can be directly executed. The purpose of this interface is to
137+
provide fast pattern matching, so several sanity checks are not performed.
138+
However, feature tests are still performed. The new interface provides
139+
1.4x speedup compared to the old one.
140+
141+
29. If pcre_exec() or pcre_dfa_exec() was called with a negative value for
142+
the subject string length, the error given was PCRE_ERROR_BADOFFSET, which
143+
was confusing. There is now a new error PCRE_ERROR_BADLENGTH for this case.
144+
145+
30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints
146+
greater than 0x7fffffff (which cannot be represented in UTF-8, even under
147+
the "old" RFC 2279). Instead, it ended up passing a negative length to
148+
pcre_exec().
149+
150+
31. Add support for GCC's visibility feature to hide internal functions.
151+
152+
32. Running "pcretest -C pcre8" or "pcretest -C pcre16" gave a spurious error
153+
"unknown -C option" after outputting 0 or 1.
154+
155+
33. There is now support for generating a code coverage report for the test
156+
suite in environments where gcc is the compiler and lcov is installed. This
157+
is mainly for the benefit of the developers.
158+
159+
34. If PCRE is built with --enable-valgrind, certain memory regions are marked
160+
unaddressable using valgrind annotations, allowing valgrind to detect
161+
invalid memory accesses. This is mainly for the benefit of the developers.
162+
163+
25. (*UTF) can now be used to start a pattern in any of the three libraries.
164+
165+
26. Give configure error if --enable-cpp but no C++ compiler found.
166+
167+
4168
Version 8.31 06-July-2012
5169
-------------------------
6170

ext/pcre/pcrelib/HACKING

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -49,16 +49,17 @@ complexity in Perl regular expressions, I couldn't do this. In any case, a
4949
first pass through the pattern is helpful for other reasons.
5050

5151

52-
Support for 16-bit data strings
53-
-------------------------------
52+
Support for 16-bit and 32-bit data strings
53+
-------------------------------------------
5454

55-
From release 8.30, PCRE supports 16-bit as well as 8-bit data strings, by being
56-
compilable in either 8-bit or 16-bit modes, or both. Thus, two different
57-
libraries can be created. In the description that follows, the word "short" is
55+
From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from
56+
release 8.32, PCRE supports 32-bit data strings. The library can be compiled
57+
in any combination of 8-bit, 16-bit or 32-bit modes, creating different
58+
libraries. In the description that follows, the word "short" is
5859
used for a 16-bit data quantity, and the word "unit" is used for a quantity
59-
that is a byte in 8-bit mode and a short in 16-bit mode. However, so as not to
60-
over-complicate the text, the names of PCRE functions are given in 8-bit form
61-
only.
60+
that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned
61+
integer in 32-bit mode. However, so as not to over-complicate the text, the
62+
names of PCRE functions are given in 8-bit form only.
6263

6364

6465
Computing the memory requirement: how it was
@@ -138,9 +139,10 @@ Format of compiled patterns
138139
---------------------------
139140

140141
The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
141-
shorts in 16-bit mode), containing items of variable length. The first unit in
142-
an item contains an opcode, and the length of the item is either implicit in
143-
the opcode or contained in the data that follows it.
142+
shorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing
143+
items of variable length. The first unit in an item contains an opcode, and
144+
the length of the item is either implicit in the opcode or contained in the
145+
data that follows it.
144146

145147
In many cases listed below, LINK_SIZE data values are specified for offsets
146148
within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
@@ -207,7 +209,8 @@ Matching literal characters
207209

208210
The OP_CHAR opcode is followed by a single character that is to be matched
209211
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
210-
the character may be more than one unit long.
212+
the character may be more than one unit long. In UTF-32 mode, characters
213+
are always exactly one unit long.
211214

212215

213216
Repeating single characters
@@ -228,7 +231,8 @@ following opcodes, which come in caseful and caseless versions:
228231
OP_POSQUERY OP_POSQUERYI
229232

230233
Each opcode is followed by the character that is to be repeated. In ASCII mode,
231-
these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable.
234+
these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in
235+
UTF-32 mode these are one-unit items.
232236
Those with "MIN" in their names are the minimizing versions. Those with "POS"
233237
in their names are possessive versions. Other repeats make use of these
234238
opcodes:
@@ -299,7 +303,7 @@ bit map containing a 1 bit for every character that is acceptable. The bits are
299303
counted from the least significant end of each unit. In caseless mode, bits for
300304
both cases are set.
301305

302-
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16 mode,
306+
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode,
303307
subject characters with values greater than 255 can be handled correctly. For
304308
OP_CLASS they do not match, whereas for OP_NCLASS they do.
305309

@@ -412,7 +416,8 @@ OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
412416
is OP_REVERSE, followed by a two byte (one short) count of the number of
413417
characters to move back the pointer in the subject string. In ASCII mode, the
414418
count is a number of units, but in UTF-8/16 mode each character may occupy more
415-
than one unit. A separate count is present in each alternative of a lookbehind
419+
than one unit; in UTF-32 mode each character occupies exactly one unit.
420+
A separate count is present in each alternative of a lookbehind
416421
assertion, allowing them to have different fixed lengths.
417422

418423

ext/pcre/pcrelib/NEWS

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,46 @@
11
News about PCRE releases
22
------------------------
33

4+
Release 8.32 30-November-2012
5+
-----------------------------
6+
7+
This release fixes a number of bugs, but also has some new features. These are
8+
the highlights:
9+
10+
. There is now support for 32-bit character strings and UTF-32. Like the
11+
16-bit support, this is done by compiling a separate 32-bit library.
12+
13+
. \X now matches a Unicode extended grapheme cluster.
14+
15+
. Case-independent matching of Unicode characters that have more than one
16+
"other case" now makes all three (or more) characters equivalent. This
17+
applies, for example, to Greek Sigma, which has two lowercase versions.
18+
19+
. Unicode character properties are updated to Unicode 6.2.0.
20+
21+
. The EBCDIC support, which had decayed, has had a spring clean.
22+
23+
. A number of JIT optimizations have been added, which give faster JIT
24+
execution speed. In addition, a new direct interface to JIT execution is
25+
available. This bypasses some of the sanity checks of pcre_exec() to give a
26+
noticeable speed-up.
27+
28+
. A number of issues in pcregrep have been fixed, making it more compatible
29+
with GNU grep. In particular, --exclude and --include (and variants) apply
30+
to all files now, not just those obtained from scanning a directory
31+
recursively. In Windows environments, the default action for directories is
32+
now "skip" instead of "read" (which provokes an error).
33+
34+
. If the --only-matching (-o) option in pcregrep is specified multiple
35+
times, each one causes appropriate output. For example, -o1 -o2 outputs the
36+
substrings matched by the 1st and 2nd capturing parentheses. A separating
37+
string can be specified by --om-separator (default empty).
38+
39+
. When PCRE is built via Autotools using a version of gcc that has the
40+
"visibility" feature, it is used to hide internal library functions that are
41+
not part of the public API.
42+
43+
444
Release 8.31 06-July-2012
545
-------------------------
646

@@ -9,7 +49,7 @@ This is mainly a bug-fixing release, with a small number of developments:
949
. The JIT compiler now supports partial matching and the (*MARK) and
1050
(*COMMIT) verbs.
1151

12-
. PCRE_INFO_MAXLOOKBEHIND can be used to find the longest lookbehing in a
52+
. PCRE_INFO_MAXLOOKBEHIND can be used to find the longest lookbehind in a
1353
pattern.
1454

1555
. There should be a performance improvement when using the heap instead of the

0 commit comments

Comments
 (0)