|
1 | 1 | ChangeLog for PCRE
|
2 | 2 | ------------------
|
3 | 3 |
|
4 |
| -Version 3.0 02-Jan-02 |
| 4 | +Version 4.00 .... |
| 5 | +----------------- |
| 6 | + |
| 7 | +1. If a comment in an extended regex that started immediately after a meta-item |
| 8 | +extended to the end of string, PCRE compiled incorrect data. This could lead to |
| 9 | +all kinds of weird effects. Example: /#/ was bad; /()#/ was bad; /a#/ was not. |
| 10 | + |
| 11 | +2. Moved to autoconf 2.53 and libtool 1.4.2. |
| 12 | + |
| 13 | +3. Perl 5.8 no longer needs "use utf8" for doing UTF-8 things. Consequently, |
| 14 | +the special perltest8 script is no longer needed - all the tests can be run |
| 15 | +from a single perltest script. |
| 16 | + |
| 17 | +4. From 5.004, Perl has not included the VT character (0x0b) in the set defined |
| 18 | +by \s. It has now been removed in PCRE. This means it isn't recognized as |
| 19 | +whitespace in /x regexes too, which is the same as Perl. Note that the POSIX |
| 20 | +class [:space:] *does* include VT, thereby creating a mess. |
| 21 | + |
| 22 | +5. Added the class [:blank:] (a GNU extension from Perl 5.8) to match only |
| 23 | +space and tab. |
| 24 | + |
| 25 | +6. Perl 5.005 was a long time ago. It's time to amalgamate the tests that use |
| 26 | +its new features into the main test script, reducing the number of scripts. |
| 27 | + |
| 28 | +7. Perl 5.8 has changed the meaning of patterns like /a(?i)b/. Earlier |
| 29 | +versions were backward compatible, and made the (?i) apply to the whole |
| 30 | +pattern, as if /i were given. Now it behaves more logically, and applies the |
| 31 | +option setting only to what follows. PCRE has been changed to follow suit. |
| 32 | +However, if it finds options settings right at the start of the pattern, it |
| 33 | +extracts them into the global options, as before. Thus, they show up in the |
| 34 | +info data. |
| 35 | + |
| 36 | +8. Added support for the \Q...\E escape sequence. Characters in between are |
| 37 | +treated as literals. This is slightly different from Perl in that $ and @ are |
| 38 | +also handled as literals inside the quotes. In Perl, they will cause variable |
| 39 | +interpolation. Note the following examples: |
| 40 | + |
| 41 | + Pattern PCRE matches Perl matches |
| 42 | + |
| 43 | + \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz |
| 44 | + \Qabc\$xyz\E abc\$xyz abc\$xyz |
| 45 | + \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 46 | + |
| 47 | +9. Re-organized 3 code statements in pcretest to avoid "overflow in |
| 48 | +floating-point constant arithmetic" warnings from a Microsoft compiler. Added a |
| 49 | +(size_t) cast to one statement in pcretest and one in pcreposix to avoid |
| 50 | +signed/unsigned warnings. |
| 51 | + |
| 52 | +10. SunOS4 doesn't have strtoul(). This was used only for unpicking the -o |
| 53 | +option for pcretest, so I've replaced it by a simple function that does just |
| 54 | +that job. |
| 55 | + |
| 56 | +11. pcregrep was ending with code 0 instead of 2 for the commands "pcregrep" or |
| 57 | +"pcregrep -". |
| 58 | + |
| 59 | +12. Added "possessive quantifiers" ?+, *+, ++, and {,}+ which come from Sun's |
| 60 | +Java package. This provides some syntactic sugar for simple cases of what my |
| 61 | +documentation calls "once-only subpatterns". A pattern such as x*+ is the |
| 62 | +same as (?>x*). In other words, if what is inside (?>...) is just a single |
| 63 | +repeated item, you can use this simplified notation. Note that only makes sense |
| 64 | +with greedy quantifiers. Consequently, the use of the possessive quantifier |
| 65 | +forces greediness, whatever the setting of the PCRE_UNGREEDY option. |
| 66 | + |
| 67 | +13. A change of greediness default within a pattern was not taking effect at |
| 68 | +the current level for patterns like /(b+(?U)a+)/. It did apply to parenthesized |
| 69 | +subpatterns that followed. Patterns like /b+(?U)a+/ worked because the option |
| 70 | +was abstracted outside. |
| 71 | + |
| 72 | +14. PCRE now supports the \G assertion. It is true when the current matching |
| 73 | +position is at the start point of the match. This differs from \A when the |
| 74 | +starting offset is non-zero. Used with the /g option of pcretest (or similar |
| 75 | +code), it works in the same way as it does for Perl's /g option. |
| 76 | + |
| 77 | +15. Some bugs concerning the handling of certain option changes within patterns |
| 78 | +have been fixed. These applied to options other than (?ims). For example, |
| 79 | +"a(?x: b c )d" did not match "XabcdY" but did match "Xa b c dY". It should have |
| 80 | +been the other way round. Some of this was related to change 7 above. |
| 81 | + |
| 82 | +16. PCRE now gives errors for /[.x.]/ and /[=x=]/ as unsupported POSIX |
| 83 | +features, as Perl does. Previously, PCRE gave the warnings only for /[[.x.]]/ |
| 84 | +and /[[=x=]]/. PCRE now also gives an error for /[:name:]/ because it supports |
| 85 | +POSIX classes only within a class (e.g. /[[:alpha:]]/). |
| 86 | + |
| 87 | +17. Added support for Perl's \C escape. This matches one byte, even in UTF8 |
| 88 | +mode. Unlike ".", it always matches newline, whatever the setting of |
| 89 | +PCRE_DOTALL. However, PCRE does not permit \C to appear in lookbehind |
| 90 | +assertions. (Perl allows it, but it doesn't (in general) work because it can't |
| 91 | +calculate the length of the lookbehind. At least, that's the case for Perl |
| 92 | +5.8.0) |
| 93 | + |
| 94 | +18. Added an error diagnosis for escapes that PCRE does not support: these are |
| 95 | +\L, \l, \N, \P, \p, \U, \u, and \X. |
| 96 | + |
| 97 | +19. Although correctly diagnosing a missing ']' in a character class, PCRE was |
| 98 | +reading past the end of the pattern in cases such as /[abcd/. |
| 99 | + |
| 100 | +20. PCRE was getting more memory than necessary for patterns with classes that |
| 101 | +contained both POSIX named classes and other characters, e.g. /[[:space:]abc/. |
| 102 | + |
| 103 | +21. Added some code, conditional on #ifdef VPCOMPAT, to make life easier for |
| 104 | +compiling PCRE for use with Virtual Pascal. |
| 105 | + |
| 106 | +22. Small fix to the Makefile to make it work properly if the build is done |
| 107 | +outside the source tree. |
| 108 | + |
| 109 | +23. Added a new extension: a condition to go with recursion. If a conditional |
| 110 | +subpattern starts with (?(R) the "true" branch is used if recursion has |
| 111 | +happened, whereas the "false" branch is used only at the top level. |
| 112 | + |
| 113 | +24. When there was a very long string of literal characters (over 255 bytes |
| 114 | +without UTF support, over 250 bytes with UTF support), the computation of how |
| 115 | +much memory was required could be incorrect, leading to segfaults or other |
| 116 | +strange effects. |
| 117 | + |
| 118 | +25. PCRE was incorrectly assuming anchoring (either to start of subject or to |
| 119 | +start of line for a non-DOTALL pattern) when a pattern started with (.*) and |
| 120 | +there was a subsequent back reference to those brackets. This meant that, for |
| 121 | +example, /(.*)\d+\1/ failed to match "abc123bc". Unfortunately, it isn't |
| 122 | +possible to check for precisely this case. All we can do is abandon the |
| 123 | +optimization if .* occurs inside capturing brackets when there are any back |
| 124 | +references whatsoever. |
| 125 | + |
| 126 | +26. The handling of the optimization for finding the first character of a |
| 127 | +non-anchored pattern, and for finding a character that is required later in the |
| 128 | +match were failing in some cases. This didn't break the matching; it just |
| 129 | +failed to optimize when it could. The way this is done has been re-implemented. |
| 130 | + |
| 131 | +27. Fixed typo in error message for invalid (?R item (it said "(?p"). |
| 132 | + |
| 133 | +28. Added a new feature that provides some of the functionality that Perl |
| 134 | +provides with (?{...}). The facility is termed a "callout". The way it is done |
| 135 | +in PCRE is for the caller to provide an optional function, by setting |
| 136 | +pcre_callout to its entry point. Like pcre_malloc and pcre_free, this is a |
| 137 | +global variable. By default it is unset, which disables all calling out. To get |
| 138 | +the function called, the regex must include (?C) at appropriate points. This |
| 139 | +is, in fact, equivalent to (?C0), and any number <= 255 may be given with (?C). |
| 140 | +This provides a means of identifying different callout points. When PCRE |
| 141 | +reaches such a point in the regex, if pcre_callout has been set, the external |
| 142 | +function is called. It is provided with data in a structure called |
| 143 | +pcre_callout_block, which is defined in pcre.h. If the function returns 0, |
| 144 | +matching continues; if it returns a non-zero value, the match at the current |
| 145 | +point fails. However, backtracking will occur if possible. |
| 146 | + |
| 147 | +29. pcretest is upgraded to test the callout functionality. It provides a |
| 148 | +callout function that displays information. By default, it shows the start of |
| 149 | +the match and the current position in the text. There are some new data escapes |
| 150 | +to vary what happens: |
| 151 | + |
| 152 | + \C+ in addition, show current contents of captured substrings |
| 153 | + \C- do not supply a callout function |
| 154 | + \C!n return 1 when callout number n is reached |
| 155 | + \C!n!m return 1 when callout number n is reached for the mth time |
| 156 | + |
| 157 | +30. If pcregrep was called with the -l option and just a single file name, it |
| 158 | +output "<stdin>" if a match was found, instead of the file name. |
| 159 | + |
| 160 | +31. Improve the efficiency of the POSIX API to PCRE. If the number of capturing |
| 161 | +slots is less than POSIX_MALLOC_THRESHOLD, use a block on the stack to pass to |
| 162 | +pcre_exec(). This saves a malloc/free per call. The default value of |
| 163 | +POSIX_MALLOC_THRESHOLD is 5; it can be changed by --with-posix-malloc-threshold |
| 164 | +when configuring. |
| 165 | + |
| 166 | +32. The default maximum size of a compiled pattern is 64K. There have been a |
| 167 | +few cases of people hitting this limit. The code now uses macros to handle the |
| 168 | +storing of links as offsets within the compiled pattern. It defaults to 2-byte |
| 169 | +links, but this can be changed to 3 or 4 bytes by --with-link-size when |
| 170 | +configuring. Tests 2 and 5 work only with 2-byte links because they output |
| 171 | +debugging information about compiled patterns. |
| 172 | + |
| 173 | +33. Internal code re-arrangements: |
| 174 | + |
| 175 | + (a) Moved the debugging function for printing out a compiled regex into |
| 176 | + its own source file (printint.c) and used #include to pull it into |
| 177 | + pcretest.c and, when DEBUG is defined, into pcre.c, instead of having |
| 178 | + two separate copies. |
| 179 | + |
| 180 | + (b) Defined the list of op-code names for debugging as a macro in |
| 181 | + internal.h so that it is next to the definition of the opcodes. |
| 182 | + |
| 183 | + (c) Defined a table of op-code lengths for simpler skipping along compiled |
| 184 | + code. This is again a macro in internal.h so that it is next to the |
| 185 | + definition of the opcodes. |
| 186 | + |
| 187 | +34. Added support for recursive calls to individual subpatterns, along the |
| 188 | + lines of Robin Houston's patch (but implemented somewhat differently). |
| 189 | + |
| 190 | +35. Further mods to the Makefile to help Win32. Also, added code to pcregrep |
| 191 | + to allow it to read and process whole directories in Win32. This code was |
| 192 | + contributed by Lionel Fourquaux; it has not been tested by me. |
| 193 | + |
| 194 | +36. Added support for named subpatterns. The Python syntax (?P<name>...) is |
| 195 | + used to name a group. Names consist of alphanumerics and underscores, and |
| 196 | + must be unique. Back references use the syntax (?P=name) and recursive |
| 197 | + calls use (?P>name) which is a PCRE extension to the Python extension. |
| 198 | + Groups still have numbers. The function pcre_fullinfo() can be used after |
| 199 | + compilation to extract a name/number map. There are three relevant calls: |
| 200 | + |
| 201 | + PCRE_INFO_NAMEENTRYSIZE yields the size of each entry in the map |
| 202 | + PCRE_INFO_NAMECOUNT yields the number of entries |
| 203 | + PCRE_INFO_NAMETABLE yields a pointer to the map. |
| 204 | + |
| 205 | + The map is a vector of fixed-size entries. The size of each entry depends |
| 206 | + on the length of the longest name used. The first two bytes of each entry |
| 207 | + are the group number, most significant byte first. There follows the |
| 208 | + corresponding name, zero terminated. The names are in alphabetical order. |
| 209 | + |
| 210 | + |
| 211 | +Version 3.9 02-Jan-02 |
5 | 212 | ---------------------
|
6 | 213 |
|
7 | 214 | 1. A bit of extraneous text had somehow crept into the pcregrep documentation.
|
|
0 commit comments