Expresiones Regulares

Descargar como docx, pdf o txt
Descargar como docx, pdf o txt
Está en la página 1de 17

Expresiones Regulares

Una expresin regular, consiste en comparar un patrn frente a un texto, para comprobar si el
texto contiene lo especificado en el patrn.

Para practicar y entender podemos usar este entorno libre:

http://www.regular-expressions.info/regexbuddy.html

Sintaxis bsica de una expresin regular

El punto

.
El punto representa cualquier caracter. Escribiendo un punto en un patrn querrs decir
que ah hay un caracter, cualquiera. Desde la A a la Z (en minscula y mayscula), del 0 al 9,
o algn otro smbolo.

Ejemplos:
ca.a coincide con cana, cama, casa, caja, etc
No coincide con casta ni caa

Principio y fin de cadena

Si queremos indicar al patrn qu es el principio de la cadena o qu es el final, debemos


hacerlo con ^ para inicio y $ para final.

Ejemplos:
^olivas coincide con olivas verdes, pero no con quiero olivas

Cuantificadores

Para indicar que cierto elemento del patrn va a repetirse un nmero indeterminado de
veces, usaremos + o * . Usando + queremos decir que el elemento anterior aparece una o
ms veces. Usando * queremos decir que el elemento anterior aparece cero o ms veces.

Ejemplos:
gafas+ coincide con gafassss pero no con gafa
sin embargo
clo*aca coincide con claca, cloaca, cloooooooaca, etc..

Si lo que queremos indicar al patrn es que un elemento puede que est (una vez) o puede
que no, lo haremos con el interrogante de la siguiente forma:
coches? coincide con coche y con coches

Busca la palabra coche pero a la s le aplica el cuantificador

Ejemplos: +*?
Para definir la cantidad de veces que va a repetirse el elemento, tendremos que hacer uso
de las llaves: { }, indicando en su interior el intervalo, o la cantidad exacta de veces que va a
repetirse.

Ejemplos:
abc{4} coincide con abcccc, pero no con abc ni abcc, etc
abc{1,3} coincide con abc, abcc, abccc, pero no con abcccc

Si un parmetro queda vaco, significa un nmero indeterminado. Por ejemplo: x{5,}


significa que la x ha de repetirse 5 veces, o ms.

Rangos

Los corchetes [] incluidos en un patrn permiten especificar el rango de caracteres vlidos


a comparar. Basta que exista cualquiera de ellos para que se de la condicin. Dentro de
ellos pondremos cualquier cantidad de caracteres, uno a continuacin del otro; o un rango
del abecedario o de los nmeros enteros del 0 al 9.

Ejemplos:
c[ao]sa coincide con casa y con cosa
[a-f] coincide con todos los caracteres alfabticos de la a a la f
[0-9][2-6][ANR] coincide con 12A, 35N, 84R, etc..
pero no con 21A, ni 33L, ni 3A, etc

Dentro de los corchetes, hay que tener en cuenta que el smbolo ^ ya no significa inicio,
sin que es un negador, es decir: [^a-Z] coincidir con cualquier texto que NO tenga
ningn caracter alfabtico (ni minsculas ni maysculas), y ^@ coincide con cualquier
caracter excepto @ y espacio

Alternancia

Para alternar entre varias opciones, usaremos el smbolo | (barra vertical, en los teclados
suele ser Alt Gr + 1). Con este mecanismo haremos un disyuntor, que nos permitir dar
varias opciones. Si una de ellas coincide, el patrn ser cierto.

Ejemplos:
aleman(ia|es) coincide con alemania y con alemanes
(norte|sur|este|oeste) coincide con cualquiera de los puntos cardinales.

Agrupadores

Los parntesis nos sirven para agrupar un subconjunto. Como hemos visto en el ejemplo
anterior, nos es til para definir la alternancia, pero agrupar un sub-patrn nos permite
trabajar con l como si fuera un nico elemento.

Ejemplos:
(abc)+ coincide con abc, abcabc, abcabcabc, etc
ca(sca)?da coincide con cascada y con cada

Escapar caracteres

Si por ejemplo quisiramos que en el patrn hubiese un punto, o un smbolo asterisco, sin
que se interprete como metacaracter, tendremos que escaparlo. Esto se hace poniendo
una barra invertida justo antes: \. o \*
Esto puede hacerse con cualquier caracter que quieras introducir de forma literal, y no
interpretada.

En la tabla que sigue se muestran los caracteres comodn usados para crear los patrones y
su significado, junto a un pequeo ejmplo de su utilizacin.

Significado Ejemplo Resultado


\ Marca de carcter
especial /\$ftp/ Busca la palabra $ftp
^ Comienzo de una
lnea /^-/ Lneas que comienzan por -
$ Final de una lnea /s$/ Lneas que terminan por s
Cualquier carcter
. (menos salto de lnea) /\b.\b/ Palabras de una sla letra
| Indica opciones /(L|l|f|)ocal/ Busca Local, local, focal
()
Agrupar caracteres /(vocal)/ Busca vocal
[] Conjunto de
caracteres opcionales /escrib[aoe]/ Vale escriba, escribo, escribe

La tabla que sigue describe los modificadores que pueden usarse con los caracteres que
forman el patrn. Cada modificador acta sobre el carcter o el parntesis inmediatamente
anterior.

Descripcin Ejemplo Resultado


* Repetir 0 o ms veces /l*234/ Valen 234, 1234, 11234...
+ Repetir 1 o ms veces /a*mar/ Valen amar, aamar, aaamar...
? 1 o 0 veces /a?mar/ Valen amar, mar.
{n} Exactamente n veces /p{2}sado/ Vale ppsado
{n,} Al menos n veces /(m){2,}ala/ Vale mmala, mmmala....
{m,n} entre m y n veces /tal{1,3}a/ Vale tala, talla, tallla

Significado Ejemplos Resultado


\b
Principio o final de Encuentra ver en "ver de", pero no en
palabra /\bver\b/ "verde"
\B
Frontera entre no- Empareja ver con "Valverde" pero no
palabras /\Bver\B/ con "verde"
\d
Un dgito /[A-Z]\d/ No falla en "A4"
\D
Alfabtico (no dgito) /[A-Z]\D/ Fallara en "A4"
\O
Carcter nulo
\t
Caracter ASCII 9
(tabulador)
\f
Salto de pgina
\n
Salto de lnea
\w
Cualquier alfanumrico,
Encuentra frase en "frase.", pero no el
[a-zA-Z0-9_ ] /\w+/ . (punto).
\W
Opuesto a \w
([^a-zA-Z0-9_ ]) /\W/ Hallara slo el punto (.)
\s
Carcter tipo espacio Encuentra Si en "Digo Si ", pero no en
(como tab) /\sSi\s/ "Digo Sientate"
\S
Opuesto a \s
\cX
Carcter de control X \c9 El tabulador
\oNN
Carcter octal NN
\xhh
Encuentra la A (ASCII Hex41) en "letra
El hexadecimal hh /\x41/ A"

Greedy and Lazy Repetition

The repetition operators or quantifiers are greedy. They will expand the match as far as they
can, and only give back if they must to satisfy the remainder of the regex. The regex <.+>
will match <EM>first</EM> in This is a <EM>first</EM> test .

The reason is that the plus is greedy. That is, the plus causes the regex engine to repeat
the preceding token as often as possible. Only if that causes the entire regex to fail, will
the regex engine backtrack. That is, it will go back to the plus, make it give up the last
iteration, and proceed with the remainder of the regex. Let's take a look inside the regex
engine to see in detail how this works and why this causes our regex to fail. After that, I
will present you with two possible solutions.
Like the plus, the star and the repetition using curly braces are greedy.

Place a question mark after the quantifier to make it lazy. <.+?> will match <EM> in the
above string.

A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly
match an HTML tag without regard to attributes. The negated character class is more
specific than the dot, which helps the regex engine find matches quickly.
Watch Out for The Greediness!

Suppose you want to use a regex to match an HTML tag. You know that the input will be a
valid HTML file, so the regular expression does not need to exclude any invalid use of sharp
brackets. If it sits between sharp brackets, it is an HTML tag.

Most people new to regular expressions will attempt to use <.+> . They will be surprised
when they test it on a string like This is a <EM>first</EM> test . You might expect the regex
to match <EM> and when continuing after that match, </EM> .

But it does not. The regex will match <EM>first</EM> . Obviously not what we wanted. The
reason is that the plus is greedy. That is, the plus causes the regex engine to repeat the
preceding token as often as possible. Only if that causes the entire regex to fail, will the
regex engine backtrack. That is, it will go back to the plus, make it give up the last iteration,
and proceed with the remainder of the regex. Let's take a look inside the regex engine to
see in detail how this works and why this causes our regex to fail. After that, I will present
you with two possible solutions.

Like the plus, the star and the repetition using curly braces are greedy.
Looking Inside The Regex Engine

The first token in the regex is < . This is a literal. As we already know, the first place where it
will match is the first < in the string. The next token is the dot, which matches any character
except newlines. The dot is repeated by the plus. The plus is greedy. Therefore, the engine
will repeat the dot as many times as it can. The dot matches E , so the regex continues to try
to match the dot with the next character. M is matched, and the dot is repeated once
more. The next character is the > . You should see the problem by now. The dot matches the
> , and the engine continues repeating the dot. The dot will match all remaining characters
in the string. The dot fails when the engine has reached the void after the end of the string.
Only at this point does the regex engine continue with the next token: > .

So far, <.+ has matched <EM>first</EM> test and the engine has arrived at the end of the
string. > cannot match here. The engine remembers that the plus has repeated the dot
more often than is required. (Remember that the plus requires the dot to match only once.)
Rather than admitting failure, the engine will backtrack. It will reduce the repetition of the
plus by one, and then continue trying the remainder of the regex.

So the match of .+ is reduced to EM>first</EM> tes . The next token in the regex is still > .
But now the next character in the string is the last t . Again, these cannot match, causing the
engine to backtrack further. The total match so far is reduced to <EM>first</EM> te . But >
still cannot match. So the engine continues backtracking until the match of .+ is reduced to
EM>first</EM . Now, > can match the next character in the string. The last token in the
regex has been matched. The engine reports that <EM>first</EM> has been successfully
matched.
Remember that the regex engine is eager to return a match. It will not continue
backtracking further to see if there is another possible match. It will report the first valid
match it finds. Because of greediness, this is the leftmost longest match.

Laziness Instead of Greediness

The quick fix to this problem is to make the plus lazy instead of greedy. Lazy quantifiers are
sometimes also called "ungreedy" or "reluctant". You can do that by putting a question
mark behind the plus in the regex. You can do the same with the star, the curly braces and
the question mark itself. So our example becomes <.+?> . Let's have another look inside the
regex engine.

Again, < matches the first < in the string. The next token is the dot, this time repeated by a
lazy plus. This tells the regex engine to repeat the dot as few times as possible. The
minimum is one. So the engine matches the dot with E . The requirement has been met, and
the engine continues with > and M . This fails. Again, the engine will backtrack. But this
time, the backtracking will force the lazy plus to expand rather than reduce its reach. So the
match of .+ is expanded to EM , and the engine tries again to continue with > . Now, > is
matched successfully. The last token in the regex has been matched. The engine reports
that <EM> has been successfully matched. That's more like it.

An Alternative to Laziness

In this case, there is a better option than making the plus lazy. We can use a greedy plus and
a negated character class: <[^>]+> . The reason why this is better is because of the
backtracking. When using the lazy plus, the engine has to backtrack for each character in the
HTML tag that it is trying to match. When using the negated character class, no backtracking
occurs at all when the string contains valid HTML code. Backtracking slows down the regex
engine. You will not notice the difference when doing a single search in a text editor. But
you will save plenty of CPU cycles when using such a regex repeatedly in a tight loop in a
script that you are writing, or perhaps in a custom syntax coloring scheme for EditPad Pro.

Finally, remember that this tutorial only talks about regex-directed engines. Text-directed
engines do not backtrack. They do not get the speed penalty, but they also do not support
lazy repetition operators.

Use Round Brackets for Grouping

By placing part of a regular expression inside round brackets or parentheses, you can group
that part of the regular expression together. This allows you to apply a regex operator, e.g. a
repetition operator, to the entire group. I have already used round brackets for this purpose
in previous topics throughout this tutorial.

Note that only round brackets can be used for grouping. Square brackets define a character
class, and curly braces are used by a special repetition operator.

Round Brackets Create a Backreference

Besides grouping part of a regular expression together, round brackets also create a
"backreference". A backreference stores the part of the string matched by the part of the
regular expression inside the parentheses.

That is, unless you use non-capturing parentheses. Remembering part of the regex match in
a backreference, slows down the regex engine because it has more work to do. If you do not
use the backreference, you can speed things up by using non-capturing parentheses, at the
expense of making your regular expression slightly harder to read.

The regex Set(Value)? matches Set or SetValue . In the first case, the first backreference will be
empty, because it did not match anything. In the second case, the first backreference will
contain Value .

If you do not use the backreference, you can optimize this regular expression into
Set(?:Value)? . The question mark and the colon after the opening round bracket are the
special syntax that you can use to tell the regex engine that this pair of brackets should not
create a backreference. Note the question mark after the opening bracket is unrelated to
the question mark at the end of the regex. That question mark is the regex operator that
makes the previous token optional. This operator cannot appear after an opening round
bracket, because an opening bracket by itself is not a valid regex token. Therefore, there is
no confusion between the question mark as an operator to make a token optional, and the
question mark as a character to change the properties of a pair of round brackets. The colon
indicates that the change we want to make is to turn off capturing the backreference.
How to Use Backreferences

Backreferences allow you to reuse part of the regex match. You can reuse it inside the
regular expression (see below), or afterwards. What you can do with it afterwards, depends
on the tool or programming language you are using. The most common usage is in search-
and-replace operations. The replacement text will use a special syntax to allow text matched
by capturing groups to be reinserted. This syntax differs greatly between various tools and
languages, far more than the regex syntax does. Please check the replacement text
reference for details.

Using Backreferences in The Regular Expression

Backreferences can not only be used after a match has been found, but also during the
match. Suppose you want to match a pair of opening and closing HTML tags, and the text in
between. By putting the opening tag into a backreference, we can reuse the name of the tag
for the closing tag. Here's how: <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1> . This regex contains only one
pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first
backreference. This backreference is reused with \1 (backslash one). The / before it is
simply the forward slash in the closing HTML tag that we are trying to match.

To figure out the number of a particular backreference, scan the regular expression from
left to right and count the opening round brackets. The first bracket starts backreference
number one, the second number two, etc. Non-capturing parentheses are not counted. This
fact means that non-capturing parentheses have another benefit: you can insert them into a
regular expression without changing the numbers assigned to the backreferences. This can
be very useful when modifying a complex regular expression.

You can reuse the same backreference more than once. ([a-c])x\1x\1 will match axaxa , bxbxb
and cxcxc .

Looking Inside The Regex Engine

Let's see how the regex engine applies the above regex to the string Testing <B><I>bold
italic</I></B> text . The first token in the regex is the literal < . The regex engine will traverse
the string until it can match at the first < in the string. The next token is [A-Z] . The regex
engine also takes note that it is now inside the first pair of capturing parentheses. [A-Z]
matches B . The engine advances to [A-Z0-9] and > . This match fails. However, because of
the star, that's perfectly fine. The position in the string remains at > . The position in the
regex is advanced to [^>] .

This step crosses the closing bracket of the first pair of capturing parentheses. This prompts
the regex engine to store what was matched inside them into the first backreference. In this
case, B is stored.

After storing the backreference, the engine proceeds with the match attempt. [^>] does not
match > . Again, because of another star, this is not a problem. The position in the string
remains at > , and position in the regex is advanced to > . These obviously match. The next
token is a dot, repeated by a lazy star. Because of the laziness, the regex engine will initially
skip this token, taking note that it should backtrack in case the remainder of the regex fails.

The engine has now arrived at the second < in the regex, and the second < in the string.
These match. The next token is / . This does not match I , and the engine is forced to
backtrack to the dot. The dot matches the second < in the string. The star is still lazy, so the
engine again takes note of the available backtracking position and advances to < and I .
These do not match, so the engine again backtracks.

The backtracking continues until the dot has consumed <I>bold italic . At this point, < matches
the third < in the string, and the next token is / which matches / . The next token is \1 .
Note that the token is the backreference, and not B . The engine does not substitute the
backreference in the regular expression. Every time the engine arrives at the backreference,
it will read the value that was stored. This means that if the engine had backtracked beyond
the first pair of capturing parentheses before arriving the second time at \1 , the new value
stored in the first backreference would be used. But this did not happen here, so B it is. This
fails to match at I , so the engine backtracks again, and the dot consumes the third < in the
string.

Backtracking continues again until the dot has consumed <I>bold italic</I> . At this point, <
matches < and / matches / . The engine arrives again at \1 . The backreference still holds B .
B matches B . The last token in the regex, > matches > . A complete match has been found:
<B><I>bold italic</I></B> .

Backtracking Into Capturing Groups

You may have wondered about the word boundary \b in the <([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
mentioned above. This is to make sure the regex won't match incorrectly paired tags such as
<boo>bold</b> . You may think that cannot happen because the capturing group matches boo
which causes \1 to try to match the same, and fail. That is indeed what happens. But then
the regex engine backtracks.

Let's take the regex <([A-Z][A-Z0-9]*)[^>]*>.*?</\1> without the word boundary and look inside
the regex engine at the point where \1 fails the first time. First, .*? continues to expand
until it has reached the end of the string, and </\1> has failed to match each time .*?
matched one more character.

Then the regex engine backtracks into the capturing group. [A-Z0-9]* has matched oo , but
would just as happily match o or nothing at all. When backtracking, [A-Z0-9]* is forced to
give up one character. The regex engine continues, exiting the capturing group a second
time. Since [A-Z][A-Z0-9]* has now matched bo , that is what is stored into the capturing
group, overwriting boo that was stored before. [^>]* matches the second o in the opening
tag. >.*?</ matches >bold< . \1 fails again.
The regex engine does all the same backtracking once more, until [A-Z0-9]* is forced to give
up another character, causing it to match nothing, which the star allows. The capturing
group now stores just b . [^>]* now matches oo . >.*?</ once again matches >bold< . \1 now
succeeds, as does > and an overall match is found. But not the one we wanted.

There are several solutions to this. One is to use the word boundary. When [A-Z0-9]*
backtracks the first time, reducing the capturing group to bo , \b fails to match between o
and o . This forces [A-Z0-9]* to backtrack again immediately. The capturing group is reduced
to b and the word boundary fails between b and o . There are no further backtracking
positions, so the whole match attempt fails.

The reason we need the word boundary is that we're using [^>]* to skip over any attributes
in the tag. If your paired tags never have any attributes, you can leave that out, and use
<([A-Z][A-Z0-9]*)>.*?</\1> . Each time [A-Z0-9]* backtracks, the > that follows it will fail to
match, quickly ending the match attempt.

If you didn't expect the regex engine to backtrack into capturing groups, you can use an
atomic group. The regex engine always backtracks into capturing groups, and never
captures atomic groups. You can put the capturing group inside an atomic group to get an
atomic capturing group: (?>(atomic capture)) . In this case, we can put the whole opening tag
into the atomic group: (?><([A-Z][A-Z0-9]*)[^>]*>).*?</\1> . The tutorial section on atomic
grouping has all the details.

Backreferences to Failed Groups

The previous section applies to all regex flavors, except those few that don't support
capturing groups at all. Flavors behave differently when you start doing things that don't fit
the "match the text matched by a previous capturing group" job description.

There is a difference between a backreference to a capturing group that matched nothing,


and one to a capturing group that did not participate in the match at all. The regex (q?)b\1
will match b . q? is optional and matches nothing, causing (q?) to successfully match and
capture nothing. b matches b and \1 successfully matches the nothing captured by the
group.

The regex (q)?b\1 however will fail to match b . (q) fails to match at all, so the group never
gets to capture anything at all. Because the whole group is optional, the engine does
proceed to match b . However, the engine now arrives at \1 which references a group that
did not participate in the match attempt at all. This causes the backreference to fail to
match at all, mimicking the result of the group. Since there's no ? making \1 optional, the
overall match attempt fails.

The only exception is JavaScript. According to the official ECMA standard, a backreference to
a non-participating capturing group must successfully match nothing just like a
backreference to a participating group that captured nothing does. In other words, in
JavaScript, (q?)b\1 and (q)?b\1 both match b .
Forward References and Invalid References

Modern flavors, notably JGsoft, .NET, Java, Perl, PCRE and Ruby allow forward references.
That is: you can use a backreference to a group that appears later in the regex. Forward
references are obviously only useful if they're inside a repeated group. Then there can be
situations in which the regex engine evaluates the backreference after the group has
already matched. Before the group is attempted, the backreference will fail like a
backreference to a failed group does.

If forward references are supported, the regex (\2two|(one))+ will match oneonetwo . At the
start of the string, \2 fails. Trying the other alternative, one is matched by the second
capturing group, and subsequently by the first group. The first group is then repeated. This
time, \2 matches one as captured by the second group. two then matches two . With two
repetitions of the first group, the regex has matched the whole subject string.

A nested reference is a backreference inside the capturing group that it references, e.g.
(\1two|(one))+ . This regex will give exactly the same behavior with flavors that support
forward references. Some flavors that don't support forward references do support nested
references. This includes JavaScript.

With all other flavors, using a backreference before its group in the regular expression is the
same as using a backreference to a group that doesn't exist at all. All flavors discussed in this
tutorial, except JavaScript and Ruby, treat backreferences to undefined groups as an error.
In JavaScript and Ruby, they always result in a zero-width match. For Ruby this is a potential
pitfall. In Ruby, (a)(b)?\2 will fail to match a , because \2 references a non-participating
group. But (a)(b)?\7 will match a . For JavaScript this is logical, as backreferences to non-
participating groups do the same. Both regexes will match a .

Repetition and Backreferences

As I mentioned in the above inside look, the regex engine does not permanently substitute
backreferences in the regular expression. It will use the last match saved into the
backreference each time it needs to be used. If a new match is found by capturing
parentheses, the previously saved match is overwritten. There is a clear difference between
([abc]+) and ([abc])+ . Though both successfully match cab , the first regex will put cab into the
first backreference, while the second regex will only store b . That is because in the second
regex, the plus caused the pair of parentheses to repeat three times. The first time, c was
stored. The second time a and the third time b . Each time, the previous value was
overwritten, so b remains.

This also means that ([abc]+)=\1 will match cab=cab , and that ([abc])+=\1 will not. The reason is
that when the engine arrives at \1 , it holds b which fails to match c . Obvious when you
look at a simple example like this one, but a common cause of difficulty with regular
expressions nonetheless. When using backreferences, always double check that you are
really capturing what you want.
Useful Example: Checking for Doubled Words

When editing text, doubled words such as "the the" easily creep in. Using the regex
\b(\w+)\s+\1\b in your text editor, you can easily find them. To delete the second word,
simply type in \1 as the replacement text and click the Replace button.

Parentheses and Backreferences Cannot Be Used Inside Character Classes

Round brackets cannot be used inside character classes, at least not as metacharacters.
When you put a round bracket in a character class, it is treated as a literal character. So the
regex [(a)b] matches a , b , ( and ) .

Backreferences also cannot be used inside a character class. The \1 in regex like (a)[\1b] will
be interpreted as an octal escape in most regex flavors. So this regex will match an a
followed by either \x01 or a b .

Aplicacion
Ahora que ya sabemos cmo funcionan las expresiones regulares podemos aplicarlas a nuestro
proceso, veamos en el software de prueba como funcionaria:

Reemplazar:

<\!\-\-:en\-\->(.+?)<\!\-\-:\-\->

<english>\1</english>

También podría gustarte