Regex basics

What is a regex?

A regex is a pattern that you want to look for in strings. When found, it’s called a match.

Regex
A regex is a pattern that you want to look for in strings. When found, it’s called a ‘match’.

Simple regexes

The simplest example is looking for one specific character in a sentence. Consider the simple regex r:

The result of the matching process is a set of match-objects (one for each match). These objects naturally contain the matched substring, but also extra information like its position in the original string.
 
In the next example, the regex (or pattern) is two characters long:

 

Regexes in Python

Regexes are not part of the Python language itself, but they are available through the imported re module. When creating and using regexes, you typically don’t want Python to perform its own operations on the strings. Therefore you need to use raw strings! The prefix r does the job:

    >>> myStr = "This is a\nline"     <- Just an ordinary Python string
    >>> print(myStr)
    This is a
    line
    >>> print(myStr[9])
            
                    <- Newline '\n' is printed
    >>> 
    >>> myStr = r"This is a\nline"    <- A raw string
    >>> print(myStr)
    This is a\nline
    >>> print(myStr[9])
    \               <- Character '\' is printed

 
Now we are ready for a simple regex example in Python. Please note the several steps being taken:

  1. import re
    First you’ve got to import the regex module.
     
  2. p = re.compile(r”r”)
    You can represent the regex as a (raw) string. The re module will compile it into a real regex object. This compilation has only to be done one time. Once you got your regex object, you can use it several times to search in many strings.
     
  3. m_all = p.finditer(“The frog jumped in the water”)
    Use the regex object to search in the string “The frog …”. While searching, we say that the matching engine is running. Think of it as a piece of software doing a (somtimes heavy) task. When ready, a set of match-objects is returned as an iterable object.
     
  4. for m in m_all:
        print(m)

    Print out the match-objects.
    >>> import re
    >>> p = re.compile(r"r")
    >>> m_all = p.finditer("The frog jumped in the water.")
    >>> for m in m_all:
    ...     print(m)
    ...
    <_sre.SRE_Match object; span=(5, 6), match='r'>
    <_sre.SRE_Match object; span=(27, 28), match='r'>

From this, we can derive the following definitions:

Compiled regex-object / Pattern : When a regex is compiled into the regex/pattern object, it can be used to analyze several strings.
Matching engine : The matching engine is the software doing the actual matching work.
Matching objects : The typical output from the matching engine is a set of matching objects – possibly returned in an iterator.

 
We have used the finditer(..) function on our pattern objects so far. But Python has more to offer. The table below gives an overview of functions you can use on a regex/pattern object. Not everything in this table has been covered yet. But I’ve decided to make the table complete, so you can fully understand it when you’ve finished the regex tutorial.

    Functions for a regex/pattern object:
     
  • p.match(..)   Looks only from the beginning of the string. Can return (at most) one match-object.
     
  • p.search(..)   Looks everywhere in the string. Returns the first match-object found. Cannot return more than one match-object.
     
  • p.findall(..)   Returns all the matches, but they are returned as a string-list. This function doesn’t return actual match-objects!
     
  • p.finditer(..)   Returns all matches as match-objects in an iterable.
     
  • p.split(..)   Splits the string apart wherever the regex matches, and returns a list of the pieces. The actual matches do not appear in the list, unless capturing parentheses were used in the regex. More on capturing and non-capturing parentheses in the chapter about character groups.
     
  • p.sub(replacement, ..)   Returns a new string in which all the matches got replaced by replacement. Usually the replacement is just a string. But it can also be a function that returns a string. Such function is given the match-object automatically when called.
    Within the replacement parameter, one can use backreferences to groups. The reference can use the group number, like \1 or \g<1>, but it can also use the group name like \g<name>.
     
  • p.subn(replacement, ..)   Similar to previous function, but it returns also the nr of replacements that took place.

 
And up till now, we don’t do much with our match-objects. Python offers several functions to extract information from them:

    Functions for a match-object:
     
  • m.group(n)   Each group in a regular expression has a number. This function returns the substring that corresponds to group n. When n = 0, it returns the whole string from the match-object. More on groups in the corresponding chapter..
     
  • m.groups()   Returns a list of all the capturing groups. The list starts from group 1.
     
  • m.start()   Returns starting position.
     
  • m.end()   Returns end position.
     
  • m.span()   Returns tuple (start, end).