Python tutorial-Regular Expressions
Regular Expressions Regular expression is a powerful feature of a computer programming language. It is a pattern that matches a piece of text. Python provides re module that contains many methods to work with regular expressions. Those methods will be discussed later.
Creating patterns to match text
A pattern to match any one beginning character (exception a new line):
The dot (.) is wildcard character to match any characters except new lines. For example, the pattern ‘.ython’ can match texts that add only one character to ‘ython’(e.g. Python, Jython, Lython, Vython…).
A pattern to match only B and C characters at the beginning:
This pattern can be written as ‘[B|C]an’
A pattern to match any character from a to z, A to Z, or 0 to 9:
You need only write ‘[a-z]’ to match a character from a to z, ’[A-Z]’ to a letter from A to Z, ‘[0-9]’ to match a character from 0 to 9, and ‘[a-zA-Z0-9]’ to match lower case letters, upper case letters, and digits.
A pattern to match any character except some characters:
For example, a pattern that matches any character except abc: ‘[^abc]’
A pattern to mach sub-strings:
To create a pattern that can match a sub-string in a text, you can write the sub-string in single quotes (e.g. ‘Phton’). If you want to match two alternative sub-strings, you can write the two sub-strings in single quotes separated by pipe character (|)(e.g.’Python|Jython’).
A pattern to match any character but not strictly required:
To make a character optional, you can add question mark (?) to the sub-patterns. For example, to match the strings:
You can write: r’(http://)?(www\.)?google.com
Note: We use \ to prevent the dot from being treated as a wildchard character and use r
to make a raw string to reduce the number of backslashes. If you do not use r, the pattern would be written as: ’(http://)?(www\\.)?google.com. Sub-patterns need to be put in parentheses.
A pattern to match a sub-string at the beginning of a text:
Sometimes, you want to match only a sub-string that stays at the beginning of a text. For example, you want to match ‘http’ of ‘http://www.worldbestlearningcenter.com’, you would write the pattern as the following: ‘ht+p:’
Note: we use + to match a character that can occur more than 1 time in a text. Therefore the pattern ‘ht+p’ can’t match only ‘http’ but also ‘httttttttttp’.