Regular Expressions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104

Regular

Expressions
 A Regular Expression (RegEx) is a sequence of characters that defines a
search pattern.
 Regular Expressions are used in programming languages to match text
patterns.
 It's possible to check, if a text or a string matches a regular expression.
 A great thing about regular expressions: The syntax of regular expressions is
the same for all programming and script languages, e.g. Python, Perl, Java etc.
 Python has a module named re to work with RegEx.
MetaCharacters

 Metacharacters are characters that are interpreted in a special


way by a RegEx engine.

Here's a list of metacharacters:


[] . ^ $ * + ? {} () \ |
[] Square brackets
 Square brackets specifies a set of characters you wish to match.
[] Square brackets
 Specify a range of characters using - inside square brackets.
[a-e] is the same as [abcde]
[1-4] is the same as [1234]
[0-39] is the same as [01239]

 You can complement (invert) the character set by using


caret ^ symbol at the start of a square-bracket.
[^abc] means any character except a or b or c
[^0-9] means any non-digit character
[] Square brackets

 [Pp]ython
Match "Python" or "python"
 rub[ye]
Match "ruby" or "rube“
 [aeiou]
Match any one lowercase vowel
. Period
 A period matches any single character (except newline '\n').
^ Caret
 The caret symbol ^ is used to check if a string starts with a certain
character.
$ Dollar
 The dollar symbol $ is used to check if a string ends with a certain
character.
* Star
 The star symbol * matches zero or more occurrences of the pattern left
to it.
+ Plus
 The plus symbol + matches one or more occurrences of the pattern
left to it.
? Question Mark
 The question mark symbol ? matches zero or one occurrence of the
pattern left to it.
{} Braces
 Consider this code: {n,m}.
 This means at least n, and at most m repetitions of the pattern left to it.
{} - Braces
| Alternation
 Vertical bar | is used for alternation (or operator).
() Group
 Parentheses () is used to group sub-patterns.
 For example, (a|b|c)xz match any string that matches
either a or b or c followed by xz
Summary
\ Backslash
 Backlash \ is used to escape various characters including all
meta characters.
 For example,
\$a match if a string contains $ followed by a.
Here, $ is not interpreted by a RegEx engine in a special way.
search() vs. match()
 match() checks for a match only at the beginning of the string.
 search() checks for a match anywhere in the string.
 Ex:
import re
x = re.search("cat","A cat and a rat can't be friends.")
print (x)
y = re.search("dog","A cat and a rat can't be friends.")
print (y)
x = re.match("cat","cat and a rat can't be friends.")
print (x)
y = re.match("cat","A cat and a rat can't be friends.")
print (y)

Output:
<re.RE_Match object; span=(2, 5), match='cat'>
None
<re.RE_Match object; span=(0, 3), match='cat'>
None
 Ex:
import re
if re. search("cat", "A cat and a rat can't be friends."):
print(“Cat has been found.")
else:
print("No cat has been found.")

 Output:
Cat has been found.
group() and groups()
Example:
import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number


pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.


match = re.search(pattern, string)
find=re.findall(pattern, string)

if match:
print(match.group())
print(match.group(1))
print(match.groups())
else:
print("pattern not found")

print(find)
Output:
801 35
801
('801', '35')
[('801', '35'), ('102', '11')]
Using r prefix before RegEx
 When r or R prefix is used before a regular expression, it means
raw string. For example, '\n' is a new line whereas r'\n' means
two characters: a backslash \ followed by n.

 Backlash \ is used to escape various characters including all


metacharacters. However, using r prefix makes \ treat as a
normal character.
import re
x = re.search(r'^A and',"A and a rat can't be friends.")
print ('output 1:\n',x.group())
x = re.search(r'(^A) cat (a)nd',"A cat and a rat can't be friends.")
print ('output 2:\n',x.group())
x = re.search(r'(^A) (a)nd',"A and a rat can't be friends.")
print ('output 3:\n',x.group(1))

Output:
output 1:
A and
output 2:
A cat and
output 3:
A
 Ex:
import re
x = re.search(r'(^A) (and)',"A and a rat can't be friends.")
print ('output 4:\n',x.group(2))
x = re.search(r'(^A) ([a-z])nd',"A and a rat can't be friends.")
print ('output 5:\n',x.group(2))
x = re.search(r'(^A) (and)',"A and a rat can't be friends.")
print ('output 6:\n',x.groups())
Output:

output 4:
and
output 5:
a
output 6:
('A', 'and')
split() Function
 The split() function returns a list where the string has been split at
each match:

import re

txt = "The rain in Spain"


x = re.split("\s", txt)
print(x)

Ouput:
['The', 'rain', 'in', 'Spain']
sub() Function
The sub() function replaces the matches with the text of your choice

import re

txt = "The rain in Spain"


x = re.sub("\s", "9", txt)
print(x)

Output:
The9rain9in9Spain
 Ex:
import re
x = re.search(".","ab\n")
print (x.group())
x = re.search(”.+","ab\na")
print (x.group()) Output:
x = re.search(".*","") a
print (x.group()) ab
x = re.search("^a+","aabc")
aa
print (x.group()) bc
x = re.search("bc$","abc")
print (x.group())
 Ex:

import re
match = re.search(r'iii', 'piiig')
print (match.group())
match = re.search(r'..g', 'piiig')
print (match.group())
match = re.search(r'\d\d\d', 'p123g')
print (match.group())
match = re.search(r'\w\w\w', '@@aB0d!!')
print (match.group())
match = re.search(r'\s', 'ab\nc') Output:
print (match.group()) iii
iig
match = re.search(r'\s', 'ab c') 123
print (match.group()) aB0
match = re.search(r'\S+', 'ab\nc')
print (match.group())
ab
 Ex:
import re
str="[email protected]"
match = re.search(r'[\w-]+@[\w-]+', str)
Print(match.group())
match = re.search(r'[\w-]+@[\w.-]+', str)
Print(match.group())
match = re.search(r'[\w]+@[\w.-]+', str)
Print(match.group())

Output:

abc-xyz@yahoo
[email protected]
[email protected]
 Ex:

import re
x = re.search('[^a-z]+',"ABC")
print ('output 1:\n',x.group())
x = re.search('[^a-z]',"ABC")
print ('output 2:\n',x.group())

 Output:
output 1:
ABC
output 2:
A
Search and Replace
 Syntax:
re.sub(pattern, repl, string, max)

 This method replaces all occurrences of the


RE pattern in string with repl, substituting all occurrences
unless max provided. This method returns modified string.
phone = "2004-959-559 # This is #Phone Number"

num = re.sub(r'#', "a", phone,1)


print ("Phone Num : ", num)

num = re.sub(r'#.*', "", phone)


print ("Phone Num : ", num)

Output:
Phone Num : 2004-959-559 a This is #Phone Number
Phone Num : 2004-959-559
Repetition Cases
 Ex:
import re
x = re.search('ruby?',"rub")
print('output 1:\n',x.group())
x = re.search('ruby?',"ruby")
print('output 2:\n',x.group())
x = re.search('ruby*',"rub")
print('output 3:\n',x.group()) Output:
output 1:
x = re.search('ruby+',"rubyyy") rub
print('output 4:\n',x.group()) output 2:
ruby
output 3:
rub
output 4:
rubyyy
 Ex:
import re
x = re.search('\d{3}',"0Ab456")
print ('output 1:\n',x.group())
x = re.search('\d{3,}',"0Ab456789")
print ('output 2:\n',x.group())
x = re.search('\d{3,5}',"0Ab456789")
print ('output 3:\n',x.group()) Output:
x = re.search('\d{3}',"0Ab4c56") output 1:
456
print ('output 4:\n',x.group()) output 2:
456789
output 3:
45678
AttributeError
Grouping with Parentheses
 Ex:
import re
x = re.search('\D{3}',"0Abc456")
print('output 1:\n',x.group())
x = re.search('\D',"abc0123012")
print('output 2:\n',x.group()) Output:
output 1:
x = re.search('\D\d+',"abc012301a2") Abc
print('output 3:\n',x.group()) output 2:
a
x = re.search('(\D\d)+',"abc012ef34") output 3:
c012301
print('output 4:\n',x.group()) output 4:
x = re.search('(\D\d)+',"c0s234a") c0
output 5:
print('output 5:\n',x.group()) c0s2
 Ex:
import re
x = re.search('([Pp]ython?)+',"Python and Java")
print('output 1:\n',x.group())
x = re.search ('([Pp]ython,?)+',"Python,python,Python")
print('output 2:\n',x.group())

Output:
output 1:
Python
output 2:
Python,python,Python
findall() method
 findall() finds all the matches and returns them as a list of strings, with
each string representing one match.
 If the pattern is not found, re.findall() returns an empty list.

 Ex:
import re
str="[email protected], [email protected], iop"
match = re.findall(r'[\w-]+@[\w.-]+', str)
print(match)
for i in match:
print(i)

 Output:
['[email protected]', '[email protected]']
[email protected]
[email protected]
findall() method
import re
string = 'hello 1 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string)


print(result)

Output:
['1', '89', '34']
findall() and groups
 Ex:
import re
str="[email protected], [email protected], iop"
match = re.findall(r'([\w-]+)@([\w.-]+)', str)
print(match)
for i in match:
print (i)
print (i[0])
print (i[1])
 Output:
[('abc-xyz', 'yahoo.ac.in'), ('qwe', 'gmail.com')]
('abc-xyz', 'yahoo.ac.in')
abc-xyz
yahoo.ac.in
('qwe', 'gmail.com')
qwe
gmail.com
 Python program to search some literals strings in a string
 Program:
import re
patterns = [ 'fox', 'dog', 'horse' ]
text = 'The quick brown fox jumps over the lazy dog.'
for pattern in patterns:
print('Searching for "%s" in "%s" ->' % (pattern, text),)
if re.search(pattern, text):
print('Matched!')
else:
print('Not Matched!')

Searching for "fox" in "The quick brown fox jumps over the lazy dog." ->
Matched!
Searching for "dog" in "The quick brown fox jumps over the lazy dog." ->
Matched!
Searching for "horse" in "The quick brown fox jumps over the lazy dog." ->
Not Matched!
 Regex for language that accepts strings
containing ‘ab’ as substring:

import re
inp = input("Enter string: ")
if re.search("(a+b)*ab(a+b)*",inp):
print("Match found.")
else:
print("Match not found.")
aaabab Match found.
ab Match found.
aaa Match not found.
Problem
 Write a python program to remove leading zeros
from an IP address.
 Code:
import re
ip= input("Enter IP address: ")
ip= ip.lstrip('0')
ip2= re.sub('\.0+','.',ip)
print (ip)
print (ip2)

Output:
Enter IP address: 0014.01.0003.3
14.01.0003.3
14.1.3.3
 Example:
import re
x = re.search('[-a-z]+',"a-b@c")
print('output 1:\n',x.group())

Output:
output 1:
a-b
Problem
 Write a regular expression which matches
strings which starts with a sequence of digits -
at least one digit - followed by a blank and after
this arbitrary characters.
 Code:
import re
x = re.search('^[0-9]+ .*',"0123 abc")
print('output:\n',x.group())

Output:
0123 abc

#re.start() method
Problem
 We have an imaginary phone list. Not all entries
contain a phone number, but if a phone number
exists it is the first part of an entry. Then follows
separated by a blank and a surname, which is
followed by first names. Surname and first
name are separated by a comma and a space.
 The task is to print the list in following order:
fisrt_name last_name phone_number
 Ex:
555-8396 Neu, Allison
555-5299 Putz, Lionel
555-7334 Simpson, Homer Jay

Expected output:
Allison Neu 555-8396
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
 Code:
import re
l = ["555-8396 Neu, Allison", "555-5299 Putz, Lionel", "555-7334 Simpson, Homer
Jay"]

for i in l:
res = re.search(r"([0-9-]+)\s([A-Za-z]+),\s([A-Za-z]+)", i)
print(res.group(3) + " " + res.group(2) + " " + res.group(1))

Output:
Allison Neu 555-8396
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
 Ex:
555-8396 Neu, Allison
Burns, Montgomery
555-5299 Putz, Lione
555-7334 Simpson, Homer Jay

Expected Output:
Allison Neu 555-8396
Montgomery Burns
Lionel Putz 555-5299
Homer Jay Simpson 555-7334
Code:
import re
l = ["555-8396 Neu, Allison", "Burns, Montgomery",
"555-5299 Putz, Lionel", "555-7334 Simpson, Homer Jay"]
for i in l:
res = re.search(r"([0-9-]*)\s*([A-Za-z]+),\s+([A-Za-z]+)", i)
print res.group(3) + " " + res.group(2) + " " + res.group(1)
 RE that matches a string that has an ‘a’ followed by one or more b's.
re.search(“ab+”,inp)
 RE that matches a string that has an ’a’ followed by two 'b'
re.search(“ab{2}”,inp)
 RE that matches a string that has an ’a’ followed by two or three 'b'
re.search(“ab{2,3}”,inp)
 RE that matches a string that has an 'a' followed by anything, ending in
'b'
re.search(“ a.*b$”,inp)
 RE that matches matches a word at the beginning of a string.
re.search(“^\w+”,inp)
 RE that matches a word containing 'z‘
re.search(“\w*z\w*”,inp)
 RE to check for a number at the end of a string
re.search(“.*[0-9]$”,inp)
 Write a Python program to extract year, month and date from an url.

 url1="https://www.washingtonpost.com/news/football-
insider/wp/2016/09/02/odell-beckhams-fame-rests-on-one-stupid-
little-ball-josh-norman-tells-author/"
import re
def extract_date(url):
return re.findall(r'/(\d{4})/(\d{1,2})/(\d{1,2})/',
url)
url1=
"https://www.washingtonpost.com/news/football-
insider/wp/2016/09/02/odell-beckhams-fame-rests-
on-one-stupid-little-ball-josh-norman-tells-author/"
print(extract_date(url1))

Output:
[('2016', '09', '02')]
 Write
a Python program to convert a date of yyyy-
mm-dd format to dd-mm-yyyy format.

Original date in YYY-MM-DD Format: 2026-01-02


New date in DD-MM-YYYY Format: 02-01-2026
import re
def change_date_format(dt):
return re.sub(r'(\d{4})-(\d{1,2})-(\d{1,2})',
'\\3-\\2-\\1', dt)

dt1 = "2026-01-02“

print("Original date in YYY-MM-DD Format: ",dt1)


print("New date in DD-MM-YYYY Format:
",change_date_format(dt1))

Output:
Original date in YYY-MM-DD Format: 2026-01-02
New date in DD-MM-YYYY Format: 02-01-2026
 Write a Python program to print the numbers of
a given string.

 text = "Ten 10, Twenty 20, Thirty 30"


import re
# Sample string.
text = "Ten 10, Twenty 20, Thirty 30"
result = re.findall("\d+", text)
for i in result:
print(i)

Output:
10
20
30
 Write Python program to search the numbers (0-9) of
length between 1 to 3 in a given string.

 Exercises number 1, 12, 13, and 345 are important


import re
results=re.findall(r"([0-9]{1,3})","Exercises number 1,
12, 13, and 345 are important")
print("Number of length 1 to 3")
for n in results:
print(n)

Output:
Number of length 1 to 3
1
12
13
345
 Write a Python program to find all words
starting with 'a' or 'e' in a given string.

 text = "The following example."creates an


ArrayList with a capacity of 50 elements
import re
text = "The following example creates an
ArrayList with a capacity of 50 elements."
list = re.findall(r"\b[ae]\w*", text)
print(list)

Output:
['e', 'example', 'eates', 'an', 'ayList', 'a', 'apacity',
'elements']
 Write a Python program to replace maximum 2
occurrences of space, comma, or dot with a
colon.
import re
text = 'Python Exercises, PHP exercises.'
print(re.sub("[ ,.]", ":", text, 2))

Output:
Python:Exercises: PHP exercises.
 Write a Python program to find all five characters
long word in a string.

 text = 'The quick brown fox jumps over the lazy


dog.'
import re
text = 'The quick brown fox jumps over the lazy
dog.'
print(re.findall(r"\w{5}", text))

Output:
['quick', 'brown', 'jumps']
 Write a Python program to replace all occurrences
of space, comma, or dot with a colon.

 Input: Python Exercises, PHP exercises.


 Output:Python:Exercises::PHP:exercises:
import re
text = 'Python Exercises, PHP exercises.'
print(re.sub("[ ,.]", ":", text))

Output:
Python:Exercises::PHP:exercises:
Write a Python program to remove multiple
spaces in a string.

Original string: Python Exercises


Without extra spaces: Python Exercises
import re
text1 = 'Python Exercises'
print("Original string:",text1)
print("Without extra spaces:",re.sub(' +',' ',text1))

Output:
Original string: Python Exercises
Without extra spaces: Python Exercises
 Write a Python program to remove everything
except alphanumeric characters from a string.

Original string: **//Python Exercises// - 12.


Without extra spaces: PythonExercises12
import re
text1 = '**//Python Exercises// - 12. '
print("Original string:",text1)
print("Without extra spaces:", re.sub('[\W_]+',
'',text1))

Output:
Original string: **//Python Exercises// - 12.
Without extra spaces: PythonExercises12
 Writea Python program to split a string at
uppercase letters.
import re
text = "PythonTutorialAndExercises"
print(re.findall('[A-Z][^A-Z]*', text))

Output:
['Python', 'Tutorial', 'And', 'Exercises']
 Write a Python program to remove the
parenthesis area in a string.

 items = ["example(.com)", "w3resource",


"github(.com)", "stackoverflow(.com)"]

Output:
example
w3resource
github
stackoverflow
import re
items = ["example(.com)", "w3resource", "github(.com)",
"stackoverflow(.com)"]
for item in items:
print(re.sub(r"\([^)]+\)", "", item))

Output:
example
w3resource
github
stackoverflow
MCQ
1) What does the function re.match do?
a) matches a pattern at the start of the string
b) matches a pattern at any position in the string
c) such a function does not exist
d) none of the mentioned
Conti..
2) What does the function re.search do?
a) matches a pattern at the start of the string
b) matches a pattern at any position in the string
c) such a function does not exist
d) none of the mentioned
3) What will be the output of the following Python code?
sentence = 'we are humans’
matched = re.match(r'(.*) (.*?) (.*)', sentence)
print(matched.groups())
a)(‘we’, ‘are’, ‘humans’)
b)(we, are, humans)
c)(‘we’, ‘humans’)
d) ‘we are humans’
4). ________ matches the start of the string.
________ matches the end of the string.
a) ‘^’, ‘$’
b) ‘$’, ‘^’
c) ‘$’, ‘?’
d) ‘?’, ‘^’
5) What will be the output of the following Python function?
re.findall("hello world", "hello", 1)

a) [“hello”]
b) [ ]
c) hello
d) hello world
6)What will be the output of the following Python code?
re.sub('morning', 'evening', 'good morning’)

a) ‘good evening’
b) ‘good’
c) ‘morning’
d) ‘evening’
7) What will be the output of the following Python code?
re.findall('good', 'good is good')
re.findall('good', 'bad is good’)

a)
[‘good’, ‘good’]
[‘good’]
b)
(‘good’, ‘good’)
(good)
c)
(‘good’)
(‘good’)
d)
[‘good’]
[‘good’]
8) What will be the output of the following Python code?

re.split(r'(n\d)=', 'n1=3.1, n2=5, n3=4.565')


a) Error
b) [”, ‘n1’, ‘3.1, ‘, ‘n2’, ‘5, ‘, ‘n3’, ‘4.565’]
c) [‘n1’, ‘3.1, ‘, ‘n2’, ‘5, ‘, ‘n3’, ‘4.565’]
d) [‘3.1, ‘, ‘5, ‘, ‘4.565’]
9)What will be the output of the following Python code?

import re
s = 'abc123 xyz666 lmn-11 def77'
re.sub(r'\b([a-z]+)(\d+)', r'\2\1:', s)
a) ‘123abc: 666xyz: lmn-11 77def:’
b) ‘77def: lmn-11: 666xyz: 123abc’
c) ‘abc123:’, ‘xyz666:’, ‘lmn-11:’, ‘def77:’
d) ‘abc123: xyz666: lmn-11: def77’
10) Which of the following statements regarding the output of
the function re.match is incorrect?
a) ‘pq*’ will match ‘pq’
b) ‘pq?’ matches ‘p’
c) ‘p{4}, q’ does not match ‘pppq’
d) ‘pq+’ matches ‘p
11) Which of the following lines of code will not show a
match?
a) re.match(‘ab*’, ‘a’)
b) re.match(‘ab*’, ‘ab’)
c) re.match(‘ab*’, ‘abb’)
d) re.match(‘ab*’, ‘ba’)
1) a
2) b
3) a
4) a
5) b
6) a
7) a
8) b
9) a
10) d
11) d

You might also like