Chapter 4 - Regular Expression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Unit 4 : Regular Expressions ​grep FAMILY OF COMMANDS AND THE sed

Quite often a Unix user is involved in searching one or more records from a database
or one or more lines from a text file. Such a search could be for finding or extracting
a

1. file using the filename among a large number of filenames,


2. line having a specific word or a phrase in a document,
3. record based on certain data item like designation or name
4. selected portion of the output of a program, and so on.

REGULAR EXPRESSIONS

The term ​regular expression​ comes from theoretical computer science. In its
simplest form, it is defined as a language for specifying patterns that match a
sequence of characters. These patterns are made up of one of the following.

1. Normal characters that match exactly the same character in the input.
2. character classes that match any single character in the class.
3. certain other special characters that specify the way in which parts of an expression are
to be matched against the input.

Metacharacters and their Meaning

^—The Caret or Circumflex Character​ This metacharacter is used to search


and extract lines or records that begin with a specific pattern.
$—The Dollar Character​ This metacharacter is used to search and extract lines
or records that end with a specific pattern.
.—The Dot Character​ The dot is used to match any single character, except a
newline character.
*—The Asterisk Character​ Asterisk is used to match multiple characters. This
metacharacter stands for zero or more occurrences of the ​preceding character​.

Character Class

There are situations when it is necessary to match a character from within a set of
characters. In Unix this set of characters out of which, only one character is
matched, is referred to as a ​character class​. This set of characters is presented
within a pair of square brackets—[and the character]. For example, if the user wants
to extract all lines that have a pattern (anywhere on it) that begins with ​chap​ and
end with any one of the digits ​1, 2, 3​ or ​4​ then the search pattern will be ​chap[1234]​.
The same search pattern can also be written as ​chap[1-4]​. The hyphen (-) indicates
the range of the characters in the set. Here ​[1-4]​ means any of the characters that
constitute the set {​1,2,3,4​}.

Searching for Patterns Having Metacharacters

Sometimes it is necessary to search and extract lines containing metacharacters.


This can be done by de-specialising the metacharacters that appear in the search
pattern. The metacharacter \ (backslash) is used to de-specialize or remove the
special meaning associated with any character that immediately follows it. For
example, to search and extract all lines that contain the ​$​ character, the regular
expression has to be ‘​\$​’.

THE grep FAMILY

This family consists of three commands—​grep, egrep​ (extended ​grep​) and ​fgrep
(fixed ​grep​).

The grep Command

This command is used to search, select and print specified records or lines from an
input file. ​grep​ is an acronym for globally search a regular expression and print it.
$grep [options] pattern [filename1][filename2] …
grep​ ​Options​ ​grep​ has a number of options like the inverse option ​–v​, the ignore
option ​–i​, the filename option ​–l​, the line number option ​–n​, the count option ​–c
The inverse option:​ –v​ Generally ​grep​ searches for lines or records containing a
pattern, and prints them out.
The ignore option:​ –i​ Normally, grep distinguishes between uppercase and lowercase
letters. This option (ignore case) searches for all patterns without considering the
case.
The filename option:​ –l​ When this option is used, only the filenames on which the
required pattern is present will be printed.
The count option:​ –c​ This option counts the occurrences of the records that contain
the pattern in all files given as arguments,
The line number option:​ –n​ This option prints out the line numbers of the selected
lines or records

THE egrep COMMAND

egrep​ stands for ​extended grep​. This is so because it has two additional
metacharacters. These two additional metacharacters are the plus (+) character and
the question mark (?) character. This command is the most powerful member of the
grep​ command family. The foremost advantage of this command is that multiple
search patterns can be handled very easily. The pipe (​|​) character is used to mention
alternate patterns.

THE fgrep COMMAND

fgrep​ stands for ​fixed grep​ or ​fixed character grep​. This command uses only fixed
characters patterns. In other words, it does not allow the use of regular expressions.
Because this command works with only fixed patterns and does not involve itself in
the interpretation of any regular expression it is the fastest among the entire
pattern-searching programs. It is used for searching large files. The important
feature of this command is that like ​egrep​, this command also accepts multiple
search patterns.

THE STREAM EDITOR—sed

sed​ is an acronym for the stream editor. It is an extremely powerful editor by using
which, one can perform (affect) quick and easy changes to a file without entering
into an editor like ​vi​ or ​emacs​ and others.
The general format of a ​sed​ command is as follows.
$sed options `address_actionlist` filelist
Where ​action​ part of the ​address_actionlist​ informs the users about the action or
actions to be taken and the ​address​ part identifies a line (record) or lines (records)
on which these actions are to be taken. The ​filelist​ holds zero or more filenames from
which lines are picked up one by one, processed and sent on to the standard output,
that is the monitor.

Operational Mechanism of sed

sed​ reads in one line at a time, holds it in a memory space called pattern space and
acts on it as mentioned in the ​sed​ command. It then reads in the next line, acts on it
in the same manner and so on. By default, all the processed lines are sent to the
standard output—the monitor. The ​sed’s​ operational mechanism is shown in ​Fig.
6.3​. This processing does not affect the original contents of the file in any way. If
required, the processed output can be written on to a separate file.

As shown in ​Fig. 6.2​ every line/record read from the input file is held in a memory
area that is called the pattern space and all the commands are applied on this, one
by one. Because the ​sed​ reads in and works on a line at a time, one can alter very
large files without invoking an editor or worrying about the memory or disk-space
requirements.
The​ ​q​ ​Command—Quitting​ ​sed​ When this command is used, all the lines upto
and including the line addressed from the input file are picked up for processing and
then quits.
The​ ​d​ ​Command—Deleting Lines​ Unnecessary lines or records can be deleted
by using the delete command d
The​ ​p​ ​Command and the​ ​–n​ ​option—Printing Lines​ Required lines or
records can be printed by using the p command
The​ ​s​ ​Command—Substitution​ This is one of the very widely used commands.
Substitutions are made using the ​s​ command.
The​ ​a​ ​Command—Appending​ One or more lines or records can be appended to
an existing file or a database by using the append command ​a
The​ ​i​ ​Command—Inserting the Text​ Using this command, one can insert
certain text before the contents of an input file.
The​ ​c​ ​Command—Changing the Text​ Using this command one can change one
or more lines or records of an input line.
The​ ​w​ ​Command—Writing Files​ One can write the output of a ​sed​ command
onto a separate file by using the ​write​ command ​w​.
The​ ​r​ ​Command—Reading a File​ The contents of a given file can be read into a
specified input file by using the ​read​ command ​r​.

You might also like