Chapter 4 - Regular Expression
Chapter 4 - Regular Expression
Chapter 4 - Regular Expression
Quite often a Unix user is involved in searching one or more records from a database
or one or more lines from a text file. Such a search could be for finding or extracting
a
REGULAR EXPRESSIONS
The term regular expression comes from theoretical computer science. In its
simplest form, it is defined as a language for specifying patterns that match a
sequence of characters. These patterns are made up of one of the following.
1. Normal characters that match exactly the same character in the input.
2. character classes that match any single character in the class.
3. certain other special characters that specify the way in which parts of an expression are
to be matched against the input.
Character Class
There are situations when it is necessary to match a character from within a set of
characters. In Unix this set of characters out of which, only one character is
matched, is referred to as a character class. This set of characters is presented
within a pair of square brackets—[and the character]. For example, if the user wants
to extract all lines that have a pattern (anywhere on it) that begins with chap and
end with any one of the digits 1, 2, 3 or 4 then the search pattern will be chap[1234].
The same search pattern can also be written as chap[1-4]. The hyphen (-) indicates
the range of the characters in the set. Here [1-4] means any of the characters that
constitute the set {1,2,3,4}.
This family consists of three commands—grep, egrep (extended grep) and fgrep
(fixed grep).
This command is used to search, select and print specified records or lines from an
input file. grep is an acronym for globally search a regular expression and print it.
$grep [options] pattern [filename1][filename2] …
grep Options grep has a number of options like the inverse option –v, the ignore
option –i, the filename option –l, the line number option –n, the count option –c
The inverse option: –v Generally grep searches for lines or records containing a
pattern, and prints them out.
The ignore option: –i Normally, grep distinguishes between uppercase and lowercase
letters. This option (ignore case) searches for all patterns without considering the
case.
The filename option: –l When this option is used, only the filenames on which the
required pattern is present will be printed.
The count option: –c This option counts the occurrences of the records that contain
the pattern in all files given as arguments,
The line number option: –n This option prints out the line numbers of the selected
lines or records
egrep stands for extended grep. This is so because it has two additional
metacharacters. These two additional metacharacters are the plus (+) character and
the question mark (?) character. This command is the most powerful member of the
grep command family. The foremost advantage of this command is that multiple
search patterns can be handled very easily. The pipe (|) character is used to mention
alternate patterns.
fgrep stands for fixed grep or fixed character grep. This command uses only fixed
characters patterns. In other words, it does not allow the use of regular expressions.
Because this command works with only fixed patterns and does not involve itself in
the interpretation of any regular expression it is the fastest among the entire
pattern-searching programs. It is used for searching large files. The important
feature of this command is that like egrep, this command also accepts multiple
search patterns.
sed is an acronym for the stream editor. It is an extremely powerful editor by using
which, one can perform (affect) quick and easy changes to a file without entering
into an editor like vi or emacs and others.
The general format of a sed command is as follows.
$sed options `address_actionlist` filelist
Where action part of the address_actionlist informs the users about the action or
actions to be taken and the address part identifies a line (record) or lines (records)
on which these actions are to be taken. The filelist holds zero or more filenames from
which lines are picked up one by one, processed and sent on to the standard output,
that is the monitor.
sed reads in one line at a time, holds it in a memory space called pattern space and
acts on it as mentioned in the sed command. It then reads in the next line, acts on it
in the same manner and so on. By default, all the processed lines are sent to the
standard output—the monitor. The sed’s operational mechanism is shown in Fig.
6.3. This processing does not affect the original contents of the file in any way. If
required, the processed output can be written on to a separate file.
As shown in Fig. 6.2 every line/record read from the input file is held in a memory
area that is called the pattern space and all the commands are applied on this, one
by one. Because the sed reads in and works on a line at a time, one can alter very
large files without invoking an editor or worrying about the memory or disk-space
requirements.
The q Command—Quitting sed When this command is used, all the lines upto
and including the line addressed from the input file are picked up for processing and
then quits.
The d Command—Deleting Lines Unnecessary lines or records can be deleted
by using the delete command d
The p Command and the –n option—Printing Lines Required lines or
records can be printed by using the p command
The s Command—Substitution This is one of the very widely used commands.
Substitutions are made using the s command.
The a Command—Appending One or more lines or records can be appended to
an existing file or a database by using the append command a
The i Command—Inserting the Text Using this command, one can insert
certain text before the contents of an input file.
The c Command—Changing the Text Using this command one can change one
or more lines or records of an input line.
The w Command—Writing Files One can write the output of a sed command
onto a separate file by using the write command w.
The r Command—Reading a File The contents of a given file can be read into a
specified input file by using the read command r.