Regular Expression

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

Article

Regular Expressions and the Java


Programming Language
Applications frequently require text processing for features like word searches, email validation,
or XML document integrity. This often involves pattern matching. Languages like Perl, sed, or
awk improves pattern matching with the use of regular expressions, strings of characters that
define patterns used to search for matching text. To pattern match using the Java programming
language required the use of the StringTokenizer class with many charAt substring methods
to read through the characters or tokens to process the text. This often lead to complex or messy
code.
Until now.
The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a new package called
java.util.regex, enabling the use of regular expressions. Now functionality includes the use
of meta characters, which gives regular expressions versatility.
This article provides an overview of the use of regular expressions, and details how to use
regular expressions with the java.util.regex package, using the following common scenarios
as examples:
• Simple word replacement
• Email validation
• Removal of control characters from a file
• File searching
To compile the code in these examples and to use regular expressions in your applications, you'll
need to install J2SE version 1.4.
Regular Expressions Constructs
A regular expression is a pattern of characters that describes a set of strings. You can use the
java.util.regex package to find, display, or modify some or all of the occurrences of a pattern
in an input sequence.
The simplest form of a regular expression is a literal string, such as "Java" or "programming."
Regular expression matching also allows you to test whether a string fits into a specific syntactic
form, such as an email address.
To develop regular expressions, ordinary and special characters are used:

\$ ^ . *
+ ? [' ']
\.
Any other character appearing in a regular expression is ordinary, unless a \ precedes it.
Special characters serve a special purpose. For instance, the . matches anything except a new
line. A regular expression like s.n matches any three-character string that begins with s and ends
with n, including sun and son.
There are many special characters used in regular expressions to find words at the beginning of
lines, words that ignore case or are case-specific, and special characters that give a range, such as
a-e, meaning any letter from a to e.
Regular expression usage using this new package is Perl-like, so if you are familiar with using
regular expressions in Perl, you can use the same expression syntax in the Java programming
language. If you're not familiar with regular expressions here are a few to get you started:

Construct Matches

Characters

x The character x

\\ The backslash character

\0n The character with octal value 0n (0 <= n <= 7)

\0nn The character with octal value 0nn (0 <= n <= 7)

\0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

\xhh The character with hexadecimal value 0xhh

\uhhhh The character with hexadecimal value 0xhhhh

\t The tab character ('\u0009')

\n The newline (line feed) character ('\u000A')

\r The carriage-return character ('\u000D')

\f The form-feed character ('\u000C')

\a The alert (bell) character ('\u0007')

\e The escape character ('\u001B')

\cx The control character corresponding to x

Character Classes
[abc] a, b, or c (simple class)

[^abc] Any character except a, b, or c (negation)

[a-zA-Z] a through z or A through Z, inclusive (range)

[a-z-[bc]] a through z, except for b and c: [ad-z] (subtraction)

[a-z-[m-p]] a through z, except for m through p: [a-lq-z]

[a-z-[^def]] d, e, or f

Predefined Character Classes

. Any character (may or may not match line terminators)

\d A digit: [0-9]

\D A non-digit: [^0-9]

\s A whitespace character: [ \t\n\x0B\f\r]

\S A non-whitespace character: [^\s]

\w A word character: [a-zA-Z_0-9]

\W A non-word character: [^\w]

Check the documentation about the Pattern class for more specific details and examples.
Classes and Methods
The following classes match character sequences against patterns specified by regular
expressions.
Pattern Class
An instance of the Pattern class represents a regular expression that is specified in string form
in a syntax similar to that used by Perl.
A regular expression, specified as a string, must first be compiled into an instance of the
Pattern class. The resulting pattern is used to create a Matcher object that matches arbitrary
character sequences against the regular expression. Many matchers can share the same pattern
because it is stateless.
The compile method compiles the given regular expression into a pattern, then the matcher
method creates a matcher that will match the given input against this pattern. The pattern
method returns the regular expression from which this pattern was compiled.
The split method is a convenience method that splits the given input sequence around matches
of this pattern. The following example demonstrates:
/*
* Uses split to break up a string of input separated by
* commas and/or whitespace.
*/
import java.util.regex.*;

public class Splitter {


public static void main(String[] args) throws Exception {
// Create a pattern to match breaks
Pattern p = Pattern.compile("[,\\s]+");
// Split input with the pattern
String[] result =
p.split("one,two, three four , five");
for (int i=0; i<result.length; i++)
System.out.println(result[i]);
}
}

Matcher Class
Instances of the Matcher class are used to match character sequences against a given string
sequence pattern. Input is provided to matchers using the CharSequence interface to support
matching against characters from a wide variety of input sources.
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a
matcher can be used to perform three different kinds of match operations:
• The matches method attempts to match the entire input sequence against the pattern.
• The lookingAt method attempts to match the input sequence, starting at the beginning,
against the pattern.
• The find method scans the input sequence looking for the next sequence that matches the
pattern.
Each of these methods returns a boolean indicating success or failure. More information about a
successful match can be obtained by querying the state of the matcher.
This class also defines methods for replacing matched sequences by new strings whose contents
can, if desired, be computed from the match result.
The appendReplacement method appends everything up to the next match and the replacement
for that match. The appendTail appends the strings at the end, after the last match.
For instance, in the string blahcatblahcatblah, the first appendReplacement appends
blahdog. The second appendReplacement appends blahdog, and the appendTail appends
blah, resulting in: blahdogblahdogblah. See Simple word replacement for an example.
CharSequence Interface
The CharSequence interface provides uniform, read-only access to many different types of
character sequences. You supply the data to be searched from different sources. String,
StringBuffer and CharBuffer implement CharSequence, so they are easy sources of data to
search through. If you don't care for one of the available sources, you can write your own input
source by implementing the CharSequence interface.
Example Regex Scenarios
The following code samples demonstrate the use of the java.util.regex package for various
common scenarios:
Simple Word Replacement
/*
* This code writes "One dog, two dogs in the yard."
* to the standard-output stream:
*/
import java.util.regex.*;

public class Replacement {


public static void main(String[] args)
throws Exception {
// Create a pattern to match cat
Pattern p = Pattern.compile("cat");
// Create a matcher with an input string
Matcher m = p.matcher("one cat," +
" two cats in the yard");
StringBuffer sb = new StringBuffer();
boolean result = m.find();
// Loop through and create a new String
// with the replacements
while(result) {
m.appendReplacement(sb, "dog");
result = m.find();
}
// Add the last segment of input to
// the new String
m.appendTail(sb);
System.out.println(sb.toString());
}
}

Email Validation
The following code is a sample of some characters you can check are in an email address, or
should not be in an email address. It is not a complete email validation program that checks for
all possible email scenarios, but can be added to as needed.
/*
* Checks for invalid characters
* in email addresses
*/
public class EmailValidation {
public static void main(String[] args)
throws Exception {
String input = "@sun.com";
//Checks for email addresses starting with
//inappropriate symbols like dots or @ signs.
Pattern p = Pattern.compile("^\\.|^\\@");
Matcher m = p.matcher(input);
if (m.find())
System.err.println("Email addresses don't start" +
" with dots or @ signs.");
//Checks for email addresses that start with
//www. and prints a message if it does.
p = Pattern.compile("^www\\.");
m = p.matcher(input);
if (m.find()) {
System.out.println("Email addresses don't start" +
" with \"www.\", only web pages do.");
}
p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");
m = p.matcher(input);
StringBuffer sb = new StringBuffer();
boolean result = m.find();
boolean deletedIllegalChars = false;

while(result) {
deletedIllegalChars = true;
m.appendReplacement(sb, "");
result = m.find();
}

// Add the last segment of input to the new String


m.appendTail(sb);

input = sb.toString();

if (deletedIllegalChars) {
System.out.println("It contained incorrect characters" +
" , such as spaces or commas.");
}
}
}

Removing Control Characters from a File


/* This class removes control characters from a named
* file.
*/
import java.util.regex.*;
import java.io.*;

public class Control {


public static void main(String[] args)
throws Exception {
//Create a file object with the file name
//in the argument:
File fin = new File("fileName1");
File fout = new File("fileName2");
//Open and input and output stream
FileInputStream fis =
new FileInputStream(fin);
FileOutputStream fos =
new FileOutputStream(fout);

BufferedReader in = new BufferedReader(


new InputStreamReader(fis));
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(fos));

// The pattern matches control characters


Pattern p = Pattern.compile("{cntrl}");
Matcher m = p.matcher("");
String aLine = null;
while((aLine = in.readLine()) != null) {
m.reset(aLine);
//Replaces control characters with an empty
//string.
String result = m.replaceAll("");
out.write(result);
out.newLine();
}
in.close();
out.close();
}
}

File Searching
/*
* Prints out the comments found in a .java file.
*/
import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;

public class CharBufferExample {


public static void main(String[] args) throws Exception {
// Create a pattern to match comments
Pattern p =
Pattern.compile("//.*$", Pattern.MULTILINE);

// Get a Channel for the source file


File f = new File("Replacement.java");
FileInputStream fis = new FileInputStream(f);
FileChannel fc = fis.getChannel();
// Get a CharBuffer from the source file
ByteBuffer bb =
fc.map(FileChannel.MAP_RO, 0, (int)fc.size());
Charset cs = Charset.forName("8859_1");
CharsetDecoder cd = cs.newDecoder();
CharBuffer cb = cd.decode(bb);

// Run some matches


Matcher m = p.matcher(cb);
while (m.find())
System.out.println("Found comment: "+m.group());
}
}

Conclusion
Pattern matching in the Java programming language is now as flexible as in many other
programming languages. Regular expressions can be put to use in applications to ensure data is
formatted correctly before being entered into a database, or sent to some other part of an
application, and they can be used for a wide variety of administrative tasks. In short, you can use
regular expressions anywhere in your Java programming that calls for pattern matching.
For More Information
Package java.util.regex
Java Programming Forum
About the Authors
Dana Nourie is a JDC technical writer. She enjoys exploring the Java platform, especially
creating interactive web applications using servlets and JavaServer Pages technologies, such as
the JDC Quizzes and Learning Paths and Step-by-Step pages. She is also a scuba diver and is
looking for the Pacific Cold Water Seahorse.
Mike McCloskey is a Sun engineer, working in Core Libraries for J2SE. He has made
contributions in java.lang, java.util, java.io and java.math, as well as the new
packages java.util.regex and java.nio. He enjoys playing racquetball and writing science
fiction.

Introduction
What Are Regular Expressions?
Regular expressions are a way to describe a set of strings based on common characteristics
shared by each string in the set. They can be used to search, edit, or manipulate text and data.
You must learn a specific syntax to create regular expressions — one that goes beyond the
normal syntax of the Java programming language. Regular expressions vary in complexity, but
once you understand the basics of how they're constructed, you'll be able to decipher (or create)
any regular expression.
This trail teaches the regular expression syntax supported by the java.util.regex API and
presents several working examples to illustrate how the various objects interact. In the world of
regular expressions, there are many different flavors to choose from, such as grep, Perl, Tcl,
Python, PHP, and awk. The regular expression syntax in the java.util.regex API is most
similar to that found in Perl.
How Are Regular Expressions Represented in This Package?
The java.util.regex package primarily consists of three classes: Pattern, Matcher, and
PatternSyntaxException.
• A Pattern object is a compiled representation of a regular expression. The
Pattern class provides no public constructors. To create a pattern, you must first
invoke one of its public static compile methods, which will then return a
Pattern object. These methods accept a regular expression as the first argument;
the first few lessons of this trail will teach you the required syntax.
• A Matcher object is the engine that interprets the pattern and performs match
operations against an input string. Like the Pattern class, Matcher defines no
public constructors. You obtain a Matcher object by invoking the matcher
method on a Pattern object.
• A PatternSyntaxException object is an unchecked exception that indicates a
syntax error in a regular expression pattern.
The last few lessons of this trail explore each class in detail. But first, you must understand how
regular expressions are actually constructed. Therefore, the next section introduces a simple test
harness that will be used repeatedly to explore their syntax.

Test Harness
This section defines a reusable test harness, RegexTestHarness.java , for exploring the regular
expression constructs supported by this API. The command to run this code is java
RegexTestHarness; no command-line arguments are accepted. The application loops
repeatedly, prompting the user for a regular expression and input string. Using this test harness is
optional, but you may find it convenient for exploring the test cases discussed in the following
pages.
import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class RegexTestHarness {

public static void main(String[] args){


Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));

Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search:
"));

boolean found = false;


while (matcher.find()) {
console.format("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true;
}
if(!found){
console.format("No match found.%n");
}
}
}
}

Before continuing to the next section, save and compile this code to ensure that your
development environment supports the required packages.
String Literals
The most basic form of pattern matching supported by this API is the match of a string literal.
For example, if the regular expression is foo and the input string is foo, the match will succeed
because the strings are identical. Try this out with the test harness:
Enter your regex: foo
Enter input string to search: foo
I found the text "foo" starting at index 0 and ending at index 3.
This match was a success. Note that while the input string is 3 characters long, the start index is
0 and the end index is 3. By convention, ranges are inclusive of the beginning index and
exclusive of the end index, as shown in the following figure:

The string literal "foo", with numbered cells and index values.
Each character in the string resides in its own cell, with the index positions pointing between
each cell. The string "foo" starts at index 0 and ends at index 3, even though the characters
themselves only occupy cells 0, 1, and 2.
With subsequent matches, you'll notice some overlap; the start index for the next match is the
same as the end index of the previous match:

Enter your regex: foo


Enter input string to search: foofoofoo
I found the text "foo" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "foo" starting at index 6 and ending at index 9.

Metacharacters
This API also supports a number of special characters that affect the way a pattern is matched.
Change the regular expression to cat. and the input string to cats. The output will appear as
follows:
Enter your regex: cat.
Enter input string to search: cats
I found the text "cats" starting at index 0 and ending at index 4.
The match still succeeds, even though the dot "." is not present in the input string. It succeeds
because the dot is a metacharacter — a character with special meaning interpreted by the
matcher. The metacharacter "." means "any character" which is why the match succeeds in this
example.
The metacharacters supported by this API are: ([{\^-$|]})?*+.

Note: In certain situations the special characters listed above will not be treated as
metacharacters. You'll encounter this as you learn more about how regular expressions are
constructed. You can, however, use this list to check whether or not a specific character will ever
be considered a metacharacter. For example, the characters ! @ and # never carry a special
meaning.

There are two ways to force a metacharacter to be treated as an ordinary character:


• precede the metacharacter with a backslash, or
• enclose it within \Q (which starts the quote) and \E (which ends it).
When using this technique, the \Q and \E can be placed at any location within the expression,
provided that the \Q comes first.

Character Classes

If you browse through the Pattern class specification, you'll see tables summarizing
the supported regular expression constructs. In the "Character Classes" section
you'll find the following:

Character Classes

[abc] a, b, or c (simple class)

[^abc] Any character except a, b, or c (negation)


a through z, or A through Z, inclusive
[a-zA-Z]
(range)

a through d, or m through p: [a-dm-p]


[a-d[m-p]]
(union)

[a-z&&[def]] d, e, or f (intersection)

a through z, except for b and c: [ad-z]


[a-z&&[^bc]]
(subtraction)

a through z, and not m through p: [a-lq-z]


[a-z&&[^m-p]]
(subtraction)

The left-hand column specifies the regular expression constructs, while the right-
hand column describes the conditions under which each construct will match.

Note: The word "class" in the phrase "character class" does not refer to a .class
file. In the context of regular expressions, a character class is a set of characters
enclosed within square brackets. It specifies the characters that will successfully
match a single character from a given input string.

Simple Classes
The most basic form of a character class is to simply place a set of characters side-
by-side within square brackets. For example, the regular expression [bcr]at will
match the words "bat", "cat", or "rat" because it defines a character class
(accepting either "b", "c", or "r") as its first character.

Enter your regex: [bcr]at


Enter input string to search: bat
I found the text "bat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at


Enter input string to search: cat
I found the text "cat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at


Enter input string to search: rat
I found the text "rat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at


Enter input string to search: hat
No match found.
In the above examples, the overall match succeeds only when the first letter
matches one of the characters defined by the character class.
Negation
To match all characters except those listed, insert the "^" metacharacter at the
beginning of the character class. This technique is known as negation.

Enter your regex: [^bcr]at


Enter input string to search: bat
No match found.

Enter your regex: [^bcr]at


Enter input string to search: cat
No match found.

Enter your regex: [^bcr]at


Enter input string to search: rat
No match found.

Enter your regex: [^bcr]at


Enter input string to search: hat
I found the text "hat" starting at index 0 and ending at index 3.
The match is successful only if the first character of the input string does not
contain any of the characters defined by the character class.

Ranges
Sometimes you'll want to define a character class that includes a range of values,
such as the letters "a through h" or the numbers "1 through 5". To specify a range,
simply insert the "-" metacharacter between the first and last character to be
matched, such as [1-5] or [a-h]. You can also place different ranges beside each
other within the class to further expand the match possibilities. For example, [a-zA-
Z] will match any letter of the alphabet: a to z (lowercase) or A to Z (uppercase).

Here are some examples of ranges and negation:


Enter your regex: [a-c]
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: [a-c]


Enter input string to search: b
I found the text "b" starting at index 0 and ending at index 1.

Enter your regex: [a-c]


Enter input string to search: c
I found the text "c" starting at index 0 and ending at index 1.

Enter your regex: [a-c]


Enter input string to search: d
No match found.

Enter your regex: foo[1-5]


Enter input string to search: foo1
I found the text "foo1" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]


Enter input string to search: foo5
I found the text "foo5" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]


Enter input string to search: foo6
No match found.

Enter your regex: foo[^1-5]


Enter input string to search: foo1
No match found.

Enter your regex: foo[^1-5]


Enter input string to search: foo6
I found the text "foo6" starting at index 0 and ending at index 4.

Unions
You can also use unions to create a single character class comprised of two or more
separate character classes. To create a union, simply nest one class inside the
other, such as [0-4[6-8]]. This particular union creates a single character class that
matches the numbers 0, 1, 2, 3, 4, 6, 7, and 8.

Enter your regex: [0-4[6-8]]


Enter input string to search: 0
I found the text "0" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]


Enter input string to search: 5
No match found.

Enter your regex: [0-4[6-8]]


Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]


Enter input string to search: 8
I found the text "8" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]


Enter input string to search: 9
No match found.

Intersections
To create a single character class matching only the characters common to all of its
nested classes, use &&, as in [0-9&&[345]]. This particular intersection creates a
single character class matching only the numbers common to both character
classes: 3, 4, and 5.

Enter your regex: [0-9&&[345]]


Enter input string to search: 3
I found the text "3" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]


Enter input string to search: 4
I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]


Enter input string to search: 5
I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]


Enter input string to search: 2
No match found.

Enter your regex: [0-9&&[345]]


Enter input string to search: 6
No match found.
And here's an example that shows the intersection of two ranges:

Enter your regex: [2-8&&[4-6]]


Enter input string to search: 3
No match found.

Enter your regex: [2-8&&[4-6]]


Enter input string to search: 4
I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]


Enter input string to search: 5
I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]


Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]


Enter input string to search: 7
No match found.

Subtraction
Finally, you can use subtraction to negate one or more nested character classes,
such as [0-9&&[^345]]. This example creates a single character class that matches
everything from 0 to 9, except the numbers 3, 4, and 5.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 2
I found the text "2" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 3
No match found.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 4
No match found.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 5
No match found.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 6
I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]


Enter input string to search: 9
I found the text "9" starting at index 0 and ending at index 1.
Now that we've covered how character classes are created, You may want to review
the Character Classes table before continuing with the next section.

« Previous • Trail • Next »

Predefined Character Classes

The Pattern API contains a number of useful predefined character classes, which
offer convenient shorthands for commonly used regular expressions:

Predefined Character Classes

Any character (may or may not match line


.
terminators)

\d A digit: [0-9]

\D A non-digit: [^0-9]

\s A whitespace character: [ \t\n\x0B\f\r]

\S A non-whitespace character: [^\s]

\w A word character: [a-zA-Z_0-9]

\W A non-word character: [^\w]

In the table above, each construct in the left-hand column is shorthand for the
character class in the right-hand column. For example, \d means a range of digits
(0-9), and \w means a word character (any lowercase letter, any uppercase letter,
the underscore character, or any digit). Use the predefined classes whenever
possible. They make your code easier to read and eliminate errors introduced by
malformed character classes.

Constructs beginning with a backslash are called escaped constructs. We previewed escaped
constructs in the String Literals section where we mentioned the use of backslash and \Q and \E
for quotation. If you are using an escaped construct within a string literal, you must preceed the
backslash with another backslash for the string to compile. For example:

private final String REGEX = "\\d"; // a single digit


In this example \d is the regular expression; the extra backslash is required for the
code to compile. The test harness reads the expressions directly from the Console,
however, so the extra backslash is unnecessary.

The following examples demonstrate the use of predefined character classes.

Enter your regex: .


Enter input string to search: @
I found the text "@" starting at index 0 and ending at index 1.

Enter your regex: .


Enter input string to search: 1
I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: .


Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \d


Enter input string to search: 1
I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: \d


Enter input string to search: a
No match found.

Enter your regex: \D


Enter input string to search: 1
No match found.

Enter your regex: \D


Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \s


Enter input string to search:
I found the text " " starting at index 0 and ending at index 1.

Enter your regex: \s


Enter input string to search: a
No match found.

Enter your regex: \S


Enter input string to search:
No match found.

Enter your regex: \S


Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \w


Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: \w


Enter input string to search: !
No match found.

Enter your regex: \W


Enter input string to search: a
No match found.

Enter your regex: \W


Enter input string to search: !
I found the text "!" starting at index 0 and ending at index 1.
In the first three examples, the regular expression is simply . (the "dot"
metacharacter) that indicates "any character." Therefore, the match is successful in
all three cases (a randomly selected @ character, a digit, and a letter). The
remaining examples each use a single regular expression construct from the
Predefined Character Classes table. You can refer to this table to figure out the logic
behind each match:

• \d matches all digits


• \s matches spaces
• \w matches word characters
Alternatively, a capital letter means the opposite:
• \D matches non-digits
• \S matches non-spaces
• \W matches non-word characters
• Quantifiers
• Quantifiers allow you to specify the number of occurrences to match
against. For convenience, the three sections of the Pattern API
specification describing greedy, reluctant, and possessive quantifiers
are presented below. At first glance it may appear that the quantifiers
X?, X?? and X?+ do exactly the same thing, since they all promise to
match "X, once or not at all". There are subtle implementation
differences which will be explained near the end of this section.

Quantifiers
Meaning
Greedy Reluctant Possessive

X? X?? X?+ X, once or not at all

X* X*? X*+ X, zero or more times

X+ X+? X++ X, one or more times

X{n} X{n}? X{n}+ X, exactly n times

X{n,} X{n,}? X{n,}+ X, at least n times


X, at least n but not more than m
X{n,m} X{n,m}? X{n,m}+
times

• Let's start our look at greedy quantifiers by creating three different regular expressions:
the letter "a" followed by either ?, *, or +. Let's see what happens when these expressions
are tested against an empty input string "":

• Enter your regex: a?
• Enter input string to search:
• I found the text "" starting at index 0 and ending at index 0.

• Enter your regex: a*
• Enter input string to search:
• I found the text "" starting at index 0 and ending at index 0.

• Enter your regex: a+
• Enter input string to search:
• No match found.

• Zero-Length Matches
• In the above example, the match is successful in the first two cases
because the expressions a? and a* both allow for zero occurrences of
the letter a. You'll also notice that the start and end indices are both
zero, which is unlike any of the examples we've seen so far. The empty
input string "" has no length, so the test simply matches nothing at
index 0. Matches of this sort are known as a zero-length matches. A
zero-length match can occur in several cases: in an empty input string,
at the beginning of an input string, after the last character of an input
string, or in between any two characters of an input string. Zero-length
matches are easily identifiable because they always start and end at
the same index position.

• Let's explore zero-length matches with a few more examples. Change the input string to a
single letter "a" and you'll notice something interesting:

• Enter your regex: a?
• Enter input string to search: a
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "" starting at index 1 and ending at index 1.

• Enter your regex: a*
• Enter input string to search: a
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "" starting at index 1 and ending at index 1.

• Enter your regex: a+
• Enter input string to search: a
• I found the text "a" starting at index 0 and ending at index 1.
• All three quantifiers found the letter "a", but the first two also found a
zero-length match at index 1; that is, after the last character of the
input string. Remember, the matcher sees the character "a" as sitting
in the cell between index 0 and index 1, and our test harness loops
until it can no longer find a match. Depending on the quantifier used,
the presence of "nothing" at the index after the last character may or
may not trigger a match.

• Now change the input string to the letter "a" five times in a row and you'll get the
following:

• Enter your regex: a?
• Enter input string to search: aaaaa
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "a" starting at index 1 and ending at index 2.
• I found the text "a" starting at index 2 and ending at index 3.
• I found the text "a" starting at index 3 and ending at index 4.
• I found the text "a" starting at index 4 and ending at index 5.
• I found the text "" starting at index 5 and ending at index 5.

• Enter your regex: a*
• Enter input string to search: aaaaa
• I found the text "aaaaa" starting at index 0 and ending at index 5.
• I found the text "" starting at index 5 and ending at index 5.

• Enter your regex: a+
• Enter input string to search: aaaaa
• I found the text "aaaaa" starting at index 0 and ending at index 5.
• The expression a? finds an individual match for each character, since it
matches when "a" appears zero or one times. The expression a* finds
two separate matches: all of the letter "a"'s in the first match, then the
zero-length match after the last character at index 5. And finally, a+
matches all occurrences of the letter "a", ignoring the presence of
"nothing" at the last index.

• At this point, you might be wondering what the results would be if the first two
quantifiers encounter a letter other than "a". For example, what happens if it encounters
the letter "b", as in "ababaaaab"?
• Let's find out:

• Enter your regex: a?
• Enter input string to search: ababaaaab
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "" starting at index 1 and ending at index 1.
• I found the text "a" starting at index 2 and ending at index 3.
• I found the text "" starting at index 3 and ending at index 3.
• I found the text "a" starting at index 4 and ending at index 5.
• I found the text "a" starting at index 5 and ending at index 6.
• I found the text "a" starting at index 6 and ending at index 7.
• I found the text "a" starting at index 7 and ending at index 8.
• I found the text "" starting at index 8 and ending at index 8.
• I found the text "" starting at index 9 and ending at index 9.

• Enter your regex: a*
• Enter input string to search: ababaaaab
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "" starting at index 1 and ending at index 1.
• I found the text "a" starting at index 2 and ending at index 3.
• I found the text "" starting at index 3 and ending at index 3.
• I found the text "aaaa" starting at index 4 and ending at index 8.
• I found the text "" starting at index 8 and ending at index 8.
• I found the text "" starting at index 9 and ending at index 9.

• Enter your regex: a+
• Enter input string to search: ababaaaab
• I found the text "a" starting at index 0 and ending at index 1.
• I found the text "a" starting at index 2 and ending at index 3.
• I found the text "aaaa" starting at index 4 and ending at index 8.
• Even though the letter "b" appears in cells 1, 3, and 8, the output
reports a zero-length match at those locations. The regular expression
a? is not specifically looking for the letter "b"; it's merely looking for
the presence (or lack thereof) of the letter "a". If the quantifier allows
for a match of "a" zero times, anything in the input string that's not an
"a" will show up as a zero-length match. The remaining a's are
matched according to the rules discussed in the previous examples.

• To match a pattern exactly n number of times, simply specify the number inside a set of
braces:

• Enter your regex: a{3}
• Enter input string to search: aa
• No match found.

• Enter your regex: a{3}
• Enter input string to search: aaa
• I found the text "aaa" starting at index 0 and ending at index 3.

• Enter your regex: a{3}
• Enter input string to search: aaaa
• I found the text "aaa" starting at index 0 and ending at index 3.
• Here, the regular expression a{3} is searching for three occurrences of
the letter "a" in a row. The first test fails because the input string does
not have enough a's to match against. The second test contains
exactly 3 a's in the input string, which triggers a match. The third test
also triggers a match because there are exactly 3 a's at the beginning
of the input string. Anything following that is irrelevant to the first
match. If the pattern should appear again after that point, it would
trigger subsequent matches:


• Enter your regex: a{3}
• Enter input string to search: aaaaaaaaa
• I found the text "aaa" starting at index 0 and ending at index 3.
• I found the text "aaa" starting at index 3 and ending at index 6.
• I found the text "aaa" starting at index 6 and ending at index 9.
• To require a pattern to appear at least n times, add a comma after the
number:


• Enter your regex: a{3,}
• Enter input string to search: aaaaaaaaa
• I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.
• With the same input string, this test finds only one match, because the
9 a's in a row satisfy the need for "at least" 3 a's.

• Finally, to specify an upper limit on the number of occurances, add a second number
inside the braces:

• Enter your regex: a{3,6} // find at least 3 (but no more than 6) a's in
a row
• Enter input string to search: aaaaaaaaa
• I found the text "aaaaaa" starting at index 0 and ending at index 6.
• I found the text "aaa" starting at index 6 and ending at index 9.
• Here the first match is forced to stop at the upper limit of 6 characters.
The second match includes whatever is left over, which happens to be
three a's — the mimimum number of characters allowed for this
match. If the input string were one character shorter, there would not
be a second match since only two a's would remain.

• Capturing Groups and Character Classes with Quantifiers


• Until now, we've only tested quantifiers on input strings containing one
character. In fact, quantifiers can only attach to one character at a
time, so the regular expression "abc+" would mean "a, followed by b,
followed by c one or more times". It would not mean "abc" one or more
times. However, quantifiers can also attach to Character Classes and
Capturing Groups, such as [abc]+ (a or b or c, one or more times) or
(abc)+ (the group "abc", one or more times).

• Let's illustrate by specifying the group (dog), three times in a row.



• Enter your regex: (dog){3}
• Enter input string to search: dogdogdogdogdogdog
• I found the text "dogdogdog" starting at index 0 and ending at index 9.
• I found the text "dogdogdog" starting at index 9 and ending at index 18.

• Enter your regex: dog{3}
• Enter input string to search: dogdogdogdogdogdog
• No match found.
• Here the first example finds three matches, since the quantifier applies
to the entire capturing group. Remove the parentheses, however, and
the match fails because the quantifier {3} now applies only to the
letter "g".

• Similarly, we can apply a quantifier to an entire character class:


• Enter your regex: [abc]{3}
• Enter input string to search: abccabaaaccbbbc
• I found the text "abc" starting at index 0 and ending at index 3.
• I found the text "cab" starting at index 3 and ending at index 6.
• I found the text "aaa" starting at index 6 and ending at index 9.
• I found the text "ccb" starting at index 9 and ending at index 12.
• I found the text "bbc" starting at index 12 and ending at index 15.

• Enter your regex: abc{3}
• Enter input string to search: abccabaaaccbbbc
• No match found.
• Here the quantifier {3} applies to the entire character class in the first
example, but only to the letter "c" in the second.

• Differences Among Greedy, Reluctant, and Possessive Quantifiers


• There are subtle differences among greedy, reluctant, and possessive
quantifiers.

• Greedy quantifiers are considered "greedy" because they force the matcher to read in, or
eat, the entire input string prior to attempting the first match. If the first match attempt
(the entire input string) fails, the matcher backs off the input string by one character and
tries again, repeating the process until a match is found or there are no more characters
left to back off from. Depending on the quantifier used in the expression, the last thing it
will try matching against is 1 or 0 characters.
• The reluctant quantifiers, however, take the opposite approach: They start at the
beginning of the input string, then reluctantly eat one character at a time looking for a
match. The last thing they try is the entire input string.
• Finally, the possessive quantifiers always eat the entire input string, trying once (and only
once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off,
even if doing so would allow the overall match to succeed.
• To illustrate, consider the input string xfooxxxxxxfoo.

• Enter your regex: .*foo // greedy quantifier
• Enter input string to search: xfooxxxxxxfoo
• I found the text "xfooxxxxxxfoo" starting at index 0 and ending at
index 13.

• Enter your regex: .*?foo // reluctant quantifier
• Enter input string to search: xfooxxxxxxfoo
• I found the text "xfoo" starting at index 0 and ending at index 4.
• I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

• Enter your regex: .*+foo // possessive quantifier
• Enter input string to search: xfooxxxxxxfoo
• No match found.
• The first example uses the greedy quantifier .* to find "anything", zero
or more times, followed by the letters "f" "o" "o". Because the
quantifier is greedy, the .* portion of the expression first eats the
entire input string. At this point, the overall expression cannot
succeed, because the last three letters ("f" "o" "o") have already
been consumed. So the matcher slowly backs off one letter at a time
until the rightmost occurrence of "foo" has been regurgitated, at which
point the match succeeds and the search ends.

• The second example, however, is reluctant, so it starts by first consuming "nothing".


Because "foo" doesn't appear at the beginning of the string, it's forced to swallow the first
letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the
process until the input string is exhausted. It finds another match at 4 and 13.
• The third example fails to find a match because the quantifier is possessive. In this case,
the entire input string is consumed by .*+, leaving nothing left over to satisfy the "foo" at
the end of the expression. Use a possessive quantifier for situations where you want to
seize all of something without ever backing off; it will outperform the equivalent greedy
quantifier in cases where the match is not immediately found.
Capturing Groups

In the previous section, we saw how quantifiers attach to one character, character
class, or capturing group at a time. But until now, we have not discussed the notion
of capturing groups in any detail.

Capturing groups are a way to treat multiple characters as a single unit. They are created by
placing the characters to be grouped inside a set of parentheses. For example, the regular
expression (dog) creates a single group containing the letters "d" "o" and "g". The portion of
the input string that matches the capturing group will be saved in memory for later recall via
backreferences (as discussed below in the section, Backreferences).
Numbering
As described in the Pattern API, capturing groups are numbered by counting their
opening parentheses from left to right. In the expression ((A)(B(C))), for example,
there are four such groups:

1. ((A)(B(C)))
2. (A)
3. (B(C))
4. (C)
To find out how many groups are present in the expression, call the groupCount
method on a matcher object. The groupCount method returns an int showing the
number of capturing groups present in the matcher's pattern. In this example,
groupCount would return the number 4, showing that the pattern contains 4
capturing groups.
There is also a special group, group 0, which always represents the entire expression. This group
is not included in the total reported by groupCount. Groups beginning with (? are pure, non-
capturing groups that do not capture text and do not count towards the group total. (You'll see
examples of non-capturing groups later in the section Methods of the Pattern Class.)
It's important to understand how groups are numbered because some Matcher methods accept an
int specifying a particular group number as a parameter:
• public int start(int group): Returns the start index of the subsequence
captured by the given group during the previous match operation.
• public int end (int group): Returns the index of the last character, plus
one, of the subsequence captured by the given group during the previous
match operation.
• public String group (int group): Returns the input subsequence captured
by the given group during the previous match operation.
Backreferences
The section of the input string matching the capturing group(s) is saved in memory
for later recall via backreference. A backreference is specified in the regular
expression as a backslash (\) followed by a digit indicating the number of the group
to be recalled. For example, the expression (\d\d) defines one capturing group
matching two digits in a row, which can be recalled later in the expression via the
backreference \1.

To match any 2 digits, followed by the exact same two digits, you would use (\d\d)\1 as the
regular expression:

Enter your regex: (\d\d)\1


Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.
If you change the last two digits the match will fail:

Enter your regex: (\d\d)\1


Enter input string to search: 1234
No match found.
For nested capturing groups, backreferencing works in exactly the same way:
Specify a backslash followed by the number of the group to be recalled.

Boundary Matchers
Until now, we've only been interested in whether or not a match is found at some location within
a particular input string. We never cared about where in the string the match was taking place.
You can make your pattern matches more precise by specifying such information with boundary
matchers. For example, maybe you're interested in finding a particular word, but only if it
appears at the beginning or end of a line. Or maybe you want to know if the match is taking
place on a word boundary, or at the end of the previous match.
The following table lists and explains all the boundary matchers.
Boundary Matchers
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input
The following examples demonstrate the use of boundary matchers ^ and $. As noted above, ^
matches the beginning of a line, and $ matches the end.
Enter your regex: ^dog$
Enter input string to search: dog
I found the text "dog" starting at index 0 and ending at index 3.

Enter your regex: ^dog$


Enter input string to search: dog
No match found.

Enter your regex: \s*dog$


Enter input string to search: dog
I found the text " dog" starting at index 0 and ending at index 15.

Enter your regex: ^dog\w*


Enter input string to search: dogblahblah
I found the text "dogblahblah" starting at index 0 and ending at index 11.
The first example is successful because the pattern occupies the entire input string. The second
example fails because the input string contains extra whitespace at the beginning. The third
example specifies an expression that allows for unlimited white space, followed by "dog" on the
end of the line. The fourth example requires "dog" to be present at the beginning of a line
followed by an unlimited number of word characters.
To check if a pattern begins and ends on a word boundary (as opposed to a substring within a
longer string), just use \b on either side; for example, \bdog\b

Enter your regex: \bdog\b


Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \bdog\b


Enter input string to search: The doggie plays in the yard.
No match found.
To match the expression on a non-word boundary, use \B instead:
Enter your regex: \bdog\B
Enter input string to search: The dog plays in the yard.
No match found.

Enter your regex: \bdog\B


Enter input string to search: The doggie plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.
To require the match to occur only at the end of the previous match, use \G:
Enter your regex: dog
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \Gdog


Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
Here the second example finds only one match, because the second occurrence of "dog" does not
start at the end of the previous match.

You might also like