Using Oracle10g's Regexp_like evaluation operator in place of the traditional Like operator
By John Adolph Palinski
Oracle's new database version 10g includes a powerful new tool for evaluating regular expressions. Regular expressions are a notation for describing textual patterns. This tool is the Regexp_like evaluation operator. It has the ability to perform regular POSIX and Unicode expression matching. POSIX and Unicode tools have been available for years in Unix environments, but not in Oracle. In pre-10g versions, Oracle users had to be content with the simple Like operator pattern matching. No more, Oracle has now given you a great tool to enhance your evaluations.
The Regexp_like operator has a different syntax than the Like operator. The operator is a function that accepts two parameters and returns a Boolean value. The parameters are the target column and the regular expression. The regular expression is enclosed by single quote and can contain a variety of symbols which are discussed in the following sections. The following is a syntax template of the operator:
regexp_like (argument, regular_expression)
To illustrate the operator, the Select statement in Listing 1 identifies sales people whose last name begins with a "W" and ends with an "N". A comparable Like operator pattern would be 'W%N'.
Listing 1 - Using the Regexp_like operator to identify sales people whose last names begin with a "W" and end with an "N"
SQL> select last_name
2 from salesmen
3 where regexp_like (last_name, '^W.*N$');
LAST_NAME
--------------------
WASHINGTON
WASHINGTON
SQL>
The regular expression '^W.*N$' appears strange after using the simple Like characters, but it isn't hard to understand. The carot (^) and dollar sign ($) symbols are anchors. The (^) is the beginning anchor symbol and the ($) is the ending anchor symbol. The period (.) is a one-character wildcard similar to the Like operator underbar (_). The asterisk (*) is a quantifier or repitition operator that causes the associated character to repeat indefinitely. In this case the repeating character is the wildcard character (.). The associated symbol can be an alphbetic letter or any other symbol. The combination of the .* symbols is equivalent to the Like operator % symbol. The Listing 1 regular expression states the following:
^W - Begin the expression with a W character
.* - Ignore an unlimited number of characters
N$ - End the expression with the N character
Thus, the values "WASHINGTON", "WN", "W4N", or "WOODEN" would meet the criteria of the regular expression.
The true power of the Regexp_like operator is its ability to support POSIX (Portable Operating System Interface) character classes. This allows you the ability to test for upper and lower case characters. You can do alternate character evaluations. Test whether a character is a digit or alphabetic character. You can even test whether the character is within a specific range. To further illustrate the versatility of the Regexp_like operator, Listing 2 expands the regular expression seen in Listing 1. The alternation operator (|) is added to the initial anchor value. This alternation operator now searches for last names that begin with a "W" or "V".
Listing 2 - Using the Regexp_like operator to identify sales people whose last names begin with a "W" or "V" and end with an "N"
SQL> select last_name
2 from salesmen
3 where regexp_like (last_name, '^(W|V).*N$');
LAST_NAME
--------------------
WASHINGTON
VAN BUREN
WASHINGTON
SQL>
The following tables describe the various POSIX regualar expression characters. You have already seen examples of the carot ^ and $ anchor characters described in Table 1.
Table 1 - Anchoring characters
| Character |
Description |
| ^ |
Anchors the following character to the beginning of the expression |
| $ |
Anchors the preceding character to the end of the expression |
Table 2 contains the asterisk * repetition operator that was used in Listings 1 and 2. The table also describes other repetition operators that have a more finite scope. The question mark (?) operator has a scope of 0 to 1 times. The plus symbol (+) has a scope of at least 1. The bracket {} operator allows you to specify exactly the number of matches that are to occur.
Table 2 - Quantifiers, or Repetition Operators
| Character |
Description |
| * |
Causes the preceding pattern character to be evaluated and matched 0 or more times |
| ? |
Causes the preceding pattern character to be evaluated and matched 0 or 1 addition time |
| + |
Causes the preceding pattern character to be evaluated and matched 1 or more times |
| {m} |
Causes the preceding pattern character to be evaluated and matched exactly the specified (m) number of times |
| {m,} |
Causes the preceding pattern character to be evaluated and matched at least the number of specified (m) times |
| {m.n} |
Causes the preceding pattern character to be evaluated and matched at least the number of specified (m) times and no more than the specified (n) number of times |
Listing 3 illustrates another of the repitition operators. The Regexp_like operator is used to identify sales people that have combinations of the F and L characters in their last name. The wild card (.) and asterisk (*) symbols begin the expression. The symbols essentially mean ignore any character until an F or L is encountered. The bracketed {2} indicates that the target characters "F" or "L" must be repeated. In other words, combinations of "FF", "FL", "LF", or "LL" must be met for the evaluation to be true. Finally, the expression is completed with the wild card (.) and asterisk (*) symbols. These symbols state that any characters following the tested symbols should be ignored.
Listing 3 - Using the Regexp_like operator to identify sales people with an "FF", "FL", "LF", or "LL" in their last name
SQL> select last_name
2 from salesmen
3 where regexp_like (last_name, '.*(F|L){2}.*');
LAST_NAME
--------------------
JEFFERSON
FILLMORE
SQL>
Table 3 descibes a set of Character classes that allow you to test the character type. You can test whether the character is uppercase or lowercase. You can also test to determine whether the character is alphabetic or a digit.
Table 3 - Predefined POSIX Character Classes
| Character Class |
Description |
| [:alpha:] |
Tests for alphabetic characters such as a, B, c. |
| [:lower:] |
Tests for lowercase alphabetic characters such as a, b, c. |
| [:upper:] |
Tests for uppercase alphabetic characters such as A, B, C. |
| [:digit:] |
Tests for numeric digits such as 1, 2, 3 |
| [:alnum:] |
Tests for alphanumeric characters such as a, B, c, 1, 2, 3. |
| [:space:] |
Tests for space characters such as carriage return, newline, vertical tab, and form feed. |
| [:punct:] |
Tests for punction characters |
| [:cntrl:] |
Tests for control characters |
| [:print:] |
Tests for printable characters |
Character classes are contained in a character list. The character list is enclosed by a pair of brackets ([]). In addition, the Character class is enclosed by brackets. The following is a syntax template:
[[character class 1] [character class 2] ...]
Listing 4 illustrates the use of the digit Character class to determine sales people whose address has four or more consecutive digits. Notice that the regular expression does not contain an anchor. This means that the tested characters can exist anywhere in the expression, not necessarily at the beginning.
Listing 4 - Using the Regexp_like operator to identify sales people with four digit addresses
SQL> select last_name, address
2 from salesmen
3 where regexp_like (address, '[[:digit:]]{4}');
LAST_NAME ADDRESS
-------------------- -------------------------
WASHINGTON 1629 S. 47TH ST.
ADAMS 2742 HELMUT RD.
ROSS 4501 RODEO DRIVE
MONROE 6570 HIGHLAND RD.
POLK 29809 W. LOWE CIRCLE
FILLMORE 9645 BLONDO RD.
PIERCE 8730 CREST ST.
BUCHANAN 6739 MAIN ST.
JOHNSON 1908 SPAULDING RD.
9 rows selected.
SQL>
Listing 5 increments the Listing 4 repetitive operator. The Regexp_like operator is now evaluating sales people addresses searching for those that have five consecutive digits. Notice that Polk is the only employee now displayed.
Listing 5 - Using the Regexp_like operator to identify sales people with five consecutive digits in the address
SQL> select last_name, address
2 from salesmen
3 where regexp_like (address, '[[:digit:]]{5}')
LAST_NAME ADDRESS
-------------------- -------------------------
POLK 29809 W. LOWE CIRCLE
SQL>
Listing 6 further tests the Character class. The Select statement in this listing was modified. The [:digit:] class was replaced with the [:alpha:] class. In addition, the Character class is anchored. Anchoring the character class causes the operator to evaluate whether the first four expression characters in the target Address column are alphabetic. Each Address value in the table actually begins with two or more digits. Thus, the result of the Select statement is an empty result set. There are no sales people addresses that begin with an alphabetic character. Notice that the beginning anchor is outside the character list brackets.
Listing 6 - Using the Regexp_like operator to identify sales people whose adresses begin with an alphabetic character
SQL> select last_name, address
2 from salesmen
3 where regexp_like (address, '^[[:alpha:]]{4}');
no rows selected
One more example should help you understand the Character class. The goal of the Listing 7 Select statement is to identify the Address values that begin with two digits followed by a space. In order to accomplish this task, the regular expression has the following components:
- A beginning anchor tag (^)
- Three character lists containing a Character class.
- A [:digit:] Character class. This symbol along with the anchor causes Oracle to look for a digit in the first position.
- A second [:digit:] Character class. Since there is no repitition character, Oracle is looking for a digit in the second position.
- A [:space:] Character class. Oracle is looking for a space in the third position.
Listing 7 - Using the Regexp_like operator and Character classes to identify two digit address numbers
SQL> Select last_name, address
2 from salesmen
3 where regexp_like (address, '^[[:digit:]][[:digit:]][[:space:]]');
LAST_NAME ADDRESS
-------------------- -------------------------
JEFFERSON 17 WATERLOO RD.
PITCHER 34 WASHINGTON ST.
TAYLOR 87 SHERMAN #5
SQL>
Some things to remember about Character classes are:
- Character classes are enclosed by double brackets [[ ]]. The outer bracket is the character list.
- Character classes are a position indicator just as the . (wildcard) is a position indicator.
- Character classes are interchangeable with wildcard characters or literal characters.
Table 4 contains the last set of characters. You have seen the alternate symbol (|) in Listing 3. Its use in that Select statement allowed for the selection of Washington and Van Buren records. The two separated values were enclosed by the group characters. These characters consist of a set of parentheses (). The group characters can also be used to define multiple characters. For example, the expression '(ssi){2}' would identify the value Mississippi since the characters "ssi" are repeated.
Table 4 - Alternate Matching and Grouping of Expressions
| Character |
Description |
| | |
Separates alternatives which are used with the grouping operator ( ). It essentially means Or. |
| () |
Groups a set of characters for alternatives or repetition. |
| [char] |
Indicates a character list; most characters inside a list are understood as literals, with the exception of character classes, and the ^ and - characters. When the ^ symbol is used in this context it is consider a negation symbol. The hyphen (-) symbol in this context indicates a range (i.e. 3 - 7). |
The Table 4 [char] character requires some explanation. The brackets indicate a character list. The brackets have been used in many of the preceding Character class examples. However, the brackets can also contain settings other than Character classes. Two important symbols are the carot (^) and hyphen (-).
Carots (^) placed inside a character list are not anchor symbols. They are the negate symbol. Review Listing 8. It is a modification of Listing 6. If you remember, the purpose of the Listing 5 Select statement is to identify addresses that begin with four alphabetic characters. The beginning anchor (^) symbol was used along with the [:alpha:] character class. The result was an empty result set since none of the target addresses begin with an alphabetic character. In Listing 8, the carot (^) symbol was added to the character list. It is the second carot in the expression. It negated the expression. The Regexp_like operator is now searching for addesses that do not have alphabetic characters in the first four positions. The Select statement now returns a result set.
Listing 8 - Using the Negate carot to identify addresses that do not begin with alphabetic characters
SQL> select last_name, address
2 from salesmen
3 where regexp_like (address, '^[^[:alpha:]]{4}');
LAST_NAME ADDRESS
-------------------- -------------------------
WASHINGTON 1629 S. 47TH ST.
ADAMS 2742 HELMUT RD.
ROSS 4501 RODEO DRIVE
MADISON 444 N. 16TH PLAZA
MONROE 6570 HIGHLAND RD.
ADAMS 817 CARPENTER BLVD.
JACKSON 982 N. 197TH ST
VAN BUREN 419 E. HAMPTON RD.
HARRISON 496 S. 67TH ST.
TYLER 689 BELLEVUE BLVD.
POLK 29809 W. LOWE CIRCLE
WASHINGTON 907 DREXEL ST.
FILLMORE 9645 BLONDO RD.
PIERCE 8730 CREST ST.
BUCHANAN 6739 MAIN ST.
LINCOLN 438 KRUG CIRCLE
JOHNSON 1908 SPAULDING RD.
17 rows selected.
SQL>
A hyphen (-) is used in the character list to indicate a range of values. Listing 9 illustrates this symbol. It is used in a character list to delineate the characters K and R. In essence, it is stating that any character between K and R will match the expression. The character list was anchored to the beginning of the regular expression with a carot (^). The result is the Regexp_like operator is searching for Last_name values that begin with any letter between K and R.
Listing 9 - Using the range (-) symbol to identify sales people whose names begin with characters between K and R
SQL> select last_name
2 from salesmen
3 where regexp_like (last_name, '^[K-R]');
LAST_NAME
--------------------
ROSS
MADISON
MONROE
PITCHER
POLK
PIERCE
LINCOLN
7 rows selected.
SQL>
I hope that you have seen the power of the new Regexp_like operator. It should be a very good tool set addition.
John Adolph Palinski is a former adjunct faculty member at the University of Nebraska at Omaha and Iowa Western Community College. He is the author of the "Oracle Database Construction Kit", Que, 1997, the "Oracle SQL and PL/SQL Handbook: A Guide for Data Administrators, Developers, and Business Analysts", Addison Wesley Longman, 2002, and "Oracle 9i Developer: Developing Web Applications With Forms Builder", Course Technology, 2003. He has also written numerous articles for periodicals such as Oracle Internals and Oracle Professional. He provides Oracle training and consulting services. One of his passions is to help customers set up logical business models to improve information efficiency and effectiveness. He may be contacted at www.oracle-trainer.com.