Using Regular Expressions with PHP

2013 年 7 月 24 日3820

RegexBuddy

Easily use the power of regular expressions in your PHP scripts with RegexBuddy.

Create and analyze regex patterns with RegexBuddy's intuitive regex building blocks. Implement regexes in your PHP scripts with instant PHP code snippets. Just tell RegexBuddy what you want to achieve, and copy and paste the auto-generated PHP code. Get your own copy of RegexBuddy now.

PHP Provides Three Sets of Regular Expression Functions

PHP is an open source language for producing dynamic web pages. PHP has three sets of functions that allow you to work with regular expressions.

The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE library (Perl-Compatible Regular Expressions). Anything said about the PCRE regex flavor in the regular expression tutorial on this website applies to PHP's preg functions. You should use the preg functions for all new PHP code that uses regular expressions. PHP includes PCRE by default as of PHP 4.2.0 (April 2002).

The oldest set of regex functions are those that start with ereg. They implement POSIX Extended Regular Expressions, like the traditional UNIX egrep command. These functions are mainly for backward compatibility with PHP 3, and officially deprecated as of PHP 5.3.0. Many of the more modern regex features such as lazy quantifiers, lookaround and Unicode are not supported by the ereg functions. Don't let the "extended" moniker fool you. The POSIX standard was defined in 1986, and regular expressions have come a long way since then.

The last set is a variant of the ereg set, prefixing mb_ for "multibyte" to the function names. While ereg treats the regex and subject string as a series of 8-bit characters, mb_ereg can work with multi-byte characters from various code pages. If you want your regex to treat Far East characters as individual characters, you'll either need to use the mb_ereg functions, or the preg functions with the /u modifier. mb_ereg is available in PHP 4.2.0 and later. It uses the same POSIX ERE flavor.

The preg Function Set

All of the preg functions require you to specify the regular expression as a string using Perl syntax. In Perl, /regex/ defines a regular expression. In PHP, this becomes preg_match('/regex/', $subject). When forward slashes are used as the regex delimiter, any forward slashes in the regular expression have to be escaped with a backslash. So http://http://www.zjjv.com/// becomes '/http:\/\/http://www.zjjv.com/\//'. Just like Perl, the preg functions allow any non-alphanumeric character as regex delimiters. The URL regex would be more readable as '%http://http://www.zjjv.com///%' using percentage signs as the regex delimiters, since then you don't need to escape the forward slashes. You would have to escape percentage sings if the regex contained any.

Unlike programming languages like C# or Java, PHP does not require all backslashes in strings to be escaped. If you want to include a backslash as a literal character in a PHP string, you only need to escape it if it is followed by another character that needs to be escaped. In single quoted-strings, only the single quote and the backslash itself need to be escaped. That is why in the above regex, I didn't have to double the backslashes in front of the literal dots. The regex \\ to match a single backslash would become '/\\\\/' as a PHP preg string. Unless you want to use variable interpolation in your regular expression, you should always use single-quoted strings for regular expressions in PHP, to avoid messy duplication of backslashes.

To specify regex matching options such as case insensitivity are specified in the same way as in Perl. '/regex/i' applies the regex case insensitively. '/regex/s' makes the dot match all characters. '/regex/m' makes the start and end of line anchors match at embedded newlines in the subject string. '/regex/x' turns on free-spacing mode. You can specify multiple letters to turn on several options. '/regex/misx' turns on all four options.

A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string.

Like the ereg function, bool preg_match (string pattern, string subject [, array groups]) returns TRUE if the regular expression pattern matches the subject string or part of the subject string. If you specify the third parameter, preg will store the substring matched by the first capturing group in $groups[1]. $groups[2] will contain the second pair, and so on. If the regex pattern uses named capture, you can access the groups by name with $groups['name']. $groups[0] will hold the overall match.

int preg_match_all (string pattern, string subject, array matches, int flags) fills the array "matches" with all the matches of the regular expression pattern in the subject string. If you specify PREG_SET_ORDER as the flag, then $matches[0] is an array containing the match and backreferences of the first match, just like the $groups array filled by preg_match. $matches[1] holds the results for the second match, and so on. If you specify PREG_PATTERN_ORDER, then $matches[0] is an array with full consecutive regex matches, $matches[1] an array with the first backreference of all matches, $matches[2] an array with the second backreference of each match, etc.

array preg_grep (string pattern, array subjects) returns an array that contains all the strings in the array "subjects" that can be matched by the regular expression pattern.

mixed preg_replace (mixed pattern, mixed replacement, mixed subject [, int limit]) returns a string with all matches of the regex pattern in the subject string replaced with the replacement string. At most limit replacements are made. One key difference is that all parameters, except limit, can be arrays instead of strings. In that case, preg_replace does its job multiple times, iterating over the elements in the arrays simultaneously. You can also use strings for some parameters, and arrays for others. Then the function will iterate over the arrays, and use the same strings for each iteration. Using an array of the pattern and replacement, allows you to perform a sequence of search and replace operations on a single subject string. Using an array for the subject string, allows you to perform the same search and replace operation on many subject strings.

preg_replace_callback (mixed pattern, callback replacement, mixed subject [, int limit]) works just like preg_replace, except that the second parameter takes a callback instead of a string or an array of strings. The callback function will be called for each match. The callback should accept a single parameter. This parameter will be an array of strings, with element 0 holding the overall regex match, and the other elements the text matched by capturing groups. This is the same array you'd get from preg_match. The callback function should return the text that the match should be replaced with. Return an empty string to delete the match. Return $groups[0] to skip this match.

Callbacks allow you to do powerful search-and-replace operations that you cannot do with regular expressions alone. E.g. if you search for the regex (\d+)\+(\d+), you can replace 2+3 with 5 using the callback:

function regexadd($groups) {

return $groups[1] + $groups[2];

}

array preg_split (string pattern, string subject [, int limit]) works just like split, except that it uses the Perl syntax for the regex pattern.

See the PHP manual for more information on the preg function set

The ereg Function Set

The ereg functions require you to specify the regular expression as a string, as you would expect. ereg('regex', "subject") checks if regex matches subject. You should use single quotes when passing a regular expression as a literal string. Several special characters like the dollar and backslash are also special characters in double-quoted PHP strings, but not in single-quoted PHP strings.

int ereg (string pattern, string subject [, array groups]) returns the length of the match if the regular expression pattern matches the subject string or part of the subject string, or zero otherwise. Since zero evaluates to False and non-zero evaluates to True, you can use ereg in an if statement to test for a match. If you specify the third parameter, ereg will store the substring matched by the part of the regular expression between the first pair of round brackets in $groups[1]. $groups[2] will contain the second pair, and so on. Note that grouping-only round brackets are not supported by ereg. ereg is case sensitive. eregi is the case insensitive equivalent.

string ereg_replace (string pattern, string replacement, string subject) replaces all matches of the regex patten in the subject string with the replacement string. You can use backreferences in the replacement string. \\0 is the entire regex match, \\1 is the first backreference, \\2 the second, etc. The highest possible backreference is \\9. ereg_replace is case sensitive. eregi_replace is the case insensitive equivalent.

array split (string pattern, string subject [, int limit]) splits the subject string into an array of strings using the regular expression pattern. The array will contain the substrings between the regular expression matches. The text actually matched is discarded. If you specify a limit, the resulting array will contain at most that many substrings. The subject string will be split at most limit-1 times, and the last item in the array will contain the unsplit remainder of the subject string. split is case sensitive. spliti is the case insensitive equivalent.

See the PHP manual for more information on the ereg function set

The mb_ereg Function Set

The mb_ereg functions work exactly the same as the ereg functions, with one key difference: while ereg treats the regex and subject string as a series of 8-bit characters, mb_ereg can work with multi-byte characters from various code pages. E.g. encoded with Windows code page 936 (Simplified Chinese), the word 中国 ("China") consists of four bytes: D6D0B9FA. Using the ereg function with the regular expression . on this string would yield the first byte D6 as the result. The dot matched exactly one byte, as the ereg functions are byte-oriented. Using the mb_ereg function after calling mb_regex_encoding("CP936") would yield the bytes D6D0 or the first character as the result.

To make sure your regular expression uses the correct code page, call mb_regex_encoding() to set the code page. If you don't, the code page returned by or set by mb_internal_encoding() is used instead.

If your PHP script uses UTF-8, you can use the preg functions with the /u modifier to match multi-byte UTF-8 characters instead of individual bytes. The preg functions do not support any other code pages.

See the PHP manual for more information on the mb_ereg function set

Further Reading

Mastering Regular Expressions

The book Mastering Regular Expressions not only explains everything you want to know and don't want to know about regular expressions. It also has an excellent chapter on PHP's preg function set, with details on the underlying PCRE regex engine and plenty of example PHP code showing more advanced techniques. The book does not cover the ereg and mb_ereg function sets.

My review of the book Mastering Regular Expressions

Buy Oracle Regular Expressions Pocket Reference from Amazon.com

Buy Oracle Regular Expressions Pocket Reference from Amazon.co.uk

Buy Oracle Regular Expressions Pocket Reference from Amazon.fr

Buy Oracle Regular Expressions Pocket Reference from Amazon.de

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

Page last updated: 03 October 2012

Site last updated: 17 June 2013

Copyright © 2003-2013 Jan Goyvaerts. All rights reserved.

0 0