COMP284 Scripting Languages
Lecture 6: PHP (Part 4)
Handouts
Ullrich Hustadt
Department of Computer Science
School of Electrical Engineering, Electronics, and Computer Science
University of Liverpool
Contents
1 Regular Expressions
Overview
PCRE
Characters
Escape sequence and meta-characters
Character classes
Anchors
Quantifiers
Capturing subpatterns
Alternations
Modifiers
2 PCRE functions
COMP284 Scripting Languages Lecture 6 Slide L6 1
Regular Expressions Overview
PHP and regular expressions
Input validation is an important step in preventing incorrect or
dangerous data entering an application
Regular expression matching is a useful technique for input validation
Regular expressions are also useful for
data extraction
data clearning and data transformation
Over time, PHP has supported different variants of regular expressions
POSIX-extended regular expressions
; deprecated in PHP 5.3.0, removed in PHP 7.0.0
Shell-style wildcard patterns
; still supported, but not particularly powerful
Perl-compatible regular expressions (PCRE)
; closely resemble Perl regular expressions
COMP284 Scripting Languages Lecture 6 Slide L6 2
Regular Expressions PCRE
Perl-compatible regular expressions
In PHP, a regular expression is a string the content of which resembles a
Perl regular expression
' /(\ d +\.\ d +) seconds / ' " /[ bc ][b -e ][^ bcd ]/"
The character / acts as a delimiter indicating the start and end of the
regular expression
; carried over from Perl
; ‘closing’ delimiter can be followed by modifiers
In PHP any non-alphanumeric, non-backslash, non-whitespace character
can be used as delimiter
If the delimiter also occurs inside the pattern it must be escaped using
a backslash \
Backslash behaviour and differences between single- and double-quoted
strings carry over
COMP284 Scripting Languages Lecture 6 Slide L6 3
Regular Expressions Characters
Regular expressions: Characters
The simplest regular expression just consists of a sequence of
alphanumberic characters and
non-alphanumeric characters escaped by a backslash
that matches exactly this sequence of characters occurring as a substring
in the target string
"/cbc/"
matches "ababcbcdcde"
'/1\+2/'
matches "21+23=44"
Strictly speaking, not all non-alphanumeric characters need to be
escaped by a backslash, only those that are ‘reserved’
COMP284 Scripting Languages Lecture 6 Slide L6 4
Regular Expressions Escape sequence and meta-characters
Regular expressions: Escape sequences and
meta-characters
There are various meta-characters and escape sequences that match
characters other than themselves:
. Matches any character except \n
\w Matches a ‘word’ character (alphanumeric plus _’)
\W Matches a non-‘word’ character
\s Matches a whitespace character
\S Matches a non-whitespace character
\d Matches a decimal digit character
\D Matches a non-digit character
"/\w.\d/"
matches "-N$1.00"
'/\W.\D/'
matches "-N$1.00"
COMP284 Scripting Languages Lecture 6 Slide L6 5
Regular Expressions Character classes
Regular expressions: Character class
A character class, a list of characters and escape sequences enclosed in
square brackets, matches any single character from within the class,
for example, [ad\t\n\-\\09]
One may specify a range of characters with a hyphen - ,
for example, [b-u]
A caret ^ at the start of a character class negates/complements it,
that is, it matches any single character that is not from within the class,
for example, [^01a-z]
"/[bc][b-e][^bcd]/"
matches "abacdecdc"
'/[][1-9]\.[0-9]/'
matches "$2.1"
COMP284 Scripting Languages Lecture 6 Slide L6 6
Regular Expressions Anchors
Regular expressions: Anchors / assertions
Anchors (assertions) allow us to fix where a match has to start or end
\A Match only at string start
^ Match only at string start (default)
Match only at a line start (in multi-line matching)
\Z Match only at string end modulo a preceding \n
\z Match only at string end
$ Match only at string end modulo a preceding \n
Match only at a line end (in multi-line matching)
\b Match word boundary (between \w and \W)
\B Match except at word boundary
Targ et stri ng " one two \ n three four \ nfive \n"
Do match Do not match
'/\Aone/' (at string start) '/^three/' (not a string start)
'/^three/m' (at line start) '/four\Z/' (not at string end mod \n)
'/four$/m' (at line end) '/five\z/' (not a string end)
'/five\Z/' (at string end mod \n)
'/\bone\b/' (a word in string) '/\bour\b/' (not a word in string)
COMP284 Scripting Languages Lecture 6 Slide L6 7
Regular Expressions Quantifiers
Quantifiers
The constructs for regular expressions that we have seen so far are not
sufficient to match, for example, natural numbers of arbitrary size
Also, writing a regular expressions for, say, a nine digit number
would be tedious
This is made possible with the use of quantifiers
regexpr* (Greedily) Match regexpr 0 or more times
regexpr+ (Greedily) Match regexpr 1 or more times
regexpr? (Greedily) Match regexpr 1 or 0 times
regexpr{n} Match regexpr exactly n times
regexpr{n,} (Greedily) Match regexpr at least n times
regexpr{n,m} (Greedily) Match regexpr between n and m times
Quantifiers are greedy by default and match the longest leftmost sequence
of characters possible
COMP284 Scripting Languages Lecture 6 Slide L6 8
Regular Expressions Quantifiers
Quantifiers
regexpr* (Greedily) Match regexpr 0 or more times
regexpr+ (Greedily) Match regexpr 1 or more times
regexpr? (Greedily) Match regexpr 1 or 0 times
regexpr{n} Match regexpr exactly n times
regexpr{n,} (Greedily) Match regexpr at least n times
regexpr{n,m} (Greedily) Match regexpr between n and m times
"/\d+\.\d+/"
matches "Delay 10.486 sec"
'/\d+/'
matches "No 12 345 6789"
'/[A-Z]0{2}\d{6}/'
matches "ID E004813709"
'/\d+/'
matches "ID E004813709"
The examples on the right illustrate that a regular expressions \d+
matches as early as possible
matches as many digits as possible
COMP284 Scripting Languages Lecture 6 Slide L6 9
Regular Expressions Quantifiers
Quantifiers
There are situations where greedy matching is not appropriate,
for example, matching a single comment in a program
'=/\*.*\*/='
matches "x = /* one */ y*z; /* three */"
By adding ? to a quantifier, we can make it match lazily, that is,
it will match the shortest leftmost sequence of characters possible
'=/\*.*?\*/='
matches "x = /* one */ y*z; /* three */"
COMP284 Scripting Languages Lecture 6 Slide L6 10
Regular Expressions Quantifiers
(Lazy) Quantifiers
regexpr*? Lazily match regexpr 0 or more times
regexpr+? Lazily match regexpr 1 or more times
regexpr?? Lazily match regexpr 1 or 0 times
regexpr{n,}? Lazily match regexpr at least n times
regexpr{n,m}? Lazily match regexpr between n and m times
"/\d+?\.\d+?/"
matches "Delay 10.486 sec"
'/\d+?/'
matches "No 12 345 6789"
'/[A-Z]0{2}\d{6}/'
matches "ID E004813709"
'/\d+?/'
matches "ID E004813709"
COMP284 Scripting Languages Lecture 6 Slide L6 11
Regular Expressions Capturing subpatterns
Regular expressions: Capturing subpatterns and
backreferences
We often encounter situations where we want to identify the repetition
of the same or similar text, for example, in HTML markup:
<strong > ... </ strong >
<h1 > ... </h1 >
We might also not just be interested in the repeating text itself,
but the text between or outside the repetition
We can characterise each individual example above
using regular expressions:
'= < strong >.*? </ strong >= '
'= <h1 >.*? </ h1 >= '
but we cannot characterise both without losing fidelity, for example:
' = <\ w + >.*? </\ w+ >= '
does not capture the ‘pairing’ of HTML tags
COMP284 Scripting Languages Lecture 6 Slide L6 12
Regular Expressions Capturing subpatterns
Regular expressions: Capturing subpatterns and
backreferences
The solution are capturing subpatterns and backreferences
(regexpr) creates a capturing subpattern
(opening parentheses are counted starting from 1
to give an index number)
(?<name>regexpr) creates a named capturing subpattern
(?:regexpr) creates a non-capturing subpattern
\N, \gN, \g{N} backreference to a capturing subpattern N
(where N is a natural number)
\g{name} backreference to a named capture group
"=<(\w+)>.*?</\1>=" "=<(?<c1>\w+)>.*?</\g{c1}>="
matches "<li><b>item</b></li>" matches "<li><b>item</b></li>"
"=<(\w+)>.*?</\1>="
do es not match "<li>item</em>"
COMP284 Scripting Languages Lecture 6 Slide L6 13
Regular Expressions Alternations
Regular expressions: Alternations
The regular expression regexpr1|regexpr2 matches
if either regexpr1 or regexpr2 matches
This type of regular expression is called an alternation
Within a larger regular expression we need to enclose alternations
in a subpattern or capturing subpattern
(regexpr1|regexpr2) or (?:regexpr1|regexpr2)
to indicate where regexpr1 and regexpr2 start/end
'/(Mr|Ms|Mrs|Dr).*?([\w\-]+)/'
matches "Dr Michele Zito"
matches "Mr Dave Shield"
matches "Mrs Judith Birtall"
COMP284 Scripting Languages Lecture 6 Slide L6 14
Regular Expressions Alternations
Regular expressions: Alternations
The order of expressions in an alternation only matters
if one expression matches a sub-expression of another
'/cat|dog|bird/'
matches "cats and dogs"
'/dog|cat|bird/'
matches "cats and dogs"
s
0
s
1
s
2
s
3
s
d
s
c
s
b
s
o
s
a
s
i
s
g
s
t
s
r
s
2
d
d o
g
c a t
b i r t
'/dog|dogs/'
matches "cats and dogs"
s
0
s
1
s
2
s
1
d
s
2
d
s
1
o
s
2
o
s
1
g
s
2
g
s
2
s
d
d
o
g
o
g
s
'/dogs|dog/'
matches "cats and dogs"
s
0
s
1
s
2
s
1
d
s
2
d
s
1
o
s
2
o
s
1
g
s
2
g
s
1
s
d
d
o
g
o
g
s
COMP284 Scripting Languages Lecture 6 Slide L6 15
Regular Expressions Modifiers
Regular expressions: Modifiers
Modifiers change the interpretation of certain characters in a regular
expression or the way in which PHP finds a match for a regular expression
/ / Default
. matches any character except \n
^ matches only at string start
$ matches only at string end modulo preceding \n
/ /i perform a case-insensitive match
Target string "Hillary\nClinton"
/clinton/ do es not match
/clinton/i do es match
COMP284 Scripting Languages Lecture 6 Slide L6 16
Regular Expressions Modifiers
Regular expressions: Modifiers
Modifiers change the interpretation of certain characters in a regular
expression or the way in which PHP finds a match for a regular expression
/ /s Treat string as a single long line
. matches any character including \n
^ matches only at string start
$ matches only at string end modulo preceding \n
/ /m Treat string as a set of multiple lines
. matches any character except \n
^ matches at a line start
$ matches at a line end
Target string "Hillary\nClinton"
/(bill|hillary).clinton/mi does not match
/(bill|hillary).clinton/si does match
/(bill|hillary).^clinton/mi does not match
/(bill|hillary).^clinton/si does not match
COMP284 Scripting Languages Lecture 6 Slide L6 17
Regular Expressions Modifiers
Regular expressions: Modifiers
Modifiers change the interpretation of certain characters in a regular
expression or the way in which Perl finds a match for a regular expression
/ /sm Treat string as a single long line, but detect multiple lines
. matches any character including \n
^ matches at a line start
$ matches at a line end
Target string "Hillary\nClinton"
/(bill|hillary).^clinton/smi do es match
COMP284 Scripting Languages Lecture 6 Slide L6 18
PCRE functions
PCRE functions: preg_match (1)
int preg_match(rx, str [,&$matches [,flags [,offset]]])
Attempts to match the regular expression rx against the string str
starting at offset
Returns 1 if there is match; 0 if there is not; FALSE in case of error
$matches is an array containing at index 0 the full match and at the
remaining indices the matches for any capture groups
flags modify the behaviour of the function
$txt = " Yabba dabba doo ";
if ( p r e g _ m a t c h ( ' /^[a -z\s ]* $ /i ', $ txt )) {
echo " ' $ txt ' only consis t s of letters and spaces "
}
' Yabba dabba doo ' only cons i sts of lette rs and spaces
COMP284 Scripting Languages Lecture 6 Slide L6 19
PCRE functions
PCRE functions: preg_match (2)
int preg_match(rx, str [,&$matches [,flags [,offset]]])
Attempts to match the regular expression rx against the string str
starting at offset
Returns 1 if there is match; 0 if there is not; FALSE in case of error
$matches is an array containing at index 0 the full match and at the
remaining indices the matches for any capture groups
flags modify the behaviour of the function
$t = " Yabba dadda doo ";
if ( p r e g _ m a t c h ( ' /((? < c1 >\ w)(? < c2 >\w )\ g{ c2 }\ g{ c1 })/ ',$t , $m )){
foreach ( $m as $ key => $ value ) {
prin tf ( " % -2s => % s \n" ,$key , $ value );
} }
0 => abba
1 => abba
c1 = > a c2 = > b
2 => a 3 => b
COMP284 Scripting Languages Lecture 6 Slide L6 20
PCRE functions
PCRE functions: preg_match_all
int preg_match_all(rx, str [,&$matches [,flags [,offs]]])
Retrieves all matches of the regular expression rx against the string
str starting at offs
Returns the number of matches; FALSE in case of error
$matches is a multi-dimensional array containing the matches found
flags modify the behaviour of the function
$txt = " Yabba dadda doo ";
if ( p reg_match_all ( ' /((? < c1 >\ w )(? <c2 >\w )\ g{ c2 }\ g{ c1 })/ ',
$txt , $m , PREG_S E T _ O R DER )){
print_r ( $m ); }
Arr ay ( [0] =>
Array (
[0] = > abba
[1] = > abba
[ c1 ] => a [c2 ] => b
[2] = > a [3] => b )
[1] = >
Arr ay (
[0] = > adda
[1] = > adda
[ c1 ] => a [c2 ] => d
[2] = > a [3] => d ))
COMP284 Scripting Languages Lecture 6 Slide L6 21
PCRE functions
PCRE functions: preg_replace
mixed preg_replace(rx, rpl, str [, lmt [, &$num]])
Returns the result of replacing matches of rx in str by rpl
lmt specifies the maximum number of replacements
On completion, $num contains the number of replacements
performed
rpl can refer back to capturing subpatterns in rx via $N where N is
a natural number
$old = " Dr Ul l rich Hustadt ";
$new = preg_replace ( ' /( Mr |Ms| Mrs | Dr )?\ s *(\ w +)\ s +(\ w +)/ ' ,
'$3 , $2 ' , $old );
echo $new ;
Hustadt , Ullrich
COMP284 Scripting Languages Lecture 6 Slide L6 22
PCRE functions
PCRE functions: preg_replace
mixed preg_replace_callback(rx,fun,str [,lmt [,&$num]])
Returns the result of replacing matches of rx in str by the return
values of the application of fun to each match
lmt specifies the maximum number of replacements
On completion, $num contains the number of replacements
performed
$old = " 105 degre es F a h r e n h eit is quite warm ";
$new = pre g _ replace_c a l lback ( ' /(\ d +) deg rees Fahrenheit / ' ,
functio n ( $matc h ) {
retu rn round (( $ match [1] - 32) * 5 / 9) .
" degrees Cel cius " ;
}, $old );
echo $new ;
41 degrees Celcius is quite warm
COMP284 Scripting Languages Lecture 6 Slide L6 23
PCRE functions
PCRE functions: preg_split
array preg_split(rx, str [,lmt [,flags]])
Splits str by the regular expression rx and returns the result as array
lmt specifies the maximum number of split components
flags modify the behaviour of the function
# E s t ablish the freque n c y of words in a s tring
$string = " peter paul mary paul jim mary paul " ;
$cou nt = [];
forea ch ( preg_split ( '/\ s+/ ', $s tring ) as $ word )
if ( array_key_ex i s t s ( $ word , $ coun t ))
$cou nt [ $ word ]++;
else
$cou nt [ $ word ] = 1;
forea ch ( $count as $key => $ value )
print (" $key = > $value ; ");
pet er = > 1; paul = > 3; mary = > 2; jim => 1;
COMP284 Scripting Languages Lecture 6 Slide L6 24
PCRE functions
Revision
Read
R. Cox: Regular Expression Matching Can Be Simple And Fast (but is slow
in Java, Perl, PHP, Python, Ruby, . . . ). swtchboard, Jan 2007.
https://swtch.com/
~
rsc/regexp/regexp1.html [accessed 28 Dec 2019]
Read
PHP Manual: Regular Expressions (Perl-Compatible)
http://uk.php.net/manual/en/book.pcre.php
of P. Cowburn (ed.): PHP Manual. The PHP Group, 24 Dec 2019.
http://uk.php.net/manual/en [accessed 29 Dec 2019]
COMP284 Scripting Languages Lecture 6 Slide L6 25