Regular expressions describe text patterns.
Each text pattern in a regular expression is called a metacharacter.
=~ is the operator used for regular expressions.
When characters are written between [ and ] it means they are part of a character class. One character from the character class must match in order to continue evaluating the rest of the regular expression.
Inside a character class, – indicates a range and ^ indicates negation.
Perl has shortcuts for the most common character classes.
[a-zA-Z0-9_] can be written as \w and [^a-zA-Z0-9_] as \W.
Metacharacters.
- . means match any character except a newline
- \w means match any alphanumeric character or the underscore
- \W means match any character that is not alphanumeric or the underscore
- \d means match any character that is a digit
- \D means match any character that is not a digit
- \s means match any character that is a whitespace such as a space, newline or a tab
- \S means match any character that is not a whitespace
- ^ means match the beginning of the line
- $ means match the end of the line
^ and $ are called anchor metacharacters. They’re also sometimes called assertions.
Quantifiers describe how many times a character can be found in a string.
- * means zero or more
- + means one or more
- ? means zero or one time
- {n} means n times where n is an integer
- {n,m}means any number of times between n and m
- {n,} means n or more times
Modifiers.
- i (Ignore case)
- s (Single line)
- u (Unicode)
- m (Multiline)
- x (Verbose)
- l (Locale)
m/regular expression here/ is the same as /regular expression here/. It checks whether the first operand matches the text pattern.
s/find this regular expression/replace with this text/
Regex can be used to find a certain text and substitute it with another text.
The following example substitutes spaghetti with pizza:
#!/usr/bin/perl use strict; use warnings; my $sentence = "I love eating spaghetti."; $sentence =~ s/spaghetti/pizza/; print $sentence, "\n";
This example substitutes the number of slices to 4:
my $order = "3 slices of plain pizza 5 slices of pepperoni pizza"; $order =~ s/\d+/4/g; print "Your order has been changed to:\n", $order, "\n";
/g modifier means match the regex globally so it replaces all occurrences of a digit to 4.
The program prints this on the screen:
Your order has been changed to: 4 slices of plain pizza 4 slices of pepperoni pizza
When you want to take a portion of a string based on your regular expression, you must put parentheses around each pattern that you want to match. First matching part will be stored in $1, second matching part will be stored in $2, etc. We call this process capturing.
If you read perlrequick, there is this example:
($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);
It’s capturing this:
($time =~ /(\d\d):(\d\d):(\d\d)/) # returns $1, $2, $3
The values are assigned to ($hours, $minutes, $second)
You need the parentheses to group the expression like this. Otherwise it’d first assign $time to $hours, then check $second (undef) against the regex. (Precedence issue with = and =~)
Notes: In Programming Perl, it says that an easy mistake is to think that \w matches a word. Use \w+ to match a word.
When you’re learning how to make regex, I found this very useful. http://gskinner.com/RegExr/