Regular Expressions: Tutorial

The absolute minimum amount of regex to learn

Regular Expressions: Tutorial

posted in productivity on

The minimum Regex one should know and still be fairly productive.

Almost all of this stuff should work in most regex implementations.
Notable exceptions are: \< and \>, [a-Z], \Q\E. Your mileage may vary.

Basic Syntax

Regex Matches Remarks
Literals    
abc Literals match themselves. Here: abc  
     
Metacharacters    
^s s but only at the start of the input  
e$ e but only at the end. ^ and $ match a position.
\$ Escape character \: Match any $  
. Match any one character Matches \r\n with dotall (s) flag
\<e Word that starts with e Word boundaries: [a-zA-Z0-9_]
e\> That ends with e  
\b Word boundary Between \w and \W
     
Character Classes    
[az] a or z  
[0-9] A single digit \d
[a-zA-Z]    
[a-Z] This might also include []\\^_ and a literal `  
[\n$^-] A newline, $, ^ or a hyphen (-) A ^ at the start negates the class
[^a\b] Everything but a and ‘backspace’ [\b] (backspace) vs \b (word boundary)
     
Quantifiers    
a? Zero or one  
a* Zero or more  
a+ One or more  
a{2,5} Two to five  
a{4} Exactly four  
a{5,} Five or more a{,5} for 0 to 5
     
Grouping    
(ab|yz) Literal ab or yz. Alternation
(?:ab) ab but non-capturing group  
(\w+) \1 \1 matches the same thing as (\w+) Backreferences (also \2 etc)

Shorthands

Shorthand Meaning Remarks
\d [0-9] \D -> [^0-9]
\w [a-zA-Z0-9_]  
\s Whitespace  
\t Tab  
\r Carriage return  
\n Newline  
\b Word boundary \A and \Z: Start/End of string
\Q…\E Literal sequence Ex: \Qlite[]ral\E

Shorthands can be inversed by capitalizing them: \D (not a digit)

Also \v (vertical tab), \f (form feed).
Inside a character clas \b matches backspace.

Modifiers

Modifier Description
g(lobal) JS: Match more than once
i(nsensitive) Case insensitive matching
m(ultiline) ^$ match every line in the string (vs \A and \Z)
s (dotall) . matches \r|\n

Replacement

Replacement Description
$1 First captured group
$2 Second captured group
$$ A literal $
$& Entire match
$` Before matched string
$’ After matched string
$+ Last matched string

Some implementations use \ instead of $.

Use non-capturing groups (?:) to keep your backreferences ($1, $2, …) in check.
Or use named groups if supported in your regex implementation.

Looking around

The last & thoughest feature each developer should definitely know: lookahead & lookbehind!

Lookahead vs Lookbehind syntax

Example:

A(?=B)   – Literal A followed by B
A(?!B)   – Literal A not followed by B

Lookarounds do not match anything in themselves which make them very handy when you want to replace some text but only when it is (not) preceded/succeeded by something else.


Other interesting reads