Regular Expressions

Regular expressions are also known as regex or regexp, and they are magical. Regular expressions involve a syntax for string matching of the sort used in find-and-replace algorithms. Here are some of the metacharacters we will be using in various text mining applications:
. | ( ) [ { ^ $ * + ? -

Help & practice

Regexp cheatsheet https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Online regular expression checker https://spannbaueradam.shinyapps.io/r_regex_tester/

Quotes and brackets

“[regexp]” brackets a character class– a set of characters you hope to match with regular expressions. Make sure to honor the brackets and the quotes.

Metacharacters & anchors

- [a-d] denotes a sequence
. “b.d” denotes b followed by ‘any character’ then d
“a|B” denotes a ‘or’ B
^ [^x] denotes ‘not’ x when inside the brackets, [^x-y] denotes not in the range of x-y
^ ^[x] denotes ‘starts with’ x when outside the brackets
\b word boundary

Character Classes, POSIX Classes

Specify entire classes of characters (e.g., digits). Here are some character classes:
1. [[:digit:]] or \\d or [0-9]; digits from 0-9
2. \\D or [^0-9]; not digits
3. [[:lower:]] or [a-z]; lowercase letters
4. [[:upper:]] or [A-Z]; Uppercase letters
5. [[:alpha:]] or [A-Za-z]; Alphabetic characters
6. [[:alnum:]] or [A-Za-z0-9]; Alphanumeric characters
7. [[:punct:]] – Punctuation characters; !"#$%&’()*+,-./:;<=>?@[]^_`{|}~

Examples

[ajc] matches a, j, or c
[a-z] matches every character between a and z
grep(“i.”, pan, value=T)
grep(“[[:digit:]]”, pan, value=T)
grep(“t.e|o.e”, pan, value=T)

Let’s work with regular expressions on a familiar phrase (a pangram) with some extra junk in it. Read this phrase into an object called ‘pan’ so that each element separated by a space gets its own row in a character vector. Hint – you need to unlist and/or split the string if you read it in raw.

The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d’og .

pan <- c("The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .")
pan <- str_split(pan, " ") %>% unlist()  ##split the string
class(pan)

## [1] "character"

print(pan)

##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "the"   
##  [9] "Laz7y"  "d'og"   "."

Our goal is to eventually clean this up and eventually tokenize it. There are misspellings, errant punctuation, extra white space and a mix of upper and lowercase letters. Let’s first check some of the observations using metacharacters and regular expressions.

Meta1: “.” any

Let’s grep our way to retrieving observations with some particular patterns where the form is grep(pattern, string, value=T). If you want to search or extract a character like “.” that is also reserved as a metacharacter, you must escape the character with //
1. return all observations with an “i” followed by any character
2. return all observations with a “T” followed by any character
3. return all observations with a J followed by any character then an m
4. return all observations where “^” is preceeded by any character

# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("i.", pan, value = T)

## [1] "qUiccK"

grep("T.", pan, value = T)

## [1] "The"

grep("J.m", pan, value = T)

## [1] "Jump;s"

grep(".\\^", pan, value = T)

## [1] "&^"

Meta2: “|” or

return all observations with an “i” or a “z” in them
return all observations with an “o”, “&”, or “j”
return all observations with a digit or an “F”
return all observations of the pattern t _ e or o _ e
return all observations with a “z” and a “y” in them

grep("i|z", pan, value = T)  #returns all observations of h or z

## [1] "qUiccK" "Laz7y"

grep("o|&|j", pan, value = T)

## [1] "Fox"   "&^"    "ove_r" "d'og"

grep("\\d|F", pan, value = T)

## [1] "Fox"   "Laz7y"

grep("t.e|o.e", pan, value = T)

## [1] "ove_r" "the"

[…] specifies permitted characters

Return all observations with “the” or “The”
Return all observations that are not “The” or “the”

# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("[Tt]he", pan, value = T)  #returns any instances of The or the

## [1] "The" "the"

grep("[Tt]he", pan, value = T, invert = T)  #returns any instances of not The or the

## [1] "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "Laz7y"  "d'og"  
## [9] "."

Let’s create a new vector to test sequence matching vs character matching
bananas are yellow but sometimes brown

bananas <- c("bananas are yellow but sometimes brown")
bananas2 <- str_split(bananas, " ") %>% unlist()
print(bananas2)

## [1] "bananas"   "are"       "yellow"    "but"       "sometimes" "brown"

str(bananas2)

##  chr [1:6] "bananas" "are" "yellow" "but" "sometimes" "brown"

Look for the exact pattern -ana anywhere in “bananas” versus any character string that has either an a or n anywhere in it. We will try this three ways.

# bananas are yellow but sometimes brown
grep("[ana]", bananas2, value = T)  #return entries with a, n, or a

## [1] "bananas" "are"     "brown"

grep("ana", bananas2, value = T)  #return entries with letter sequence -ana

## [1] "bananas"

grep("a|n|a", bananas2, value = T)  #return entries with a, n, or a

## [1] "bananas" "are"     "brown"

^ the dreaded caret: begins with vs not permitted characters [^...]

This can mean ‘starts with’ or ‘not’ depending on where it is in the brackets
[^x] means one character that is not x

# bananas are yellow but sometimes brown
grep("^[ba]", bananas2, value = T)  #starts with b or a -^ is outside the bracket

## [1] "bananas" "are"     "but"     "brown"

grep("[^are]", bananas2, value = T)  #matches strings that do not have are in them - caret inside bracket

## [1] "bananas"   "yellow"    "but"       "sometimes" "brown"

Another example big, bog, beg, byg, bfg

big <- c("big bog beg byg bfg")
big <- str_split(big, " ") %>% unlist()
print(big)

## [1] "big" "bog" "beg" "byg" "bfg"

grep("b[io]g", big, value = T)  #returns big or bog

## [1] "big" "bog"

grep("b[^io]g", big, value = T)  #returns not big or bog

## [1] "beg" "byg" "bfg"

[a-z] specifies a sequence of characters

Return all observations with letters a,b,c, or d in them
Return all observations with the numbers 7-9 in them
Return all observations that do not have the letters m-z in them

grep("[a-d]", pan, value = T)  #will return any observation with the lowercase letters a-d in it

## [1] "qUiccK" "brOwn"  "Laz7y"  "d'og"

grep("[7-9]", pan, value = T)

## [1] "Laz7y"

grep("[^7-9]", pan, value = T)  #^inside means not

##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "the"   
##  [9] "Laz7y"  "d'og"   "."

Character Classes

Return all entries with digits in them
Return all entries with punctuation in them
gsub out punctuation for an empty space – form is gsub(pattern, relacement, vec) - write to pan2

grep("[[:digit:]]", pan, value = T)

## [1] "Laz7y"

grep("[[:punct:]]", pan, value = T)

## [1] "&^"     "Jump;s" "ove_r"  "d'og"   "."

pan2 <- gsub("[[:punct:]]", " ", pan)
pan2

##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "  "     "Jump s" "ove r"  "the"   
##  [9] "Laz7y"  "d og"   " "

Anchors

^ Start of the string
$ End of the string
\\b Empty string at either edge of a word
\\B NOT the edge of a word
\\< Beginning of a word
\\> End of a word

Let’s give some of these a shot
1. Return words in pan that start with a T or a t
2. Return words in pan that don’t start with a T or a t

grep("^[Tt]", pan, value = T)

## [1] "The" "the"

grep("^[Tt]", pan, value = T, invert = T)  #starts with not T or t

## [1] "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "Laz7y"  "d'og"  
## [9] "."

Quantifiers

\* matches at most 0 times
+ Matches at least 1 time
? Matches at most 1 time; optional string
\{n} Matches exactly n times
\{n,} Matches at least n times
\{,n} Matches at most n times
\{n,m} Matches between n and m times

let’s try some quantifiers
1. Return words that with at least two c’s in them

grep("c{2}", pan, value = T)

## [1] "qUiccK"

Regular Expressions in R

Jamie Reilly. Ph.D.

September 25, 2020