Regular Expressions

Regular expressions are also known as regex or regexp, and they are magical. Regular expressions involve a syntax for string matching of the sort used in find-and-replace algorithms. Here are some of the metacharacters we will be using in various text mining applications:
. | ( ) [ { ^ $ * + ? -

Quotes and brackets

“[regexp]” brackets a character class– a set of characters you hope to match with regular expressions. Make sure to honor the brackets and the quotes.

Metacharacters & anchors

  1. - [a-d] denotes a sequence
  2. . “b.d” denotes b followed by ‘any character’ then d
  3.  “a|B” denotes a ‘or’ B
  4. ^ [^x] denotes ‘not’ x when inside the brackets, [^x-y] denotes not in the range of x-y
  5. ^ ^[x] denotes ‘starts with’ x when outside the brackets
  6. \b word boundary

Character Classes, POSIX Classes

Specify entire classes of characters (e.g., digits). Here are some character classes:
1. [[:digit:]] or \\d or [0-9]; digits from 0-9
2. \\D or [^0-9]; not digits
3. [[:lower:]] or [a-z]; lowercase letters
4. [[:upper:]] or [A-Z]; Uppercase letters
5. [[:alpha:]] or [A-Za-z]; Alphabetic characters
6. [[:alnum:]] or [A-Za-z0-9]; Alphanumeric characters
7. [[:punct:]] – Punctuation characters; !"#$%&’()*+,-./:;<=>?@[]^_`{|}~

Examples

  1. [ajc] matches a, j, or c
  2. [a-z] matches every character between a and z
  3. grep(“i.”, pan, value=T)
  4. grep(“[[:digit:]]”, pan, value=T)
  5. grep(“t.e|o.e”, pan, value=T)

Let’s work with regular expressions on a familiar phrase (a pangram) with some extra junk in it. Read this phrase into an object called ‘pan’ so that each element separated by a space gets its own row in a character vector. Hint – you need to unlist and/or split the string if you read it in raw.

The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d’og .

pan <- c("The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .")
pan <- str_split(pan, " ") %>% unlist()  ##split the string
class(pan)
## [1] "character"
print(pan)
##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "the"   
##  [9] "Laz7y"  "d'og"   "."

Our goal is to eventually clean this up and eventually tokenize it. There are misspellings, errant punctuation, extra white space and a mix of upper and lowercase letters. Let’s first check some of the observations using metacharacters and regular expressions.

Meta1: “.” any

Let’s grep our way to retrieving observations with some particular patterns where the form is grep(pattern, string, value=T). If you want to search or extract a character like “.” that is also reserved as a metacharacter, you must escape the character with //
1. return all observations with an “i” followed by any character
2. return all observations with a “T” followed by any character
3. return all observations with a J followed by any character then an m
4. return all observations where “^” is preceeded by any character

# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("i.", pan, value = T)
## [1] "qUiccK"
grep("T.", pan, value = T)
## [1] "The"
grep("J.m", pan, value = T)
## [1] "Jump;s"
grep(".\\^", pan, value = T)
## [1] "&^"

Meta2: “|” or

  1. return all observations with an “i” or a “z” in them
  2. return all observations with an “o”, “&”, or “j”
  3. return all observations with a digit or an “F”
  4. return all observations of the pattern t _ e or o _ e
  5. return all observations with a “z” and a “y” in them
grep("i|z", pan, value = T)  #returns all observations of h or z
## [1] "qUiccK" "Laz7y"
grep("o|&|j", pan, value = T)
## [1] "Fox"   "&^"    "ove_r" "d'og"
grep("\\d|F", pan, value = T)
## [1] "Fox"   "Laz7y"
grep("t.e|o.e", pan, value = T)
## [1] "ove_r" "the"

[…] specifies permitted characters

  1. Return all observations with “the” or “The”
  2. Return all observations that are not “The” or “the”
# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("[Tt]he", pan, value = T)  #returns any instances of The or the
## [1] "The" "the"
grep("[Tt]he", pan, value = T, invert = T)  #returns any instances of not The or the
## [1] "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "Laz7y"  "d'og"  
## [9] "."

Let’s create a new vector to test sequence matching vs character matching
bananas are yellow but sometimes brown

bananas <- c("bananas are yellow but sometimes brown")
bananas2 <- str_split(bananas, " ") %>% unlist()
print(bananas2)
## [1] "bananas"   "are"       "yellow"    "but"       "sometimes" "brown"
str(bananas2)
##  chr [1:6] "bananas" "are" "yellow" "but" "sometimes" "brown"

Look for the exact pattern -ana anywhere in “bananas” versus any character string that has either an a or n anywhere in it. We will try this three ways.

# bananas are yellow but sometimes brown
grep("[ana]", bananas2, value = T)  #return entries with a, n, or a
## [1] "bananas" "are"     "brown"
grep("ana", bananas2, value = T)  #return entries with letter sequence -ana
## [1] "bananas"
grep("a|n|a", bananas2, value = T)  #return entries with a, n, or a
## [1] "bananas" "are"     "brown"

^ the dreaded caret: begins with vs not permitted characters [^...]

This can mean ‘starts with’ or ‘not’ depending on where it is in the brackets
[^x] means one character that is not x

# bananas are yellow but sometimes brown
grep("^[ba]", bananas2, value = T)  #starts with b or a -^ is outside the bracket
## [1] "bananas" "are"     "but"     "brown"
grep("[^are]", bananas2, value = T)  #matches strings that do not have are in them - caret inside bracket
## [1] "bananas"   "yellow"    "but"       "sometimes" "brown"

Another example big, bog, beg, byg, bfg

big <- c("big bog beg byg bfg")
big <- str_split(big, " ") %>% unlist()
print(big)
## [1] "big" "bog" "beg" "byg" "bfg"
grep("b[io]g", big, value = T)  #returns big or bog
## [1] "big" "bog"
grep("b[^io]g", big, value = T)  #returns not big or bog
## [1] "beg" "byg" "bfg"

[a-z] specifies a sequence of characters

  1. Return all observations with letters a,b,c, or d in them
  2. Return all observations with the numbers 7-9 in them
  3. Return all observations that do not have the letters m-z in them
grep("[a-d]", pan, value = T)  #will return any observation with the lowercase letters a-d in it
## [1] "qUiccK" "brOwn"  "Laz7y"  "d'og"
grep("[7-9]", pan, value = T)
## [1] "Laz7y"
grep("[^7-9]", pan, value = T)  #^inside means not
##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "the"   
##  [9] "Laz7y"  "d'og"   "."

Character Classes

  1. Return all entries with digits in them
  2. Return all entries with punctuation in them
  3. gsub out punctuation for an empty space – form is gsub(pattern, relacement, vec) - write to pan2
grep("[[:digit:]]", pan, value = T)
## [1] "Laz7y"
grep("[[:punct:]]", pan, value = T)
## [1] "&^"     "Jump;s" "ove_r"  "d'og"   "."
pan2 <- gsub("[[:punct:]]", " ", pan)
pan2
##  [1] "The"    "qUiccK" "brOwn"  "Fox"    "  "     "Jump s" "ove r"  "the"   
##  [9] "Laz7y"  "d og"   " "

Anchors

  1. ^ Start of the string
  2. $ End of the string
  3. \\b Empty string at either edge of a word
  4. \\B NOT the edge of a word
  5. \\< Beginning of a word
  6. \\> End of a word

Let’s give some of these a shot
1. Return words in pan that start with a T or a t
2. Return words in pan that don’t start with a T or a t

grep("^[Tt]", pan, value = T)
## [1] "The" "the"
grep("^[Tt]", pan, value = T, invert = T)  #starts with not T or t
## [1] "qUiccK" "brOwn"  "Fox"    "&^"     "Jump;s" "ove_r"  "Laz7y"  "d'og"  
## [9] "."

Quantifiers

  1. \* matches at most 0 times
  2. + Matches at least 1 time
  3. ? Matches at most 1 time; optional string
  4. \{n} Matches exactly n times
  5. \{n,} Matches at least n times
  6. \{,n} Matches at most n times
  7. \{n,m} Matches between n and m times

let’s try some quantifiers
1. Return words that with at least two c’s in them

grep("c{2}", pan, value = T)
## [1] "qUiccK"