Regular expressions are also known as regex or regexp, and they are magical. Regular expressions involve a syntax for string matching of the sort used in find-and-replace algorithms. Here are some of the metacharacters we will be using in various text mining applications:
. | ( ) [ { ^ $ * + ? -
Regexp cheatsheet https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
Online regular expression checker https://spannbaueradam.shinyapps.io/r_regex_tester/
“[regexp]” brackets a character class– a set of characters you hope to match with regular expressions. Make sure to honor the brackets and the quotes.
Specify entire classes of characters (e.g., digits). Here are some character classes:
1. [[:digit:]] or \\d or [0-9]; digits from 0-9
2. \\D or [^0-9]; not digits
3. [[:lower:]] or [a-z]; lowercase letters
4. [[:upper:]] or [A-Z]; Uppercase letters
5. [[:alpha:]] or [A-Za-z]; Alphabetic characters
6. [[:alnum:]] or [A-Za-z0-9]; Alphanumeric characters
7. [[:punct:]] – Punctuation characters; !"#$%&’()*+,-./:;<=>?@[]^_`{|}~
Let’s work with regular expressions on a familiar phrase (a pangram) with some extra junk in it. Read this phrase into an object called ‘pan’ so that each element separated by a space gets its own row in a character vector. Hint – you need to unlist and/or split the string if you read it in raw.
The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d’og .
pan <- c("The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .")
pan <- str_split(pan, " ") %>% unlist() ##split the string
class(pan)
## [1] "character"
print(pan)
## [1] "The" "qUiccK" "brOwn" "Fox" "&^" "Jump;s" "ove_r" "the"
## [9] "Laz7y" "d'og" "."
Our goal is to eventually clean this up and eventually tokenize it. There are misspellings, errant punctuation, extra white space and a mix of upper and lowercase letters. Let’s first check some of the observations using metacharacters and regular expressions.
Let’s grep our way to retrieving observations with some particular patterns where the form is grep(pattern, string, value=T). If you want to search or extract a character like “.” that is also reserved as a metacharacter, you must escape the character with //
1. return all observations with an “i” followed by any character
2. return all observations with a “T” followed by any character
3. return all observations with a J followed by any character then an m
4. return all observations where “^” is preceeded by any character
# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("i.", pan, value = T)
## [1] "qUiccK"
grep("T.", pan, value = T)
## [1] "The"
grep("J.m", pan, value = T)
## [1] "Jump;s"
grep(".\\^", pan, value = T)
## [1] "&^"
grep("i|z", pan, value = T) #returns all observations of h or z
## [1] "qUiccK" "Laz7y"
grep("o|&|j", pan, value = T)
## [1] "Fox" "&^" "ove_r" "d'og"
grep("\\d|F", pan, value = T)
## [1] "Fox" "Laz7y"
grep("t.e|o.e", pan, value = T)
## [1] "ove_r" "the"
# The qUiccK brOwn Fox &^ Jump;s ove_r the Laz7y d'og .
grep("[Tt]he", pan, value = T) #returns any instances of The or the
## [1] "The" "the"
grep("[Tt]he", pan, value = T, invert = T) #returns any instances of not The or the
## [1] "qUiccK" "brOwn" "Fox" "&^" "Jump;s" "ove_r" "Laz7y" "d'og"
## [9] "."
Let’s create a new vector to test sequence matching vs character matching
bananas are yellow but sometimes brown
bananas <- c("bananas are yellow but sometimes brown")
bananas2 <- str_split(bananas, " ") %>% unlist()
print(bananas2)
## [1] "bananas" "are" "yellow" "but" "sometimes" "brown"
str(bananas2)
## chr [1:6] "bananas" "are" "yellow" "but" "sometimes" "brown"
Look for the exact pattern -ana anywhere in “bananas” versus any character string that has either an a or n anywhere in it. We will try this three ways.
# bananas are yellow but sometimes brown
grep("[ana]", bananas2, value = T) #return entries with a, n, or a
## [1] "bananas" "are" "brown"
grep("ana", bananas2, value = T) #return entries with letter sequence -ana
## [1] "bananas"
grep("a|n|a", bananas2, value = T) #return entries with a, n, or a
## [1] "bananas" "are" "brown"
This can mean ‘starts with’ or ‘not’ depending on where it is in the brackets
[^x] means one character that is not x
# bananas are yellow but sometimes brown
grep("^[ba]", bananas2, value = T) #starts with b or a -^ is outside the bracket
## [1] "bananas" "are" "but" "brown"
grep("[^are]", bananas2, value = T) #matches strings that do not have are in them - caret inside bracket
## [1] "bananas" "yellow" "but" "sometimes" "brown"
Another example big, bog, beg, byg, bfg
big <- c("big bog beg byg bfg")
big <- str_split(big, " ") %>% unlist()
print(big)
## [1] "big" "bog" "beg" "byg" "bfg"
grep("b[io]g", big, value = T) #returns big or bog
## [1] "big" "bog"
grep("b[^io]g", big, value = T) #returns not big or bog
## [1] "beg" "byg" "bfg"
grep("[a-d]", pan, value = T) #will return any observation with the lowercase letters a-d in it
## [1] "qUiccK" "brOwn" "Laz7y" "d'og"
grep("[7-9]", pan, value = T)
## [1] "Laz7y"
grep("[^7-9]", pan, value = T) #^inside means not
## [1] "The" "qUiccK" "brOwn" "Fox" "&^" "Jump;s" "ove_r" "the"
## [9] "Laz7y" "d'og" "."
grep("[[:digit:]]", pan, value = T)
## [1] "Laz7y"
grep("[[:punct:]]", pan, value = T)
## [1] "&^" "Jump;s" "ove_r" "d'og" "."
pan2 <- gsub("[[:punct:]]", " ", pan)
pan2
## [1] "The" "qUiccK" "brOwn" "Fox" " " "Jump s" "ove r" "the"
## [9] "Laz7y" "d og" " "
Let’s give some of these a shot
1. Return words in pan that start with a T or a t
2. Return words in pan that don’t start with a T or a t
grep("^[Tt]", pan, value = T)
## [1] "The" "the"
grep("^[Tt]", pan, value = T, invert = T) #starts with not T or t
## [1] "qUiccK" "brOwn" "Fox" "&^" "Jump;s" "ove_r" "Laz7y" "d'og"
## [9] "."
let’s try some quantifiers
1. Return words that with at least two c’s in them
grep("c{2}", pan, value = T)
## [1] "qUiccK"