Our aim is to evaluate whether a machine can detect a recurring sequential pattern within a univariate time series (i.e., a single vector of observations that are temporally ordered). The time series we are working with are univariate – they look just like a long vector of numbers. The observations could reflect continuous measurement of air temperature, pupil size, heart rate, reaction time, etc. Most of these data are autocorrelated – i.e., where you are today predicts where you will be tomorrow.
# load some libraries
library(zoo)
library(tidyverse)
library(TSMining)
library(ngram)
We’ll start by simulating some data in which there exists a repeating pattern such as a specific number sequence. Here it’s 1-10 repeated 10 times. We’ll call that signal1. We’ll also create a different number pattern (21,22,23,24,25,24, 23,22,21) repeated 10x and call that signal2. Each of these signals constitute their own unique little pattern or motif. Time series researchers call these snippets ‘subsequences’. A motif is a special case of a repeating subsequence. For example, we might expect to see a few years of cooling temperatures with a gradual return to baseline in the air temperature after a giant volcano explodes. This motif is a meaningful pattern nested within climate time series.
signal1 <- data.frame(rep(seq(1, 10), 10))
colnames(signal1) <- "obs"
signal2 <- data.frame(rep(c(21, 22, 23, 24, 25, 24, 22, 21), 10))
colnames(signal2) <- "obs"
What’s a signal withut noise? Let’s embed our two motifs into a noise distribution. We’ll create this by randomly sampling integers 1-40 with replacement (200 observations) then combine signal1, signal2, and noise into a single vector. Convert that to a time series object, and plot it.
set.seed(1234)
noise <- data.frame(sample(c(1:40), 200, replace = T))
colnames(noise) <- "obs"
both <- rbind(signal1, noise, signal2)
both.ts <- as.ts(both)
plot.ts(both.ts, col = "red")
We have reached the critical point where we need to see if an agnostic (unsupervised) machine learning algoirthm can detect and extract the two motifs we embedded in there (signal1 and signal2) using univariate motif detection. We won’t tell it what to look for (that’s the unsupervised part), but we know the ground truth in advance. It needs to detect two motifs….and so it does.
both.look <- Func.motif(ts = both.ts, global.norm = T, local.norm = F, window.size = 20,
overlap = 0, w = 6, a = 5, mask.size = 5, max.dist.ratio = 1.2, count.ratio.1 = 1.1,
count.ratio.2 = 1.1)
length(both.look$Indices)
## [1] 2
Next step – get ggplot2 to identify and mark (color) the two motifs within the original time series.
col.motifs <- Func.visual.SingleMotif(single.ts = both.ts, window.size = 20,
motif.indices = both.look$Indices)
# Determine the total number of motifs discovered
n <- length(unique(col.motifs$data.1$Y))
# plot it
ggplot(data = col.motifs$data.1) + geom_line(aes(x = 1:dim(col.motifs$data.1)[1],
y = obs)) + geom_point(aes(x = 1:dim(col.motifs$data.1)[1], y = obs, color = Y,
shape = Y)) + scale_shape_manual(values = seq(1:n))
That worked great. The algorithm detected and colored the two motifs. Now let’s get ggplot to extract and plot the individual motifs from the original time series. This code was adapted from https://cran.r-project.org/web/packages/TSMining/vignettes/MotifDiscovery.html
for (i in 1:length(col.motifs$data.2)) {
data.temp <- col.motifs$data.2[[i]]
print(ggplot(data = data.temp) + geom_line(aes(x = Time, y = Value, color = Instance,
linetype = Instance)) + ggtitle(paste0("WCC Motif ", i)) + scale_y_continuous(limits = c(0,
max(data.temp$Value))) + theme(panel.background = element_rect(fill = "white",
colour = "black"), legend.position = "none", legend.title = element_blank()))
}
Now let’s check and make sure that the motifs are not baseline dependent. That is, 1,2,3,4,5 is recognized as the same motif as 36,37,38,39,40. Create and plot this distribution.
signal3 <- data.frame(rep(seq(1, 5), 10))
colnames(signal3) <- "obs"
signal22 <- data.frame(rep(seq(36, 40), 10))
colnames(signal22) <- "obs"
new.dist <- as.ts(rbind(signal3, noise, signal22))
plot(new.dist, col = "blue")
matters <- Func.motif(ts = new.dist, global.norm = T, local.norm = F, window.size = 20,
overlap = 0, w = 6, a = 5, mask.size = 5, max.dist.ratio = 1.2, count.ratio.1 = 1.1,
count.ratio.2 = 1.1)
length(matters$Indices)
## [1] 2
Next step – get ggplot2 to identify and color the two motifs within the original time series.
mat.mot <- Func.visual.SingleMotif(single.ts = new.dist, window.size = 20, motif.indices = matters$Indices)
# Determine the total number of motifs discovered
n <- length(unique(mat.mot$data.1$Y))
# plot it
ggplot(data = mat.mot$data.1) + geom_line(aes(x = 1:dim(mat.mot$data.1)[1],
y = obs)) + geom_point(aes(x = 1:dim(mat.mot$data.1)[1], y = obs, color = Y,
shape = Y)) + scale_shape_manual(values = seq(1:n)) + theme(panel.background = element_rect(fill = "white",
colour = "black"), legend.position = "none", legend.title = element_blank())
Well I’ll be damned. Two identical sequences in absolute difference are recognized as different motifs. That means we need to do some baseline correction.
This normalizes for changes in the baseline by subtracting the previous event from the current event. We’ll see then if the motifs are recognized as the same for the lagged time series. Here are the original and the normalized time series. We can then check if func.motif recognizes that there is just one motif now.
lag.dist <- new.dist %>% data.frame() %>% mutate(obs.l = obs - lag(obs, 1)) %>%
as.ts()
plot(lag.dist, col = "green")
lag.clean <- lag.dist %>% data.frame() %>% select(2) %>% filter(complete.cases(obs.l)) %>%
as.ts()
maybe <- Func.motif(ts = lag.clean, global.norm = T, local.norm = F, window.size = 20,
overlap = 0, w = 6, a = 5, mask.size = 5, max.dist.ratio = 1.2, count.ratio.1 = 1.1,
count.ratio.2 = 1.1)
length(maybe$Indices)
## [1] 1
Yes. it says there is just one. Let’s check and make sure.
onlyone <- Func.visual.SingleMotif(single.ts = lag.clean, window.size = 20,
motif.indices = maybe$Indices)
# Determine the total number of motifs discovered
n <- length(unique(onlyone$data.1$Y))
# plot it
ggplot(onlyone$data.1) + geom_line(aes(x = 1:dim(onlyone$data.1)[1], y = obs.l)) +
geom_point(aes(x = 1:dim(onlyone$data.1)[1], y = obs.l, color = Y, shape = Y)) +
scale_shape_manual(values = seq(1:n)) + theme(panel.background = element_rect(fill = "white",
colour = "black"), legend.position = "none", legend.title = element_blank())
This musical score was trnascribed courtesy of Professor Bob Weisberg - each note was assigned a numeric value beginning with 1 and ascending to the highest point in the scale (33). The note changes reflect a time series where we can look for recurring patterns. First let’s read in the data and set it up for time series analysis.
lady <- read.csv("ladybegood.csv", header = F)
lady.ts <- as.ts(lady)
plot.ts(lady.ts, col = "red")
Check for motifs
test <- Func.motif(ts = lady.ts, global.norm = T, local.norm = F, window.size = 5,
overlap = 0, w = 3, a = 3, mask.size = 2)
length(test$Indices)
## [1] 12
solo <- Func.visual.SingleMotif(single.ts = lady.ts, window.size = 5, motif.indices = test$Indices)
# Determine the total number of motifs discovered
n <- length(unique(solo$data.1$Y))
# plot it
ggplot(solo$data.1) + geom_line(aes(x = 1:dim(solo$data.1)[1], y = V1)) + geom_point(aes(x = 1:dim(solo$data.1)[1],
y = V1, color = Y, shape = Y)) + scale_shape_manual(values = seq(1:n))
Here are the individual motifs
for (i in 1:length(solo$data.2)) {
data.temp <- solo$data.2[[i]]
print(ggplot(data = data.temp) + geom_line(aes(x = Time, y = Value, color = Instance,
linetype = Instance)) + ggtitle(paste0("WCC Motif ", i)) + scale_y_continuous(limits = c(0,
max(data.temp$Value))) + theme(panel.background = element_rect(fill = "white",
colour = "black"), legend.position = "none", legend.title = element_blank()))
}
Ngrams work on character strings (numbers, letters, punctuation, spaces, etc.). For example, “old black frock coat” would be considered a 4-gram. You could scrape zillions of texts searching for this 4-gram and plot its frequency by year. Ngram analyses can be very useful for brute force pattern recognition. If a motif is a loose theme, then an ngram is an exact match. Let’s take a look at how ngrams work. Here’s a letter string where we know that there is a repeating 4-gram (a,b,c,d).
The ngram package produces a phrase table with all ngrams we specified at a particular length (n=4). You could see how this would get unwieldy of you were looking for repetitive ngrams in a Jane Austin novel. It’s important to threshold for frequency of n-grams or you’ll get all of them.
chars <- "a b c d e r d v h k a q e d u r s q p a b c d y a x v"
checking <- ngram(chars, n = 4)
get.phrasetable(checking)
## ngrams freq prop
## 1 a b c d 2 0.08333333
## 2 v h k a 1 0.04166667
## 3 k a q e 1 0.04166667
## 4 b c d e 1 0.04166667
## 5 d v h k 1 0.04166667
## 6 h k a q 1 0.04166667
## 7 e r d v 1 0.04166667
## 8 q p a b 1 0.04166667
## 9 b c d y 1 0.04166667
## 10 r s q p 1 0.04166667
## 11 q e d u 1 0.04166667
## 12 u r s q 1 0.04166667
## 13 c d e r 1 0.04166667
## 14 d e r d 1 0.04166667
## 15 s q p a 1 0.04166667
## 16 r d v h 1 0.04166667
## 17 a q e d 1 0.04166667
## 18 y a x v 1 0.04166667
## 19 d u r s 1 0.04166667
## 20 d y a x 1 0.04166667
## 21 c d y a 1 0.04166667
## 22 p a b c 1 0.04166667
## 23 e d u r 1 0.04166667
Here’s another way to get n-grams from this character string. The ngram_asweka (God bless you!) function allows the user to specify a minimum and maximum ngram range and export the results as a character string — not an n-gram object
range <- ngram_asweka(chars, min = 3, max = 4)
str(range)
## chr [1:49] "a b c d" "b c d e" "c d e r" "d e r d" "e r d v" ...
table(range)
## range
## a b c a b c d a q e a q e d a x v b c d b c d e b c d y c d e
## 2 2 1 1 1 2 1 1 1
## c d e r c d y c d y a d e r d e r d d u r d u r s d v h d v h k
## 1 1 1 1 1 1 1 1 1
## d y a d y a x e d u e d u r e r d e r d v h k a h k a q k a q
## 1 1 1 1 1 1 1 1 1
## k a q e p a b p a b c q e d q e d u q p a q p a b r d v r d v h
## 1 1 1 1 1 1 1 1 1
## r s q r s q p s q p s q p a u r s u r s q v h k v h k a y a x
## 1 1 1 1 1 1 1 1 1
## y a x v
## 1
maybe <- concatenate(as.character(rep(seq(1:5), 5))) numgram <- ngram_asweka(maybe, min = 3, max = 5) str(numgram)