1 Introduction

We often need to create new variables that involve weird linear (or nonlinear) transormations of other variables. Sometimes these transformations are driven by assumptions of particular statistical tests (e.g., normality). There are many different ways of transorming data and/or applying functions to specific grouping variables. Loops, lapply, mutate, if, elseif – all of these do the job in different ways.

2 Functions

Many R functions under the hood automagically. However, sometimes it’s quite useful to develop your own functions. Functions take the generic form:

function(arguments) {body}

The body of a function is where the magic happens. Arguments are fed into the function and executed. Here’s a function that simply adds two numbers together. The nice thing about functions is that you can call them whenever you need it rather than doing a repetitive computation and/or transformation. The irony in the example to follow is that it takes longer to type funk.it(9,3) than just 9+3. However, you get the idea.

funk.it <- function(a, b) {
    a + b
}  #convention is to drop the curly bracket on its own line. You'll see why later.
funk.it(9, 3)
## [1] 12

Let’s do something crazy like create a custom function (called i.mean.it) that generates the mean from a specified vector. Will this work?

complicated.vec <- c(10, 20, 45, NA)  #play data where we know the mean in advance, X=25.
i.mean.it <- function(x) {
    sum(x)/length(x)  #length X only includes observations that are not na
}
i.mean.it(complicated.vec)
## [1] NA

Yikes. No. It doesn’t work. We need the function to account for missing values both in the numerator and denominator. If you sum an NA in the numnerator, you get NA. If you sum NAs in the count of the denominator, however, you’ll inflate the number of observations. Missing observation(s) are a double whammy for this function. What to do?

I’m so glad you asked. Here’s a function called ‘i.really.mean.it’ that calculates a mean with the all-important caveat that it ignores missing observations – dropping all NA’s from the sum() and length() commands.

i.really.mean.it <- function(x) {
    sum(x, na.rm = T)/length(x[!is.na(x)])  #length x only includes observations that are not na
}
i.really.mean.it(complicated.vec)  #run the 'mean' function on the test vector
## [1] 25

voila! now you can apply ‘i.really.mean.it’ to calculate any vector of numbers, even when there are missing values. Alternatively you could just skip all that and use the base R mean() function, but then you wouldn’t have the satisfaction of writing your own function. If you are interested in the global application of functions, check out ‘scoping’.

2.1 if function

tbd

2.2 elseif function

tbd

3 Loops

A loop executes a function over and over until the list terminates (or forever if you’re not careful in specifying a break). There are several different types of loop (for, while).

The most basic, generic form of a ‘for’ loop is:

for (i in dat) {

function that repeats over observations
}

3.1 ‘for’ loops

Starting super simple – ‘For’ loops. Generate a sequence of numbers from 1-5, square each number, print the output.

for (i in 1:5) {
    print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25

Now let’s index a predetermined vector using a ‘for’ loop where we just square the original value.

myvec <- c(-3, 2, 1, 7, 5)
for (i in myvec) {
    print(i^2)
}
## [1] 9
## [1] 4
## [1] 1
## [1] 49
## [1] 25

Create an empty vector and populate it with numeric input from a simple ‘for’ loop. This process of setting up an empty basket is known as pre-allocation. In this loop, we’ll just multiply a sequence of five numbers by 4, dumping each result into the ‘storage’ vector.

storage <- numeric(5)  #creates an empty vector 5 elements long
for (i in 1:5) {
    storage[i] <- i * 4  #fills the storage vector, the i-th element of storage is filled with i*4
}
print(storage)
## [1]  4  8 12 16 20

Vectors are pretty easy. Here’s pre-allocation for a matrix. It’s not much harder. Use a nested for loop to create a multiplication table. This clever little loop first involves pre-allocating an empty 8 x 8 matrix that will receive loop output. Matrix notation here is i=row index, j=column index.

TT <- matrix(data = NA, nrow = 8, ncol = 8)  #creates an empty matrix of missing values
for (i in 1:8)   #row sequence 1:8
  {
  for (j in 1:8)  #column sequence 1:8
  {
    TT[i,j] <- i * j    #1*1, 1*2, .... 8*8
  }
}
print(TT)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    2    3    4    5    6    7    8
## [2,]    2    4    6    8   10   12   14   16
## [3,]    3    6    9   12   15   18   21   24
## [4,]    4    8   12   16   20   24   28   32
## [5,]    5   10   15   20   25   30   35   40
## [6,]    6   12   18   24   30   36   42   48
## [7,]    7   14   21   28   35   42   49   56
## [8,]    8   16   24   32   40   48   56   64


3.2 ‘while’ loops

TBD



4 Mutate for data transformations

4.1 add a new variable to an existing dataframe

Here’s dplyr’s powerful mutate function as an alternative to a ‘for’ loop: generate 1 variable First generate a sequence from 1:5, save as a dataframe, then mutate it.

vec1 <- data.frame(a = seq(1:5))
vec2 <- vec1 %>% as.tibble() %>% mutate(b = a^2)
print(vec2)
## # A tibble: 5 x 2
##       a     b
##   <int> <dbl>
## 1     1     1
## 2     2     4
## 3     3     9
## 4     4    16
## 5     5    25

4.2 add more than one variable to an existing dataframe

Here’s an example of how to add a cubed variable

vec1 <- data.frame(a = seq(1:5))
vec2 <- vec1 %>% as.tibble() %>% mutate(b = a^2, c = a^3)
print(vec2)
## # A tibble: 5 x 3
##       a     b     c
##   <int> <dbl> <dbl>
## 1     1     1     1
## 2     2     4     8
## 3     3     9    27
## 4     4    16    64
## 5     5    25   125