Start with a small dataset

Things get tricky with grouping and summarizing. Things that have helped me:
1. call group_by with dplyr – dplyr:group_by
2. remember to ungroup!

fakeiq <- read.csv("data/state_fake_iq.csv")
fakeiq <- fakeiq %>%
    mutate(myrandvar = rnorm(12, 10, 2))
print(fakeiq)
##    Shoes        State         City Person Height Day myrandvar
## 1     92      GEORGIA      Atlanta      A   Tall   M 10.082926
## 2     90      GEORGIA      Atlanta      A  Short   T 10.138773
## 3     93      GEORGIA      Atlanta      B   Tall   M  7.268743
## 4     94      GEORGIA      Atlanta      B  Short   T 10.997517
## 5    115   CALIFORNIA        Davis      C   Tall   M  8.602848
## 6    117   CALIFORNIA        Davis      C  Short   T  9.846509
## 7    114   CALIFORNIA        Davis      D   Tall   M  9.078669
## 8    116   CALIFORNIA        Davis      D  Short   T 12.269994
## 9    185 PENNSYLVANIA Philadelphia      E   Tall   M  4.951502
## 10   187 PENNSYLVANIA Philadelphia      E  Short   T 10.132920
## 11   188 PENNSYLVANIA Philadelphia      F   Tall   M 10.426704
## 12   186 PENNSYLVANIA Philadelphia      F  Short   T 14.260135

Convert multiple columns to factor

Use lapply – converts multiple columns to factors

myfacs <- c(2:6)
fakeiq[, myfacs] <- lapply(fakeiq[, myfacs], factor)
str(fakeiq)
## 'data.frame':    12 obs. of  7 variables:
##  $ Shoes    : int  92 90 93 94 115 117 114 116 185 187 ...
##  $ State    : Factor w/ 3 levels "CALIFORNIA","GEORGIA",..: 2 2 2 2 1 1 1 1 3 3 ...
##  $ City     : Factor w/ 3 levels "Atlanta","Davis",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Person   : Factor w/ 6 levels "A","B","C","D",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Height   : Factor w/ 2 levels "Short","Tall": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Day      : Factor w/ 2 levels "M","T": 1 2 1 2 1 2 1 2 1 2 ...
##  $ myrandvar: num  10.08 10.14 7.27 11 8.6 ...
print(fakeiq)
##    Shoes        State         City Person Height Day myrandvar
## 1     92      GEORGIA      Atlanta      A   Tall   M 10.082926
## 2     90      GEORGIA      Atlanta      A  Short   T 10.138773
## 3     93      GEORGIA      Atlanta      B   Tall   M  7.268743
## 4     94      GEORGIA      Atlanta      B  Short   T 10.997517
## 5    115   CALIFORNIA        Davis      C   Tall   M  8.602848
## 6    117   CALIFORNIA        Davis      C  Short   T  9.846509
## 7    114   CALIFORNIA        Davis      D   Tall   M  9.078669
## 8    116   CALIFORNIA        Davis      D  Short   T 12.269994
## 9    185 PENNSYLVANIA Philadelphia      E   Tall   M  4.951502
## 10   187 PENNSYLVANIA Philadelphia      E  Short   T 10.132920
## 11   188 PENNSYLVANIA Philadelphia      F   Tall   M 10.426704
## 12   186 PENNSYLVANIA Philadelphia      F  Short   T 14.260135

Distinct_at

One variable distinct_at

Takes the first instance of a given combination, drops others. Let’s take first state

try1 <- distinct_at(fakeiq, vars(State), .keep_all = TRUE) %>%
    data.frame()
try1
Shoes State City Person Height Day myrandvar
92 GEORGIA Atlanta A Tall M 10.08
115 CALIFORNIA Davis C Tall M 8.60
185 PENNSYLVANIA Philadelphia E Tall M 4.95

Two variables distinct_at

Takes the first instance of a given combination, drops others. Let’s take the first discrete combinations of state and person - this takes the first combination of those two factors and drops all other levels of those factors, retaining all else

try2 <- distinct_at(fakeiq, vars(State, Person), .keep_all = TRUE) %>%
    data.frame()
try2
Shoes State City Person Height Day myrandvar
92 GEORGIA Atlanta A Tall M 10.08
93 GEORGIA Atlanta B Tall M 7.27
115 CALIFORNIA Davis C Tall M 8.60
114 CALIFORNIA Davis D Tall M 9.08
185 PENNSYLVANIA Philadelphia E Tall M 4.95
188 PENNSYLVANIA Philadelphia F Tall M 10.43

drop groups distinct_at

change keep_all to F and you drop all other observeations oustide the two specified

try3 <- distinct_at(fakeiq, vars(State, City), .keep_all = F) %>%
    data.frame()
try3
State City
GEORGIA Atlanta
CALIFORNIA Davis
PENNSYLVANIA Philadelphia

Group_By, Summarize_at

1-variable, summarize to one row per var

Now let’s create a summary statistic (mean) for “Shoes” by State dropping groups

# var to summarize in quotes
try4 <- fakeiq %>%
    dplyr::group_by(State) %>%
    summarize_at("Shoes", mean, na.rm = TRUE) %>%
    ungroup()
try4
State Shoes
CALIFORNIA 115.50
GEORGIA 92.25
PENNSYLVANIA 186.50

2-variable, summarize to two rows

try5 <- fakeiq %>%
    dplyr::group_by(State, Person) %>%
    summarize_at("Shoes", mean, na.rm = TRUE) %>%
    ungroup()
try5
State Person Shoes
CALIFORNIA C 116.0
CALIFORNIA D 115.0
GEORGIA A 91.0
GEORGIA B 93.5
PENNSYLVANIA E 186.0
PENNSYLVANIA F 187.0

Group_By and Summarize_across

all numeric vars

Compute the SUM, across of all all numeric vars

try6 <- fakeiq %>%
    dplyr::group_by(State) %>%
    summarise(across(where(is.numeric), sum)) %>%
    ungroup()
try6
State Shoes myrandvar
CALIFORNIA 462 39.80
GEORGIA 369 38.49
PENNSYLVANIA 746 39.77