Things get tricky with grouping and summarizing. Things that have
helped me:
1. call group_by with dplyr – dplyr:group_by
2.
remember to ungroup!
fakeiq <- read.csv("data/state_fake_iq.csv")
fakeiq <- fakeiq %>%
mutate(myrandvar = rnorm(12, 10, 2))
print(fakeiq)
## Shoes State City Person Height Day myrandvar
## 1 92 GEORGIA Atlanta A Tall M 10.082926
## 2 90 GEORGIA Atlanta A Short T 10.138773
## 3 93 GEORGIA Atlanta B Tall M 7.268743
## 4 94 GEORGIA Atlanta B Short T 10.997517
## 5 115 CALIFORNIA Davis C Tall M 8.602848
## 6 117 CALIFORNIA Davis C Short T 9.846509
## 7 114 CALIFORNIA Davis D Tall M 9.078669
## 8 116 CALIFORNIA Davis D Short T 12.269994
## 9 185 PENNSYLVANIA Philadelphia E Tall M 4.951502
## 10 187 PENNSYLVANIA Philadelphia E Short T 10.132920
## 11 188 PENNSYLVANIA Philadelphia F Tall M 10.426704
## 12 186 PENNSYLVANIA Philadelphia F Short T 14.260135
Use lapply – converts multiple columns to factors
myfacs <- c(2:6)
fakeiq[, myfacs] <- lapply(fakeiq[, myfacs], factor)
str(fakeiq)
## 'data.frame': 12 obs. of 7 variables:
## $ Shoes : int 92 90 93 94 115 117 114 116 185 187 ...
## $ State : Factor w/ 3 levels "CALIFORNIA","GEORGIA",..: 2 2 2 2 1 1 1 1 3 3 ...
## $ City : Factor w/ 3 levels "Atlanta","Davis",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Person : Factor w/ 6 levels "A","B","C","D",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ Height : Factor w/ 2 levels "Short","Tall": 2 1 2 1 2 1 2 1 2 1 ...
## $ Day : Factor w/ 2 levels "M","T": 1 2 1 2 1 2 1 2 1 2 ...
## $ myrandvar: num 10.08 10.14 7.27 11 8.6 ...
print(fakeiq)
## Shoes State City Person Height Day myrandvar
## 1 92 GEORGIA Atlanta A Tall M 10.082926
## 2 90 GEORGIA Atlanta A Short T 10.138773
## 3 93 GEORGIA Atlanta B Tall M 7.268743
## 4 94 GEORGIA Atlanta B Short T 10.997517
## 5 115 CALIFORNIA Davis C Tall M 8.602848
## 6 117 CALIFORNIA Davis C Short T 9.846509
## 7 114 CALIFORNIA Davis D Tall M 9.078669
## 8 116 CALIFORNIA Davis D Short T 12.269994
## 9 185 PENNSYLVANIA Philadelphia E Tall M 4.951502
## 10 187 PENNSYLVANIA Philadelphia E Short T 10.132920
## 11 188 PENNSYLVANIA Philadelphia F Tall M 10.426704
## 12 186 PENNSYLVANIA Philadelphia F Short T 14.260135
Takes the first instance of a given combination, drops others. Let’s take first state
try1 <- distinct_at(fakeiq, vars(State), .keep_all = TRUE) %>%
data.frame()
try1
Shoes | State | City | Person | Height | Day | myrandvar |
---|---|---|---|---|---|---|
92 | GEORGIA | Atlanta | A | Tall | M | 10.08 |
115 | CALIFORNIA | Davis | C | Tall | M | 8.60 |
185 | PENNSYLVANIA | Philadelphia | E | Tall | M | 4.95 |
Takes the first instance of a given combination, drops others. Let’s take the first discrete combinations of state and person - this takes the first combination of those two factors and drops all other levels of those factors, retaining all else
try2 <- distinct_at(fakeiq, vars(State, Person), .keep_all = TRUE) %>%
data.frame()
try2
Shoes | State | City | Person | Height | Day | myrandvar |
---|---|---|---|---|---|---|
92 | GEORGIA | Atlanta | A | Tall | M | 10.08 |
93 | GEORGIA | Atlanta | B | Tall | M | 7.27 |
115 | CALIFORNIA | Davis | C | Tall | M | 8.60 |
114 | CALIFORNIA | Davis | D | Tall | M | 9.08 |
185 | PENNSYLVANIA | Philadelphia | E | Tall | M | 4.95 |
188 | PENNSYLVANIA | Philadelphia | F | Tall | M | 10.43 |
change keep_all to F and you drop all other observeations oustide the two specified
try3 <- distinct_at(fakeiq, vars(State, City), .keep_all = F) %>%
data.frame()
try3
State | City |
---|---|
GEORGIA | Atlanta |
CALIFORNIA | Davis |
PENNSYLVANIA | Philadelphia |
Now let’s create a summary statistic (mean) for “Shoes” by State dropping groups
# var to summarize in quotes
try4 <- fakeiq %>%
dplyr::group_by(State) %>%
summarize_at("Shoes", mean, na.rm = TRUE) %>%
ungroup()
try4
State | Shoes |
---|---|
CALIFORNIA | 115.50 |
GEORGIA | 92.25 |
PENNSYLVANIA | 186.50 |
try5 <- fakeiq %>%
dplyr::group_by(State, Person) %>%
summarize_at("Shoes", mean, na.rm = TRUE) %>%
ungroup()
try5
State | Person | Shoes |
---|---|---|
CALIFORNIA | C | 116.0 |
CALIFORNIA | D | 115.0 |
GEORGIA | A | 91.0 |
GEORGIA | B | 93.5 |
PENNSYLVANIA | E | 186.0 |
PENNSYLVANIA | F | 187.0 |
Compute the SUM, across of all all numeric vars
try6 <- fakeiq %>%
dplyr::group_by(State) %>%
summarise(across(where(is.numeric), sum)) %>%
ungroup()
try6
State | Shoes | myrandvar |
---|---|---|
CALIFORNIA | 462 | 39.80 |
GEORGIA | 369 | 38.49 |
PENNSYLVANIA | 746 | 39.77 |