Start with a small dataset

Things get tricky with grouping and summarizing. Things that have helped me:
1. call group_by with dplyr – dplyr:group_by
2. remember to ungroup!

fakeiq <- read.csv("data/state_fake_iq.csv")
fakeiq <- fakeiq %>%
    mutate(myrandvar = rnorm(12, 10, 2))
print(fakeiq)

##    Shoes        State         City Person Height Day myrandvar
## 1     92      GEORGIA      Atlanta      A   Tall   M 10.082926
## 2     90      GEORGIA      Atlanta      A  Short   T 10.138773
## 3     93      GEORGIA      Atlanta      B   Tall   M  7.268743
## 4     94      GEORGIA      Atlanta      B  Short   T 10.997517
## 5    115   CALIFORNIA        Davis      C   Tall   M  8.602848
## 6    117   CALIFORNIA        Davis      C  Short   T  9.846509
## 7    114   CALIFORNIA        Davis      D   Tall   M  9.078669
## 8    116   CALIFORNIA        Davis      D  Short   T 12.269994
## 9    185 PENNSYLVANIA Philadelphia      E   Tall   M  4.951502
## 10   187 PENNSYLVANIA Philadelphia      E  Short   T 10.132920
## 11   188 PENNSYLVANIA Philadelphia      F   Tall   M 10.426704
## 12   186 PENNSYLVANIA Philadelphia      F  Short   T 14.260135

Convert multiple columns to factor

Use lapply – converts multiple columns to factors

myfacs <- c(2:6)
fakeiq[, myfacs] <- lapply(fakeiq[, myfacs], factor)
str(fakeiq)

## 'data.frame':    12 obs. of  7 variables:
##  $ Shoes    : int  92 90 93 94 115 117 114 116 185 187 ...
##  $ State    : Factor w/ 3 levels "CALIFORNIA","GEORGIA",..: 2 2 2 2 1 1 1 1 3 3 ...
##  $ City     : Factor w/ 3 levels "Atlanta","Davis",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Person   : Factor w/ 6 levels "A","B","C","D",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Height   : Factor w/ 2 levels "Short","Tall": 2 1 2 1 2 1 2 1 2 1 ...
##  $ Day      : Factor w/ 2 levels "M","T": 1 2 1 2 1 2 1 2 1 2 ...
##  $ myrandvar: num  10.08 10.14 7.27 11 8.6 ...

print(fakeiq)

##    Shoes        State         City Person Height Day myrandvar
## 1     92      GEORGIA      Atlanta      A   Tall   M 10.082926
## 2     90      GEORGIA      Atlanta      A  Short   T 10.138773
## 3     93      GEORGIA      Atlanta      B   Tall   M  7.268743
## 4     94      GEORGIA      Atlanta      B  Short   T 10.997517
## 5    115   CALIFORNIA        Davis      C   Tall   M  8.602848
## 6    117   CALIFORNIA        Davis      C  Short   T  9.846509
## 7    114   CALIFORNIA        Davis      D   Tall   M  9.078669
## 8    116   CALIFORNIA        Davis      D  Short   T 12.269994
## 9    185 PENNSYLVANIA Philadelphia      E   Tall   M  4.951502
## 10   187 PENNSYLVANIA Philadelphia      E  Short   T 10.132920
## 11   188 PENNSYLVANIA Philadelphia      F   Tall   M 10.426704
## 12   186 PENNSYLVANIA Philadelphia      F  Short   T 14.260135

Distinct_at

One variable distinct_at

Takes the first instance of a given combination, drops others. Let’s take first state

try1 <- distinct_at(fakeiq, vars(State), .keep_all = TRUE) %>%
    data.frame()
try1

Shoes	State	City	Person	Height	Day	myrandvar
92	GEORGIA	Atlanta	A	Tall	M	10.08
115	CALIFORNIA	Davis	C	Tall	M	8.60
185	PENNSYLVANIA	Philadelphia	E	Tall	M	4.95

Two variables distinct_at

Takes the first instance of a given combination, drops others. Let’s take the first discrete combinations of state and person - this takes the first combination of those two factors and drops all other levels of those factors, retaining all else

try2 <- distinct_at(fakeiq, vars(State, Person), .keep_all = TRUE) %>%
    data.frame()
try2

Shoes	State	City	Person	Height	Day	myrandvar
92	GEORGIA	Atlanta	A	Tall	M	10.08
93	GEORGIA	Atlanta	B	Tall	M	7.27
115	CALIFORNIA	Davis	C	Tall	M	8.60
114	CALIFORNIA	Davis	D	Tall	M	9.08
185	PENNSYLVANIA	Philadelphia	E	Tall	M	4.95
188	PENNSYLVANIA	Philadelphia	F	Tall	M	10.43

drop groups distinct_at

change keep_all to F and you drop all other observeations oustide the two specified

try3 <- distinct_at(fakeiq, vars(State, City), .keep_all = F) %>%
    data.frame()
try3

State	City
GEORGIA	Atlanta
CALIFORNIA	Davis
PENNSYLVANIA	Philadelphia

Group_By, Summarize_at

1-variable, summarize to one row per var

Now let’s create a summary statistic (mean) for “Shoes” by State dropping groups

# var to summarize in quotes
try4 <- fakeiq %>%
    dplyr::group_by(State) %>%
    summarize_at("Shoes", mean, na.rm = TRUE) %>%
    ungroup()
try4

State	Shoes
CALIFORNIA	115.50
GEORGIA	92.25
PENNSYLVANIA	186.50

2-variable, summarize to two rows

try5 <- fakeiq %>%
    dplyr::group_by(State, Person) %>%
    summarize_at("Shoes", mean, na.rm = TRUE) %>%
    ungroup()
try5

State	Person	Shoes
CALIFORNIA	C	116.0
CALIFORNIA	D	115.0
GEORGIA	A	91.0
GEORGIA	B	93.5
PENNSYLVANIA	E	186.0
PENNSYLVANIA	F	187.0

Group_By and Summarize_across

all numeric vars

Compute the SUM, across of all all numeric vars

try6 <- fakeiq %>%
    dplyr::group_by(State) %>%
    summarise(across(where(is.numeric), sum)) %>%
    ungroup()
try6

State	Shoes	myrandvar
CALIFORNIA	462	39.80
GEORGIA	369	38.49
PENNSYLVANIA	746	39.77

Group_By, Summarize, Distinct Data

Jamie Reilly, Ph.D.

June 2024