1 Libraries and whatnot
2 Summary Statistics
3 Coarse data visualization
4 Summary statistics by levels of a grouping variable
- 4.1 describeBy from the Psych package
- 4.2 Summary statistics using group_by

1 Libraries and whatnot

Our goal here is to explore data visualization, descruptive statistics, and assumption checks. Here are the libraries and whatnot that we’ll be using

library(reshape2)
library(tidyverse)
library(stats)
library(psych)
library(tibble)
library(mvoutlier)
library(gvlma)
library(tibble)
jamie.theme <- theme_bw() + theme(axis.line = element_line(colour = "black"), 
    panel.grid.minor = element_blank(), panel.grid.major = element_blank(), 
    panel.border = element_blank(), panel.background = element_blank(), legend.title = element_blank())

2 Summary Statistics

Generating basic descriptive statistics from the mtcars dataset from base R. Inspect the dataset

head(mtcars, n = 15)  # List the data in 'mtcars' dataset

##                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4          21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag      21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710         22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive     21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant            18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360         14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D          24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230           22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280           19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C          17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE         16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL         17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC        15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

2.1 Interquartile range

Show the interquartile range for a particular variable, mean and SD

summary(mtcars$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

2.2 mean

mean(mtcars$mpg)  # Calculate the average mpg

## [1] 20.09062

2.3 standard deviation

sd(mtcars$mpg)

## [1] 6.026948

2.4 range

range(mtcars$mpg)

## [1] 10.4 33.9

3 Coarse data visualization

Nothing fancy here. Just brute force eyeballing of trends

3.1 Scatterplot

Let’s say, for example, we are interested in MPG as a function of total number of carborators. We could treat carborators as a continuous variable and plot them.

plot(mtcars$carb ~ mtcars$mpg, col = "red")

3.2 Run a simple OLS regression, take beta weights and annotate scatterplot

hmmm… maybe we want to run a quick regression and plot the slope and intercept (MPG as a linear function of HP)

maybe <- lm(mtcars$mpg ~ mtcars$hp)
summary(maybe)

## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## mtcars$hp   -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

plot(maybe)

3.3 boxplot

or we could convert carborators to a factor variable and get summary statistics on mpg by group. Sometimes it’s useful to look at a boxplot first

boxplot(mtcars$mpg ~ mtcars$carb, col = "red")

4 Summary statistics by levels of a grouping variable

4.1 describeBy from the Psych package

Use the Describe_By function from the Psych package. This particular function is of the form describeBy(dat, group) where dat is the dataframe and group is the grouping variable. In this instance, if you just feed in a dataframe name, describeBy will yield descriptives for all the variables in the dataframe. Sometimes this doesn’t make sense (e.g., when you have a nominal predictor)

mtcars$vs <- as.factor(mtcars$vs)
describeBy(mtcars, mtcars$vs)

## 
##  Descriptive statistics by group 
## group: 0
##      vars  n   mean     sd median trimmed   mad    min    max  range  skew
## mpg     1 18  16.62   3.86  15.65   16.42  2.97  10.40  26.00  15.60  0.48
## cyl     2 18   7.44   1.15   8.00    7.62  0.00   4.00   8.00   4.00 -1.74
## disp    3 18 307.15 106.77 311.00  308.52 72.65 120.30 472.00 351.70 -0.26
## hp      4 18 189.72  60.28 180.00  186.81 48.18  91.00 335.00 244.00  0.45
## drat    5 18   3.39   0.47   3.18    3.37  0.32   2.76   4.43   1.67  0.74
## wt      6 18   3.69   0.90   3.57    3.68  0.50   2.14   5.42   3.28  0.54
## qsec    7 18  16.69   1.09  17.02   16.75  0.85  14.50  18.00   3.50 -0.71
## vs*     8 18   1.00   0.00   1.00    1.00  0.00   1.00   1.00   0.00   NaN
## am      9 18   0.33   0.49   0.00    0.31  0.00   0.00   1.00   1.00  0.65
## gear   10 18   3.56   0.86   3.00    3.50  0.00   3.00   5.00   2.00  0.90
## carb   11 18   3.61   1.54   4.00    3.44  1.48   2.00   8.00   6.00  1.17
##      kurtosis    se
## mpg     -0.05  0.91
## cyl      1.94  0.27
## disp    -1.06 25.16
## hp      -0.15 14.21
## drat    -0.73  0.11
## wt      -0.43  0.21
## qsec    -0.80  0.26
## vs*       NaN  0.00
## am      -1.66  0.11
## gear    -1.07  0.20
## carb     1.33  0.36
## -------------------------------------------------------- 
## group: 1
##      vars  n   mean    sd median trimmed   mad   min    max  range  skew
## mpg     1 14  24.56  5.38  22.80   24.34  6.00 17.80  33.90  16.10  0.41
## cyl     2 14   4.57  0.94   4.00    4.50  0.00  4.00   6.00   2.00  0.85
## disp    3 14 132.46 56.89 120.55  127.11 61.82 71.10 258.00 186.90  0.80
## hp      4 14  91.36 24.42  96.00   92.00 32.62 52.00 123.00  71.00 -0.24
## drat    5 14   3.86  0.51   3.92    3.86  0.26  2.76   4.93   2.17 -0.28
## wt      6 14   2.61  0.72   2.62    2.63  0.95  1.51   3.46   1.95 -0.17
## qsec    7 14  19.33  1.35  19.17   19.24  1.02 16.90  22.90   6.00  0.86
## vs*     8 14   2.00  0.00   2.00    2.00  0.00  2.00   2.00   0.00   NaN
## am      9 14   0.50  0.52   0.50    0.50  0.74  0.00   1.00   1.00  0.00
## gear   10 14   3.86  0.53   4.00    3.83  0.00  3.00   5.00   2.00 -0.17
## carb   11 14   1.79  1.05   1.50    1.67  0.74  1.00   4.00   3.00  1.13
##      kurtosis    se
## mpg     -1.40  1.44
## cyl     -1.36  0.25
## disp    -0.49 15.21
## hp      -1.61  6.53
## drat     0.46  0.14
## wt      -1.68  0.19
## qsec     1.25  0.36
## vs*       NaN  0.00
## am      -2.14  0.14
## gear    -0.09  0.14
## carb    -0.03  0.28

###Use describeBy to generate stats for levsls of one variable Here’s how to use describeBy to get descriptives for just one variable

describeBy(mtcars$mpg, mtcars$carb)  #just constrain what you want by using the $ character

## 
##  Descriptive statistics by group 
## group: 1
##    vars n  mean sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 7 25.34  6   22.8   25.34 6.67 18.1 33.9  15.8 0.31    -1.78 2.27
## -------------------------------------------------------- 
## group: 2
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis
## X1    1 10 22.4 5.47   22.1    22.3 5.41 15.2 30.4  15.2 0.18    -1.48
##      se
## X1 1.73
## -------------------------------------------------------- 
## group: 3
##    vars n mean   sd median trimmed  mad  min  max range  skew kurtosis
## X1    1 3 16.3 1.05   16.4    16.3 1.33 15.2 17.3   2.1 -0.09    -2.33
##      se
## X1 0.61
## -------------------------------------------------------- 
## group: 4
##    vars  n  mean   sd median trimmed  mad  min max range skew kurtosis
## X1    1 10 15.79 3.91  15.25   15.81 4.82 10.4  21  10.6    0    -1.56
##      se
## X1 1.24
## -------------------------------------------------------- 
## group: 6
##    vars n mean sd median trimmed mad  min  max range skew kurtosis se
## X1    1 1 19.7 NA   19.7    19.7   0 19.7 19.7     0   NA       NA NA
## -------------------------------------------------------- 
## group: 8
##    vars n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 1   15 NA     15      15   0  15  15     0   NA       NA NA

4.2 Summary statistics using group_by

Sometimes you need to pass summary statistics to a dataframe of their own. For example, if you need to collapse across items within an individual, you would need to “group_by” item and summarize means. Herer’s how to do that with the pipe and the summarize_at function. One of the keys here is that summarize_at takes as input a tibble not a conventional datafram

mt.try <- as.tibble(mtcars)
who.knows <- mt.try %>% group_by(carb) %>% summarize_at(vars(mpg), funs(mean_of_MPG = mean))
who.knows

## # A tibble: 6 x 2
##   carb  mean_of_MPG
##   <fct>       <dbl>
## 1 1            25.3
## 2 2            22.4
## 3 3            16.3
## 4 4            15.8
## 5 6            19.7
## 6 8            15

Testing outliers & Normality

# outliers <- aq.plot(mtcars[c('mpg','disp','hp','drat','wt','qsec')])
# outliers # show list of outliers

# plot(mtcars$wt, which = 2) plot(mtcars$mpg, which = 2)

# mpgbywt<-lm(mtcars$mpg~mtcars$wt) assump <- gvlma(mpgbywt) summary(assump)

Regression assumption checks: Is each individual variable normally distributed? (Check qqplots, skewness, kurtosis) To assume linearity, you need to check the link function, skewness, kurtosis and heteroscadacity of the relations between the two (+) variables you’re regressing.

Descriptive Statistics, Checking Assumptions

Jamie Reilly, Ph.D.

2019-02-10