Our goal here is to explore data visualization, descruptive statistics, and assumption checks. Here are the libraries and whatnot that we’ll be using
library(reshape2)
library(tidyverse)
library(stats)
library(psych)
library(tibble)
library(mvoutlier)
library(gvlma)
library(tibble)
jamie.theme <- theme_bw() + theme(axis.line = element_line(colour = "black"),
panel.grid.minor = element_blank(), panel.grid.major = element_blank(),
panel.border = element_blank(), panel.background = element_blank(), legend.title = element_blank())
Generating basic descriptive statistics from the mtcars dataset from base R. Inspect the dataset
head(mtcars, n = 15) # List the data in 'mtcars' dataset
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Show the interquartile range for a particular variable, mean and SD
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
mean(mtcars$mpg) # Calculate the average mpg
## [1] 20.09062
sd(mtcars$mpg)
## [1] 6.026948
range(mtcars$mpg)
## [1] 10.4 33.9
Nothing fancy here. Just brute force eyeballing of trends
Let’s say, for example, we are interested in MPG as a function of total number of carborators. We could treat carborators as a continuous variable and plot them.
plot(mtcars$carb ~ mtcars$mpg, col = "red")
hmmm… maybe we want to run a quick regression and plot the slope and intercept (MPG as a linear function of HP)
maybe <- lm(mtcars$mpg ~ mtcars$hp)
summary(maybe)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## mtcars$hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
plot(maybe)
or we could convert carborators to a factor variable and get summary statistics on mpg by group. Sometimes it’s useful to look at a boxplot first
boxplot(mtcars$mpg ~ mtcars$carb, col = "red")
Use the Describe_By function from the Psych package. This particular function is of the form describeBy(dat, group) where dat is the dataframe and group is the grouping variable. In this instance, if you just feed in a dataframe name, describeBy will yield descriptives for all the variables in the dataframe. Sometimes this doesn’t make sense (e.g., when you have a nominal predictor)
mtcars$vs <- as.factor(mtcars$vs)
describeBy(mtcars, mtcars$vs)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew
## mpg 1 18 16.62 3.86 15.65 16.42 2.97 10.40 26.00 15.60 0.48
## cyl 2 18 7.44 1.15 8.00 7.62 0.00 4.00 8.00 4.00 -1.74
## disp 3 18 307.15 106.77 311.00 308.52 72.65 120.30 472.00 351.70 -0.26
## hp 4 18 189.72 60.28 180.00 186.81 48.18 91.00 335.00 244.00 0.45
## drat 5 18 3.39 0.47 3.18 3.37 0.32 2.76 4.43 1.67 0.74
## wt 6 18 3.69 0.90 3.57 3.68 0.50 2.14 5.42 3.28 0.54
## qsec 7 18 16.69 1.09 17.02 16.75 0.85 14.50 18.00 3.50 -0.71
## vs* 8 18 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.00 NaN
## am 9 18 0.33 0.49 0.00 0.31 0.00 0.00 1.00 1.00 0.65
## gear 10 18 3.56 0.86 3.00 3.50 0.00 3.00 5.00 2.00 0.90
## carb 11 18 3.61 1.54 4.00 3.44 1.48 2.00 8.00 6.00 1.17
## kurtosis se
## mpg -0.05 0.91
## cyl 1.94 0.27
## disp -1.06 25.16
## hp -0.15 14.21
## drat -0.73 0.11
## wt -0.43 0.21
## qsec -0.80 0.26
## vs* NaN 0.00
## am -1.66 0.11
## gear -1.07 0.20
## carb 1.33 0.36
## --------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew
## mpg 1 14 24.56 5.38 22.80 24.34 6.00 17.80 33.90 16.10 0.41
## cyl 2 14 4.57 0.94 4.00 4.50 0.00 4.00 6.00 2.00 0.85
## disp 3 14 132.46 56.89 120.55 127.11 61.82 71.10 258.00 186.90 0.80
## hp 4 14 91.36 24.42 96.00 92.00 32.62 52.00 123.00 71.00 -0.24
## drat 5 14 3.86 0.51 3.92 3.86 0.26 2.76 4.93 2.17 -0.28
## wt 6 14 2.61 0.72 2.62 2.63 0.95 1.51 3.46 1.95 -0.17
## qsec 7 14 19.33 1.35 19.17 19.24 1.02 16.90 22.90 6.00 0.86
## vs* 8 14 2.00 0.00 2.00 2.00 0.00 2.00 2.00 0.00 NaN
## am 9 14 0.50 0.52 0.50 0.50 0.74 0.00 1.00 1.00 0.00
## gear 10 14 3.86 0.53 4.00 3.83 0.00 3.00 5.00 2.00 -0.17
## carb 11 14 1.79 1.05 1.50 1.67 0.74 1.00 4.00 3.00 1.13
## kurtosis se
## mpg -1.40 1.44
## cyl -1.36 0.25
## disp -0.49 15.21
## hp -1.61 6.53
## drat 0.46 0.14
## wt -1.68 0.19
## qsec 1.25 0.36
## vs* NaN 0.00
## am -2.14 0.14
## gear -0.09 0.14
## carb -0.03 0.28
###Use describeBy to generate stats for levsls of one variable Here’s how to use describeBy to get descriptives for just one variable
describeBy(mtcars$mpg, mtcars$carb) #just constrain what you want by using the $ character
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 7 25.34 6 22.8 25.34 6.67 18.1 33.9 15.8 0.31 -1.78 2.27
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 22.4 5.47 22.1 22.3 5.41 15.2 30.4 15.2 0.18 -1.48
## se
## X1 1.73
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 3 16.3 1.05 16.4 16.3 1.33 15.2 17.3 2.1 -0.09 -2.33
## se
## X1 0.61
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 10 15.79 3.91 15.25 15.81 4.82 10.4 21 10.6 0 -1.56
## se
## X1 1.24
## --------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1 19.7 NA 19.7 19.7 0 19.7 19.7 0 NA NA NA
## --------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1 15 NA 15 15 0 15 15 0 NA NA NA
Sometimes you need to pass summary statistics to a dataframe of their own. For example, if you need to collapse across items within an individual, you would need to “group_by” item and summarize means. Herer’s how to do that with the pipe and the summarize_at function. One of the keys here is that summarize_at takes as input a tibble not a conventional datafram
mt.try <- as.tibble(mtcars)
who.knows <- mt.try %>% group_by(carb) %>% summarize_at(vars(mpg), funs(mean_of_MPG = mean))
who.knows
## # A tibble: 6 x 2
## carb mean_of_MPG
## <fct> <dbl>
## 1 1 25.3
## 2 2 22.4
## 3 3 16.3
## 4 4 15.8
## 5 6 19.7
## 6 8 15
Testing outliers & Normality
# outliers <- aq.plot(mtcars[c('mpg','disp','hp','drat','wt','qsec')])
# outliers # show list of outliers
# plot(mtcars$wt, which = 2) plot(mtcars$mpg, which = 2)
# mpgbywt<-lm(mtcars$mpg~mtcars$wt) assump <- gvlma(mpgbywt) summary(assump)
Regression assumption checks: Is each individual variable normally distributed? (Check qqplots, skewness, kurtosis) To assume linearity, you need to check the link function, skewness, kurtosis and heteroscadacity of the relations between the two (+) variables you’re regressing.