Chapter 4 Exploratory Data Analysis (EDA)
In this chapter, we will start with data analysis. One crucial step when analyzing data is Exploratory Data Analysis. It describes the process of analyzing data sets to summarize their main characteristics. This step can help understanding the data, checking its quality, early detecting patterns and trends and gain first insights!
Since R is a software specifically designed for statistics, we have a lot of high value libraries to perform EDA. The dataset we will use for this part is called palmerpenguins. This is a dataset about penguin species and their attributes.
pacman::p_load("summarytools", "SmartEDA", "skimr",
"naniar", "gtsummary", "dlookr",
"DataExplorer", "psych", "ggplot2",
"palmerpenguins", "dplyr", "tidyr", "corrplot")
penguins <- na.omit(penguins)
penguins_raw <- penguins_raw4.1 Standard Descriptive Statistics
4.1.1 Measures of Central Tendency
As the name suggests, measures of central tendency are helping us to understand the probability distribution of the data, its center and typical values. The three most common measures of central tendency are the arithmetic mean, the median and the mode.
Mode: The most frequent number
Mean: The sum of all values divided by the total number of values
Median: The middle number in an ordered dataset
4.1.1.1 Mode
The mode is probably the easiest measure out of all measures: It is defined as the most frequent number of all observations. We cannot directly calculate the mode, but there is a way to it. First we look at all unique values of or observations with the unique() function, then we count the occurrences of each unique value with tabulate(), and lastly we use the which.max() function to get the most frequent unique value:
uniq_vals <- unique(penguins$bill_length_mm) # Get unique values
freqs <- tabulate(match(penguins$bill_length_mm, uniq_vals)) # Count occurrences
uniq_vals[which.max(freqs)] # Getting the unique value with the most occurrences## [1] 41.1
4.1.1.2 Mean
Let us start by looking at the formula to calculate a mean:
\[ \bar{x} = \frac{\sum{x_i}}{n} \]
whereas:
\(\bar{x}\) is our mean
\(\sum{x_i}\) is the sum of all our observations.
\(n\) is the number of all our observations
We could do that by hand or we just use the built-in mean() function:
## [1] 43.99279
4.1.1.3 Median
Image sorting all your data from the lowest to highest and then pointing at the value, which has exactly 50% of all values to its left and the other 50% to its right, this would be the median value. Well, at least you will point at a value if your distribution has an yeven number of observations. But you can also calculate the value for an uneven number of observations, let us have a look at both formulas:
\(X_{(\frac{n+1}{2})}\) for an even number of n
\(\frac{1}{2}X_{(\frac{n}{2})} + X_{(\frac{n}{2} + 1)}\) for an uneven number of n
Again we could calculate that by hand or we just use the median() function:
## [1] 44.5
4.1.2 Measures of Dispersion
In statistics, measures of dispersion describe the extent to which a distribution of a variable is stretched or squeezed. In other words, they help to gauge the spread of our distributions.
4.1.2.1 Interquartile Range (IQR)
You remember the boxplot from the data visualization chapter? It is supposed to show the so-called interquartile range (IQR). It is defined as the difference between the 75th percentile (or third quartile) and the 25th percentile (or the first quartile). Basically the distribution is spread into four equally big areas, which are separated by three points, the first quartile denoted by \(Q_1\) (also called the lower quartile), the second quartile is the median and denoted as \(Q_2\), and the third quartile is measures of dispersiondenoted by \(Q_3\) (also called the upper quartile), thus the formula is:
\[ IQR = Q_3 - Q_1 \]
In R, we can calculate the Interquartile Range (IQR) using the IQR() function. By hand, we would:
Sort the data in ascending order.
Split the data into four equal parts (quartiles).
Identify Q1 (first quartile, 25th percentile) and Q3 (third quartile, 75th percentile).
Note that in some cases the data cannot be divided into four even parts ddue to their size. In such cases, different statistical methods (e.g. Tukey’s Hinges) approximate quartiles in such cases.
## [1] 9.1
4.1.2.2 Variance
In statistics, the variance is the expected value of the squared deviation from the mean of our random variable. The concept of the mean gets clear if we break down its formula:
\[ s² = \frac{\sum(x_i - \bar{x})²}{n-1} \]
where
the index i runs over the observations (Respondents, Countries,…), i = 1,…,n
\(x_i\) are our observations
\(\bar{x}\) is the mean of our distribution
\(n\) is the number of observations
Especially the nominator of the formula \(\sum(x_i - \bar{x})²\) is quite interesting because it uses an interesting technique:
Image our data is aligned on one dimension and somewhere in the middle there is our mean:
Now, image we would calculate the differences and sum them up without squaring.
You see that distance 1 would be -3 and distance 2 would be 4 an in sum that makes 0,8, but that is for sure not the distance between those two points. Here comes the squared part into action, every squared number is positive, this ensures that -3 becomes 9 and 4 becomes 16, thus the differences can be summed and result in 24.
In R, we can implement this by simply calling the var() function:
## [1] 29.90633
Well we get 29.9 and the problem with variance is, that the squaring technique I showed you earlier leads to a problem: We cannot really interpret the data because it loses its unit, for example if our data is in meter, the variance would be in square meters. But we can solve this problem with the next measure: standard deviation.
4.1.2.3 Standard Deviation
The standard deviation is the square root of the variance. It describes the amount of variation of the values of a variables about its mean. A low standard deviation indicate closeness of the values to the mean, and a high standard deviation indicates that the values are spread out over a wider range.
The standard deviation is the square root of the variance:
\[ s = \sqrt{s²} = \sqrt\frac{\sum(x_i - \bar{x})}{n-1} \]
Remember our example?:
We calculated the variance the distance of the squared sums to be 24 (9 + 16). However, what happens if we take the square root of the squared values before adding them? Well, we get 3 for distance 1 and 4 for distance 2. For distance 2 nothing changed, it came back to its original value 4, but distance 1 has become positive from -3 to 3. Now we can add it together and get the distance 7.
Let us apply it in R, with the function sd():
## [1] 5.468668
The huge advantage of the standard deviation in comparison to the variance is that it can be interpreted in the original unit and is thus a more intuitive measure of dispersion to work with.
4.1.3 Relationships between Variables
It is necessary to understand one key difference in EDA. There are values you simply look at and interpret such as the measures shown before. But on the other hand there are measures which look at the relationship between variables, thus analyzing how they interact which each other. In the following, we will look at two methods to do so.
4.1.3.1 Crosstables / Contingency Tables
##
## Biscoe Dream Torgersen
## Adelie 44 55 47
## Chinstrap 0 68 0
## Gentoo 119 0 0
## Cross-Tabulation, Row Proportions
## species * island
## Data Frame: penguins
##
## ----------- -------- -------------- -------------- ------------ --------------
## island Biscoe Dream Torgersen Total
## species
## Adelie 44 ( 30.1%) 55 ( 37.7%) 47 (32.2%) 146 (100.0%)
## Chinstrap 0 ( 0.0%) 68 (100.0%) 0 ( 0.0%) 68 (100.0%)
## Gentoo 119 (100.0%) 0 ( 0.0%) 0 ( 0.0%) 119 (100.0%)
## Total 163 ( 48.9%) 123 ( 36.9%) 47 (14.1%) 333 (100.0%)
## ----------- -------- -------------- -------------- ------------ --------------
|
island
|
Total | |||
|---|---|---|---|---|
| Biscoe | Dream | Torgersen | ||
| species | ||||
| Adelie | 44 | 55 | 47 | 146 |
| Chinstrap | 0 | 68 | 0 | 68 |
| Gentoo | 119 | 0 | 0 | 119 |
| Total | 163 | 123 | 47 | 333 |
4.1.3.2 Correlation
Correlation is an umbrella term for any statistical relationship, whether causal or not, between two random variables or bivariate data. The three main measures of correlations are named after their inventors: Pearsons Correlation (or simply Pearson’s r), Spearman’s Rank Correlation (or simply Spearman’s rho) and Kendall’s Tau.
Before going on let us make clear what a linear relationship means. It means that if an observation increases or decreases the corresponding variables changes in a proportional and predictable way.
4.1.3.2.1 Pearsons Correlation
Pearsons Correlation measures the linear relationship between two continous variables. To do so it assumes a normal distribution between two variables, a linear relationship and no major outliers.
Normally, you should test those before running a correlation, but we skip that part and look at the implementation in R:
## [1] 0.5894511
The interpretation of pearson’s r is quite straightforward: The result always ranges between +1 and -1, where +1 means a perfect linear relationship, 0 means no relationship at all, -1 means a perfect linear relationship.
In our case there is a strong positive, linear relationship of both variables with a r = 0.59.
4.1.3.2.2 Spearman’s Rank Correlation
Spearman’s Rank Correlation, or simply Spearman’s Rho shows if ordinal or continuous variables have a Monotonic Relationship. This means that if one variables increases, the other always increases or decreases but not necessarily at a constant way, as it would be with a linear relationship.
It has different assumptions like outliers are allowed, the relationship is non-linear, the data contains rank and the variables are not normally distributed.
Let us have a look, how it is calculated in R:
## [1] 0.5764804
The interpretation is analog to Pearson’s R, +1 means a perfect positive, monotonic relationship, 0 means no monotonic relationship and -1 means perfect negative, monotonic relationship.
A spearman’s rho of 0.58 indicates a strong, positive, monotonic relationship between the two variables.
4.1.3.2.3 Kendalls Tau
Kendalls Tau measures the strength and direction of association between two ranked variables. We differentiate between two types of relationships: Concordant and Discordant relationships.
A concordant relationship means both data points move in the same direction
A discordant relationship if one data point increases, the other decreases (or vice versa).
Let us calculate it:
## [1] 0.4277598
Kendalls Tau can be interpreted wiht +1 as a perfect rank agreement, 0 means no association at all, and -1 means a perfect rank reversal.
Our kendall’s tau indicates a high rank agreement between both our variables.
4.1.3.3 Correlation Graphically
The part before was quite theoretical, but there are also nice approaches to look at correlations graphically. we start with a simple scatterplot:
ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
geom_point(color = "#0077b6") +
labs(x = "Length in mm",
y = "Body Mass in g",
title = "Relationship between Length (in mm) and Body Mass (in g)") +
theme_bw()When looking at relationships between two variables, scatterplots are the standard visualization to do so. As we can see, there is a clear trend that the length in mm could be positively related to the body mass in g. In the end of the day, there are three types of linear relationships and the scatterplots always look accordingly:
set.seed(123)
n <- 100
df_cor <- data.frame(
x = rep(1:n, 3),
relationship = rep(c("Positive", "Negative", "None"),
each = n),
y = c(
(1:n) + rnorm(n, sd = 15), # strong positive correlation
(n:1) + rnorm(n, sd = 15), # strong negative correlation
rnorm(n, mean = 50, sd = 20) # no correlation
)
)
# Reorder factor levels
df_cor$relationship <- factor(df_cor$relationship,
levels = c("Positive",
"None", "Negative"),
labels = c("Positive",
"No Correlation",
"Negative"))
# Plot
ggplot(df_cor, aes(x = x, y = y)) +
geom_point(color = "steelblue", size = 2) +
facet_wrap(~relationship, nrow = 1) +
labs(title = "Strong Positive, Negative, and No Correlation",
x = "X", y = "Y") +
theme_bw(base_size = 18)With scatterplots in combination with facet_wrap() you can show several correlations graphically, but there is a way to calculate the correlation coefficient and to show it graphically with a correlation plot. A correlation plot combines the logic of contingency tables, heat maps and the correlation coefficients.
First a correlation matrix is created. A table that shows the pairwise correlation coefficients (typically Pearson) between several numerical variables. Each cell in the matrix represents the strength and direction of the linear relationship between two variables. A correlation plot is then a visual representation of this matrix, often using color gradients or circle sizes to show the strength and direction of correlations, making it easier to spot patterns.
Let us compute it in R: We will use the corrplot package in R (There are other ways to compute it, which I will show later).
First, we cut down our dataset to the variables you want to correlate with each other
Second, compute a correlation matrix with the cor() command
Third, call the corrplot variable, and take in the dataset, define the method (we will use the color to display the strength of the correlation), the type (
'full'(default),'upper'or'lower', display full matrix, lower triangular or upper triangular matrix).
# Step 1: Prepare numeric data
penguins_numeric <- penguins %>%
select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
drop_na()
# Step 2: Compute correlation matrix
corr_matrix <- cor(penguins_numeric)
# Step 3: Plot the correlation matrix
corrplot(corr_matrix, method = "color")Now, we can see in an elegant way the correlation between the four selected variables, which are displayed through the color. Every cell displays the correlation coefficient of the variable on its respective column and on its respective row.
The corrplot() function allows for further aesthetics:
corrplot(corr_matrix, method = "color", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45)And finally, we can change the method to circular. This changes the plot insofar that it does not fill the cells with the color of the correlation coefficient, but makes a circle around the numbers and fills it then with the color. The size of the circles, thus the radius is determined but the strength of the correlation, meaning that the closer the coefficient to zero the smaller the circle:
corrplot(corr_matrix, method = "circle", type = "upper",
addCoef.col = "black", tl.col = "black", tl.srt = 45)If I use the circular method, I like to display it without the numbers, looks better in my opinion, and is more intuitive then filling the whole row:
4.2 Working with EDA packages
There are several other measures for EDA, and of course you do not have to calculate every single measure by hand, although you could. In the following, I will introduce you to the most popular packages for EDA. From my point of view, they are all basically the same, with some nuances and when talking to other R-users I noticed that everyone somehow established his or her own routine of EDA. Therefore I suggest that you get an overview of all those packages and find your own “EDA-Routine” so to speak.
4.2.1 psych
The psych package is a crucial package for psychologists and I love to use its “describe()” function, which shows a bunch of descriptive statistics with one line of code:
## vars n mean sd median trimmed
## species* 1 333 1.92 0.89 2.0 1.90
## island* 2 333 1.65 0.71 2.0 1.57
## bill_length_mm 3 333 43.99 5.47 44.5 43.98
## bill_depth_mm 4 333 17.16 1.97 17.3 17.19
## flipper_length_mm 5 333 200.97 14.02 197.0 200.36
## body_mass_g 6 333 4207.06 805.22 4050.0 4159.46
## sex* 7 333 1.50 0.50 2.0 1.51
## year 8 333 2008.04 0.81 2008.0 2008.05
## mad min max range skew kurtosis
## species* 1.48 1.0 3.0 2.0 0.16 -1.72
## island* 1.48 1.0 3.0 2.0 0.62 -0.85
## bill_length_mm 6.97 32.1 59.6 27.5 0.04 -0.90
## bill_depth_mm 2.22 13.1 21.5 8.4 -0.15 -0.91
## flipper_length_mm 16.31 172.0 231.0 59.0 0.36 -0.98
## body_mass_g 889.56 2700.0 6300.0 3600.0 0.47 -0.75
## sex* 0.00 1.0 2.0 1.0 -0.02 -2.01
## year 1.48 2007.0 2009.0 2.0 -0.08 -1.49
## se
## species* 0.05
## island* 0.04
## bill_length_mm 0.30
## bill_depth_mm 0.11
## flipper_length_mm 0.77
## body_mass_g 44.13
## sex* 0.03
## year 0.04
The describe() function output is a summary table of the basic summary statistics as mean, standard deviation, median, and a lot more and it is a shortcut, so you do not have to calculate every measure on its own.
## Call:corr.test(x = penguins_numeric)
## Correlation matrix
## bill_length_mm bill_depth_mm
## bill_length_mm 1.00 -0.23
## bill_depth_mm -0.23 1.00
## flipper_length_mm 0.65 -0.58
## body_mass_g 0.59 -0.47
## flipper_length_mm body_mass_g
## bill_length_mm 0.65 0.59
## bill_depth_mm -0.58 -0.47
## flipper_length_mm 1.00 0.87
## body_mass_g 0.87 1.00
## Sample Size
## [1] 333
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## bill_length_mm bill_depth_mm
## bill_length_mm 0 0
## bill_depth_mm 0 0
## flipper_length_mm 0 0
## body_mass_g 0 0
## flipper_length_mm body_mass_g
## bill_length_mm 0 0
## bill_depth_mm 0 0
## flipper_length_mm 0 0
## body_mass_g 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
In the next step, I want to show you the pairs.panel() function. It displays a really interesting visualization regarding correlations between the variables:
So this is an interesting grid consisting of different types of data visualization, all related to correlations. Let us start with the easiest one, the diagonal.
- The diagonal shows us a histogram of the distributions of our input variables, additionally it includes a line, so we can check if the distributions look right.
- The upper right visualizations display the correlation coefficients of the respective two variables, like a correlation matrix.
- The bottom left visualizations are scatterplots between the two variables, where a line is fitted between the data points.
- In addition, the shape of the red line (a loess smoother) can reveal whether the relationship between the two variables is linear or more complex, such as curved or s-shaped.
- The scatterplots also include correlation ellipses, which visually represent the strength and direction of the relationship: narrow, tilted ellipses indicate strong correlations, while rounder shapes indicate weaker or no correlations.
Together, this grid gives a comprehensive overview of both the distributions of individual variables and the pairwise relationships between them.
4.2.2 skimr
The skimr package is a wonderful way to get an overview of our datasets structure and basic statistics, even with a visualization of the distribution of the variable. The main function is skim():
| Name | penguins |
| Number of rows | 333 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1 | FALSE | 3 | Ade: 146, Gen: 119, Chi: 68 |
| island | 0 | 1 | FALSE | 3 | Bis: 163, Dre: 123, Tor: 47 |
| sex | 0 | 1 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 0 | 1 | 43.99 | 5.47 | 32.1 | 39.5 | 44.5 | 48.6 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 0 | 1 | 17.16 | 1.97 | 13.1 | 15.6 | 17.3 | 18.7 | 21.5 | ▅▆▇▇▂ |
| flipper_length_mm | 0 | 1 | 200.97 | 14.02 | 172.0 | 190.0 | 197.0 | 213.0 | 231.0 | ▂▇▃▅▃ |
| body_mass_g | 0 | 1 | 4207.06 | 805.22 | 2700.0 | 3550.0 | 4050.0 | 4775.0 | 6300.0 | ▃▇▅▃▂ |
| year | 0 | 1 | 2008.04 | 0.81 | 2007.0 | 2007.0 | 2008.0 | 2009.0 | 2009.0 | ▇▁▇▁▇ |
The advantage of the skimr package is that it directly calculates descriptive statistics and shows the missing values for every variable in your dataset. I often use this command to have a look at datasets I am not familiar with. I think it is more a function for exploring the dataset rather than EDA it can be a useful first step for conducting EDA.
4.2.3 summarytools
We will dive into the summarytools package with the classic dfSummary command which summarizes the structure of our dataset:
## Data Frame Summary
## penguins
## Dimensions: 333 x 8
## Duplicates: 0
##
## ----------------------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ------------------- ---------------------------- --------------------- --------------------- ---------- ---------
## 1 species 1. Adelie 146 (43.8%) IIIIIIII 333 0
## [factor] 2. Chinstrap 68 (20.4%) IIII (100.0%) (0.0%)
## 3. Gentoo 119 (35.7%) IIIIIII
##
## 2 island 1. Biscoe 163 (48.9%) IIIIIIIII 333 0
## [factor] 2. Dream 123 (36.9%) IIIIIII (100.0%) (0.0%)
## 3. Torgersen 47 (14.1%) II
##
## 3 bill_length_mm Mean (sd) : 44 (5.5) 163 distinct values . . : 333 0
## [numeric] min < med < max: . : : : : : (100.0%) (0.0%)
## 32.1 < 44.5 < 59.6 : : : : : :
## IQR (CV) : 9.1 (0.1) : : : : : : .
## . : : : : : : : .
##
## 4 bill_depth_mm Mean (sd) : 17.2 (2) 79 distinct values : 333 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 13.1 < 17.3 < 21.5 : . : : : .
## IQR (CV) : 3.1 (0.1) . : : : : : :
## : : : : : : : . .
##
## 5 flipper_length_mm Mean (sd) : 201 (14) 54 distinct values : 333 0
## [integer] min < med < max: . : (100.0%) (0.0%)
## 172 < 197 < 231 : : : . .
## IQR (CV) : 23 (0.1) . : : : : : :
## : : : : : : : : :
##
## 6 body_mass_g Mean (sd) : 4207.1 (805.2) 93 distinct values : 333 0
## [integer] min < med < max: . : (100.0%) (0.0%)
## 2700 < 4050 < 6300 : : : :
## IQR (CV) : 1225 (0.2) : : : : : .
## . : : : : : :
##
## 7 sex 1. female 165 (49.5%) IIIIIIIII 333 0
## [factor] 2. male 168 (50.5%) IIIIIIIIII (100.0%) (0.0%)
##
## 8 year Mean (sd) : 2008 (0.8) 2007 : 103 (30.9%) IIIIII 333 0
## [integer] min < med < max: 2008 : 113 (33.9%) IIIIII (100.0%) (0.0%)
## 2007 < 2008 < 2009 2009 : 117 (35.1%) IIIIIII
## IQR (CV) : 2 (0)
## ----------------------------------------------------------------------------------------------------------------------
The huge advantage of this command is that if we wrap it around the dfSummary function and then it gives us a formatted in a nice table:
## Output file written: /tmp/Rtmp5cSF0B/fileeb6b77fef100.html
The package also includes nice ways of showing frequency tables for single variables with more information than the standard table() command:
## Frequencies
## penguins$species
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
## Adelie 146 43.84 43.84 43.84 43.84
## Chinstrap 68 20.42 64.26 20.42 64.26
## Gentoo 119 35.74 100.00 35.74 100.00
## <NA> 0 0.00 100.00
## Total 333 100.00 100.00 100.00 100.00
We can also use the descr() function rom the package which shows us the most common descriptive statistics of our variables in our dataset:
## Non-numerical variable(s) ignored: species, island, sex
## Descriptive Statistics
## penguins
## N: 333
##
## bill_depth_mm bill_length_mm body_mass_g flipper_length_mm year
## ----------------- --------------- ---------------- ------------- ------------------- ---------
## Mean 17.16 43.99 4207.06 200.97 2008.04
## Std.Dev 1.97 5.47 805.22 14.02 0.81
## Min 13.10 32.10 2700.00 172.00 2007.00
## Q1 15.60 39.50 3550.00 190.00 2007.00
## Median 17.30 44.50 4050.00 197.00 2008.00
## Q3 18.70 48.60 4775.00 213.00 2009.00
## Max 21.50 59.60 6300.00 231.00 2009.00
## MAD 2.22 6.97 889.56 16.31 1.48
## IQR 3.10 9.10 1225.00 23.00 2.00
## CV 0.11 0.12 0.19 0.07 0.00
## Skewness -0.15 0.04 0.47 0.36 -0.08
## SE.Skewness 0.13 0.13 0.13 0.13 0.13
## Kurtosis -0.91 -0.90 -0.75 -0.98 -1.49
## N.Valid 333.00 333.00 333.00 333.00 333.00
## N 333.00 333.00 333.00 333.00 333.00
## Pct.Valid 100.00 100.00 100.00 100.00 100.00
Lastly, I want to show you how you can make cross tables with the summarytools package (you already saw it above):
## Cross-Tabulation, Row Proportions
## species * island
## Data Frame: penguins
##
## ----------- -------- -------------- -------------- ------------ --------------
## island Biscoe Dream Torgersen Total
## species
## Adelie 44 ( 30.1%) 55 ( 37.7%) 47 (32.2%) 146 (100.0%)
## Chinstrap 0 ( 0.0%) 68 (100.0%) 0 ( 0.0%) 68 (100.0%)
## Gentoo 119 (100.0%) 0 ( 0.0%) 0 ( 0.0%) 119 (100.0%)
## Total 163 ( 48.9%) 123 ( 36.9%) 47 (14.1%) 333 (100.0%)
## ----------- -------- -------------- -------------- ------------ --------------
4.2.4 naniar
Naniar is one of the most powerful packages for working with missing data. At first glance, dealing with missing values may seem straightforward — as covered in the “Data Manipulation” chapter, it’s common to simply remove rows with missing values using functions like na.omit() or drop_na().
However, as you progress in data analysis, handling missing data becomes much more important and nuanced. Here’s why:
Dropping missing values can lead to a small sample size (
n), reducing statistical power.If a large portion of data is missing, removing it may introduce bias, especially if the missingness is not random.
In such situations, advanced techniques like multiple imputation become valuable. These methods estimate missing values using mathematical models that consider patterns in the data.
But there’s a catch:
These models have assumptions (e.g., data are Missing at Random — MAR). Violating these assumptions can lead to misleading results. That’s why it’s crucial to explore and understand the structure of missingness before choosing a strategy.
Let us start with the basic miss_var_summary() function, it shows the number of missing values, and calculates the percentages of missing values:
## # A tibble: 17 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 Comments 290 84.3
## 2 Delta 15 N (o/oo) 14 4.07
## 3 Delta 13 C (o/oo) 13 3.78
## 4 Sex 11 3.20
## 5 Culmen Length (mm) 2 0.581
## 6 Culmen Depth (mm) 2 0.581
## 7 Flipper Length (mm) 2 0.581
## 8 Body Mass (g) 2 0.581
## 9 studyName 0 0
## 10 Sample Number 0 0
## 11 Species 0 0
## 12 Region 0 0
## 13 Island 0 0
## 14 Stage 0 0
## 15 Individual ID 0 0
## 16 Clutch Completion 0 0
## 17 Date Egg 0 0
The function that made naniar famous is the gg_miss_upset() function, which shows us the structure of the missing values graphically:
4.2.5 gtsummary
The gtsummary package is the package for data reporting, because it automatically creates data tables ready for publication with one line of code. Let us start with the tbl_summary() function:
| Characteristic | N = 3331 |
|---|---|
| species | |
| Adelie | 146 (44%) |
| Chinstrap | 68 (20%) |
| Gentoo | 119 (36%) |
| island | |
| Biscoe | 163 (49%) |
| Dream | 123 (37%) |
| Torgersen | 47 (14%) |
| bill_length_mm | 44.5 (39.5, 48.6) |
| bill_depth_mm | 17.30 (15.60, 18.70) |
| flipper_length_mm | 197 (190, 213) |
| body_mass_g | 4,050 (3,550, 4,775) |
| sex | |
| female | 165 (50%) |
| male | 168 (50%) |
| year | |
| 2007 | 103 (31%) |
| 2008 | 113 (34%) |
| 2009 | 117 (35%) |
| 1 n (%); Median (Q1, Q3) | |
We get a nice table which splits up the categorical data in its categories and displays the absolute and relative frequencies (in the brackets) and for numeric variables we get the median value with first quartile and the third quartile.
We can also group by certain variables to get a more detailed overview:
| Characteristic | female N = 1651 |
male N = 1681 |
|---|---|---|
| species | ||
| Adelie | 73 (44%) | 73 (43%) |
| Chinstrap | 34 (21%) | 34 (20%) |
| Gentoo | 58 (35%) | 61 (36%) |
| island | ||
| Biscoe | 80 (48%) | 83 (49%) |
| Dream | 61 (37%) | 62 (37%) |
| Torgersen | 24 (15%) | 23 (14%) |
| bill_length_mm | 42.8 (37.6, 46.2) | 46.8 (41.0, 50.4) |
| bill_depth_mm | 17.00 (14.50, 17.80) | 18.45 (16.05, 19.30) |
| flipper_length_mm | 193 (187, 210) | 201 (193, 219) |
| body_mass_g | 3,650 (3,350, 4,550) | 4,300 (3,900, 5,325) |
| year | ||
| 2007 | 51 (31%) | 52 (31%) |
| 2008 | 56 (34%) | 57 (34%) |
| 2009 | 58 (35%) | 59 (35%) |
| 1 n (%); Median (Q1, Q3) | ||
The gtsummary package gives you many options to customize your table, which can be regarding the content (mean instead of median, displaying p-value…) but you can also customize its appearance for example customizing the font.
It also includes a nice option to compute publish-ready cross tables:
|
island
|
Total | |||
|---|---|---|---|---|
| Biscoe | Dream | Torgersen | ||
| species | ||||
| Adelie | 44 | 55 | 47 | 146 |
| Chinstrap | 0 | 68 | 0 | 68 |
| Gentoo | 119 | 0 | 0 | 119 |
| Total | 163 | 123 | 47 | 333 |
4.2.6 dlookr
dlookr is a nice package with different functions that can support us with EDA. Let us start with the diagnose() function, which is a gelps us identify missing values and unique observations in the dataset:
## # A tibble: 8 × 6
## variables types missing_count missing_percent unique_count
## <chr> <chr> <int> <dbl> <int>
## 1 species fact… 0 0 3
## 2 island fact… 0 0 3
## 3 bill_lengt… nume… 0 0 163
## 4 bill_depth… nume… 0 0 79
## 5 flipper_le… inte… 0 0 54
## 6 body_mass_g inte… 0 0 93
## 7 sex fact… 0 0 2
## 8 year inte… 0 0 3
## # ℹ 1 more variable: unique_rate <dbl>
We can also generate an output for summary statistics with an old friend, the describe() function, but this time from the dlookr package:
## # A tibble: 5 × 26
## described_variables n na mean sd se_mean
## <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 bill_length_mm 333 0 44.0 5.47 0.300
## 2 bill_depth_mm 333 0 17.2 1.97 0.108
## 3 flipper_length_mm 333 0 201. 14.0 0.768
## 4 body_mass_g 333 0 4207. 805. 44.1
## 5 year 333 0 2008. 0.813 0.0445
## # ℹ 20 more variables: IQR <dbl>, skewness <dbl>,
## # kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## # p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>,
## # p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>,
## # p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
The dlookr package has one special feature: It can generate an EDA report with one line of code, I introduce you the eda_report() function:
dlookr::eda_paged_report(penguins, output_format = "html")
4.2.7 DataExplorer
DataExplorer is a powerful all-in-one EDA package, that helps us to explore our data with a few line of code. It also includes a function that generates an EDA report.
But let us start by getting basic information about our data with the introduce() function:
## # A tibble: 1 × 9
## rows columns discrete_columns continuous_columns
## <int> <int> <int> <int>
## 1 333 8 3 5
## # ℹ 5 more variables: all_missing_columns <int>,
## # total_missing_values <int>, complete_rows <int>,
## # total_observations <int>, memory_usage <dbl>
We can also plot missing values with DataExplorer by using the plot_missing() function:
Further, we can plot correlations with DataExplorer:
And finally, we can use DataExplorer to generate an automated Data Report:
create_report(penguins)
4.2.8 smartEDA
The last package in this chapter is smartEDA. It is a powerful package designed to quickly create descriptive statistics and visualizations for numeric and categorical data.
The first function is ExpData(), it gives us the structure, missing values and variable types:
## Descriptions
## 1 Sample size (nrow)
## 2 No. of variables (ncol)
## 3 No. of numeric/interger variables
## 4 No. of factor variables
## 5 No. of text variables
## 6 No. of logical variables
## 7 No. of identifier variables
## 8 No. of date variables
## 9 No. of zero variance variables (uniform)
## 10 %. of variables having complete cases
## 11 %. of variables having >0% and <50% missing cases
## 12 %. of variables having >=50% and <90% missing cases
## 13 %. of variables having >=90% missing cases
## Value
## 1 333
## 2 8
## 3 5
## 4 3
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 100% (8)
## 11 0% (0)
## 12 0% (0)
## 13 0% (0)
We can also let smartEDA calculate summary statistics such as mean, standard deviation, skewness, etc.
## Vname Group TN nNeg nZero nPos NegInf PosInf
## 2 bill_depth_mm All 333 0 0 333 0 0
## 1 bill_length_mm All 333 0 0 333 0 0
## 4 body_mass_g All 333 0 0 333 0 0
## 3 flipper_length_mm All 333 0 0 333 0 0
## NA_Value Per_of_Missing sum min max mean
## 2 0 0 5715.9 13.1 21.5 17.165
## 1 0 0 14649.6 32.1 59.6 43.993
## 4 0 0 1400950.0 2700.0 6300.0 4207.057
## 3 0 0 66922.0 172.0 231.0 200.967
## median SD CV IQR Skewness Kurtosis
## 2 17.3 1.969 0.115 3.1 -0.149 -0.897
## 1 44.5 5.469 0.124 9.1 0.045 -0.888
## 4 4050.0 805.216 0.191 1225.0 0.470 -0.740
## 3 197.0 14.016 0.070 23.0 0.359 -0.965
Lastly, we can again generate an automatized EDA report with ExpReport():
ExpReport(data = penguins, Target = "species", label = "Penguin Species", op_file="Samp1.html", Rc=3 )
4.3 Conclusion
And that is it - at least for now - the possibilities and functions of the packages presented could fill easily an own course and as I already said, at one point everyone gets its own EDA routine and has its own packages they want to work with. The important point is to always get an overview over your data and to always check for interesting patterns in your data before conducting substantial analysis.
4.4 Exercises EDA
4.4.1 Exercise 1: Standard Descriptive Statistics
In this exercise we will work with the built-in iris package in R:
a. Calculate the mode, mean and the median for the iris$Sepal.Length variable
b. Calculate the interquartile range, variance and the standard deviation for iris$Sepal.Length
c. Calculate all five measures at once by using a function that does so (Choose by yourself, which one you want to use)
4.4.2 Exercise 2: Contingency Tables and Correlations
a. Make a Contingency Table for esoph$agegp and esoph$alcgp
b. Cut down the iris dataset to Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and save it in an object called iris_numeric.
c. Make a correlation matrix with iris_numeric
d. Make the correlation matrix prettyChapter 4: Exploratory Data Analysis