Chapter 4 Exploratory Data Analysis (EDA)

In this chapter, we will start with data analysis. One crucial step when analyzing data is Exploratory Data Analysis. It describes the process of analyzing data sets to summarize their main characteristics. This step can help understanding the data, checking its quality, early detecting patterns and trends and gain first insights!

Since R is a software specifically designed for statistics, we have a lot of high value libraries to perform EDA. The dataset we will use for this part is called palmerpenguins. This is a dataset about penguin species and their attributes.

pacman::p_load("summarytools", "SmartEDA", "skimr", 
               "naniar", "gtsummary", "dlookr",
               "DataExplorer", "psych", "ggplot2",
               "palmerpenguins", "dplyr", "tidyr", "corrplot")

penguins <- na.omit(penguins)
penguins_raw <- penguins_raw

4.1 Standard Descriptive Statistics

4.1.1 Measures of Central Tendency

As the name suggests, measures of central tendency are helping us to understand the probability distribution of the data, its center and typical values. The three most common measures of central tendency are the arithmetic mean, the median and the mode.

Mode: The most frequent number
Mean: The sum of all values divided by the total number of values
Median: The middle number in an ordered dataset

4.1.1.1 Mode

The mode is probably the easiest measure out of all measures: It is defined as the most frequent number of all observations. We cannot directly calculate the mode, but there is a way to it. First we look at all unique values of or observations with the unique() function, then we count the occurrences of each unique value with tabulate(), and lastly we use the which.max() function to get the most frequent unique value:

uniq_vals <- unique(penguins$bill_length_mm)  # Get unique values
freqs <- tabulate(match(penguins$bill_length_mm, uniq_vals))  # Count occurrences
uniq_vals[which.max(freqs)] # Getting the unique value with the most occurrences

## [1] 41.1

4.1.1.2 Mean

Let us start by looking at the formula to calculate a mean:

\[ \bar{x} = \frac{\sum{x_i}}{n} \]

whereas:

$\bar{x}$ is our mean
$\sum{x_i}$ is the sum of all our observations.
$n$ is the number of all our observations

We could do that by hand or we just use the built-in mean() function:

mean(penguins$bill_length_mm)

## [1] 43.99279

4.1.1.3 Median

Image sorting all your data from the lowest to highest and then pointing at the value, which has exactly 50% of all values to its left and the other 50% to its right, this would be the median value. Well, at least you will point at a value if your distribution has an yeven number of observations. But you can also calculate the value for an uneven number of observations, let us have a look at both formulas:

$X_{(\frac{n+1}{2})}$ for an even number of n
$\frac{1}{2}X_{(\frac{n}{2})} + X_{(\frac{n}{2} + 1)}$ for an uneven number of n

Again we could calculate that by hand or we just use the median() function:

median(penguins$bill_length_mm)

## [1] 44.5

4.1.2 Measures of Dispersion

In statistics, measures of dispersion describe the extent to which a distribution of a variable is stretched or squeezed. In other words, they help to gauge the spread of our distributions.

4.1.2.1 Interquartile Range (IQR)

You remember the boxplot from the data visualization chapter? It is supposed to show the so-called interquartile range (IQR). It is defined as the difference between the 75th percentile (or third quartile) and the 25th percentile (or the first quartile). Basically the distribution is spread into four equally big areas, which are separated by three points, the first quartile denoted by $Q_1$ (also called the lower quartile), the second quartile is the median and denoted as $Q_2$, and the third quartile is measures of dispersiondenoted by $Q_3$ (also called the upper quartile), thus the formula is:

\[ IQR = Q_3 - Q_1 \]

In R, we can calculate the Interquartile Range (IQR) using the IQR() function. By hand, we would:

Sort the data in ascending order.
Split the data into four equal parts (quartiles).
Identify Q1 (first quartile, 25th percentile) and Q3 (third quartile, 75th percentile).

Note that in some cases the data cannot be divided into four even parts ddue to their size. In such cases, different statistical methods (e.g. Tukey’s Hinges) approximate quartiles in such cases.

IQR(penguins$bill_length_mm)

## [1] 9.1

4.1.2.2 Variance

In statistics, the variance is the expected value of the squared deviation from the mean of our random variable. The concept of the mean gets clear if we break down its formula:

\[ s² = \frac{\sum(x_i - \bar{x})²}{n-1} \]

where

the index i runs over the observations (Respondents, Countries,…), i = 1,…,n
$x_i$ are our observations
$\bar{x}$ is the mean of our distribution
$n$ is the number of observations

Especially the nominator of the formula $\sum(x_i - \bar{x})²$ is quite interesting because it uses an interesting technique:

Image our data is aligned on one dimension and somewhere in the middle there is our mean:

Now, image we would calculate the differences and sum them up without squaring.

You see that distance 1 would be -3 and distance 2 would be 4 an in sum that makes 0,8, but that is for sure not the distance between those two points. Here comes the squared part into action, every squared number is positive, this ensures that -3 becomes 9 and 4 becomes 16, thus the differences can be summed and result in 24.

In R, we can implement this by simply calling the var() function:

var(penguins$bill_length_mm)

## [1] 29.90633

Well we get 29.9 and the problem with variance is, that the squaring technique I showed you earlier leads to a problem: We cannot really interpret the data because it loses its unit, for example if our data is in meter, the variance would be in square meters. But we can solve this problem with the next measure: standard deviation.

4.1.2.3 Standard Deviation

The standard deviation is the square root of the variance. It describes the amount of variation of the values of a variables about its mean. A low standard deviation indicate closeness of the values to the mean, and a high standard deviation indicates that the values are spread out over a wider range.

The standard deviation is the square root of the variance:

\[ s = \sqrt{s²} = \sqrt\frac{\sum(x_i - \bar{x})}{n-1} \]

Remember our example?:

We calculated the variance the distance of the squared sums to be 24 (9 + 16). However, what happens if we take the square root of the squared values before adding them? Well, we get 3 for distance 1 and 4 for distance 2. For distance 2 nothing changed, it came back to its original value 4, but distance 1 has become positive from -3 to 3. Now we can add it together and get the distance 7.

Let us apply it in R, with the function sd():

sd(penguins$bill_length_mm)

## [1] 5.468668

The huge advantage of the standard deviation in comparison to the variance is that it can be interpreted in the original unit and is thus a more intuitive measure of dispersion to work with.

4.1.3 Relationships between Variables

It is necessary to understand one key difference in EDA. There are values you simply look at and interpret such as the measures shown before. But on the other hand there are measures which look at the relationship between variables, thus analyzing how they interact which each other. In the following, we will look at two methods to do so.

4.1.3.1 Crosstables / Contingency Tables

table(penguins$species, penguins$island)

##            
##             Biscoe Dream Torgersen
##   Adelie        44    55        47
##   Chinstrap      0    68         0
##   Gentoo       119     0         0

summarytools::ctable(penguins$species, penguins$island)

## Cross-Tabulation, Row Proportions  
## species * island  
## Data Frame: penguins  
## 
## ----------- -------- -------------- -------------- ------------ --------------
##               island         Biscoe          Dream    Torgersen          Total
##     species                                                                   
##      Adelie             44 ( 30.1%)    55 ( 37.7%)   47 (32.2%)   146 (100.0%)
##   Chinstrap              0 (  0.0%)    68 (100.0%)    0 ( 0.0%)    68 (100.0%)
##      Gentoo            119 (100.0%)     0 (  0.0%)    0 ( 0.0%)   119 (100.0%)
##       Total            163 ( 48.9%)   123 ( 36.9%)   47 (14.1%)   333 (100.0%)
## ----------- -------- -------------- -------------- ------------ --------------

gtsummary::tbl_cross(data = penguins, 
                     row = species, 
                     col = island)

	island			Total
	Biscoe	Dream	Torgersen	Total
species
Adelie	44	55	47	146
Chinstrap	0	68	0	68
Gentoo	119	0	0	119
Total	163	123	47	333

4.1.3.2 Correlation

Correlation is an umbrella term for any statistical relationship, whether causal or not, between two random variables or bivariate data. The three main measures of correlations are named after their inventors: Pearsons Correlation (or simply Pearson’s r), Spearman’s Rank Correlation (or simply Spearman’s rho) and Kendall’s Tau.

Before going on let us make clear what a linear relationship means. It means that if an observation increases or decreases the corresponding variables changes in a proportional and predictable way.

4.1.3.2.1 Pearsons Correlation

Pearsons Correlation measures the linear relationship between two continous variables. To do so it assumes a normal distribution between two variables, a linear relationship and no major outliers.

Normally, you should test those before running a correlation, but we skip that part and look at the implementation in R:

cor(penguins$bill_length_mm, penguins$body_mass_g,
    method = "pearson")

## [1] 0.5894511

The interpretation of pearson’s r is quite straightforward: The result always ranges between +1 and -1, where +1 means a perfect linear relationship, 0 means no relationship at all, -1 means a perfect linear relationship.

In our case there is a strong positive, linear relationship of both variables with a r = 0.59.

4.1.3.2.2 Spearman’s Rank Correlation

Spearman’s Rank Correlation, or simply Spearman’s Rho shows if ordinal or continuous variables have a Monotonic Relationship. This means that if one variables increases, the other always increases or decreases but not necessarily at a constant way, as it would be with a linear relationship.

It has different assumptions like outliers are allowed, the relationship is non-linear, the data contains rank and the variables are not normally distributed.

Let us have a look, how it is calculated in R:

cor(penguins$bill_length_mm, penguins$body_mass_g,
    method = "spearman")

## [1] 0.5764804

The interpretation is analog to Pearson’s R, +1 means a perfect positive, monotonic relationship, 0 means no monotonic relationship and -1 means perfect negative, monotonic relationship.

A spearman’s rho of 0.58 indicates a strong, positive, monotonic relationship between the two variables.

4.1.3.2.3 Kendalls Tau

Kendalls Tau measures the strength and direction of association between two ranked variables. We differentiate between two types of relationships: Concordant and Discordant relationships.

A concordant relationship means both data points move in the same direction
A discordant relationship if one data point increases, the other decreases (or vice versa).

Let us calculate it:

cor(penguins$bill_length_mm, penguins$body_mass_g,
    method = "kendall")

## [1] 0.4277598

Kendalls Tau can be interpreted wiht +1 as a perfect rank agreement, 0 means no association at all, and -1 means a perfect rank reversal.

Our kendall’s tau indicates a high rank agreement between both our variables.

4.1.3.3 Correlation Graphically

The part before was quite theoretical, but there are also nice approaches to look at correlations graphically. we start with a simple scatterplot:

ggplot(penguins, aes(x = bill_length_mm, y = body_mass_g)) +
  geom_point(color = "#0077b6") +
  labs(x = "Length in mm",
       y = "Body Mass in g",
       title = "Relationship between Length (in mm) and Body Mass (in g)") + 
  theme_bw()

When looking at relationships between two variables, scatterplots are the standard visualization to do so. As we can see, there is a clear trend that the length in mm could be positively related to the body mass in g. In the end of the day, there are three types of linear relationships and the scatterplots always look accordingly:

set.seed(123)

n <- 100

df_cor <- data.frame(
  x = rep(1:n, 3),
  relationship = rep(c("Positive", "Negative", "None"), 
                     each = n),
  y = c(
    (1:n) + rnorm(n, sd = 15),    # strong positive correlation
    (n:1) + rnorm(n, sd = 15),    # strong negative correlation
    rnorm(n, mean = 50, sd = 20) # no correlation
  )
)

# Reorder factor levels
df_cor$relationship <- factor(df_cor$relationship, 
                          levels = c("Positive", 
                                     "None", "Negative"),
                          labels = c("Positive", 
                                     "No Correlation", 
                                     "Negative"))

# Plot
ggplot(df_cor, aes(x = x, y = y)) +
  geom_point(color = "steelblue", size = 2) +
  facet_wrap(~relationship, nrow = 1) +
  labs(title = "Strong Positive, Negative, and No Correlation",
       x = "X", y = "Y") +
  theme_bw(base_size = 18)

With scatterplots in combination with facet_wrap() you can show several correlations graphically, but there is a way to calculate the correlation coefficient and to show it graphically with a correlation plot. A correlation plot combines the logic of contingency tables, heat maps and the correlation coefficients.

First a correlation matrix is created. A table that shows the pairwise correlation coefficients (typically Pearson) between several numerical variables. Each cell in the matrix represents the strength and direction of the linear relationship between two variables. A correlation plot is then a visual representation of this matrix, often using color gradients or circle sizes to show the strength and direction of correlations, making it easier to spot patterns.

Let us compute it in R: We will use the corrplot package in R (There are other ways to compute it, which I will show later).

First, we cut down our dataset to the variables you want to correlate with each other
Second, compute a correlation matrix with the cor() command
Third, call the corrplot variable, and take in the dataset, define the method (we will use the color to display the strength of the correlation), the type ('full' (default), 'upper' or 'lower', display full matrix, lower triangular or upper triangular matrix).

# Step 1: Prepare numeric data
penguins_numeric <- penguins %>%
  select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  drop_na()

# Step 2: Compute correlation matrix
corr_matrix <- cor(penguins_numeric)

# Step 3: Plot the correlation matrix
corrplot(corr_matrix, method = "color")

Now, we can see in an elegant way the correlation between the four selected variables, which are displayed through the color. Every cell displays the correlation coefficient of the variable on its respective column and on its respective row.

The corrplot() function allows for further aesthetics:

corrplot(corr_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black", tl.srt = 45)

And finally, we can change the method to circular. This changes the plot insofar that it does not fill the cells with the color of the correlation coefficient, but makes a circle around the numbers and fills it then with the color. The size of the circles, thus the radius is determined but the strength of the correlation, meaning that the closer the coefficient to zero the smaller the circle:

corrplot(corr_matrix, method = "circle", type = "upper", 
         addCoef.col = "black", tl.col = "black", tl.srt = 45)

If I use the circular method, I like to display it without the numbers, looks better in my opinion, and is more intuitive then filling the whole row:

corrplot(corr_matrix, method = "circle", type = "upper", 
         tl.col = "black", tl.srt = 45)

4.2 Working with EDA packages

There are several other measures for EDA, and of course you do not have to calculate every single measure by hand, although you could. In the following, I will introduce you to the most popular packages for EDA. From my point of view, they are all basically the same, with some nuances and when talking to other R-users I noticed that everyone somehow established his or her own routine of EDA. Therefore I suggest that you get an overview of all those packages and find your own “EDA-Routine” so to speak.

4.2.1 psych

The psych package is a crucial package for psychologists and I love to use its “describe()” function, which shows a bunch of descriptive statistics with one line of code:

#Applying describe to the whole dataset
psych::describe(penguins)

##                   vars   n    mean     sd median trimmed
## species*             1 333    1.92   0.89    2.0    1.90
## island*              2 333    1.65   0.71    2.0    1.57
## bill_length_mm       3 333   43.99   5.47   44.5   43.98
## bill_depth_mm        4 333   17.16   1.97   17.3   17.19
## flipper_length_mm    5 333  200.97  14.02  197.0  200.36
## body_mass_g          6 333 4207.06 805.22 4050.0 4159.46
## sex*                 7 333    1.50   0.50    2.0    1.51
## year                 8 333 2008.04   0.81 2008.0 2008.05
##                      mad    min    max  range  skew kurtosis
## species*            1.48    1.0    3.0    2.0  0.16    -1.72
## island*             1.48    1.0    3.0    2.0  0.62    -0.85
## bill_length_mm      6.97   32.1   59.6   27.5  0.04    -0.90
## bill_depth_mm       2.22   13.1   21.5    8.4 -0.15    -0.91
## flipper_length_mm  16.31  172.0  231.0   59.0  0.36    -0.98
## body_mass_g       889.56 2700.0 6300.0 3600.0  0.47    -0.75
## sex*                0.00    1.0    2.0    1.0 -0.02    -2.01
## year                1.48 2007.0 2009.0    2.0 -0.08    -1.49
##                      se
## species*           0.05
## island*            0.04
## bill_length_mm     0.30
## bill_depth_mm      0.11
## flipper_length_mm  0.77
## body_mass_g       44.13
## sex*               0.03
## year               0.04

#You can apply describe() also for single variables
#psych::describe(penguins$bill_length_mm)

The describe() function output is a summary table of the basic summary statistics as mean, standard deviation, median, and a lot more and it is a shortcut, so you do not have to calculate every measure on its own.

corr.test(penguins_numeric)

## Call:corr.test(x = penguins_numeric)
## Correlation matrix 
##                   bill_length_mm bill_depth_mm
## bill_length_mm              1.00         -0.23
## bill_depth_mm              -0.23          1.00
## flipper_length_mm           0.65         -0.58
## body_mass_g                 0.59         -0.47
##                   flipper_length_mm body_mass_g
## bill_length_mm                 0.65        0.59
## bill_depth_mm                 -0.58       -0.47
## flipper_length_mm              1.00        0.87
## body_mass_g                    0.87        1.00
## Sample Size 
## [1] 333
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                   bill_length_mm bill_depth_mm
## bill_length_mm                 0             0
## bill_depth_mm                  0             0
## flipper_length_mm              0             0
## body_mass_g                    0             0
##                   flipper_length_mm body_mass_g
## bill_length_mm                    0           0
## bill_depth_mm                     0           0
## flipper_length_mm                 0           0
## body_mass_g                       0           0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

In the next step, I want to show you the pairs.panel() function. It displays a really interesting visualization regarding correlations between the variables:

pairs.panels(penguins_numeric)

So this is an interesting grid consisting of different types of data visualization, all related to correlations. Let us start with the easiest one, the diagonal.

The diagonal shows us a histogram of the distributions of our input variables, additionally it includes a line, so we can check if the distributions look right.
The upper right visualizations display the correlation coefficients of the respective two variables, like a correlation matrix.
The bottom left visualizations are scatterplots between the two variables, where a line is fitted between the data points.
In addition, the shape of the red line (a loess smoother) can reveal whether the relationship between the two variables is linear or more complex, such as curved or s-shaped.
The scatterplots also include correlation ellipses, which visually represent the strength and direction of the relationship: narrow, tilted ellipses indicate strong correlations, while rounder shapes indicate weaker or no correlations.

Together, this grid gives a comprehensive overview of both the distributions of individual variables and the pairwise relationships between them.

4.2.2 skimr

The skimr package is a wonderful way to get an overview of our datasets structure and basic statistics, even with a visualization of the distribution of the variable. The main function is skim():

skimr::skim(penguins)

Table 4.1: Data summary
Name	penguins
Number of rows	333
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
species	1	FALSE	3	Ade: 146, Gen: 119, Chi: 68
island	1	FALSE	3	Bis: 163, Dre: 123, Tor: 47
sex	1	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	1	43.99	5.47	32.1	39.5	44.5	48.6	59.6	▃▇▇▆▁
bill_depth_mm	1	17.16	1.97	13.1	15.6	17.3	18.7	21.5	▅▆▇▇▂
flipper_length_mm	1	200.97	14.02	172.0	190.0	197.0	213.0	231.0	▂▇▃▅▃
body_mass_g	1	4207.06	805.22	2700.0	3550.0	4050.0	4775.0	6300.0	▃▇▅▃▂
year	1	2008.04	0.81	2007.0	2007.0	2008.0	2009.0	2009.0	▇▁▇▁▇

#If you do not want to see the distribution
#skimr::skim_without_charts(penguins)

The advantage of the skimr package is that it directly calculates descriptive statistics and shows the missing values for every variable in your dataset. I often use this command to have a look at datasets I am not familiar with. I think it is more a function for exploring the dataset rather than EDA it can be a useful first step for conducting EDA.

4.2.3 summarytools

We will dive into the summarytools package with the classic dfSummary command which summarizes the structure of our dataset:

dfSummary(penguins)

## Data Frame Summary  
## penguins  
## Dimensions: 333 x 8  
## Duplicates: 0  
## 
## ----------------------------------------------------------------------------------------------------------------------
## No   Variable            Stats / Values               Freqs (% of Valid)    Graph                 Valid      Missing  
## ---- ------------------- ---------------------------- --------------------- --------------------- ---------- ---------
## 1    species             1. Adelie                    146 (43.8%)           IIIIIIII              333        0        
##      [factor]            2. Chinstrap                  68 (20.4%)           IIII                  (100.0%)   (0.0%)   
##                          3. Gentoo                    119 (35.7%)           IIIIIII                                   
## 
## 2    island              1. Biscoe                    163 (48.9%)           IIIIIIIII             333        0        
##      [factor]            2. Dream                     123 (36.9%)           IIIIIII               (100.0%)   (0.0%)   
##                          3. Torgersen                  47 (14.1%)           II                                        
## 
## 3    bill_length_mm      Mean (sd) : 44 (5.5)         163 distinct values       .     . :         333        0        
##      [numeric]           min < med < max:                                     . : : : : :         (100.0%)   (0.0%)   
##                          32.1 < 44.5 < 59.6                                   : : : : : :                             
##                          IQR (CV) : 9.1 (0.1)                                 : : : : : : .                           
##                                                                             . : : : : : : : .                         
## 
## 4    bill_depth_mm       Mean (sd) : 17.2 (2)         79 distinct values              :           333        0        
##      [numeric]           min < med < max:                                           : :           (100.0%)   (0.0%)   
##                          13.1 < 17.3 < 21.5                                   : . : : : .                             
##                          IQR (CV) : 3.1 (0.1)                               . : : : : : :                             
##                                                                             : : : : : : : . .                         
## 
## 5    flipper_length_mm   Mean (sd) : 201 (14)         54 distinct values          :               333        0        
##      [integer]           min < med < max:                                       . :               (100.0%)   (0.0%)   
##                          172 < 197 < 231                                        : : :   . .                           
##                          IQR (CV) : 23 (0.1)                                  . : : :   : : :                         
##                                                                               : : : : : : : : :                       
## 
## 6    body_mass_g         Mean (sd) : 4207.1 (805.2)   93 distinct values        :                 333        0        
##      [integer]           min < med < max:                                     . :                 (100.0%)   (0.0%)   
##                          2700 < 4050 < 6300                                   : : : :                                 
##                          IQR (CV) : 1225 (0.2)                                : : : : : .                             
##                                                                             . : : : : : :                             
## 
## 7    sex                 1. female                    165 (49.5%)           IIIIIIIII             333        0        
##      [factor]            2. male                      168 (50.5%)           IIIIIIIIII            (100.0%)   (0.0%)   
## 
## 8    year                Mean (sd) : 2008 (0.8)       2007 : 103 (30.9%)    IIIIII                333        0        
##      [integer]           min < med < max:             2008 : 113 (33.9%)    IIIIII                (100.0%)   (0.0%)   
##                          2007 < 2008 < 2009           2009 : 117 (35.1%)    IIIIIII                                   
##                          IQR (CV) : 2 (0)                                                                             
## ----------------------------------------------------------------------------------------------------------------------

The huge advantage of this command is that if we wrap it around the dfSummary function and then it gives us a formatted in a nice table:

view(dfSummary(penguins))

## Output file written: /tmp/Rtmp5cSF0B/fileeb6b77fef100.html

# Or if you are a good R user, then you can use also a pipe
# dfSummary(penguins) %>%
# view()

The package also includes nice ways of showing frequency tables for single variables with more information than the standard table() command:

freq(penguins$species)

## Frequencies  
## penguins$species  
## Type: Factor  
## 
##                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
##          Adelie    146     43.84          43.84     43.84          43.84
##       Chinstrap     68     20.42          64.26     20.42          64.26
##          Gentoo    119     35.74         100.00     35.74         100.00
##            <NA>      0                               0.00         100.00
##           Total    333    100.00         100.00    100.00         100.00

We can also use the descr() function rom the package which shows us the most common descriptive statistics of our variables in our dataset:

descr(penguins)

## Non-numerical variable(s) ignored: species, island, sex

## Descriptive Statistics  
## penguins  
## N: 333  
## 
##                     bill_depth_mm   bill_length_mm   body_mass_g   flipper_length_mm      year
## ----------------- --------------- ---------------- ------------- ------------------- ---------
##              Mean           17.16            43.99       4207.06              200.97   2008.04
##           Std.Dev            1.97             5.47        805.22               14.02      0.81
##               Min           13.10            32.10       2700.00              172.00   2007.00
##                Q1           15.60            39.50       3550.00              190.00   2007.00
##            Median           17.30            44.50       4050.00              197.00   2008.00
##                Q3           18.70            48.60       4775.00              213.00   2009.00
##               Max           21.50            59.60       6300.00              231.00   2009.00
##               MAD            2.22             6.97        889.56               16.31      1.48
##               IQR            3.10             9.10       1225.00               23.00      2.00
##                CV            0.11             0.12          0.19                0.07      0.00
##          Skewness           -0.15             0.04          0.47                0.36     -0.08
##       SE.Skewness            0.13             0.13          0.13                0.13      0.13
##          Kurtosis           -0.91            -0.90         -0.75               -0.98     -1.49
##           N.Valid          333.00           333.00        333.00              333.00    333.00
##                 N          333.00           333.00        333.00              333.00    333.00
##         Pct.Valid          100.00           100.00        100.00              100.00    100.00

Lastly, I want to show you how you can make cross tables with the summarytools package (you already saw it above):

ctable(penguins$species, penguins$island)

## Cross-Tabulation, Row Proportions  
## species * island  
## Data Frame: penguins  
## 
## ----------- -------- -------------- -------------- ------------ --------------
##               island         Biscoe          Dream    Torgersen          Total
##     species                                                                   
##      Adelie             44 ( 30.1%)    55 ( 37.7%)   47 (32.2%)   146 (100.0%)
##   Chinstrap              0 (  0.0%)    68 (100.0%)    0 ( 0.0%)    68 (100.0%)
##      Gentoo            119 (100.0%)     0 (  0.0%)    0 ( 0.0%)   119 (100.0%)
##       Total            163 ( 48.9%)   123 ( 36.9%)   47 (14.1%)   333 (100.0%)
## ----------- -------- -------------- -------------- ------------ --------------

4.2.4 naniar

Naniar is one of the most powerful packages for working with missing data. At first glance, dealing with missing values may seem straightforward — as covered in the “Data Manipulation” chapter, it’s common to simply remove rows with missing values using functions like na.omit() or drop_na().

However, as you progress in data analysis, handling missing data becomes much more important and nuanced. Here’s why:

Dropping missing values can lead to a small sample size (n), reducing statistical power.
If a large portion of data is missing, removing it may introduce bias, especially if the missingness is not random.

In such situations, advanced techniques like multiple imputation become valuable. These methods estimate missing values using mathematical models that consider patterns in the data.

But there’s a catch:

These models have assumptions (e.g., data are Missing at Random — MAR). Violating these assumptions can lead to misleading results. That’s why it’s crucial to explore and understand the structure of missingness before choosing a strategy.

Let us start with the basic miss_var_summary() function, it shows the number of missing values, and calculates the percentages of missing values:

naniar::miss_var_summary(penguins_raw)

## # A tibble: 17 × 3
##    variable            n_miss pct_miss
##    <chr>                <int>    <num>
##  1 Comments               290   84.3  
##  2 Delta 15 N (o/oo)       14    4.07 
##  3 Delta 13 C (o/oo)       13    3.78 
##  4 Sex                     11    3.20 
##  5 Culmen Length (mm)       2    0.581
##  6 Culmen Depth (mm)        2    0.581
##  7 Flipper Length (mm)      2    0.581
##  8 Body Mass (g)            2    0.581
##  9 studyName                0    0    
## 10 Sample Number            0    0    
## 11 Species                  0    0    
## 12 Region                   0    0    
## 13 Island                   0    0    
## 14 Stage                    0    0    
## 15 Individual ID            0    0    
## 16 Clutch Completion        0    0    
## 17 Date Egg                 0    0

The function that made naniar famous is the gg_miss_upset() function, which shows us the structure of the missing values graphically:

naniar::gg_miss_upset(penguins_raw)

naniar::vis_miss(penguins_raw)

4.2.5 gtsummary

The gtsummary package is the package for data reporting, because it automatically creates data tables ready for publication with one line of code. Let us start with the tbl_summary() function:

gtsummary::tbl_summary(penguins)

Characteristic	N = 333¹
species
Adelie	146 (44%)
Chinstrap	68 (20%)
Gentoo	119 (36%)
island
Biscoe	163 (49%)
Dream	123 (37%)
Torgersen	47 (14%)
bill_length_mm	44.5 (39.5, 48.6)
bill_depth_mm	17.30 (15.60, 18.70)
flipper_length_mm	197 (190, 213)
body_mass_g	4,050 (3,550, 4,775)
sex
female	165 (50%)
male	168 (50%)
year
2007	103 (31%)
2008	113 (34%)
2009	117 (35%)
¹ n (%); Median (Q1, Q3)

We get a nice table which splits up the categorical data in its categories and displays the absolute and relative frequencies (in the brackets) and for numeric variables we get the median value with first quartile and the third quartile.

We can also group by certain variables to get a more detailed overview:

penguins %>%
  tbl_summary(by = sex)

Characteristic	female N = 165¹	male N = 168¹
species
Adelie	73 (44%)	73 (43%)
Chinstrap	34 (21%)	34 (20%)
Gentoo	58 (35%)	61 (36%)
island
Biscoe	80 (48%)	83 (49%)
Dream	61 (37%)	62 (37%)
Torgersen	24 (15%)	23 (14%)
bill_length_mm	42.8 (37.6, 46.2)	46.8 (41.0, 50.4)
bill_depth_mm	17.00 (14.50, 17.80)	18.45 (16.05, 19.30)
flipper_length_mm	193 (187, 210)	201 (193, 219)
body_mass_g	3,650 (3,350, 4,550)	4,300 (3,900, 5,325)
year
2007	51 (31%)	52 (31%)
2008	56 (34%)	57 (34%)
2009	58 (35%)	59 (35%)
¹ n (%); Median (Q1, Q3)

The gtsummary package gives you many options to customize your table, which can be regarding the content (mean instead of median, displaying p-value…) but you can also customize its appearance for example customizing the font.

It also includes a nice option to compute publish-ready cross tables:

penguins %>%
  tbl_cross(
    row = species,
    col = island
  )

	island			Total
	Biscoe	Dream	Torgersen	Total
species
Adelie	44	55	47	146
Chinstrap	0	68	0	68
Gentoo	119	0	0	119
Total	163	123	47	333

4.2.6 dlookr

dlookr is a nice package with different functions that can support us with EDA. Let us start with the diagnose() function, which is a gelps us identify missing values and unique observations in the dataset:

diagnose(penguins) %>%
  print()

## # A tibble: 8 × 6
##   variables   types missing_count missing_percent unique_count
##   <chr>       <chr>         <int>           <dbl>        <int>
## 1 species     fact…             0               0            3
## 2 island      fact…             0               0            3
## 3 bill_lengt… nume…             0               0          163
## 4 bill_depth… nume…             0               0           79
## 5 flipper_le… inte…             0               0           54
## 6 body_mass_g inte…             0               0           93
## 7 sex         fact…             0               0            2
## 8 year        inte…             0               0            3
## # ℹ 1 more variable: unique_rate <dbl>

We can also generate an output for summary statistics with an old friend, the describe() function, but this time from the dlookr package:

dlookr::describe(penguins)

## # A tibble: 5 × 26
##   described_variables     n    na   mean      sd se_mean
##   <chr>               <int> <int>  <dbl>   <dbl>   <dbl>
## 1 bill_length_mm        333     0   44.0   5.47   0.300 
## 2 bill_depth_mm         333     0   17.2   1.97   0.108 
## 3 flipper_length_mm     333     0  201.   14.0    0.768 
## 4 body_mass_g           333     0 4207.  805.    44.1   
## 5 year                  333     0 2008.    0.813  0.0445
## # ℹ 20 more variables: IQR <dbl>, skewness <dbl>,
## #   kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## #   p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>,
## #   p50 <dbl>, p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>,
## #   p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

The dlookr package has one special feature: It can generate an EDA report with one line of code, I introduce you the eda_report() function:

dlookr::eda_paged_report(penguins, output_format = "html")

4.2.7 DataExplorer

DataExplorer is a powerful all-in-one EDA package, that helps us to explore our data with a few line of code. It also includes a function that generates an EDA report.

But let us start by getting basic information about our data with the introduce() function:

introduce(penguins)

## # A tibble: 1 × 9
##    rows columns discrete_columns continuous_columns
##   <int>   <int>            <int>              <int>
## 1   333       8                3                  5
## # ℹ 5 more variables: all_missing_columns <int>,
## #   total_missing_values <int>, complete_rows <int>,
## #   total_observations <int>, memory_usage <dbl>

We can also plot missing values with DataExplorer by using the plot_missing() function:

plot_missing(penguins_raw)

Further, we can plot correlations with DataExplorer:

plot_correlation(penguins_numeric)

And finally, we can use DataExplorer to generate an automated Data Report:

create_report(penguins)

4.2.8 smartEDA

The last package in this chapter is smartEDA. It is a powerful package designed to quickly create descriptive statistics and visualizations for numeric and categorical data.

The first function is ExpData(), it gives us the structure, missing values and variable types:

ExpData(penguins, type = 1)

##                                           Descriptions
## 1                                   Sample size (nrow)
## 2                              No. of variables (ncol)
## 3                    No. of numeric/interger variables
## 4                              No. of factor variables
## 5                                No. of text variables
## 6                             No. of logical variables
## 7                          No. of identifier variables
## 8                                No. of date variables
## 9             No. of zero variance variables (uniform)
## 10               %. of variables having complete cases
## 11   %. of variables having >0% and <50% missing cases
## 12 %. of variables having >=50% and <90% missing cases
## 13          %. of variables having >=90% missing cases
##       Value
## 1       333
## 2         8
## 3         5
## 4         3
## 5         0
## 6         0
## 7         0
## 8         0
## 9         0
## 10 100% (8)
## 11   0% (0)
## 12   0% (0)
## 13   0% (0)

We can also let smartEDA calculate summary statistics such as mean, standard deviation, skewness, etc.

ExpNumStat(penguins)

##               Vname Group  TN nNeg nZero nPos NegInf PosInf
## 2     bill_depth_mm   All 333    0     0  333      0      0
## 1    bill_length_mm   All 333    0     0  333      0      0
## 4       body_mass_g   All 333    0     0  333      0      0
## 3 flipper_length_mm   All 333    0     0  333      0      0
##   NA_Value Per_of_Missing       sum    min    max     mean
## 2        0              0    5715.9   13.1   21.5   17.165
## 1        0              0   14649.6   32.1   59.6   43.993
## 4        0              0 1400950.0 2700.0 6300.0 4207.057
## 3        0              0   66922.0  172.0  231.0  200.967
##   median      SD    CV    IQR Skewness Kurtosis
## 2   17.3   1.969 0.115    3.1   -0.149   -0.897
## 1   44.5   5.469 0.124    9.1    0.045   -0.888
## 4 4050.0 805.216 0.191 1225.0    0.470   -0.740
## 3  197.0  14.016 0.070   23.0    0.359   -0.965

Lastly, we can again generate an automatized EDA report with ExpReport():

ExpReport(data = penguins, Target = "species", label = "Penguin Species", op_file="Samp1.html", Rc=3 )

4.3 Conclusion

And that is it - at least for now - the possibilities and functions of the packages presented could fill easily an own course and as I already said, at one point everyone gets its own EDA routine and has its own packages they want to work with. The important point is to always get an overview over your data and to always check for interesting patterns in your data before conducting substantial analysis.

4.4 Exercises EDA

4.4.1 Exercise 1: Standard Descriptive Statistics

In this exercise we will work with the built-in iris package in R:

a. Calculate the mode, mean and the median for the iris$Sepal.Length variable

b. Calculate the interquartile range, variance and the standard deviation for iris$Sepal.Length

c. Calculate all five measures at once by using a function that does so (Choose by yourself, which one you want to use)

4.4.2 Exercise 2: Contingency Tables and Correlations

a. Make a Contingency Table for esoph$agegp and esoph$alcgp

b. Cut down the iris dataset to Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and save it in an object called iris_numeric.

c. Make a correlation matrix with iris_numeric

d. Make the correlation matrix prettyChapter 4: Exploratory Data Analysis

4.4.3 Exercise 3: Working with packages

a. Use a function to get an overview of the dataset mtcars

b. Have a look at the structure of the missing values in mtcars

c. Make an automatized EDA report for mtcars!