Chapter 8 Solutions Exercises

pacman::p_load("tidyverse", "gapminder", "babynames", "sjPlot", "ggridges")

8.1 Chapter 1: Fundamentals

8.1.1 Exercise 1: Making your first Vector

Create a vector called my_vector with the values 1,2,3 and check is class.

#create the vector  
my_vector <- c(1,2,3)

#check the class 
class(my_vector)

8.1.2 Exercise 2: Making your first matrix

Create a Matrix called student. This should contain information about the name, age and major. Make three vectors with three entries and bind them together to a the matrix student. Print the matrix. Choose the three names, age, and major by yourself:

#Create the vectors   
name <- c("James", "Elif", "Jonas")
age <- c("19", "17", "24")
major <- c("Political Science", "Data Science", "Physics")
  
#Create the matrix 
student <- cbind(name, age, major)

#Or with the matrix command
student <- matrix(name, age, major)
  
#Print the matrix 
print(student)

8.1.3 Exercise 3: ifelse function

Write an ifelse statement that checks if a given number is positive or negative. If the number is positive or 0, print “Number is positive”, otherwise print “Number is negative”. Feel free to decide if you want to use the ifelse function or the ifelse condition.

#Assigning the number to the object "number" 
number <- 0

#With ifelse() function
ifelse(number >= 0, "Number is positive", "Number is negative")

#With ifelse condition

if (number > 0) {
  print("Positive Number")
} else {
  print("Negative Number")
}

8.1.4 Exercise 4: ifelse ladders

Write an if-else ladder that categorizes a student’s grade based on their score. The grading criteria are as follows:

Score >= 90: “A” Score >= 80 and < 90: “B” Score >= 70 and < 80: “C” Score >= 60 and < 70: “D” Score < 60: “F”.

#Defining the vector score 
Score <- 90

#Defing ifelse ladder with ifelse() function
ifelse(Score >= 90, "A", 
       ifelse(Score >= 80 & Score < 90, "B",
              ifelse(Score >= 70 & Score < 80, "C",
                     ifelse(Score >= 60 & Score < 70, "D", 
                            ifelse(Score < 60, "F")))))

#Defining it with a ifelse condition
if (score >= 90) {
  print("A") 
} else if (score >= 80 & score < 90) {
  print("B")
} else if (score >= 70 & score < 80) {
  print("C")
} else if (score >= 60 & score < 70) {
  print("D") 
} else if (score < 60) {
  print("F")
}

8.2 Chapter 2: Data Manipulation

pacman::p_load("tidyverse")

8.2.1 Exercise 1: Let’s wrangle kid

You are interested in discrimination and the perception of the judicial. More specifically, you want to know if people, who fell discriminated evaluate courts differently. Below you see a table with all variables you want to include in your analysis:

Variable Description Scales
idnt Respondent’s identification number unique number from 1-9000
year The year when the survey was conducted only 2020
cntry Country BE, BG, CH, CZ, EE, FI, FR,GB, GR, HR, HU, IE, IS, IT, LT,NL, NO, PT, SI, SK
agea Age of the Respondent, calculated

Number of Age = 15-90

999 = Not available

gndr Gender

1 = Male;

2 = Female;

9 = No answer

happy How happy are you

0 (Extremly unhappy) - 10 (Extremly happy);

77 = Refusal;

88 = Don’t Know;

99 = No answer

eisced Highest level of education, ES - ISCED

0 = Not possible to harmonise into ES-ISCED;

1 (ES-ISCED I , less than lower secondary) - 7 (ES-ISCED V2, higher tertiary education, => MA level;

55 = Other;

77 = Refusal;

88 = Don’t know;

99 = No answer

netusoft Internet use, how often

1 (Never) - 5 (Every day);

7 = Refusal;

8 = Don’t know;

9 = No answer

trstprl Most people can be trusted or you can’t be too careful

0 (You can’t be too careful) - 10 (Most people can be trusted);

77 = Refusal;

88 = Don’t Know;

99 = No answer

lrscale Left-Right Placement

0 (Left) - 10 (Right);

77 = Refusal;

88 = Don’t know;

99 = No answer

  1. Wrangle the data, and assign it to an object called ess.
  2. Select the variables you need
  3. Filter for Austria, Belgium, Denmark, Georgia, Iceland and the Russian Federation
  4. Have a look at the codebook and code all irrelevant values as missing. If you have binary variables recode them from 1, 2 to 0 to 1
  5. You want to build an extremism variable: You do so by subtracting 5 from the from the variable and squaring it afterwards. Call it extremism
  6. Rename the variables to more intuitive names, don’t forget to name binary varaibles after the category which is on 1
  7. drop all missing values
  8. Check out your new dataset
ess <- d1 %>% 
  select(cntry, dscrgrp, agea, gndr, eisced, lrscale) %>% #a
  filter(cntry %in% c("AT", "BE", "DK", "GE", "IS","RU")) %>% #b
  mutate(dscrgrp = case_when( #c
    dscrgrp == 1 ~ 0, 
    dscrgrp == 2 ~ 1,
    dscrgrp %in% c(7, 8, 9) ~ NA_real_, 
    TRUE ~ dscrgrp),
    agea = case_when(
    agea == 999 ~ NA_real_, 
    TRUE ~ agea),
    gndr = case_when(
      gndr == 1 ~ 0, 
      gndr == 2 ~ 1,
      gndr == 9 ~ NA_real_
    ),
    eisced = case_when( 
      eisced %in% c(55, 77, 88, 99) ~ NA_real_,
      TRUE ~ eisced),
    lrscale = case_when(
      lrscale %in% c(77, 88, 99) ~ NA_real_,
      TRUE ~ lrscale) #d, you could do this step also in a separate mutate function if you think that is more intuitive 
  ) %>% 
  mutate(extremism = (lrscale - 5)^2) %>% 
  rename(discriminated = dscrgrp,  
         court = cttresa,
         age = agea, 
         female = gndr, 
         education = eisced, 
         lrscale = lrscale, 
         extremism = extremism
         ) %>% 
  drop_na()

#Checking the dataset 
head(ess)

8.3 Chapter 3: Data Visualisation

In this exercise Section, we will work with the babynames and the iris package. This is a classic built-in package in R, which contains data from the Ronald Fisher’s 1936 Study “The use of multiple measurements in taxonomic problems”. It contains three plant species and four measured features for each species. Let us get an overview of the package:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median :5.800   Median :3.000   Median :4.350  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   Petal.Width          Species  
##  Min.   :0.100   setosa    :50  
##  1st Qu.:0.300   versicolor:50  
##  Median :1.300   virginica :50  
##  Mean   :1.199                  
##  3rd Qu.:1.800                  
##  Max.   :2.500

8.3.1 Exercise 1: Distributions

a. Plot a Chart, which shows the distribution of Sepal.Length over the setosa Species. Choose the type of distribution chart for yourself. HINT: Prepare the data first and then plot it.

#Filtering for setosa
d1 <- iris %>%
  filter(Species == "setosa")

#Histogram
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#Density Plot
ggplot(iris, aes(x = Sepal.Length)) +
  geom_density()

#Boxplot
ggplot(iris, aes(x = Sepal.Length)) +
  geom_boxplot()

# EDIT: You could also do everything in one step as shown below

#Histogram
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_histogram()

#Density Plot
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_density()

#Boxplot
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_boxplot()

b. Now I want you to add the two other Species to the Plot. Make Sure, that every Species has a unique color.

#Histogram
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#Density Plot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density()

#Boxplot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_boxplot()

#Violin Plot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin()

#Ridgeline Plot
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
  geom_density_ridges()
## Picking joint bandwidth of 0.181

c. Make a nice Plot! Give the Plot a meaningful title, meaningful labels for the x-axis and the y-axis and play around with the colors.

#Histogram
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram() +
  labs(
    x = "Species",
    y = "Sepal Length",
    title = "Distribution of Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Dark2")
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#Density Plot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density() +
  labs(
    x = "Species",
    y = "Sepal Length",
    title = "Distribution of Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Dark2")

#Boxplot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_boxplot() +
  labs(
    x = "Species",
    y = "Sepal Length",
    title = "Distribution of Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Dark2")

#Ridgeline Plot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin() +
  labs(
    x = "Species",
    y = "Sepal Length",
    title = "Distribution of Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Dark2")

#Ridgeline Plot
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
  geom_density_ridges() +
  labs(
    x = "Species",
    y = "Sepal Length",
    title = "Distribution of Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Dark2")
## Picking joint bandwidth of 0.181

d. Interpret the Plot!

8.3.2 Exercise 2: Rankings

a. Calculate the average Petal.Length and plot it for every Species in a nice Barplot. HINT: You have to prepare the data again before you plot it. Decide for yourself if you want to do it vertically or horizontally.

#Prepare the data 
d1 <- iris %>%
  group_by(Species) %>%
  dplyr::summarize(PL_average = mean(Petal.Length))

#Check it
head(d1)  
## # A tibble: 3 × 2
##   Species    PL_average
##   <fct>           <dbl>
## 1 setosa           1.46
## 2 versicolor       4.26
## 3 virginica        5.55
#Plotting a horizontal barplot
ggplot(d1, aes(x = Species, y = PL_average, fill  = Species)) +
  geom_bar(stat = "identity")

#Plotting a vertical barplot
ggplot(d1, aes(x = PL_average, y = Species, fill = Species)) +
  geom_bar(stat = "identity") 

b. Make a nice Plot! Give the Plot a meaningful title, meaningful labels for the x-axis and the y-axis and play around with the colors.

#Plotting a horizontal barplot
ggplot(d1, aes(x = Species, y = PL_average, fill  = Species)) +
  geom_bar(stat = "identity") +
  labs(
    x = "Species",
    y = "Average Sepal Length",
    title = "Average Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Set1")

#Plotting a vertical barplot
ggplot(d1, aes(x = PL_average, y = Species, fill = Species)) +
  geom_bar(stat = "identity") +
    labs(
    x = "Average Sepal Length",
    y = "Species",
    title = "Average Sepal Length across Species"
  ) +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Accent")

8.3.3 Exercise 3: Correlation

a. Make a scatter plot where you plot Sepal.length on the x-axis and Sepal.width on the y-axis. Make the plot for the species virginica

#Preparing the data
d1 <- iris %>%
  filter(Species == "virginica")

#Making the plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()

b. Now I want you to add the species versicolor to the plot. The dots of this species should have a different color AND a different form.

#Preparing the Data 
d1 <- iris %>%
  filter(Species %in% c("virginica", "versicolor")) 

#Making the Plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width, 
               shape = Species, color = Species)) +
  geom_point() 

c. Make a nice plot! Add a theme, labels, and a nice title, increase the size of the forms and make play around with the colors.

#Making the Plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width, 
               shape = Species, color = Species)) +
  geom_point() +
  labs(
    title = "Relationship between Sepal Length and Sepal Width",
    x = "Sepal Length", 
    y = "Sepal Width"
  ) + 
  scale_color_brewer(palette = "Set1") +
  theme_classic()

8.4 Chapter 4: Exploratory Data Analysis

8.4.1 Exercise 1: Standard Descriptive Statistics

In this exercise we will work with the built-in iris package in R:

a. Calculate the mode, mean and the median for the iris$Sepal.Length variable

mean(iris$Sepal.Length) #Calculating the mean
## [1] 5.843333
median(iris$Sepal.Length) # Calculating the median
## [1] 5.8
#Calculating the mode
uniq_vals <- unique(iris$Sepal.Length)
freqs <- tabulate(iris$Sepal.Length)
uniq_vals[which.max(freqs)]
## [1] 5
# You could also define a function for the mode

#mode <- function (x) {
#  uniq_vals <- unique(x, na.rm = TRUE)
#  freqs <- tabulate(match(x, uniq_vals))
#  uniq_vals[which.max(freqs)]
#}
#mode(iris$Sepal.Length)

b. Calculate the interquartile range, variance and the standard deviation for iris$Sepal.Length

IQR(iris$Sepal.Length)
## [1] 1.3
var(iris$Sepal.Length) #Variance
## [1] 0.6856935
sd(iris$Sepal.Length) #Standard deviatio
## [1] 0.8280661
#Or we take just the squareroot of the variance to get the sd
#var(iris$Sepal.Length) %>%
#  sqrt()

c. Calculate all five measures at once by using a function that does so (Choose by yourself, which one you want to use)

psych::describe(iris$Sepal.Length) #psych package
##    vars   n mean   sd median trimmed  mad min max range skew
## X1    1 150 5.84 0.83    5.8    5.81 1.04 4.3 7.9   3.6 0.31
##    kurtosis   se
## X1    -0.61 0.07
skimr::skim(iris$Sepal.Length) #skim package
(#tab:chapter 4 exercise 1c)Data summary
Name iris$Sepal.Length
Number of rows 150
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 5.84 0.83 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
summarytools::descr(iris$Sepal.Length) #summarytools package
## Descriptive Statistics  
## iris$Sepal.Length  
## N: 150  
## 
##                     Sepal.Length
## ----------------- --------------
##              Mean           5.84
##           Std.Dev           0.83
##               Min           4.30
##                Q1           5.10
##            Median           5.80
##                Q3           6.40
##               Max           7.90
##               MAD           1.04
##               IQR           1.30
##                CV           0.14
##          Skewness           0.31
##       SE.Skewness           0.20
##          Kurtosis          -0.61
##           N.Valid         150.00
##                 N         150.00
##         Pct.Valid         100.00

8.4.2 Exercise 2: Contingency Tables and Correlations

a. Make a Contingency Table for esoph$agegp and esoph$alcgp

table(esoph$agegp, esoph$alcgp) #with base R
##        
##         0-39g/day 40-79 80-119 120+
##   25-34         4     4      3    4
##   35-44         4     4      4    3
##   45-54         4     4      4    4
##   55-64         4     4      4    4
##   65-74         4     3      4    4
##   75+           3     4      2    2
summarytools::ctable(esoph$agegp, esoph$alcgp) #with summarytools
## Cross-Tabulation, Row Proportions  
## agegp * alcgp  
## Data Frame: esoph  
## 
## ------- ------- ------------ ------------ ------------ ------------ -------------
##           alcgp    0-39g/day        40-79       80-119         120+         Total
##   agegp                                                                          
##   25-34            4 (26.7%)    4 (26.7%)    3 (20.0%)    4 (26.7%)   15 (100.0%)
##   35-44            4 (26.7%)    4 (26.7%)    4 (26.7%)    3 (20.0%)   15 (100.0%)
##   45-54            4 (25.0%)    4 (25.0%)    4 (25.0%)    4 (25.0%)   16 (100.0%)
##   55-64            4 (25.0%)    4 (25.0%)    4 (25.0%)    4 (25.0%)   16 (100.0%)
##   65-74            4 (26.7%)    3 (20.0%)    4 (26.7%)    4 (26.7%)   15 (100.0%)
##     75+            3 (27.3%)    4 (36.4%)    2 (18.2%)    2 (18.2%)   11 (100.0%)
##   Total           23 (26.1%)   23 (26.1%)   21 (23.9%)   21 (23.9%)   88 (100.0%)
## ------- ------- ------------ ------------ ------------ ------------ -------------
gtsummary::tbl_cross(data = esoph,
                     row = agegp, 
                     col = alcgp) #with gtsummary
alcgp
Total
0-39g/day 40-79 80-119 120+
agegp




    25-34 4 4 3 4 15
    35-44 4 4 4 3 15
    45-54 4 4 4 4 16
    55-64 4 4 4 4 16
    65-74 4 3 4 4 15
    75+ 3 4 2 2 11
Total 23 23 21 21 88

b. Cut down the iris dataset to Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and save it in an object called iris_numeric.

iris_numeric <- iris %>% 
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)

c. Make a correlation matrix with iris_numeric

#With corrplot package
cor_iris <- cor(iris_numeric)
corrplot::corrplot(cor_iris, method = "color")

#With DataExplorer
DataExplorer::plot_correlation(iris_numeric)

d. Make the correlation matrix pretty

# With Corrplot
corrplot::corrplot(cor_iris, method = "circle", type = "upper", 
         tl.col = "black", tl.srt = 45)

# With DataExplorer
DataExplorer::plot_correlation(
  iris_numeric,
  ggtheme = theme_minimal(base_size = 14),
  title = "Correlation Matrix of Iris Numeric Variables"
)

8.4.3 Exercise 3: Working with packages

a. Use a function to get an overview of the dataset mtcars

skimr::skim(mtcars)
(#tab:chapter 4 exercise 3a)Data summary
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁
summarytools::dfSummary(mtcars)
## Data Frame Summary  
## mtcars  
## Dimensions: 32 x 11  
## Duplicates: 0  
## 
## ----------------------------------------------------------------------------------------------------------
## No   Variable    Stats / Values              Freqs (% of Valid)   Graph               Valid      Missing  
## ---- ----------- --------------------------- -------------------- ------------------- ---------- ---------
## 1    mpg         Mean (sd) : 20.1 (6)        25 distinct values     :                 32         0        
##      [numeric]   min < med < max:                                   : .               (100.0%)   (0.0%)   
##                  10.4 < 19.2 < 33.9                               . : :                                   
##                  IQR (CV) : 7.4 (0.3)                             : : :   .                               
##                                                                   : : : : :                               
## 
## 2    cyl         Mean (sd) : 6.2 (1.8)       4 : 11 (34.4%)       IIIIII              32         0        
##      [numeric]   min < med < max:            6 :  7 (21.9%)       IIII                (100.0%)   (0.0%)   
##                  4 < 6 < 8                   8 : 14 (43.8%)       IIIIIIII                                
##                  IQR (CV) : 4 (0.3)                                                                       
## 
## 3    disp        Mean (sd) : 230.7 (123.9)   27 distinct values     :                 32         0        
##      [numeric]   min < med < max:                                 . :                 (100.0%)   (0.0%)   
##                  71.1 < 196.3 < 472                               : : :   : : :                           
##                  IQR (CV) : 205.2 (0.5)                           : : :   : : :   .                       
##                                                                   : : : . : : : . :                       
## 
## 4    hp          Mean (sd) : 146.7 (68.6)    22 distinct values   . :                 32         0        
##      [numeric]   min < med < max:                                 : :                 (100.0%)   (0.0%)   
##                  52 < 123 < 335                                   : : : .                                 
##                  IQR (CV) : 83.5 (0.5)                            : : : :                                 
##                                                                   : : : : . .                             
## 
## 5    drat        Mean (sd) : 3.6 (0.5)       22 distinct values       :               32         0        
##      [numeric]   min < med < max:                                   : :               (100.0%)   (0.0%)   
##                  2.8 < 3.7 < 4.9                                    : : .                                 
##                  IQR (CV) : 0.8 (0.1)                             . : : :                                 
##                                                                   : : : : .                               
## 
## 6    wt          Mean (sd) : 3.2 (1)         29 distinct values         :             32         0        
##      [numeric]   min < med < max:                                       : :           (100.0%)   (0.0%)   
##                  1.5 < 3.3 < 5.4                                        : :                               
##                  IQR (CV) : 1 (0.3)                               : : : : :     .                         
##                                                                   : : : : : .   :                         
## 
## 7    qsec        Mean (sd) : 17.8 (1.8)      30 distinct values         :             32         0        
##      [numeric]   min < med < max:                                       :             (100.0%)   (0.0%)   
##                  14.5 < 17.7 < 22.9                                     : :                               
##                  IQR (CV) : 2 (0.1)                                 . : : : :                             
##                                                                   : : : : : : :   .                       
## 
## 8    vs          Min  : 0                    0 : 18 (56.2%)       IIIIIIIIIII         32         0        
##      [numeric]   Mean : 0.4                  1 : 14 (43.8%)       IIIIIIII            (100.0%)   (0.0%)   
##                  Max  : 1                                                                                 
## 
## 9    am          Min  : 0                    0 : 19 (59.4%)       IIIIIIIIIII         32         0        
##      [numeric]   Mean : 0.4                  1 : 13 (40.6%)       IIIIIIII            (100.0%)   (0.0%)   
##                  Max  : 1                                                                                 
## 
## 10   gear        Mean (sd) : 3.7 (0.7)       3 : 15 (46.9%)       IIIIIIIII           32         0        
##      [numeric]   min < med < max:            4 : 12 (37.5%)       IIIIIII             (100.0%)   (0.0%)   
##                  3 < 4 < 5                   5 :  5 (15.6%)       III                                     
##                  IQR (CV) : 1 (0.2)                                                                       
## 
## 11   carb        Mean (sd) : 2.8 (1.6)       1 :  7 (21.9%)       IIII                32         0        
##      [numeric]   min < med < max:            2 : 10 (31.2%)       IIIIII              (100.0%)   (0.0%)   
##                  1 < 2 < 8                   3 :  3 ( 9.4%)       I                                       
##                  IQR (CV) : 2 (0.6)          4 : 10 (31.2%)       IIIIII                                  
##                                              6 :  1 ( 3.1%)                                               
##                                              8 :  1 ( 3.1%)                                               
## ----------------------------------------------------------------------------------------------------------
gtsummary::tbl_summary(mtcars)
Characteristic N = 321
mpg 19.2 (15.4, 22.8)
cyl
    4 11 (34%)
    6 7 (22%)
    8 14 (44%)
disp 196 (121, 334)
hp 123 (96, 180)
drat 3.70 (3.08, 3.92)
wt 3.33 (2.54, 3.65)
qsec 17.71 (16.89, 18.90)
vs 14 (44%)
am 13 (41%)
gear
    3 15 (47%)
    4 12 (38%)
    5 5 (16%)
carb
    1 7 (22%)
    2 10 (31%)
    3 3 (9.4%)
    4 10 (31%)
    6 1 (3.1%)
    8 1 (3.1%)
1 Median (Q1, Q3); n (%)

b. Have a look at the structure of the missing values in mtcars

# With a table
naniar::miss_var_summary(mtcars)
## # A tibble: 11 × 3
##    variable n_miss pct_miss
##    <chr>     <int>    <num>
##  1 mpg           0        0
##  2 cyl           0        0
##  3 disp          0        0
##  4 hp            0        0
##  5 drat          0        0
##  6 wt            0        0
##  7 qsec          0        0
##  8 vs            0        0
##  9 am            0        0
## 10 gear          0        0
## 11 carb          0        0
DataExplorer::introduce(mtcars)
##   rows columns discrete_columns continuous_columns
## 1   32      11                0                 11
##   all_missing_columns total_missing_values complete_rows
## 1                   0                    0            32
##   total_observations memory_usage
## 1                352         5928
SmartEDA::ExpData(mtcars, type = 1)
##                                           Descriptions
## 1                                   Sample size (nrow)
## 2                              No. of variables (ncol)
## 3                    No. of numeric/interger variables
## 4                              No. of factor variables
## 5                                No. of text variables
## 6                             No. of logical variables
## 7                          No. of identifier variables
## 8                                No. of date variables
## 9             No. of zero variance variables (uniform)
## 10               %. of variables having complete cases
## 11   %. of variables having >0% and <50% missing cases
## 12 %. of variables having >=50% and <90% missing cases
## 13          %. of variables having >=90% missing cases
##        Value
## 1         32
## 2         11
## 3         11
## 4          0
## 5          0
## 6          0
## 7          0
## 8          0
## 9          0
## 10 100% (11)
## 11    0% (0)
## 12    0% (0)
## 13    0% (0)
# Graphically
naniar::vis_miss(mtcars)

DataExplorer::plot_missing(mtcars)

c. Make an automized EDA report for mtcars!

dlookr::describe(mtcars)

DataExplorer::create_report(mtcars)

8.5 Chapter 5: Data Analysis

To understand linear regression, we do not have to load any complicated data. Let us assume, you are a market analyst and your customer is the production company of a series called “Breaking Thrones”. The production company wants to know how the viewers of the series judge the final episode. You conduct a survey and ask people on how satisfied they were with the season final and some social demographics. Here is your codebook:

Variable Description
id The id of the respondent
satisfaction The answer to the question “How satisfied were you with the final episode of Breaking Throne?”, where 0 is completely dissatisfied and 10 completely satisfied
age The age of the respondent
female The Gender of the respondent, where 0 = Male, 1 = Female

Let us generate the data:

#setting seed for reproduciability 
set.seed(123)  

#Generating the data 
final_BT <- data.frame(
  id = c(1:10),    
  satisfaction = round(rnorm(10, mean = 6, sd = 2.5)),    
  age = round(rnorm(10, mean = 25, sd = 5)),    
  female = rbinom(10, 1, 0.5)
  )  

#Print the Data Frame
print(final_BT)
##    id satisfaction age female
## 1   1            5  31      0
## 2   2            5  27      0
## 3   3           10  27      0
## 4   4            6  26      0
## 5   5            6  22      0
## 6   6           10  34      0
## 7   7            7  27      0
## 8   8            3  15      0
## 9   9            4  29      0
## 10 10            5  23      1

8.5.1 Exercise 1: Linear Regression with two variables

You want to know if age has an impact on the satisfaction with the last episode. You want to conduct a linear regression.

a. Calculate \(\beta_0\) and \(\beta_1\) by hand

#Calculating Covariance
cov <- sum((final_BT$age - mean(final_BT$age)) * (final_BT$satisfaction - mean(final_BT$satisfaction)))
  
#Calculating Variance
var <- (sum((final_BT$age - mean(final_BT$age))^2))

#Calculating beta_1
beta_1 <- cov/var

#printing beta_1
print(beta_1)
## [1] 0.2466586
#calculating beta 1
beta_0 <- mean(final_BT$satisfaction) - (beta_1 * mean(final_BT$age))

#Print both
print(beta_0)
## [1] -0.3377886
print(beta_1)
## [1] 0.2466586

b. Calculate \(\beta_0\) and \(\beta_1\) automatically with R

#Calculating it automatically
summary(m1 <- lm(satisfaction ~ age, 
                 data = final_BT))
## 
## Call:
## lm(formula = satisfaction ~ age, data = final_BT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8153 -1.0820 -0.2053  0.8530  3.6780 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.3378     3.4796  -0.097   0.9251  
## age           0.2467     0.1310   1.883   0.0964 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.058 on 8 degrees of freedom
## Multiple R-squared:  0.3072, Adjusted R-squared:  0.2206 
## F-statistic: 3.547 on 1 and 8 DF,  p-value: 0.0964

c. Interpret all quantities of your result: Standard Error, t-statistic, p-value, confidence intervals and the \(R^2\).

d. Check for influential outliers

#Cooks Distance can be calculated with a built-in function
final_BT$cooks_distance <- cooks.distance(m1) 

#Plotting it
ggplot(final_BT, aes(x = age, y = cooks_distance)) + 
  geom_point(colour = "darkgreen", size = 3, alpha = 0.5) + 
  labs(y = "Cook's Distance", x = "Independent Variable") + 
  geom_hline(yintercept = 1, linetype = "dashed") + 
  theme_bw()

8.5.2 Exercise 2: Multivariate Regression

a. Add the variable female to your regression

summary(m1 <- lm(satisfaction ~ age + female, 
                 data = final_BT))
## 
## Call:
## lm(formula = satisfaction ~ age + female, data = final_BT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8401 -1.1312 -0.0574  0.8001  3.6435 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.1712     3.8481  -0.044    0.966
## age           0.2418     0.1429   1.692    0.134
## female       -0.3895     2.3662  -0.165    0.874
## 
## Residual standard error: 2.196 on 7 degrees of freedom
## Multiple R-squared:  0.3099, Adjusted R-squared:  0.1127 
## F-statistic: 1.571 on 2 and 7 DF,  p-value: 0.2731

b. Interpret the Output. What has changed? What stood the same?

c. Make an interaction effect between age and female and interpret it!

summary(m2 <- lm(satisfaction ~ age + female + age:female, 
                 data = final_BT))
## 
## Call:
## lm(formula = satisfaction ~ age + female + age:female, data = final_BT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8401 -1.1312 -0.0574  0.8001  3.6435 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.1712     3.8481  -0.044    0.966
## age           0.2418     0.1429   1.692    0.134
## female       -0.3895     2.3662  -0.165    0.874
## age:female        NA         NA      NA       NA
## 
## Residual standard error: 2.196 on 7 degrees of freedom
## Multiple R-squared:  0.3099, Adjusted R-squared:  0.1127 
## F-statistic: 1.571 on 2 and 7 DF,  p-value: 0.2731

d. Plot the interaction and make the plot nice

plot_model(m2, type = "int") +
  scale_x_continuous(breaks = seq(0,60, 1)) + 
  labs(title = "Relationship of Age and Gender on Satisfaction with BT.",
       x = "Age", 
       y = "Satisfaction") +
  scale_color_manual(
    values = c("red", "blue"),
    labels = c("Female", "Male")
  ) +
  theme_sjplot() +
  theme(legend.position = "bottom", 
        legend.title = element_blank())
## Warning in predict.lm(model, newdata = data_grid, type =
## "response", se.fit = se, : prediction from rank-deficient
## fit; attr(*, "non-estim") has doubtful cases
## Scale for colour is already present.
## Adding another scale for colour, which will replace the
## existing scale.

8.6 Chapter 6: Loops and Functions Exercises

8.6.1 Exercise 1: Writing a loop

Write a for loop that prints the square of each number from 1 to 10

#Assigning an object for a better workflow 
number <- 10

#correct loop
for (i in 1:number) {
  print(i^2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100

8.6.2 Exercise 2: Writing a function

Write a function that takes the input x and squares it:

#Defining a function for squaring  
sq <- function (x) {          
  x_squared <- x^2
  return(x_squared)
}  

#Defining a vector containing a vector from 1 to 10  
numbers <- c(1:10)   

#Applying the number  
sq(numbers)

8.6.3 Exercise 3: The midnight Formula

This is the midnight formula separated in two equations:

\(x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\)

Make one function for the midnight formula, so the output are \(x_1\) a d \(x_2\). Test it with a = 2, b = -6, c = -8

Hint: You need two split up the formula into two equations with two outputs.

mnf <- function(a, b, c) {
  
  x_1 <- (-b + sqrt(b^2 - 4*a*c))/(2*a)
  
  print(x_1)
  
  x_2 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
  
  print(x_2)
  
}   
  
#Test it with these numbers (other numbers might throw an error if the number under the square root is negative)
mnf(2, 20, 8)