Hypothesis Testing

bil_data <-
  read_csv("../data/billionaire_gdp.csv") |> 
  distinct(year, full_name, .keep_all = TRUE)

bil_13_23 <-
  bil_data |> 
  filter(year == 2013 | year == 2023)

Paired t-test: analyze the wealth variation among billionaires between 2013 and 2023

We focused on billionaires who appeared on both the 2013 and 2023 lists, aiming to assess changes in their net worth over the span of ten years.

\(H_0: \mu_{2023} - \mu_{2013} = 0\)

\(H_1: \mu_{2023} - \mu_{2013} \neq 0\)

bil_name_13 <-
  bil_data |> 
  filter(year == 2013) |> 
  select(full_name) |> 
  unique()

bil_name_23 <-
  bil_data |> 
  filter(year == 2023) |> 
  select(full_name) |> 
  unique()

bil_diff_13_23 <-
  inner_join(bil_name_13, bil_name_23) |> 
  inner_join(bil_13_23, by = "full_name") |> 
  select(full_name, year, net_worth) |> 
  pivot_wider(
    names_from = year,
    values_from = net_worth
  )

t.test(bil_diff_13_23 |> pull(`2023`), bil_diff_13_23 |> pull(`2013`), paired = T) |> 
  broom::tidy() |> 
  knitr::kable(digits = 3)

estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
3.45	7.856	0	787	2.588	4.312	Paired t-test	two.sided

With the result shown above, the p-value is 0, which is sufficiently small to reject the null. We conclude that there is a difference in the net worth of billionaires on the list in both 2013 and 2023. Furthermore, the 95% confidence interval suggests a discernible increase in billionaires’ wealth over the past decade, estimated to range between 2.588 and 4.312 billion dollars.

T-test: compare the wealth distribution between gender groups among billionaires in 2023

When examining the data, we noticed that there is a considerable discrepancy in the numbers of female and male billionaires, with a notably smaller count among female billionaires. This led us to investigate whether a substantial difference exists in the actual wealth between these two groups. Our analysis specifically focused on billionaires in the year 2023 to conduct the test.

Test for equal variance

We first perform the F test to see if the variances between the two groups differ.

\(H_0: \sigma^2_{male} = \sigma^2_{female}\)

\(H_1: \sigma^2_{male} \neq \sigma^2_{female}\)

bil_gender <-
  bil_data |> 
  filter(year == 2013) |> 
  drop_na(gender)

var.test(filter(bil_gender, gender == "Male") |> pull(net_worth), 
       filter(bil_gender, gender == "Female") |> pull(net_worth)) |> 
  broom::tidy() |> 
  knitr::kable(digits = 3)

estimate	num.df	den.df	statistic	p.value	conf.low	conf.high	method	alternative
1.207	1281	137	1.207	0.159	0.928	1.53	F test to compare two variances	two.sided

Given the p-value of 0.159, we fail to reject the null under a 0.05 significance level. Thus, we conclude that the variances between the two groups are equal.

Perform the t-test

\(H_0: \mu_{male} = \mu_{female}\)

\(H_1: \mu_{male} \neq \mu_{female}\)

t.test(filter(bil_gender, gender == "Male") |> pull(net_worth), 
       filter(bil_gender, gender == "Female") |> pull(net_worth),
       var.equal = TRUE) |> 
  broom::tidy() |> 
  knitr::kable(digits = 3)

estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.224	3.79	4.014	-0.464	0.643	1418	-1.171	0.723	Two Sample t-test	two.sided

From the result table, we got a p-value of 0.643, which is much greater than 0.05. Thus, with a significance level of 0.05, we fail to reject the null and conclude that in 2023, the wealth of female and male billionaires is the same.

ANOVA: compare the wealth among billionaires in different age groups in 2023

Diagnostics

Before conducting the ANOVA test, we wanted to check if all assumptions of the test are met. Hence, we created two diagnostic plots: the distribution of residuals and the boxplot.

bil_age <-
  bil_data |> 
  drop_na(age) |> 
  filter(year == 2023) |> 
  mutate(
    age_group = 
      case_when(
        age <= 55 ~ "55 and below",
        age > 55 & age <= 65 ~ "55~65",
        age > 65 & age <= 75 ~ "65~75",
        age > 75 ~ "over 75"),
    age_group = 
      fct_relevel(age_group, c("55 and below", "55~65", "65~75", "over 75")))
# distribution of residuals
model_ori = lm(net_worth ~ age_group, data = bil_age)

model_ori[[2]] |> 
  as.data.frame() |> 
  rename("residuals" = "model_ori[[2]]") |> 
  plot_ly(x = ~residuals, type = "histogram", 
          marker = list(color = "rgba(68, 1, 84, 0.6)", 
          line = list(color = "rgb(68, 1, 84)", width = 2)), nbinsx = 50) |> 
  layout(title = "Distribution of residuals",
         xaxis = list(title = "Residuals"),
         yaxis = list(title = "Count"),
         font = list(family = "Helvetica"))

# boxplot
bil_age |> 
  plot_ly(x = ~age_group, y =~net_worth, color = ~age_group,
          type = "box", colors = "viridis") |>
  layout(title = "Net worth of different age groups",
         xaxis = list(title = "Age of billionaires"),
         yaxis = list(title = "Net worth (billion dollars)"),
         font = list(family = "Helvetica"))

From the boxplot above, we can tell that the distribution of net_worth in each age group is heavily right-skewed, which violates the normality and homoscedasticity assumptions of ANOVA. Hence, we have to transform the data before performing the ANOVA test.

Data transformation

L1 = boxcox(model_ori ,objective.name = "Shapiro-Wilk",optimize = TRUE)$lambda

bil_age <-
  bil_age |> 
  mutate(
    net_worth_t = (net_worth^(L1)-1)/L1)

To make the data distribution closer to a normal distribution, we used the Box-Cox method to transform the data. We found the \(\lambda\) = -0.59 that provides the best approximation for the normal distribution of our response variable, net_worth. We then transformed our data using the formula:

\(Y_{transformed} = \frac{Y^{\lambda}-1}{\lambda}\)

The transformed data is used for the subsequent analysis.

# distribution of residuals
model_t = lm(net_worth_t ~ age_group, data = bil_age)

model_t[[2]] |> 
  as.data.frame() |> 
  rename("residuals" = "model_t[[2]]") |> 
  plot_ly(x = ~residuals, type = "histogram", 
          marker = list(color = "rgba(68, 1, 84, 0.6)", 
          line = list(color = "rgb(68, 1, 84)", width = 2)), nbinsx = 30) |> 
  layout(title = "Distribution of residuals after transformation",
         xaxis = list(title = "Residuals (transformed)"),
         yaxis = list(title = "Count"),
         font = list(family = "Helvetica"))

# boxplot
bil_age |> 
  plot_ly(x = ~age_group, y =~net_worth_t, color = ~age_group,
          type = "box", colors = "viridis") |>
  layout(title = "Net worth of different age groups after transformation",
         xaxis = list(title = "Age of billionaires"),
         yaxis = list(title = "Net worth (transformed)"),
         font = list(family = "Helvetica"))

After transforming the data, we re-created the two diagnostic plots. We can tell that the transformed data have a distribution closer to normal and the variances among age groups are approximately equal. Thus, we may safely proceed and conduct the ANOVA test.

ANOVA

\(H_0: \mu_{55\space and \space below} = \mu_{55\sim65} = \mu_{65\sim75} = \mu_{over \space 75}\)

\(H_1:\) at least one \(\mu\) differs

summary(model_t) |> broom::glance() |> knitr::kable(digits = 3)

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual	nobs
0.018	0.017	0.388	15.509	0	3	2569	2573

The result table shows a p-value of 0, which is sufficiently small to reject the null. Thus, we conclude that at least one age group has a transformed average wealth differing from the rest age groups.

Post-hoc analysis

Since the ANOVA result suggests that not all age groups have the same mean wealth, we now want to know which group(s) differ in mean wealth. Hence, we conducted Tukey HSD (Honestly Significant Difference) test.

res <- aov(net_worth_t ~ age_group, data = bil_age)
mydf <- 
  TukeyHSD(res) |> 
  broom::tidy() |> 
  mutate(contrast = recode(contrast, "65~75-55~65" = "65\\~75-55~65"))
mydf |> knitr::kable(digits = 3)

term	contrast	estimate	conf.low	conf.high	adj.p.value
age_group	55~65-55 and below	0.055	0.000	0.110	0.049
age_group	65~75-55 and below	0.094	0.038	0.150	0.000
age_group	over 75-55 and below	0.147	0.089	0.204	0.000
age_group	65~75-55~65	0.038	-0.016	0.092	0.262
age_group	over 75-55~65	0.091	0.036	0.147	0.000
age_group	over 75-65~75	0.053	-0.003	0.109	0.073

The result table shows that several comparisons are significant under the 0.05 significance level. The group of age 55 and below differs from the 65-75 and the over 75 years old groups, the group of age 55-65 differs from the over 75 years old group as well. Furthermore, distinctions in wealth were observed between the 65-75 and over 75 years old groups. Recognizing the complexity of interpreting the tabulated data, we further annotated the boxplot with the significance letters for better visualization, where different letters imply statistically significant differences in wealth between the respective age groups.

letter <- multcompLetters4(res, TukeyHSD(res))
letter_df <- as.data.frame.list(letter[[1]])
dt <- bil_age |> 
  group_by(age_group) |> 
  summarize(pos = quantile(net_worth_t)[4] + 0.06) |>  
  mutate(letter = letter_df |> pull(Letters))

a <- list(
  x = dt$age_group,
  y = dt$pos,
  text = dt$letter,
  showarrow = FALSE,
  xanchor = 'left',
  font = list(size = 16)
)

bil_age |> 
  plot_ly(x = ~age_group, y =~net_worth_t, color = ~age_group,
          type = "box", colors = "viridis") |>
  layout(title = "Net worth of different age groups after transformation",
         xaxis = list(title = "Age of billionaires"),
         yaxis = list(title = "Net worth (transformed)"),
         annotations = a,
         font = list(family = "Helvetica"))

Summary

The wealth of billionaires has increased from 2013 to 2023, with a 95% confidence interval from 2.588 to 4.312 billion dollars.
There is no significant difference in the net worth between female and male billionaires in 2023.
Not all billionaires in four age groups have the same wealth amount. Among the six pairwise comparisons, four of them show a significant difference in wealth.