bil_data <-
read_csv("../data/billionaire_gdp.csv") |>
distinct(year, full_name, .keep_all = TRUE)
bil_13_23 <-
bil_data |>
filter(year == 2013 | year == 2023)
We focused on billionaires who appeared on both the 2013 and 2023 lists, aiming to assess changes in their net worth over the span of ten years.
\(H_0: \mu_{2023} - \mu_{2013} = 0\)
\(H_1: \mu_{2023} - \mu_{2013} \neq 0\)
bil_name_13 <-
bil_data |>
filter(year == 2013) |>
select(full_name) |>
unique()
bil_name_23 <-
bil_data |>
filter(year == 2023) |>
select(full_name) |>
unique()
bil_diff_13_23 <-
inner_join(bil_name_13, bil_name_23) |>
inner_join(bil_13_23, by = "full_name") |>
select(full_name, year, net_worth) |>
pivot_wider(
names_from = year,
values_from = net_worth
)
t.test(bil_diff_13_23 |> pull(`2023`), bil_diff_13_23 |> pull(`2013`), paired = T) |>
broom::tidy() |>
knitr::kable(digits = 3)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
3.45 | 7.856 | 0 | 787 | 2.588 | 4.312 | Paired t-test | two.sided |
With the result shown above, the p-value is 0, which is sufficiently small to reject the null. We conclude that there is a difference in the net worth of billionaires on the list in both 2013 and 2023. Furthermore, the 95% confidence interval suggests a discernible increase in billionaires’ wealth over the past decade, estimated to range between 2.588 and 4.312 billion dollars.
When examining the data, we noticed that there is a considerable discrepancy in the numbers of female and male billionaires, with a notably smaller count among female billionaires. This led us to investigate whether a substantial difference exists in the actual wealth between these two groups. Our analysis specifically focused on billionaires in the year 2023 to conduct the test.
We first perform the F test to see if the variances between the two groups differ.
\(H_0: \sigma^2_{male} = \sigma^2_{female}\)
\(H_1: \sigma^2_{male} \neq \sigma^2_{female}\)
bil_gender <-
bil_data |>
filter(year == 2013) |>
drop_na(gender)
var.test(filter(bil_gender, gender == "Male") |> pull(net_worth),
filter(bil_gender, gender == "Female") |> pull(net_worth)) |>
broom::tidy() |>
knitr::kable(digits = 3)
estimate | num.df | den.df | statistic | p.value | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|
1.207 | 1281 | 137 | 1.207 | 0.159 | 0.928 | 1.53 | F test to compare two variances | two.sided |
Given the p-value of 0.159, we fail to reject the null under a 0.05 significance level. Thus, we conclude that the variances between the two groups are equal.
\(H_0: \mu_{male} = \mu_{female}\)
\(H_1: \mu_{male} \neq \mu_{female}\)
t.test(filter(bil_gender, gender == "Male") |> pull(net_worth),
filter(bil_gender, gender == "Female") |> pull(net_worth),
var.equal = TRUE) |>
broom::tidy() |>
knitr::kable(digits = 3)
estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|---|
-0.224 | 3.79 | 4.014 | -0.464 | 0.643 | 1418 | -1.171 | 0.723 | Two Sample t-test | two.sided |
From the result table, we got a p-value of 0.643, which is much greater than 0.05. Thus, with a significance level of 0.05, we fail to reject the null and conclude that in 2023, the wealth of female and male billionaires is the same.
Before conducting the ANOVA test, we wanted to check if all assumptions of the test are met. Hence, we created two diagnostic plots: the distribution of residuals and the boxplot.
bil_age <-
bil_data |>
drop_na(age) |>
filter(year == 2023) |>
mutate(
age_group =
case_when(
age <= 55 ~ "55 and below",
age > 55 & age <= 65 ~ "55~65",
age > 65 & age <= 75 ~ "65~75",
age > 75 ~ "over 75"),
age_group =
fct_relevel(age_group, c("55 and below", "55~65", "65~75", "over 75")))
# distribution of residuals
model_ori = lm(net_worth ~ age_group, data = bil_age)
model_ori[[2]] |>
as.data.frame() |>
rename("residuals" = "model_ori[[2]]") |>
plot_ly(x = ~residuals, type = "histogram",
marker = list(color = "rgba(68, 1, 84, 0.6)",
line = list(color = "rgb(68, 1, 84)", width = 2)), nbinsx = 50) |>
layout(title = "Distribution of residuals",
xaxis = list(title = "Residuals"),
yaxis = list(title = "Count"),
font = list(family = "Helvetica"))
# boxplot
bil_age |>
plot_ly(x = ~age_group, y =~net_worth, color = ~age_group,
type = "box", colors = "viridis") |>
layout(title = "Net worth of different age groups",
xaxis = list(title = "Age of billionaires"),
yaxis = list(title = "Net worth (billion dollars)"),
font = list(family = "Helvetica"))
From the boxplot above, we can tell that the distribution of
net_worth
in each age group is heavily right-skewed, which
violates the normality and homoscedasticity assumptions of ANOVA. Hence,
we have to transform the data before performing the ANOVA test.
L1 = boxcox(model_ori ,objective.name = "Shapiro-Wilk",optimize = TRUE)$lambda
bil_age <-
bil_age |>
mutate(
net_worth_t = (net_worth^(L1)-1)/L1)
To make the data distribution closer to a normal distribution, we
used the Box-Cox method to transform the data. We found the \(\lambda\) = -0.59 that provides the best
approximation for the normal distribution of our response variable,
net_worth
. We then transformed our data using the
formula:
\(Y_{transformed} = \frac{Y^{\lambda}-1}{\lambda}\)
The transformed data is used for the subsequent analysis.
# distribution of residuals
model_t = lm(net_worth_t ~ age_group, data = bil_age)
model_t[[2]] |>
as.data.frame() |>
rename("residuals" = "model_t[[2]]") |>
plot_ly(x = ~residuals, type = "histogram",
marker = list(color = "rgba(68, 1, 84, 0.6)",
line = list(color = "rgb(68, 1, 84)", width = 2)), nbinsx = 30) |>
layout(title = "Distribution of residuals after transformation",
xaxis = list(title = "Residuals (transformed)"),
yaxis = list(title = "Count"),
font = list(family = "Helvetica"))
# boxplot
bil_age |>
plot_ly(x = ~age_group, y =~net_worth_t, color = ~age_group,
type = "box", colors = "viridis") |>
layout(title = "Net worth of different age groups after transformation",
xaxis = list(title = "Age of billionaires"),
yaxis = list(title = "Net worth (transformed)"),
font = list(family = "Helvetica"))
After transforming the data, we re-created the two diagnostic plots. We can tell that the transformed data have a distribution closer to normal and the variances among age groups are approximately equal. Thus, we may safely proceed and conduct the ANOVA test.
\(H_0: \mu_{55\space and \space below} = \mu_{55\sim65} = \mu_{65\sim75} = \mu_{over \space 75}\)
\(H_1:\) at least one \(\mu\) differs
summary(model_t) |> broom::glance() |> knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual | nobs |
---|---|---|---|---|---|---|---|
0.018 | 0.017 | 0.388 | 15.509 | 0 | 3 | 2569 | 2573 |
The result table shows a p-value of 0, which is sufficiently small to reject the null. Thus, we conclude that at least one age group has a transformed average wealth differing from the rest age groups.
Since the ANOVA result suggests that not all age groups have the same mean wealth, we now want to know which group(s) differ in mean wealth. Hence, we conducted Tukey HSD (Honestly Significant Difference) test.
res <- aov(net_worth_t ~ age_group, data = bil_age)
mydf <-
TukeyHSD(res) |>
broom::tidy() |>
mutate(contrast = recode(contrast, "65~75-55~65" = "65\\~75-55~65"))
mydf |> knitr::kable(digits = 3)
term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|---|---|
age_group | 55~65-55 and below | 0 | 0.055 | 0.000 | 0.110 | 0.049 |
age_group | 65~75-55 and below | 0 | 0.094 | 0.038 | 0.150 | 0.000 |
age_group | over 75-55 and below | 0 | 0.147 | 0.089 | 0.204 | 0.000 |
age_group | 65~75-55~65 | 0 | 0.038 | -0.016 | 0.092 | 0.262 |
age_group | over 75-55~65 | 0 | 0.091 | 0.036 | 0.147 | 0.000 |
age_group | over 75-65~75 | 0 | 0.053 | -0.003 | 0.109 | 0.073 |
The result table shows that several comparisons are significant under the 0.05 significance level. The group of age 55 and below differs from the 65-75 and the over 75 years old groups, the group of age 55-65 differs from the over 75 years old group as well. Furthermore, distinctions in wealth were observed between the 65-75 and over 75 years old groups. Recognizing the complexity of interpreting the tabulated data, we further annotated the boxplot with the significance letters for better visualization, where different letters imply statistically significant differences in wealth between the respective age groups.
letter <- multcompLetters4(res, TukeyHSD(res))
letter_df <- as.data.frame.list(letter[[1]])
dt <- bil_age |>
group_by(age_group) |>
summarize(pos = quantile(net_worth_t)[4] + 0.06) |>
mutate(letter = letter_df |> pull(Letters))
a <- list(
x = dt$age_group,
y = dt$pos,
text = dt$letter,
showarrow = FALSE,
xanchor = 'left',
font = list(size = 16)
)
bil_age |>
plot_ly(x = ~age_group, y =~net_worth_t, color = ~age_group,
type = "box", colors = "viridis") |>
layout(title = "Net worth of different age groups after transformation",
xaxis = list(title = "Age of billionaires"),
yaxis = list(title = "Net worth (transformed)"),
annotations = a,
font = list(family = "Helvetica"))
The wealth of billionaires has increased from 2013 to 2023, with a 95% confidence interval from 2.588 to 4.312 billion dollars.
There is no significant difference in the net worth between female and male billionaires in 2023.
Not all billionaires in four age groups have the same wealth amount. Among the six pairwise comparisons, four of them show a significant difference in wealth.