The topic of billionaires often sparks curiosity about the factors contributing to their wealth. To explore this, we adapt data from a specific year 2023 and attempt to fit the data through a multiple linear regression model.
The 2023 dataset on billionaires comprises billionaire demographics:
their name, age, gender, nationality, details of their residence, their
industry, and country-specific information: the country’s CPI and GDP,
the life expectancy of the country’s population, and other related
details such as a marker for self-made status, etc. Here,
net_worth
is regarded as the response variable,
age
, gender
, gdp_country
,
life_expectancy_country
and self_made
are
predictors. The reason for choosing these five variables as predictors
is that in the available data, these five variables are themselves
numerical data or categorical data that can be transformed into
indicator variables. Some of the numerical data, such as the country’s
CPI and GDP, have a correlation between them and their simultaneous use
would lead to the problem of multicollinearity, so only one of them is
chosen. Some of the categorical data, such as the country where the
millionaire is located and the industry from which the millionaire comes
from, have too many categories by themselves, and even if we can
transform them into indicator variables for the regression, this would
make our model too complicated, which is not conducive to the
presentation and interpretation of the results. presentation and
interpretation of the results. Meanwhile, for those factors not included
in the existing model, we have explored them more in other parts of this
website.
Based on the analysis in the data explanation and factors selection, the formula for the multiple linear regression model will be:
\[\text{net_worth} = \beta_0 + \beta_1 \times \text{age}+ \beta_2 \times \text{gender} + \beta_3 \times \text{gdp_country} + \beta_4 \times \text{life_expectancy_country} +\beta_5 \times \text{self_made} + \varepsilon\]
net_worth
: Net worth of the individual in
billions.age
: Age of the billionaire.gender
: Gender of the billionaire.gdp_country
: Gross Domestic Product in trillions of the
country they reside in.life_expectancy_country
: Life expectancy in the country
they reside in.self_made
: Whether or not they started from
nothing.According to the multiple linear regression outcome, among the five
predictors used, only age
and gdp_country
are
significant at the 0.05 significance level.
The coefficient for age is 0.0485, which suggests that with each additional year of age, a billionaire’s net worth increases by approximately 0.0485 billion, holding other factors constant.
The coefficient for country’s gdp is 0.0565 which indicates that a unit increase in the GDP of the country is associated with an increase in net worth by 0.0565 billion, giving other factors fixed.
# read the data
billionaires23 = read_csv("../data/billionaire_2023.csv")
# do the regression
lrmodel = lm(net_worth ~ age + gender + gdp_country + life_expectancy_country + self_made, data = billionaires23)
lrmodel |>
broom::tidy() |>
knitr::kable(digits = 4) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -3.5415 | 4.5554 | -0.7774 | 0.4370 |
age | 0.0485 | 0.0159 | 3.0484 | 0.0023 |
genderMale | 0.2595 | 0.6870 | 0.3777 | 0.7057 |
gdp_country | 0.0565 | 0.0225 | 2.5141 | 0.0120 |
life_expectancy_country | 0.0632 | 0.0556 | 1.1352 | 0.2564 |
self_madeTRUE | -0.9702 | 0.4990 | -1.9444 | 0.0520 |
The effectiveness of the model’s fit: Adjusted R-squared value is 0.0057, indicating that only 0.6% of the variance in the net worth of billionaires can be explained by the variables in the model. This suggests that the factors we selected, have limited predictive power for a billionaire’s net worth, there might be other factors that influence a billionaire’s net worth that are not captured in this model.
Multicollinearity check: Due to the low adjusted R-squared value,
we checked for multicollinearity among the indicators by calculating
their VIF (Variance Inflation Factor) values. All the indicators have
VIF values near 1, indicating no significant correlation between any
given indicator and the others. There is no multicollinearity issue in
the model. In detail,age
, gender
,
gdp_country
, life_expectancy_country
and
self_made
have VIF values 1.011, 1.141, 1.076, 1.010 and
1.208 respectively.
vif_table <- vif(lrmodel)
vif_showing = data.frame(
term = c("age", "gender", "gdp_country", "life_expectancy_country", "self_made"),
VIF_value = c(1.011, 1.141, 1.076, 1.010, 1.208) )
vif_showing |> knitr::kable() |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
term | VIF_value |
---|---|
age | 1.011 |
gender | 1.141 |
gdp_country | 1.076 |
life_expectancy_country | 1.010 |
self_made | 1.208 |
For multiple linear regression models, one assumption is that the error term follows a normal distribution with zero mean and equal variance. In the following part, three different graphs related to the residuals are created in order to validate whether the model satisfies this assumption.
# Calculate residuals
residuals <- resid(lrmodel)
# Q-Q Plot
ggplot(data.frame(Residuals = residuals), aes(sample = Residuals)) +
stat_qq_line(colour = "blue") +
stat_qq_point() +
labs(title = "Q-Q plot of residuals", x = "Theoretical quantiles", y = "Sample quantiles")
# Density Plot of Residuals
ggplot(data.frame(Residuals = residuals), aes(x = Residuals)) +
geom_density(fill = "blue", alpha = 0.5) +
labs(x = "Residuals", y = "Density", title = "Density plot of residuals")
# Residual Plot
ggplot(data.frame(Fitted = fitted(lrmodel), Residuals = residuals), aes(x = Fitted, y = Residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "blue") +
labs(x = "Fitted values", y = "Residuals", title = "Residual vs. fitted values")
The plot of residuals versus fitted values reflects several problems:
Ideally, the residuals should be randomly dispersed around the horizontal line (y = 0). In this plot, while there seems to be some level of randomness, there is a noticeable pattern where the residuals are not evenly distributed across the range of fitted values. This suggests that the model may not be capturing all the relevant patterns in the data.
Residuals clustering: The clustering of residuals around certain ranges of fitted values indicates potential issues with the model. It might be missing some key predictive information or nonlinear relationships.
Outliers: There seem to be several points that stand out from the general scatter of the residuals. These outliers can have a significant impact on the regression model, potentially skewing the results.
Homoscedasticity check: The homoscedasticity assumption of linear regression means the residuals should have constant variance across all levels of fitted values. The spread of residuals in this plot does not appear to be uniform, suggesting the presence of heteroscedasticity, which violates this assumption.
The analysis of the residual density plot, Q-Q plot, and residuals vs. fitted values suggests that the linear regression model previously built may not be a good fit for the data.
In this regression session of our entire work, we attempt to use
the multiple linear regression to fit the data. The response variable is
net_worth
, and the predictors include age
,
gender
, gdp_country
,
life_expectancy_country
and self_made
.
According to the model results, while age and the GDP of a billionaire’s
country appear to have a significant association with their net worth,
the overall model explains very little of the variation in net
worth,i.e.,with a low adjusted R-squared value 0.0057, indicating that
there are likely many other factors at play. At the same time, the
analysis of residuals through residuals density plot, Q-Q plot and
residuals vs. fitted values, the residuals do not have a zero mean and
equal variance with several outliers, which violates the assumption of
linear regression.
In summary, the multiple linear regression model, as configured in our analysis, is not a good way to fit the data and predict the net worth of billionaires. This inadequacy could be attributed to the simplicity of our linear model, suggesting that there are numerous avenues for refinement and enhancement. Additionally, it may be the case that the net worth does not have a linear relationship with the predictors used in our model. Exploring alternative modeling methods that better capture the underlying relationships in the data is a topic for future research.