Delivering dataset preprocessing to ensure the feasibility of merging different datasets for downstream analysis.
library(tidyverse)
library(rvest)
library(httr)
library(readxl)
library(janitor)
library(fuzzyjoin)
The country code dataset contains country names along with their corresponding standard two-letter, three-letter, and numeric codes. The dataset can be accessed here.
Key steps in cleaning this dataset
country_code <-
read_csv("../data/raw/country_code.csv") |>
clean_names() |>
rename(
"country_name" = "english_short_name_lower_case"
) |>
mutate(
country_name = recode(country_name,
"United States Of America" = "United States",
"Virgin Islands, British" = "British Virgin Islands",
"Korea, Republic of (South Korea)" = "South Korea",
"Virgin Islands, U.S." = "U.S. Virgin Islands",
"Tanzania, United Republic of" = "Tanzania",
"Turks and Caicos" = "Turks and Caicos Islands",
"Macao" = "Macau")) |>
select(country_name, alpha_3_code)
The raw dataset contains information on global billionaires from 1997 to 2023. It offers a glimpse into the distribution of wealth, industries of operation, and demographic profiles of billionaires on a global scale. The dataset is available here.
Key steps in cleaning this dataset
net_worth
to numerical valuecountry_code
dataset for further
analysisbil_2010_2023 <-
read_csv("../data/raw/billionaires_1997_2023.csv") |>
filter(year >= 2010)
bil_2010_2023_clean <-
bil_2010_2023 |>
mutate(
net_worth = str_replace_all(net_worth, " B", ""),
net_worth = as.numeric(net_worth),
industries = str_replace_all(business_industries, "[\\['\\]]", ""),
country_of_residence =
recode(country_of_residence,
"Eswatini (Swaziland)" = "Swaziland",
"Scotland" = "United Kingdom",
"Czechia" = "Czech Republic",
"Hong Kong SAR" = "Hong Kong"),
industries =
recode(industries,
"Fashion and Retail" = "Fashion & Retail",
"Finance and Investments" = "Finance & Investments",
"Food and Beverage" = "Food & Beverage",
"Healthcare" = "Health care",
"Media" = "Media & Entertainment")) |>
left_join(country_code, c("country_of_residence" = "country_name")) |>
select(-c(month, rank, last_name, first_name, birth_date, business_category,
business_industries, organization_name, position_in_organization))
The billionaires 2023 dataset includes statistics on global billionaires, such as information about their wealth, industries, and personal details. This dataset also contains more detailed country information of which each billionaire resides, which is useful for the following analysis. The dataset is available here.
Key steps in cleaning this dataset
final_worth
to
billiongdp_country
to
trilliongender
variablebil_gdp_2023 <-
read_csv("../data/raw/billionaires_2023.csv")
bil_gdp_2023_clean <-
bil_gdp_2023 |>
clean_names() |>
mutate(
net_worth = final_worth / 1000,
gdp_country = str_replace_all(gdp_country, "[$,]", ""),
gdp_country = as.numeric(gdp_country) / 1e12,
gender =
case_match(
gender,
"F" ~ "Female",
"M" ~ "Male")) |>
select(net_worth, full_name = person_name, age, gender,
country_of_citizenship, country_of_residence = country,
city_of_residence = city, industries, self_made, cpi_country,
cpi_change_country, gdp_country, life_expectancy_country)
The GDP dataset encompasses GDP information spanning 262 distinct countries or regions from 1960 to 2022. The dataset can be downloaded here.
Key steps in cleaning this dataset
country_gdp <-
read_csv("../data/raw/country_gdp.csv")
country_gdp_clean <-
country_gdp |>
filter(year >= 2010) |>
mutate(gdp = value / 1e12,
name = country_name,
code = country_code) |>
select(name, code, year, gdp)
Given the issue that the region Taiwan
is missing GDP
data in raw dataset, we used a supplementary dataset to remedy it.
The Taiwan GDP data can be accessed here. This table contains Taiwan GDP data from 1960 to 2022.
Key steps in cleaning this dataset
fetch_tw_gdp = function(){
url = "https://countryeconomy.com/gdp/taiwan"
tw_gdp_html = read_html(url)
mydata <- tw_gdp_html|> html_table()
return(mydata[[1]])
}
extract_gdp = function(string){
str_vec = str_extract(string, "\\d*,\\d*") |>
str_split(",") |>
nth(1)
s = ""
for(e in str_vec){
s = str_c(s, e)
}
return(s)
}
tw_gdp = fetch_tw_gdp() |>
janitor::clean_names() |>
mutate(gdp = map(annual_gdp_2, extract_gdp) |> as.numeric(),
year = date |> as.numeric(),
name = "Taiwan, China",
code = "TWN",
gdp = gdp/1e6) |>
filter(year >= 2010) |>
select(name, code, year, gdp)
country_gdp_clean = bind_rows(country_gdp_clean, tw_gdp)
This dataset contains GDP data for each industry in US from 2017 to 2022. The dataset can be downloaded here.
Key steps in cleaning this dataset
indus_gdp <-
read_excel("../data/raw/usa_industry_gdp.xlsx", sheet = 18, skip = 4)
indus_gdp_clean <-
indus_gdp |>
slice(3:30) |>
rename(
"industries" = "...2",
"2017" = "...3",
"2018" = "...4",
"2019" = "...5",
"2020" = "...6",
"2021" = "...7",
"2022" = "2020...8") |>
filter(!(industries %in% c("Finance, insurance, real estate, rental, and leasing",
"Educational services, health care, and social assistance",
"Arts, entertainment, recreation, accommodation, and food services"))) |>
select(industries, `2017`:`2022`) |>
pivot_longer(`2017`:`2022`, names_to = "year", values_to = "industry_gdp") |>
mutate(year = as.numeric(year),
industry_gdp = industry_gdp / 1000)
# extract key words to merge two datasets by industries variable
#
# industry names from industry_gdp_clean
df1 <- indus_gdp_clean |> select(industries) |> unique()
# industry names from bil_2013_2023_clean (only for USA from 2017 to 2022!)
df2 <-
bil_2010_2023_clean |>
filter(country_of_residence == "United States" & year >= 2017 & year <= 2022) |>
select(industries) |>
drop_na() |>
unique() |>
mutate(
categories = industries,
# rename some industries for better match results
categories = recode(categories,
"Technology" = "Information",
"Logistics" = "Transportation and warehousing")) |>
# extract keywords for each industry
separate(categories, into = c("word1", "word2"), sep = " & ") |>
pivot_longer(
word1:word2,
names_to = "order",
values_to = "keywords"
) |>
drop_na(keywords)
# use regex to perform inexact matching
reg_match <-
regex_inner_join(df1, df2, by=c("industries" = "keywords"), ignore_case = TRUE) |>
distinct(industries.y, .keep_all = TRUE) |>
# removed service industry bc it's too general
filter(industries.y != "Service") |>
select(industries.x, industries.y)
bil_gdp_2010_2023 <-
left_join(bil_2010_2023_clean, country_gdp_clean,
by = join_by(alpha_3_code == code, year == year), multiple = "all") |>
mutate(region_gdp = gdp, region_code = alpha_3_code) |>
select(-c(gdp, alpha_3_code, name))
bil_gdp_indus_usa <-
bil_gdp_2010_2023 |>
filter(country_of_residence == "United States" & year >= 2017 & year <= 2022) |>
left_join(reg_match, c("industries" = "industries.y")) |>
left_join(indus_gdp_clean, c("industries.x" = "industries", "year" = "year")) |>
select(-industries.x, -region_code)
### save useful files & add some description
write_csv(bil_gdp_2010_2023, "../data/tidy/billionaire_gdp.csv")
write_csv(bil_gdp_indus_usa, "../data/tidy/billionaire_gdp_indus_usa.csv")
write_csv(country_gdp_clean, "../data/tidy/gdp.csv")
write_csv(bil_gdp_2023_clean, "../data/tidy/billionaire_2023.csv")
Describe dataset saved for downstream analysis.
billionaire_gdp.csv
This dataset covers global billionaires from 2010 to 2023, detailing their wealth, personal information, and country-specific GDP.
year
: The year in which the data was collected
net_worth
: The net worth of the billionaire (in billions
of dollars)
full_name
, age
, and gender
:
The name, age, and gender of the billionaire, respectively
country_of_citizenship
,
country_of_residence
, and city_of_residence
:
The residence details spanning country, city, and citizenship of the
billionaire
self_made
: Whether or not the billionaire’s wealth is
self-made or inherited
wealth_status
: Current status of the billionaire’s
wealth, whether it’s rising, stable, or declining
industries
: The industries associated with the
billionaire’s business interests
region_gdp
: The real GDP (in trillions of dollars) for
each country or region the billionaire resides
region_code
: The three-letter code of the corresponding
country or region the billionaire resides
billionaire_2023.csv
This dataset contains information for global billionaires in 2023, including their wealth, personal details, and country-specific economic status.
year
: The year in which the data was collected
net_worth
: The net worth of the billionaire (in billions
dollars)
full_name
, age
, and gender
:
The name, age, and gender of the billionaire, respectively
country_of_citizenship
,
country_of_residence
, and city_of_residence
:
The residence details spanning country, city, and citizenship of the
billionaire
self_made
: Whether or not the billionaire is
self-made
cpi_country
and cpi_change_country
:
Consumer Price Index (CPI) and its change for the billionaire’s
country
gdp_country
: The real GDP (in trillions dollars) for
each country the billionaire resides
life_expectancy_country
: Life expectancy in the country
the billionaire reside
billionaire_gdp_indus_usa.csv
This dataset combines information on billionaires residing in the United States from 2017 to 2022, with the annual GDP and the GDP for various industries in the United States.
year
: The year in which the data was collected
net_worth
: The net worth of the billionaire (in billions
dollars)
full_name
, age
, and gender
:
The name, age, and gender of the billionaire, respectively
country_of_citizenship
,
country_of_residence
, and city_of_residence
:
The residence details spanning country, city, and citizenship of the
billionaire
self_made
: Whether or not the billionaire’s wealth is
self-made or inherited
wealth_status
: Current status of the billionaire’s
wealth
industries
: The industries associated with the
billionaire’s business interests
region_gdp
: The real GDP (in trillions dollars) for each
country or region the billionaire resides
region_code
: The three-letter code of the corresponding
country or region the billionaire resides
industry_gdp
: The annual GDP (in trillions dollars) for
the industry
gdp.csv
This dataset includes information of GDP data of each region.
name
: region name (country or SAR(special administration
area))
code
: region code
year
: year
gdp
: GDP value (trillion as unit)