Data cleaning

Delivering dataset preprocessing to ensure the feasibility of merging different datasets for downstream analysis.

library(tidyverse)
library(rvest)
library(httr)
library(readxl)
library(janitor)
library(fuzzyjoin)

Country code

The country code dataset contains country names along with their corresponding standard two-letter, three-letter, and numeric codes. The dataset can be accessed here.

Key steps in cleaning this dataset

Renaming some countries
Variable selection

country_code <-
  read_csv("../data/raw/country_code.csv") |> 
  clean_names() |> 
  rename(
    "country_name" = "english_short_name_lower_case"
  ) |> 
  mutate(
    country_name = recode(country_name, 
                          "United States Of America" = "United States",
                          "Virgin Islands, British" = "British Virgin Islands",
                          "Korea, Republic of (South Korea)" = "South Korea",
                          "Virgin Islands, U.S." = "U.S. Virgin Islands",
                          "Tanzania, United Republic of" = "Tanzania",
                          "Turks and Caicos" = "Turks and Caicos Islands",
                          "Macao" = "Macau")) |> 
  select(country_name, alpha_3_code)

Billionaires 2010-2023

The raw dataset contains information on global billionaires from 1997 to 2023. It offers a glimpse into the distribution of wealth, industries of operation, and demographic profiles of billionaires on a global scale. The dataset is available here.

Key steps in cleaning this dataset

Filtering for data from 2010 to 2023
Converting net_worth to numerical value
Variable selection
Merging with country_code dataset for further analysis

bil_2010_2023 <-
  read_csv("../data/raw/billionaires_1997_2023.csv") |> 
  filter(year >= 2010)

bil_2010_2023_clean <-
  bil_2010_2023 |> 
  mutate(
    net_worth = str_replace_all(net_worth, " B", ""),
    net_worth = as.numeric(net_worth), 
    industries = str_replace_all(business_industries, "[\\['\\]]", ""),
    country_of_residence = 
      recode(country_of_residence, 
             "Eswatini (Swaziland)" = "Swaziland",
             "Scotland" = "United Kingdom",
             "Czechia" = "Czech Republic",
             "Hong Kong SAR" = "Hong Kong"),
    industries = 
      recode(industries,
             "Fashion and Retail" = "Fashion & Retail",
             "Finance and Investments" = "Finance & Investments",
             "Food and Beverage" = "Food & Beverage",
             "Healthcare" = "Health care",
             "Media" = "Media & Entertainment")) |> 
  left_join(country_code, c("country_of_residence" = "country_name")) |> 
  select(-c(month, rank, last_name, first_name, birth_date, business_category,
            business_industries, organization_name, position_in_organization))

Billionaires 2023 with GDP data

The billionaires 2023 dataset includes statistics on global billionaires, such as information about their wealth, industries, and personal details. This dataset also contains more detailed country information of which each billionaire resides, which is useful for the following analysis. The dataset is available here.

Key steps in cleaning this dataset

Cleaning and changing the unit of final_worth to billion
Cleaning and changing the unit of gdp_country to trillion
Recoding gender variable
Variable selection

bil_gdp_2023 <-
  read_csv("../data/raw/billionaires_2023.csv")

bil_gdp_2023_clean <-
  bil_gdp_2023 |> 
  clean_names() |> 
  mutate(
    net_worth = final_worth / 1000,
    gdp_country = str_replace_all(gdp_country, "[$,]", ""),
    gdp_country = as.numeric(gdp_country) / 1e12,
    gender = 
      case_match(
        gender,
        "F" ~ "Female",
        "M" ~ "Male")) |>
  select(net_worth, full_name = person_name, age, gender, 
         country_of_citizenship, country_of_residence = country, 
         city_of_residence = city, industries, self_made, cpi_country, 
         cpi_change_country, gdp_country, life_expectancy_country)

Region-level GDP

Basic cleaning

The GDP dataset encompasses GDP information spanning 262 distinct countries or regions from 1960 to 2022. The dataset can be downloaded here.

Key steps in cleaning this dataset

Filtering for data starts from 2010
Changing the unit of GDP to trillion
Variable renaming and selection

country_gdp <-
  read_csv("../data/raw/country_gdp.csv")

country_gdp_clean <-
  country_gdp |> 
  filter(year >= 2010) |> 
  mutate(gdp = value / 1e12,
         name = country_name, 
         code = country_code) |> 
  select(name, code, year, gdp)

Supplementary dataset

Given the issue that the region Taiwan is missing GDP data in raw dataset, we used a supplementary dataset to remedy it.

Taiwan

The Taiwan GDP data can be accessed here. This table contains Taiwan GDP data from 1960 to 2022.

Key steps in cleaning this dataset

Scapping GDP data from specific website
Extracting and filtering GDP data

fetch_tw_gdp = function(){
  url = "https://countryeconomy.com/gdp/taiwan"
  tw_gdp_html = read_html(url)
  mydata <- tw_gdp_html|> html_table()
  return(mydata[[1]])
}

extract_gdp = function(string){
  str_vec = str_extract(string, "\\d*,\\d*") |> 
    str_split(",") |> 
    nth(1)
  s = ""
  for(e in str_vec){
    s = str_c(s, e)
  }
  return(s)
}

tw_gdp = fetch_tw_gdp() |>
  janitor::clean_names() |>
  mutate(gdp = map(annual_gdp_2, extract_gdp) |> as.numeric(),
         year = date |> as.numeric(), 
         name = "Taiwan, China",
         code = "TWN",
         gdp = gdp/1e6) |>
  filter(year >= 2010) |>
  select(name, code, year, gdp)

Merge GDP data

Combining Taiwan GDP data with the main GDP dataset

country_gdp_clean = bind_rows(country_gdp_clean, tw_gdp)

Industry-level GDP for US

Basic cleaning

This dataset contains GDP data for each industry in US from 2017 to 2022. The dataset can be downloaded here.

Key steps in cleaning this dataset

Renaming variable names
Tidying the data

indus_gdp <-
  read_excel("../data/raw/usa_industry_gdp.xlsx", sheet = 18, skip = 4)

indus_gdp_clean <-
  indus_gdp |> 
  slice(3:30) |> 
  rename(
    "industries" = "...2",
    "2017" = "...3",
    "2018" = "...4",
    "2019" = "...5",
    "2020" = "...6",
    "2021" = "...7",
    "2022" = "2020...8") |>
  filter(!(industries %in% c("Finance, insurance, real estate, rental, and leasing",
                            "Educational services, health care, and social assistance",
                      "Arts, entertainment, recreation, accommodation, and food services"))) |> 
  select(industries, `2017`:`2022`) |> 
  pivot_longer(`2017`:`2022`, names_to = "year", values_to = "industry_gdp") |> 
  mutate(year = as.numeric(year),
         industry_gdp = industry_gdp / 1000)

Merge industry-level GDP & billionaire

Creating two summary tables with industry names
Extracting keywords for each industry
Using regex to perform inexact matches for industries

# extract key words to merge two datasets by industries variable
#
# industry names from industry_gdp_clean
df1 <- indus_gdp_clean |> select(industries) |> unique()
# industry names from bil_2013_2023_clean (only for USA from 2017 to 2022!)
df2 <- 
  bil_2010_2023_clean |> 
  filter(country_of_residence == "United States" & year >= 2017 & year <= 2022) |> 
  select(industries) |> 
  drop_na() |> 
  unique() |> 
  mutate(
    categories = industries,
    # rename some industries for better match results
    categories = recode(categories, 
                        "Technology" = "Information",
                        "Logistics" = "Transportation and warehousing")) |> 
  # extract keywords for each industry
  separate(categories, into = c("word1", "word2"), sep = " & ") |> 
  pivot_longer(
    word1:word2,
    names_to = "order",
    values_to = "keywords"
  ) |> 
  drop_na(keywords)

# use regex to perform inexact matching
reg_match <-
  regex_inner_join(df1, df2, by=c("industries" = "keywords"), ignore_case = TRUE) |> 
  distinct(industries.y, .keep_all = TRUE) |>
  # removed service industry bc it's too general
  filter(industries.y != "Service") |> 
  select(industries.x, industries.y)

Merge GDP & billionaire

Joining the dataset for billionaires from 2010 to 2023 with the country GDP dataset
Variable renaming and selection

bil_gdp_2010_2023 <-
  left_join(bil_2010_2023_clean, country_gdp_clean, 
            by = join_by(alpha_3_code == code, year == year), multiple = "all") |>
  mutate(region_gdp = gdp, region_code = alpha_3_code) |>
  select(-c(gdp, alpha_3_code, name))

USA industry dataset

Filtering the billionaires dataset for only billionaires reside in US from year 2017 to 2022
Merging the filtered dataset with industry-level GDP dataset
Variable selection

bil_gdp_indus_usa <-
  bil_gdp_2010_2023 |> 
  filter(country_of_residence == "United States" & year >= 2017 & year <= 2022) |> 
  left_join(reg_match, c("industries" = "industries.y")) |> 
  left_join(indus_gdp_clean, c("industries.x" = "industries", "year" = "year")) |> 
  select(-industries.x, -region_code)

Save the cleaned datasets

### save useful files & add some description
write_csv(bil_gdp_2010_2023, "../data/tidy/billionaire_gdp.csv")
write_csv(bil_gdp_indus_usa, "../data/tidy/billionaire_gdp_indus_usa.csv")
write_csv(country_gdp_clean, "../data/tidy/gdp.csv")
write_csv(bil_gdp_2023_clean, "../data/tidy/billionaire_2023.csv")

Description for tidy data

Describe dataset saved for downstream analysis.

billionaire_gdp.csv

This dataset covers global billionaires from 2010 to 2023, detailing their wealth, personal information, and country-specific GDP.

year: The year in which the data was collected

net_worth: The net worth of the billionaire (in billions of dollars)

full_name, age, and gender: The name, age, and gender of the billionaire, respectively

country_of_citizenship, country_of_residence, and city_of_residence: The residence details spanning country, city, and citizenship of the billionaire

self_made: Whether or not the billionaire’s wealth is self-made or inherited

wealth_status: Current status of the billionaire’s wealth, whether it’s rising, stable, or declining

industries: The industries associated with the billionaire’s business interests

region_gdp: The real GDP (in trillions of dollars) for each country or region the billionaire resides

region_code: The three-letter code of the corresponding country or region the billionaire resides

billionaire_2023.csv

This dataset contains information for global billionaires in 2023, including their wealth, personal details, and country-specific economic status.

year: The year in which the data was collected

net_worth: The net worth of the billionaire (in billions dollars)

full_name, age, and gender: The name, age, and gender of the billionaire, respectively

country_of_citizenship, country_of_residence, and city_of_residence: The residence details spanning country, city, and citizenship of the billionaire

self_made: Whether or not the billionaire is self-made

cpi_country and cpi_change_country: Consumer Price Index (CPI) and its change for the billionaire’s country

gdp_country: The real GDP (in trillions dollars) for each country the billionaire resides

life_expectancy_country: Life expectancy in the country the billionaire reside

billionaire_gdp_indus_usa.csv

This dataset combines information on billionaires residing in the United States from 2017 to 2022, with the annual GDP and the GDP for various industries in the United States.

year: The year in which the data was collected

net_worth: The net worth of the billionaire (in billions dollars)

full_name, age, and gender: The name, age, and gender of the billionaire, respectively

country_of_citizenship, country_of_residence, and city_of_residence: The residence details spanning country, city, and citizenship of the billionaire

self_made: Whether or not the billionaire’s wealth is self-made or inherited

wealth_status: Current status of the billionaire’s wealth

industries: The industries associated with the billionaire’s business interests

region_gdp: The real GDP (in trillions dollars) for each country or region the billionaire resides

region_code: The three-letter code of the corresponding country or region the billionaire resides

industry_gdp: The annual GDP (in trillions dollars) for the industry

gdp.csv

This dataset includes information of GDP data of each region.

name: region name (country or SAR(special administration area))

code: region code

year: year

gdp: GDP value (trillion as unit)

Data cleaning

Sitian Zhou, Pei Tian

2023-12-07

Country code

Billionaires 2010-2023

Billionaires 2023 with GDP data

Region-level GDP

Basic cleaning

Supplementary dataset

Taiwan

Merge GDP data

Industry-level GDP for US

Basic cleaning

Merge industry-level GDP & billionaire

Merge GDP & billionaire

USA industry dataset

Description for tidy data