library(tidyverse)
library(janitor)Data Preparation
Overview
Before the data can be processed, it needs to be prepared. After the data are imported, the features will be inspected and any changes to the structure of the data will be noted. Feature name changes will be made if any are necessary.
The tidyverse libraries and the janitor package will be used throughout the analysis, so they are imported now.
Preparation
To begin, we will load the data and take a quick look at it with glimpse.
df_counties <- read_csv("data/WA_demographic_data.csv")
glimpse(df_counties)Rows: 7,040
Columns: 7
$ Year <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20…
$ Geography <chr> "Washington State", "Yakima", "Whitman", "Wh…
$ `Selection Filter` <chr> "All", "All", "All", "All", "All", "All", "A…
$ `Selection Value` <chr> "All", "All", "All", "All", "All", "All", "A…
$ `Max. % Total Population` <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100,…
$ `Max. Sub-Population` <dbl> 6773171.98, 244632.25, 44414.34, 201996.58, …
$ `Max. Total Population` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Renaming features
From the glimspe, it’s clear that many of the feature names are not going to be easy to work with as is. We can use the clean_names function from the janitor package to generate snake case feature names.
df_counties <- df_counties |>
clean_names()
glimpse(df_counties)Rows: 7,040
Columns: 7
$ year <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ geography <chr> "Washington State", "Yakima", "Whitman", …
$ selection_filter <chr> "All", "All", "All", "All", "All", "All",…
$ selection_value <chr> "All", "All", "All", "All", "All", "All",…
$ max_percent_total_population <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 1…
$ max_sub_population <dbl> 6773171.98, 244632.25, 44414.34, 201996.5…
$ max_total_population <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
These will be much easier to work with. The feature data types look reasonable for this stage of the analysis. year and geography will be converted to factors during the processing phase. The other features will need to be investigated in more detail before final data types will be decided. max_total_population looks like it might contain a lot of missing values, so it will need more scrutiny.