Data Processing

Overview

Now that the data have been imported and the features renamed, the data can be cleaned and examined to ensure that it is accurate, complete, and consistent.

library(tidyverse)
library(janitor)
library(tidylog)
theme_set(theme_minimal())

Feature Inspection

Explore the individual features in the dataset, looking for outliers, bad data points, and unusual or surprising patterns in the data. For categorical features, look for miscoded values or anything else that might interfere with the analysis.

`year`

We know from the county dashboard that the data should be from 2011 to 2021. This feature is really categorical, not numeric. There is also a distinct order, so it will be converted into an ordinal factor.

df_counties |> 
  distinct(year) |> pull()

distinct: removed 7,029 rows (>99%), 11 rows remaining

 [1] 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021

There are 10 distinct year values, which is expected. Given that the data are correct, we can now convert to a factor because year is not really a numeric feature. Instead, it is an ordinal categorical feature.

df_counties <- df_counties |> 
  mutate(
    year = fct_inseq(factor(year, ordered = TRUE))
  )

mutate: converted 'year' from double to ordered factor (0 new NA)

head(df_counties$year)

[1] 2011 2011 2011 2011 2011 2011
11 Levels: 2011 < 2012 < 2013 < 2014 < 2015 < 2016 < 2017 < 2018 < ... < 2021

Now year is a factor ordered sequentially.

`geography`

The geography feature contains the state and county names. There are 39 counties in Washington, so we expect that there will be 40 distinct values in this feature. This feature is also categorical in nature, but nominal rather than ordinal.

df_counties |> 
  distinct(geography) |> pull()

distinct: removed 7,000 rows (99%), 40 rows remaining

 [1] "Washington State" "Yakima"           "Whitman"          "Whatcom"         
 [5] "Walla Walla"      "Wahkiakum"        "Thurston"         "Stevens"         
 [9] "Spokane"          "Snohomish"        "Skamania"         "Skagit"          
[13] "San Juan"         "Pierce"           "Pend Oreille"     "Pacific"         
[17] "Okanogan"         "Mason"            "Lincoln"          "Lewis"           
[21] "Klickitat"        "Kittitas"         "Kitsap"           "King"            
[25] "Jefferson"        "Island"           "Grays Harbor"     "Grant"           
[29] "Garfield"         "Franklin"         "Ferry"            "Douglas"         
[33] "Cowlitz"          "Columbia"         "Clark"            "Clallam"         
[37] "Chelan"           "Benton"           "Asotin"           "Adams"

It looks like the data are correct. There are 40 distinct values, as expected.

df_counties <- df_counties |> 
  mutate(
    geography = factor(geography)
  )

mutate: converted 'geography' from character to factor (0 new NA)

levels(df_counties$geography)

 [1] "Adams"            "Asotin"           "Benton"           "Chelan"          
 [5] "Clallam"          "Clark"            "Columbia"         "Cowlitz"         
 [9] "Douglas"          "Ferry"            "Franklin"         "Garfield"        
[13] "Grant"            "Grays Harbor"     "Island"           "Jefferson"       
[17] "King"             "Kitsap"           "Kittitas"         "Klickitat"       
[21] "Lewis"            "Lincoln"          "Mason"            "Okanogan"        
[25] "Pacific"          "Pend Oreille"     "Pierce"           "San Juan"        
[29] "Skagit"           "Skamania"         "Snohomish"        "Spokane"         
[33] "Stevens"          "Thurston"         "Wahkiakum"        "Walla Walla"     
[37] "Washington State" "Whatcom"          "Whitman"          "Yakima"

The levels have been recorded alphabetically, which makes sense for the most part. Since “Washington State” isn’t a county, it can be repositioned at the front, to separate it from the counties.

df_counties$geography <- fct_relevel(df_counties$geography, "Washington State", after=0)
levels(df_counties$geography)

 [1] "Washington State" "Adams"            "Asotin"           "Benton"          
 [5] "Chelan"           "Clallam"          "Clark"            "Columbia"        
 [9] "Cowlitz"          "Douglas"          "Ferry"            "Franklin"        
[13] "Garfield"         "Grant"            "Grays Harbor"     "Island"          
[17] "Jefferson"        "King"             "Kitsap"           "Kittitas"        
[21] "Klickitat"        "Lewis"            "Lincoln"          "Mason"           
[25] "Okanogan"         "Pacific"          "Pend Oreille"     "Pierce"          
[29] "San Juan"         "Skagit"           "Skamania"         "Snohomish"       
[33] "Spokane"          "Stevens"          "Thurston"         "Wahkiakum"       
[37] "Walla Walla"      "Whatcom"          "Whitman"          "Yakima"

Now the factor is set up properly.

`selection_filter`

selection_filter should be a categorical feature. First, determine what the set of possible values is.

df_counties |> 
  distinct(selection_filter) |> pull()

distinct: removed 7,036 rows (>99%), 4 rows remaining

[1] "All"            "Sex"            "Age"            "Race/Ethnicity"

These are all reasonable. The only one that we are interested in is the Age filter so we will only keep observations from the dataset with Age as the selection_filter value.

df_counties <- df_counties |>
  filter(selection_filter == "Age")

filter: removed 4,400 rows (62%), 2,640 rows remaining

`selection_value`

The selection values should now only pertain to Age.

df_counties |> 
  distinct(selection_value) |> pull()

distinct: removed 2,634 rows (>99%), 6 rows remaining

[1] "65+"   "45-64" "25-44" "15-24" "1-14"  "<1"

These values look correct. Nothing more needs to be done with this variable.

`max_percent_total_population`

This feature pertains to what percentage of the population a given filter applies to for the given geography. It’s a percentage, so all values should be between 0 and 100.

summary(df_counties$max_percent_total_population)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4691 11.1973 16.9423 16.6667 24.2378 43.3292

There are no missing values and all values are positive and range between 0 and 100, so this data is valid.

`max_sub_population`

This feature pertains to the number of people that fall into the given selection_filter/selection_value combination for the given geography. All values should be positive and no larger than the total population of the state (at most).

summary(df_counties$max_sub_population)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
     12.7    2112.5    9173.4   60373.6   25492.6 2190675.0

This is the case. There are no missing values.

`max_total_population`

Without a codebook, it is not clear what this feature represents. Furthermore, most of the observations are missing data for this variable. Given that this field won’t be required for our purposes, it will be removed from the dataset.

summary(df_counties$max_total_population)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2286   17835   45835   70978   86918  542100    2232

df_counties <- df_counties |> 
  select(-max_total_population)

select: dropped one variable (max_total_population)

Dataframe pivot and new feature creation

Now, we need to pivot all of the age data into a single observation so that we can perform different dependency ratio calculations for each year/geography combination. We will need to first filter out all of the non Age selection filters, then drop the variables that we don’t need for our calculations so that we get a single observation for each year/geography combination.

df_counties <- df_counties |> 
  select(-c(max_percent_total_population, selection_filter)) |> 
  pivot_wider(names_from=selection_value, values_from=max_sub_population, names_prefix="age_") |> 
  clean_names()

select: dropped 2 variables (selection_filter, max_percent_total_population)

pivot_wider: reorganized (selection_value, max_sub_population) into (age_65+, age_45-64, age_25-44, age_15-24, age_1-14, …) [was 2640x4, now 440x8]

df_counties <- df_counties |> 
  mutate(
    total_dep_ratio = round((age_65 + age_1_14 + age_1)/(age_45_64 + age_25_44 + age_15_24)*100,2),
    child_dep_ratio = round((age_1_14 + age_1)/(age_45_64 + age_25_44 + age_15_24)*100,2),
    aged_dep_ratio = round((age_65)/(age_45_64 + age_25_44 + age_15_24)*100,2)
  )

mutate: new variable 'total_dep_ratio' (double) with 415 unique values and 0% NA

        new variable 'child_dep_ratio' (double) with 371 unique values and 0% NA

        new variable 'aged_dep_ratio' (double) with 418 unique values and 0% NA

Univariate checks on new features

Now that new dependency ratio features have been created, they need to be checked to ensure that the values are reasonable. tidylog output indicates that there are no NA values in the new features. We now need to plot data to ensure that the values are reasonable.

`total_dep_ratio`

The total dependency ratio should be positive. A histogram, faceted by year shows that the values for all years are plausible. There are no negative values and the distributions do not show any extreme outliers.

df_counties |> 
  ggplot(aes(x=total_dep_ratio)) +
  geom_histogram(colour="white") +
  labs(
    x="Total dependency ratio",
    y="Count",
    title="Total dependency ratio distrution"
  ) +
  facet_wrap(~year)

`child_dep_ratio`

The child dependency ratio should always be positive. A histogram, faceted by year shows that the values for all years are plausible. There are no negative values and the distributions do not show any extreme outliers.

df_counties |> 
  ggplot(aes(x=child_dep_ratio)) +
  geom_histogram(colour="white") +
  labs(
    x="Child dependency ratio",
    y="Count",
    title="Child dependency ratio distrution"
  ) +
  facet_wrap(~year)

`aged_dep_ratio`

The aged dependency ratio should always be positive. A histogram, faceted by year shows that the values for all years are plausible. There are no negative values and the distributions do not show any extreme outliers.

df_counties |> 
  ggplot(aes(x=aged_dep_ratio)) +
  geom_histogram(colour="white") +
  labs(
    x="Aged dependency ratio",
    y="Count",
    title="Aged dependency ratio distrution"
  ) +
  facet_wrap(~year)

Continue to Data Analysis