Data Processing

Data processing involves investigating the data looking for:

missing data
inconsistent data
typos and other errors in character data
errors in numeric data

Once found, errors need to be corrected or otherwise acknowledged so that later analysis is not impacted.

Understanding missingness

It is important to understand what data is missing from each dataset. The R package naniar is used to identify records which have a least one field that is missing data.

2016-2017 school year

The only record with data missing in the 2016-2017 dataset is information for the Karval RE-23 school district. Fields with missing data are absentee_students, total_students, and chronically_absent_rate. This is the result of this district having been missing in the original data and being added to the dataset during data preparation. Rather than attempt to impute the missing data, it will be left missing.

naniar::miss_case_summary(df_2016) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2016 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 1 × 11
   case n_miss pct_miss school_year county_code county_name district_code
  <int>  <int>    <dbl> <chr>       <chr>       <chr>       <chr>        
1   101      3     37.5 2016-2017   37          LINCOLN     1810         
# ℹ 4 more variables: district_name <chr>, absentee_students <dbl>,
#   total_students <dbl>, chronically_absent_rate <dbl>

2017-2018 school year

There is no data missing from the 2017-2018 dataset.

naniar::miss_case_summary(df_2017) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2017 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 0 × 11
# ℹ 11 variables: case <int>, n_miss <int>, pct_miss <dbl>, school_year <chr>,
#   county_code <chr>, county_name <chr>, district_code <chr>,
#   district_name <chr>, absentee_students <dbl>, total_students <dbl>,
#   chronically_absent_rate <dbl>

2018-2019 school year

The only record with data missing in the 2018-2019 dataset is information for the Cheraw 31 school district. Fields with missing data are absentee_students, total_students, and chronically_absent_rate. This is the result of this district having been missing in the original data and being added to the dataset during data preparation. Rather than attempt to impute the missing data, it will be left missing.

naniar::miss_case_summary(df_2018) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2018 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 1 × 11
   case n_miss pct_miss school_year county_code county_name district_code
  <int>  <int>    <dbl> <chr>       <chr>       <chr>       <chr>        
1   124      3     37.5 2018-2019   45          OTERO       2560         
# ℹ 4 more variables: district_name <chr>, absentee_students <dbl>,
#   total_students <dbl>, chronically_absent_rate <dbl>

2019-2020 school year

There is no data missing from the 2019-2020 dataset.

naniar::miss_case_summary(df_2019) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2019 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 0 × 11
# ℹ 11 variables: case <int>, n_miss <int>, pct_miss <dbl>, school_year <chr>,
#   county_code <chr>, county_name <chr>, district_code <chr>,
#   district_name <chr>, absentee_students <dbl>, total_students <dbl>,
#   chronically_absent_rate <dbl>

2020-2021 school year

The only record with data missing in the 2020-2021 dataset is information for the Holly RE-3 school district. Fields with missing data are absentee_students, total_students, and chronically_absent_rate. This is the result of this district having been missing in the original data and being added to the dataset during data preparation. Rather than attempt to impute the missing data, it will be left missing.

naniar::miss_case_summary(df_2020) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2020 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 1 × 11
   case n_miss pct_miss school_year county_code county_name district_code
  <int>  <int>    <dbl> <chr>       <chr>       <chr>       <chr>        
1   135      3     37.5 2020-2021   50          PROWERS     2670         
# ℹ 4 more variables: district_name <chr>, absentee_students <dbl>,
#   total_students <dbl>, chronically_absent_rate <dbl>

2021-2022 school year

There is no data missing from the 2021-2022 dataset.

naniar::miss_case_summary(df_2021) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2021 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 0 × 11
# ℹ 11 variables: case <int>, n_miss <int>, pct_miss <dbl>, school_year <chr>,
#   county_code <chr>, county_name <chr>, district_code <chr>,
#   district_name <chr>, absentee_students <dbl>, total_students <dbl>,
#   chronically_absent_rate <dbl>

2022-2023 school year

There is no data missing from the 2022-2023 dataset.

naniar::miss_case_summary(df_2022) |> 
  filter(n_miss > 0) |> 
  inner_join(df_2022 |> 
  rowid_to_column("case"), by=c("case"))

# A tibble: 0 × 11
# ℹ 11 variables: case <int>, n_miss <int>, pct_miss <dbl>, school_year <chr>,
#   county_code <chr>, county_name <chr>, district_code <chr>,
#   district_name <chr>, absentee_students <dbl>, total_students <dbl>,
#   chronically_absent_rate <dbl>

Merging the datasets

To facilitate further analysis, the individual school year datasets will be merged into a single master dataset.

df_master <- bind_rows(df_2016,
                       df_2017,
                       df_2018,
                       df_2019,
                       df_2020,
                       df_2021,
                       df_2022)

Examination of unvariates

There are 3 numeric fields that need to be reviewed: chronically_absent_rate, absentee_students, and total_students. All of these fields should have positive values. The first, chronically_absent_rate should range between 0 and 1. In addition to these numeric fields, the school_districts need to be examined to ensure that all datasets include the same 178 school districts. Additionally, each dataset should contain 178 observations for the same school year.

Verify counties

df_master |> 
  distinct(county_name)

# A tibble: 63 × 1
   county_name
   <chr>      
 1 ADAMS      
 2 ALAMOSA    
 3 ARAPAHOE   
 4 ARCHULETA  
 5 BACA       
 6 BENT       
 7 BOULDER    
 8 CHAFFEE    
 9 CHEYENNE   
10 CLEAR CREEK
# ℹ 53 more rows

There are 63 counties present. This is good. There are 64 counties in the state, but Broomfield county is missing because it does not have any school districts of its own. It is served by existing districts.

The names are in call caps, for aesthetics, these will be converted to proper case.

df_master <- df_master |> 
  mutate(
    county_name = str_to_title(county_name)
  )

df_master |> 
  distinct(county_name)

# A tibble: 63 × 1
   county_name
   <chr>      
 1 Adams      
 2 Alamosa    
 3 Arapahoe   
 4 Archuleta  
 5 Baca       
 6 Bent       
 7 Boulder    
 8 Chaffee    
 9 Cheyenne   
10 Clear Creek
# ℹ 53 more rows

Now we need to verify the county codes.

df_master |> 
  distinct(county_code) |> 
  pull()

 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59" "60"
[61] "61" "62" "63"

There are 63 county codes. The data are consistent.

Verify school districts

Each dataset has 178 school district records. They should all be unique and identical between school years.

district_count <- df_master |>
  distinct(district_code) |> 
  count() |> 
  pull()

district_count

[1] 178

This confirms that there are only 178 districts in the dataset.

district_name_count <- df_master |>
  distinct(district_name) |> 
  count() |> 
  pull()

district_name_count

[1] 361

We have an issue with the district_name field, however. There are 361 distinct values rather 178. We need to investigate the field to see if it is possible to normalise the district_name field.

df_master |>
  distinct(district_name)

# A tibble: 361 × 1
   district_name             
   <chr>                     
 1 MAPLETON 1                
 2 ADAMS 12 FIVE STAR SCHOOLS
 3 ADAMS COUNTY 14           
 4 SCHOOL DISTRICT 27J       
 5 BENNETT 29J               
 6 STRASBURG 31J             
 7 WESTMINSTER PUBLIC SCHOOLS
 8 ALAMOSA RE-11J            
 9 SANGRE DE CRISTO RE-22J   
10 ENGLEWOOD 1               
# ℹ 351 more rows

After pouring through the data, it looks like the districts have both an all caps version and a proper case version.

distinct_districts <- df_master |> 
  mutate(
    district_name = str_to_title(district_name)
  ) |> 
  distinct(district_name) |> 
  count() |> 
  pull()
distinct_districts

[1] 187

This gets us a lot closer. There are now only 187 distinct district_name values, but we still have 9 to normalise.

Since we know that there are only 178 distinct district_code values, we can group by the district_code and find all instances where the count() is greater than 1. There should be 9 such instances.

df_master |> 
  mutate(
    district_name = str_to_title(district_name)
  ) |> 
  distinct(district_name, district_code) |>
  group_by(district_code) |> 
  count() |> 
  filter(n > 1)

# A tibble: 9 × 2
# Groups:   district_code [9]
  district_code     n
  <chr>         <int>
1 0310              2
2 0470              2
3 0920              2
4 1110              2
5 2020              2
6 2710              2
7 2730              2
8 3100              2
9 3140              2

These are the 9 districts. We can retrieve them to see what we need to do to normalise them. Before we do, we will make permanent the proper casing of the district names.

df_master <- df_master |> 
  mutate(
    district_name = str_to_title(district_name)
  )

dupe_district_ids <- df_master |> 
  mutate(
    district_name = str_to_title(district_name)
  ) |> 
  distinct(district_name, district_code) |>
  group_by(district_code) |> 
  count() |> 
  filter(n > 1) |> 
  select(district_code) |> 
  pull()

df_master |> 
  filter(district_code %in% dupe_district_ids) |> 
  select(district_code, district_name) |> 
  distinct() |> 
  arrange(district_code)

# A tibble: 18 × 2
   district_code district_name                       
   <chr>         <chr>                               
 1 0310          Mc Clave Re-2                       
 2 0310          Mcclave Re-2                        
 3 0470          St Vrain Valley Re 1j               
 4 0470          St Vrain Valley Re1j                
 5 0920          Elizabeth C-1                       
 6 0920          Elizabeth School District           
 7 1110          Falcon 49                           
 8 1110          District 49                         
 9 2020          Moffat County Re:No 1               
10 2020          Moffat County Re: No 1              
11 2710          Meeker Re1                          
12 2710          Meeker Re-1                         
13 2730          Del Norte C-7                       
14 2730          Upper Rio Grande School District C-7
15 3100          Windsor Re-4                        
16 3100          Weld Re-4                           
17 3140          Weld County S/D Re-8                
18 3140          Weld Re-8 Schools

Most of these issues are minor differences that are easily fixed. The only significant differences are with district codes 0920, 1110, 2730, and 3100. These records will need to be modified individually as there are no patterns to the discrepancies. When the issue is that there are different names in use, the one being used more recently will be adopted.

df_master <- df_master |> 
  mutate(
    district_name = case_when(district_name %in% c("Mc Clave Re-2","Mcclave Re-2") ~ "McClave Re-2",
                              str_starts(district_name, "St Vrain Valley Re") ~ "St Vrain Valley Re-1j",
                              # district 0920 is called "Elizabeth School District" most recently
                              district_name == "Elizabeth C-1" ~ "Elizabeth School District",
                              # district 1110 is called "District 49" most recently
                              district_name == "Falcon 49" ~ "District 49",
                              str_starts(district_name, "Moffat County") ~ "Moffat County Re-1",
                              district_name == "Meeker Re1" ~ "Meeker Re-1",
                              # district 2730 is called "Upper Rio Grande School District C-7" most recently
                              district_name == "Del Norte C-7" ~ "Upper Rio Grande School District C-7",
                              district_name == "Windsor Re-4" ~ "Weld Re-4",
                              # district 3140 is called "Weld Re-8 Schools" most recently
                              district_name == "Weld County S/D Re-8" ~ "Weld Re-8 Schools",
                              TRUE ~ district_name)
  )

This confirms that the district_name field has been normalised. Now there are only 178 distinct values.

df_master |>
  distinct(district_name) |> 
  count()

# A tibble: 1 × 1
      n
  <int>
1   178

Verify school year information

Each dataset should have a single value in the school_year field. If this is not the case, determine what the discrepancy is and make the correction.

df_master |> 
  group_by(school_year) |> 
  count()

# A tibble: 7 × 2
# Groups:   school_year [7]
  school_year     n
  <chr>       <int>
1 2016-2017     178
2 2017-2018     178
3 2018-2019     178
4 2019-2020     178
5 2020-2021     178
6 2021-2022     178
7 2022-2023     178

The datasets all contain a single value for the school_year.

Now that this data is confirmed, the school_year field should be converted to an ordinal factor from a character field.

df_master <- df_master |> 
  mutate(
    school_year = fct_relevel(school_year, "2016-2017", "2017-2018", "2018-2019",
                              "2019-2020","2020-2021","2021-2022","2022-2023")
  )

Chronic absenteeism rate

This field represents the proportion of K-12 students who are chronically absent. Values should range from 0 to 1 inclusive.

df_master |> 
  select(chronically_absent_rate) |> 
  head() |> 
  pull()

[1] 0.273985680 0.226889195 0.368673647 0.351188322 0.007774538 0.149413020

This field has way too many significant figures. Most of these are meaningless and later years in the dataset have fewer significant figures, so the additional significant figures will be removed by rounding.

df_master <- df_master |> 
  mutate(
    chronically_absent_rate = round(chronically_absent_rate, 3)
  )

Plot histograms to look for unusual or invalid values.

df_master |> 
  ggplot(aes(x=chronically_absent_rate)) +
  geom_histogram(binwidth = 0.05, colour="white", fill=standard_chart_colour) +
  facet_wrap(~school_year) +
  labs(
    title = "Distribution of district chronic absenteeism",
    subtitle = "School years 2016-2023",
    x="Chronically absent rate",
    y="Count"
  )

The data look valid. There is a high occurrence of zero values in the 2016-2017 school year. This suggests that either different school districts may have begun record keeping at different times or that these smaller districts were actually not seeing this phenomenon until later. Of these, the former seems more plausible. Chronic absenteeism could be cross-referenced with truancy rates to provide a check on this. The truancy data may allow us to rule out the second hypothesis. This could be done in a follow-up analysis.

There are 20 districts that reported no chronically absent students in 2016-2017. The number steadily decreases and by 2021-2022 there are no districts reporting no students as chronically absent.

df_master |> 
  filter(chronically_absent_rate == 0) |>
  ggplot(aes(x=total_students)) + 
  geom_histogram(bins=20, colour="white", fill=standard_chart_colour) +
  facet_wrap(~school_year) +
  labs(
    title = "Smaller districts are less likely to report chronic absenteeism",
    subtitle = "The number of schools not reporting declines significantly over time",
    x="Total studends enrolled",
    y="Count"
  )

Total student enrollment

Based on a 5-number summary of the number of students enrolled in each district varies widely from less than 10 to more than 85,000.

df_master |> 
  group_by(school_year) |> 
  summarise(
    min_count = min(total_students, na.rm = TRUE),
    quartile_1 = quantile(total_students, 0.25, na.rm = TRUE),
    median = median(total_students, na.rm = TRUE),
    quartile_3 = quantile(total_students, 0.75, na.rm = TRUE),
    max_count = max(total_students, na.rm = TRUE),
  ) |> 
  gt(id="total-enrollment-tukey") |> 
  format_table("5-number summary of undifferentiated enrollment data")

school_year	min_count	quartile_1	median	quartile_3	max_count
5-number summary of undifferentiated enrollment data
2016-2017	5	212	596	2,339	86,139
2017-2018	4	210.5	584	2,325.5	86,863
2018-2019	37	209	598	2,299	86,887
2019-2020	16	227.2	658.5	2,475.2	96,030
2020-2021	44	219	622	2,193	93,385
2021-2022	32	225.5	619.5	2,369.8	92,170
2022-2023	31	222	599	2,322	87,225

The numbers are consistent and there are no obviously invalid values in the dataset. The minimum counts for the first two years look suspect. The district is the same in both cases, the Agate 300 school district.

df_master |> 
  filter(district_code == "0960") |> 
  select(school_year,absentee_students,total_students,chronically_absent_rate) |> 
  gt(id="agate-300-enrollment") |> 
  format_table("Agate 300 School District enrollment data 2016-2023")

school_year	absentee_students	total_students	chronically_absent_rate
Agate 300 School District enrollment data 2016-2023
2016-2017	0	5	0
2017-2018	0	4	0
2018-2019	0	47	0
2019-2020	4	34	0.1
2020-2021	16	73	0.2
2021-2022	22	95	0.2
2022-2023	12	78	0.2

The total_students data for Agate 300 school district is quite variable, but 5 and 4 seem like extremely low counts given the other 5 years of data available. It will not affect the outcome of this study, but a follow-up to determine the true values for the 2016-2017 and 2017-2018 school years would be useful.

To look for more school districts that might have large discrepancies in enrollment, year on year percent change in total enrollment can be computed and outliers investigated. After the field is created, a 5-number summary will give rough information about how the values are distributed.

df_master_tmp <- df_master |> 
  group_by(district_code) |> 
  arrange(school_year) |> 
  mutate(
    pct_enroll_diff = 100*(total_students - lag(total_students))/lag(total_students)
  ) |> 
  ungroup()

df_master_tmp |>
  filter(school_year != "2016-2017") |> 
  group_by(school_year) |> 
  summarise(
    min_diff = min(pct_enroll_diff, na.rm = TRUE),
    quartile_1 = quantile(pct_enroll_diff, 0.25, na.rm = TRUE),
    median = median(pct_enroll_diff, na.rm = TRUE),
    quartile_3 = quantile(pct_enroll_diff, 0.75, na.rm = TRUE),
    max_diff = max(pct_enroll_diff, na.rm = TRUE),
  ) |> 
  gt(id="colorado-pct-diff-enrollment-tukey5") |> 
  format_table("% change in enrollment 5-number summary by school year")

school_year	min_diff	quartile_1	median	quartile_3	max_diff
% change in enrollment 5-number summary by school year
2017-2018	−24.1	−2.4	0.1	2.2	47.1
2018-2019	−16.9	−2.1	0.2	2.5	1,075
2019-2020	−90.4	4.7	9.6	14.7	75.6
2020-2021	−81	−8.9	−3.9	0	868.8
2021-2022	−66.3	−3.4	1.5	7.2	395.5
2022-2023	−46.6	−9	−4.3	−0.2	217.8

This table shows that there are some very significant changes in district enrollments.

df_master_tmp |>
  filter(school_year != "2016-2017" & abs(pct_enroll_diff) > 50) |>
  select(school_year, district_code, district_name, absentee_students, total_students, pct_enroll_diff) |> 
  arrange(desc(abs(pct_enroll_diff))) |> 
  gt(id="colorado-pct-diff-enrollment") |> 
  format_table("Percent change of enrollment outliers 2016-2023",
               "Outliers are defined as having a change in enrollment greater than 50%")

school_year	district_code	district_name	absentee_students	total_students	pct_enroll_diff
Percent change of enrollment outliers 2016-2023
Outliers are defined as having a change in enrollment greater than 50%
2018-2019	0960	Agate 300	0	47	1,075
2020-2021	1410	North Park R-1	35	155	868.8
2021-2022	0100	Alamosa Re-11j	685	2,294	395.5
2020-2021	0230	Walsh Re-1	31	160	255.6
2022-2023	1440	Plainview Re-2	6	410	217.8
2020-2021	0260	Vilas Re-5	6	200	177.8
2022-2023	0310	McClave Re-2	42	236	168.2
2021-2022	1440	Plainview Re-2	12	129	158
2021-2022	0940	Big Sandy 100j	56	354	139.2
2020-2021	3220	Idalia Rj-3	21	185	134.2
2020-2021	0960	Agate 300	16	73	114.7
2021-2022	0740	Sierra Grande R-30	87	286	110.3
2020-2021	1590	Primero Reorganized 2	29	223	106.5
2021-2022	0580	South Conejos Re-10	32	155	98.7
2019-2020	1410	North Park R-1	2	16	−90.4
2021-2022	3110	Johnstown-Milliken Re-5j	947	3,906	83.5
2020-2021	0100	Alamosa Re-11j	127	463	−81
2021-2022	2862	Julesburg Re-1	328	742	80.5
2019-2020	0260	Vilas Re-5	12	72	75.6
2020-2021	0190	Byers 32j	2,115	7,027	73.5
2019-2020	0230	Walsh Re-1	3	45	−69.8
2021-2022	0290	Las Animas Re-1	350	859	−66.3
2022-2023	1580	Trinidad 1	238	856	64.6
2021-2022	0310	McClave Re-2	13	88	−59.8
2019-2020	3220	Idalia Rj-3	79	79	−57.8
2020-2021	1440	Plainview Re-2	10	50	56.2
2019-2020	2862	Julesburg Re-1	167	802	55.4
2019-2020	0170	Deer Trail 26j	61	283	53.8
2020-2021	0580	South Conejos Re-10	11	78	−51.2

Without knowing the history if these districts it is hard to say if the numbers contain errors. Like Agate 300, mentioned earlier, there are some very large changes year on year in micro to intermediate sized districts. Some of this may be explained by small districts be subject to more size volatility than larger ones, but some of these districts are having order of magnitude size changes in total enrollment which seems likely to be a data entry error than a real fluctuation in student enrollment.

Examples of districts with at least 1 suspicious `total_students` record

0960 - Agate 300
0230 - Walsh RE-1
1410 - North Park R-1
0100 - Alamosa Re-11j
0580 - South Conejos Re-10 (?)

The first two years for Agate show enrollments of 5 & 4 students, then enrollment jumps to 37. It is likely that the 5 and 4 should be 35 and 34, but without checking, it is hard to say what the correct values are.

For Walsh RE-1, there is a clear typo in the 2019-2020 information. The listed value is 45 but this should probably be 145 based on the other years.

The same situation exists for North Park R-1 and in the same year. The listed value is 16, but it should probably be 160 or similar based on data from the other school years.

Alamosa RE-11j has a similar issue in the 2020-2021 school year where, most likely, a leading “2” was dropped from the total enrollment number. It should probably be 2463 rather than the 463 that is listed.

There is a less clear case for the South Conejos Re-10 school district in 2021-2022. The total enrollment drops from 160 to 78 before rebounding to 155. It is possible that there is a leading “1” that is missing, but so many districts show a drop in this year, that it is possible that the data are accurate.

Chronically absent student counts

The number of chronically absent students in each district varies widely from 0 to more than 40,000. The numbers are consistent and there are no invalid values in the dataset.

df_master |> 
  group_by(school_year) |> 
  summarise(
    min_count = min(absentee_students, na.rm = TRUE),
    quartile_1 = quantile(absentee_students, 0.25, na.rm = TRUE),
    median = median(absentee_students, na.rm = TRUE),
    quartile_3 = quantile(absentee_students, 0.75, na.rm = TRUE),
    max_count = max(absentee_students, na.rm = TRUE),
  )

# A tibble: 7 × 6
  school_year min_count quartile_1 median quartile_3 max_count
  <fct>           <dbl>      <dbl>  <dbl>      <dbl>     <dbl>
1 2016-2017           0        6     49         290      30483
2 2017-2018           0       33    144.        526.     31469
3 2018-2019           0       37    106         428      34596
4 2019-2020           0       26.2   90.5       506      27873
5 2020-2021           0       32    134         554      27870
6 2021-2022           4       67    173         782.     39781
7 2022-2023           1       61.5  170.        681      35835

df_master |>
  filter(!is.na(absentee_students)) |> 
  ggplot(aes(x=absentee_students)) +
  geom_histogram(colour="white", fill=standard_chart_colour) +
  facet_wrap(~school_year) +
  labs(
      title = "Distribution of district chronic absenteeism",
      x="Absentee students",
      y="Count"
    )

Feature creation

Three new features will be made from the current data: district_size, pct_diff, and pct_enrollment_diff. The first feature is a ordered categorical variable that describes the size of the district. There are five levels. From smallest to largest they are

Micro - Up to 100 students
Small - 101 to 750 students
Intermediate - 751 to 3000 students
Medium - 3001 to 15000 students
Large - More than 15000 students

The second feature is the change in chronic absenteeism rate from year to year. The values for all districts in the 2016-2017 school year are NA as this is the first year in the dataset.

df_master <- df_master |> 
  mutate(
    district_size = case_when(
      total_students > 15000 ~ "Large",
      total_students > 3000 ~ "Medium",
      total_students > 750 ~ "Intermediate",
      total_students > 100 ~ "Small",
      TRUE ~ "Micro"
    )
  ) |> 
  mutate(
    district_size = fct_relevel(district_size, "Micro", "Small", "Intermediate", "Medium", "Large")
  ) |> 
  group_by(district_code) |> 
  arrange(school_year) |> 
  mutate(
    pct_diff = chronically_absent_rate - lag(chronically_absent_rate, n=1, order_by = district_code)
  ) |>
  ungroup(district_code) |> 
  relocate(district_size, .before=chronically_absent_rate)

The number of districts in each group is consistent from year to year. The smaller school districts, those in the Small and Micro categories, have the most variability. This is not surprising as small changes in student population can move a district from one category to another.

df_master |> 
  ggplot(aes(x=district_size)) +
  geom_bar(fill=standard_chart_colour) +
  facet_wrap(~school_year) +
  coord_flip() +
  labs(
    x="District size",
    y="Count"
  )

The pct_diff data looks to be consistent with what we know about the rest of the data in this dataset. The 5 number summary shows how volatile the absenteeism rates can be, particularly around the first COVID-19 school year 2019-2020.

df_master |>
  filter(school_year != "2016-2017") |> 
  group_by(school_year) |> 
  summarise(
    min_count = min(pct_diff, na.rm = TRUE),
    quartile_1 = quantile(pct_diff, 0.25, na.rm = TRUE),
    median = median(pct_diff, na.rm = TRUE),
    quartile_3 = quantile(pct_diff, 0.75, na.rm = TRUE),
    max_count = max(pct_diff, na.rm = TRUE),
  )

# A tibble: 6 × 6
  school_year min_count quartile_1   median quartile_3 max_count
  <fct>           <dbl>      <dbl>    <dbl>      <dbl>     <dbl>
1 2017-2018      -0.255   -0.00600  0.0250      0.147      0.478
2 2018-2019      -0.361   -0.074   -0.00200     0.0240     0.394
3 2019-2020      -0.27    -0.074   -0.00100     0.061      1    
4 2020-2021      -0.886   -0.00900  0.0420      0.095      0.62 
5 2021-2022      -0.229    0.0210   0.084       0.131      0.332
6 2022-2023      -0.251   -0.0678  -0.0245      0.0100     0.406

The histograms show that the pct_diff data is normally distributed but with significant outliers in the 2019-2020 and 2020-2021 school years.

df_master |> 
  filter(!is.na(pct_diff)) |> 
  ggplot(aes(x=pct_diff)) +
  geom_histogram(binwidth=0.1, colour="white", fill=standard_chart_colour) +
  facet_wrap(~school_year)

Create master data file

Now that the data preparation and processing have been completed, the complete dataset can be saved to disk.

write_csv(df_master, file = "output/2016-2023-master_chronic_absenteeism.csv")
write_rds(df_master, file = "intermediates/df_master.rds")

Write CSV files to output folder for each year’s dataset so that they can be reused without having to reapply the processing steps.

write_csv(df_2016, file="output/2016-2017_chronic_absenteeism.csv")
write_csv(df_2017, file="output/2017-2018_chronic_absenteeism.csv")
write_csv(df_2018, file="output/2018-2019_chronic_absenteeism.csv")
write_csv(df_2019, file="output/2019-2020_chronic_absenteeism.csv")
write_csv(df_2020, file="output/2020-2021_chronic_absenteeism.csv")
write_csv(df_2021, file="output/2021-2022_chronic_absenteeism.csv")
write_csv(df_2022, file="output/2022-2023_chronic_absenteeism.csv")

Continue to Analysis