Data Preparation

Preparing the database

The majority of the data preparation for this analysis was done using SQL scripts. The process can be broken down into 5 steps: initial data import, year by year cleanup, creating master data table, creating views, creating a cleanup stored procedure to automate that process.

Initial data import

The initial data import consisted of creating tables for each year’s data and then loading data from comma separated values (CSV) files into the respective tables. The offense codes were loaded into a separate table to be joined with the crime data when necessary.

Year by year cleanup

Each year of data was initially stored in its own table in the database. It was then inspected to ensure that the data was as complete and coherent as possible. The clean-up scripts can be viewed on github from the links below.

The master data table

After the clean-up process was completed, the data where copied into a single table, crimes and the original yearly tables were dropped.

Create master table script

Creating views

For convenience, SQL views were created to supply aggregate data.

Create views

Automation with stored procedure

In order to make the data preparation process repeatable and reproducible, all of the cleaning and processing steps were placed into a stored procedure that cn be run from the PostgreSQL admin application.

Data cleaning procedure

Creating datasets for analysis

Create 5 datasets, one for each of the crime types of interest:

  • Burglary of a residence with forced entry
  • Burglary of a business with forced entry
  • Theft of parts from a vehicle
  • Theft of a vehicle
  • Robberies that fall under the following offence codes:
    • 1204 - Robbery of a person in the open using a gun
    • 1205 - Robbery of a person in the open
    • 1206 - Robbery of a person in the open using bodily force
    • 1210 - Forcible purse snatching
    • 1212 - Carjacking, armed

Offense codes

The offense codes are critical to understanding the crime data. They categorize crimes stored in the crimes table, allowing for the 5 crime datasets listed above to be created.

df_offense_codes <- db_make_query(
  "SELECT
    *
  FROM
    offense_codes;"
)
glimpse(df_offense_codes)
Rows: 294
Columns: 6
$ offense_code           <chr> "2804", "2804", "2901", "2902", "2903", "2999",…
$ offense_code_extension <int> 1, 2, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0,…
$ offense_type_id        <chr> "stolen-property-possession", "fraud-possess-fi…
$ offense_type_name      <chr> "Possession of stolen property", "Possession of…
$ offense_category_id    <chr> "all-other-crimes", "all-other-crimes", "public…
$ offense_category_name  <chr> "All Other Crimes", "All Other Crimes", "Public…

Residential Burglary Dataset

This dataset contains residential burglary data for 2010-2021.

# create the datasets
df_burglary_res <- db_make_query(
  "SELECT
    *
  FROM
    crimes
  WHERE
    offense_code = '2204' AND
    offense_code_extension = 0;"
)
glimpse(df_burglary_res)
Rows: 20,500
Columns: 20
$ incident_id            <int64> 2012266, 2012346, 2012681, 2012861, 2014347, …
$ offense_id             <int64> 2012266220400, 2012346220400, 2012681220400, …
$ offense_code           <chr> "2204", "2204", "2204", "2204", "2204", "2204",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date  <dttm> 2012-01-01 00:00:00, 2011-12-31 20:30:00, 2011…
$ last_occurrence_date   <dttm> 2012-01-01 00:10:00, 2012-01-01 02:22:00, 2011…
$ reported_date          <dttm> 2012-01-01 03:08:00, 2012-01-01 02:22:00, 2012…
$ incident_address       <chr> "3635 N LAFAYETTE ST", "4095 N ANDES CT", "761 …
$ geo_x                  <dbl> 3148644, 3205088, 3151414, 3146974, 3129141, 31…
$ geo_y                  <dbl> 1704917, 1707035, 1690673, 1698995, 1679505, 17…
$ geo_lon                <dbl> -104.9712, -104.7704, -104.9617, -104.9773, -10…
$ geo_lat                <dbl> 39.76762, 39.77236, 39.72847, 39.75139, 39.6981…
$ district_id            <chr> "2", "5", "3", "2", "4", "1", "1", "3", "6", "3…
$ precinct_id            <int> 211, 523, 311, 211, 412, 111, 113, 322, 622, 31…
$ neighborhood_id        <chr> "cole", "gateway-green-valley-ranch", "country-…
$ victim_count           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year                 <int> 2012, 2011, 2011, 2012, 2013, 2013, 2013, 2014,…
$ c_month                <int> 1, 12, 12, 1, 12, 12, 12, 12, 12, 1, 1, 1, 12, …
$ c_day                  <int> 1, 31, 27, 1, 31, 31, 31, 28, 29, 1, 1, 1, 31, …
$ c_dow                  <int> 7, 6, 2, 7, 2, 2, 2, 7, 1, 5, 7, 7, 7, 1, 2, 1,…

Business Burglary Dataset

This dataset contains business burglary data for 2010-2021.

df_burglary_biz <- db_make_query(
  "SELECT
    *
  FROM
    crimes
  WHERE
    offense_code = '2203' AND
    offense_code_extension = 0;"
)
glimpse(df_burglary_biz)
Rows: 11,893
Columns: 20
$ incident_id            <int64> 2010451, 2011967, 2012855, 2013663, 2013823, …
$ offense_id             <int64> 2010451220300, 2011967220300, 2012855220300, …
$ offense_code           <chr> "2203", "2203", "2203", "2203", "2203", "2203",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date  <dttm> 2010-01-01 02:00:00, 2010-12-31 17:00:00, 2011…
$ last_occurrence_date   <dttm> 2010-01-01 02:15:00, 2011-01-01 11:20:00, 2011…
$ reported_date          <dttm> 2010-01-01 02:20:00, 2011-01-01 12:34:00, 2012…
$ incident_address       <chr> "1410 N XAVIER ST", "424 21ST ST", "1600 N IVY …
$ geo_x                  <dbl> 3126627.0, 3145033.0, 3163221.0, 3164055.0, 316…
$ geo_y                  <dbl> 1694313.0, 1697804.0, 1695703.0, 1693846.0, 169…
$ geo_lon                <dbl> -105.0497, -104.9842, -104.9196, -104.9167, -10…
$ geo_lat                <dbl> 39.73883, 39.74815, 39.74208, 39.73697, 39.7403…
$ district_id            <chr> "1", "6", "2", "2", "2", "3", "6", "3", "3", "3…
$ precinct_id            <int> 122, 621, 222, 222, 223, 322, 611, 322, 322, 31…
$ neighborhood_id        <chr> "west-colfax", "five-points", "south-park-hill"…
$ victim_count           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year                 <int> 2010, 2010, 2011, 2013, 2012, 2012, 2015, 2017,…
$ c_month                <int> 1, 12, 12, 1, 12, 12, 1, 1, 1, 1, 12, 1, 1, 1, …
$ c_day                  <int> 1, 31, 31, 1, 31, 28, 1, 1, 1, 1, 31, 2, 3, 3, …
$ c_dow                  <int> 5, 5, 6, 2, 1, 5, 4, 7, 1, 5, 4, 6, 7, 7, 7, 1,…

Theft of Parts from a Vehicle Dataset

This dataset contains theft of parts from a vehicle data for 2010-2021.

df_theft_parts <- db_make_query(
  "SELECT
    *
  FROM
    crimes
  WHERE
    offense_code = '2304' AND
    offense_code_extension = 0;"
)

glimpse(df_theft_parts)
Rows: 27,068
Columns: 20
$ incident_id            <int64> 2017879, 2022753, 2022824, 20105223, 20108710…
$ offense_id             <int64> 2017879230400, 2022753230400, 2022824230400, …
$ offense_code           <chr> "2304", "2304", "2304", "2304", "2304", "2304",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date  <dttm> 2016-12-28 21:00:00, 2021-12-30 18:00:00, 2021…
$ last_occurrence_date   <dttm> 2016-12-29 09:00:00, 2022-01-01 09:30:00, 2021…
$ reported_date          <dttm> 2017-01-01 11:13:00, 2022-01-01 12:22:00, 2022…
$ incident_address       <chr> "5620 E MINNESOTA DR", "3870 N QUEBEC ST", "130…
$ geo_x                  <dbl> 3162734, 3168332, 3145914, 3212056, 3128393, 31…
$ geo_y                  <dbl> 1678389, 1706036, 1677387, 1706436, 1685588, 17…
$ geo_lon                <dbl> -104.9217, -104.9012, -104.9815, -104.7456, -10…
$ geo_lat                <dbl> 39.69456, 39.77035, 39.69208, 39.77056, 39.7148…
$ district_id            <chr> "3", "5", "3", "5", "4", "5", "5", "5", "6", "4…
$ precinct_id            <int> 322, 512, 313, 523, 411, 521, 521, 511, 611, 41…
$ neighborhood_id        <chr> "virginia-village", "central-park", "platt-park…
$ victim_count           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year                 <int> 2016, 2021, 2021, 2010, 2010, 2010, 2011, 2011,…
$ c_month                <int> 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12,…
$ c_day                  <int> 28, 30, 29, 4, 6, 5, 1, 1, 2, 3, 3, 25, 29, 30,…
$ c_dow                  <int> 3, 4, 3, 1, 3, 2, 6, 6, 7, 1, 1, 6, 3, 4, 3, 4,…

Theft of a Vehicle Dataset

This dataset contains vehicle theft data for 2010-2021.

df_theft_autos <- db_make_query(
  "SELECT
    *
  FROM
    crimes
  WHERE
    offense_code = '2404' AND
    offense_code_extension = 0;"
)
glimpse(df_theft_autos)
Rows: 63,652
Columns: 20
$ incident_id            <int64> 2010811, 2011712, 2011892, 2011969, 2012480, …
$ offense_id             <int64> 2010811240400, 2011712240400, 2011892240400, …
$ offense_code           <chr> "2404", "2404", "2404", "2404", "2404", "2404",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date  <dttm> 2010-01-01 06:20:00, 2010-12-31 16:00:00, 2011…
$ last_occurrence_date   <dttm> 2010-01-01 06:25:00, 2011-01-01 04:30:00, 2011…
$ reported_date          <dttm> 2010-01-01 06:51:00, 2011-01-01 05:25:00, 2011…
$ incident_address       <chr> "1250 S KNOX CT", "4750 N PEORIA ST", "1023 N D…
$ geo_x                  <dbl> 3131775, 3183596, 3134465, 3167113, 3131900, 31…
$ geo_y                  <dbl> 1678035, 1710963, 1692454, 1705365, 1683082, 16…
$ geo_lon                <dbl> -105.0317, -104.8467, -105.0219, -104.9055, -10…
$ geo_lat                <dbl> 39.69407, 39.78359, 39.73362, 39.76853, 39.7079…
$ district_id            <chr> "4", "5", "1", "2", "4", "4", "1", "1", "4", "1…
$ precinct_id            <int> 421, 521, 122, 221, 412, 423, 122, 111, 411, 11…
$ neighborhood_id        <chr> "mar-lee", "montbello", "sun-valley", "northeas…
$ victim_count           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year                 <int> 2010, 2010, 2011, 2011, 2012, 2011, 2011, 2013,…
$ c_month                <int> 1, 12, 1, 1, 1, 12, 12, 1, 1, 12, 12, 1, 1, 12,…
$ c_day                  <int> 1, 31, 1, 1, 1, 31, 30, 1, 1, 31, 31, 1, 1, 31,…
$ c_dow                  <int> 5, 5, 6, 6, 7, 6, 5, 2, 2, 1, 1, 2, 2, 1, 1, 4,…

Robberies Dataset

This dataset contains robbery data for 2010-2021. The criteria for robbery are listed at the beginning of this section.

df_robbery <- db_make_query(
  "SELECT
    *
  FROM
    crimes
  WHERE
    offense_code IN ('1204', '1205', '1206', '1210', '1212') AND
    offense_code_extension = 0;"
)
glimpse(df_robbery)
Rows: 9,487
Columns: 20
$ incident_id            <int64> 201475, 202092, 2012380, 2012670, 2012876, 20…
$ offense_id             <int64> 201475120500, 202092120500, 2012380120500, 20…
$ offense_code           <chr> "1205", "1205", "1205", "1205", "1205", "1205",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date  <dttm> 2014-01-01 00:15:00, 2020-01-01 00:19:00, 2012…
$ last_occurrence_date   <dttm> 2014-01-01 00:25:00, NA, NA, 2012-01-01 06:50:…
$ reported_date          <dttm> 2014-01-01 00:39:00, 2020-01-01 02:10:00, 2012…
$ incident_address       <chr> "26TH ST / WELTON ST", "3900 BLK N PEORIA WAY",…
$ geo_x                  <dbl> 3146414, 3183349, 3128566, 3142790, 3141412, 31…
$ geo_y                  <dbl> 1699893, 1706918, 1681624, 1696716, 1698759, 16…
$ geo_lon                <dbl> -104.9793, -104.8477, -105.0431, -104.9922, -10…
$ geo_lat                <dbl> 39.75386, 39.77249, 39.70397, 39.74519, 39.7508…
$ district_id            <chr> "2", "5", "4", "6", "6", "1", "2", "6", "6", "6…
$ precinct_id            <int> 211, 512, 412, 611, 612, 123, 223, 612, 612, 62…
$ neighborhood_id        <chr> "five-points", "central-park", "westwood", "cbd…
$ victim_count           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year                 <int> 2014, 2020, 2012, 2012, 2012, 2013, 2014, 2014,…
$ c_month                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_day                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4,…
$ c_dow                  <int> 3, 3, 7, 7, 7, 2, 3, 3, 4, 2, 3, 3, 5, 5, 6, 1,…