Rows: 294
Columns: 6
$ offense_code <chr> "2804", "2804", "2901", "2902", "2903", "2999",…
$ offense_code_extension <int> 1, 2, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0,…
$ offense_type_id <chr> "stolen-property-possession", "fraud-possess-fi…
$ offense_type_name <chr> "Possession of stolen property", "Possession of…
$ offense_category_id <chr> "all-other-crimes", "all-other-crimes", "public…
$ offense_category_name <chr> "All Other Crimes", "All Other Crimes", "Public…
Data Preparation
Preparing the database
The majority of the data preparation for this analysis was done using SQL scripts. The process can be broken down into 5 steps: initial data import, year by year cleanup, creating master data table, creating views, creating a cleanup stored procedure to automate that process.
Initial data import
The initial data import consisted of creating tables for each year’s data and then loading data from comma separated values (CSV) files into the respective tables. The offense codes were loaded into a separate table to be joined with the crime data when necessary.
Year by year cleanup
Each year of data was initially stored in its own table in the database. It was then inspected to ensure that the data was as complete and coherent as possible. The clean-up scripts can be viewed on github from the links below.
The master data table
After the clean-up process was completed, the data where copied into a single table, crimes and the original yearly tables were dropped.
Creating views
For convenience, SQL views were created to supply aggregate data.
Automation with stored procedure
In order to make the data preparation process repeatable and reproducible, all of the cleaning and processing steps were placed into a stored procedure that cn be run from the PostgreSQL admin application.
Creating datasets for analysis
Create 5 datasets, one for each of the crime types of interest:
- Burglary of a residence with forced entry
- Burglary of a business with forced entry
- Theft of parts from a vehicle
- Theft of a vehicle
- Robberies that fall under the following offence codes:
- 1204 - Robbery of a person in the open using a gun
- 1205 - Robbery of a person in the open
- 1206 - Robbery of a person in the open using bodily force
- 1210 - Forcible purse snatching
- 1212 - Carjacking, armed
Offense codes
The offense codes are critical to understanding the crime data. They categorize crimes stored in the crimes table, allowing for the 5 crime datasets listed above to be created.
Residential Burglary Dataset
This dataset contains residential burglary data for 2010-2021.
# create the datasets
df_burglary_res <- db_make_query(
"SELECT
*
FROM
crimes
WHERE
offense_code = '2204' AND
offense_code_extension = 0;"
)
glimpse(df_burglary_res)Rows: 20,500
Columns: 20
$ incident_id <int64> 2012266, 2012346, 2012681, 2012861, 2014347, …
$ offense_id <int64> 2012266220400, 2012346220400, 2012681220400, …
$ offense_code <chr> "2204", "2204", "2204", "2204", "2204", "2204",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date <dttm> 2012-01-01 00:00:00, 2011-12-31 20:30:00, 2011…
$ last_occurrence_date <dttm> 2012-01-01 00:10:00, 2012-01-01 02:22:00, 2011…
$ reported_date <dttm> 2012-01-01 03:08:00, 2012-01-01 02:22:00, 2012…
$ incident_address <chr> "3635 N LAFAYETTE ST", "4095 N ANDES CT", "761 …
$ geo_x <dbl> 3148644, 3205088, 3151414, 3146974, 3129141, 31…
$ geo_y <dbl> 1704917, 1707035, 1690673, 1698995, 1679505, 17…
$ geo_lon <dbl> -104.9712, -104.7704, -104.9617, -104.9773, -10…
$ geo_lat <dbl> 39.76762, 39.77236, 39.72847, 39.75139, 39.6981…
$ district_id <chr> "2", "5", "3", "2", "4", "1", "1", "3", "6", "3…
$ precinct_id <int> 211, 523, 311, 211, 412, 111, 113, 322, 622, 31…
$ neighborhood_id <chr> "cole", "gateway-green-valley-ranch", "country-…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year <int> 2012, 2011, 2011, 2012, 2013, 2013, 2013, 2014,…
$ c_month <int> 1, 12, 12, 1, 12, 12, 12, 12, 12, 1, 1, 1, 12, …
$ c_day <int> 1, 31, 27, 1, 31, 31, 31, 28, 29, 1, 1, 1, 31, …
$ c_dow <int> 7, 6, 2, 7, 2, 2, 2, 7, 1, 5, 7, 7, 7, 1, 2, 1,…
Business Burglary Dataset
This dataset contains business burglary data for 2010-2021.
df_burglary_biz <- db_make_query(
"SELECT
*
FROM
crimes
WHERE
offense_code = '2203' AND
offense_code_extension = 0;"
)
glimpse(df_burglary_biz)Rows: 11,893
Columns: 20
$ incident_id <int64> 2010451, 2011967, 2012855, 2013663, 2013823, …
$ offense_id <int64> 2010451220300, 2011967220300, 2012855220300, …
$ offense_code <chr> "2203", "2203", "2203", "2203", "2203", "2203",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date <dttm> 2010-01-01 02:00:00, 2010-12-31 17:00:00, 2011…
$ last_occurrence_date <dttm> 2010-01-01 02:15:00, 2011-01-01 11:20:00, 2011…
$ reported_date <dttm> 2010-01-01 02:20:00, 2011-01-01 12:34:00, 2012…
$ incident_address <chr> "1410 N XAVIER ST", "424 21ST ST", "1600 N IVY …
$ geo_x <dbl> 3126627.0, 3145033.0, 3163221.0, 3164055.0, 316…
$ geo_y <dbl> 1694313.0, 1697804.0, 1695703.0, 1693846.0, 169…
$ geo_lon <dbl> -105.0497, -104.9842, -104.9196, -104.9167, -10…
$ geo_lat <dbl> 39.73883, 39.74815, 39.74208, 39.73697, 39.7403…
$ district_id <chr> "1", "6", "2", "2", "2", "3", "6", "3", "3", "3…
$ precinct_id <int> 122, 621, 222, 222, 223, 322, 611, 322, 322, 31…
$ neighborhood_id <chr> "west-colfax", "five-points", "south-park-hill"…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year <int> 2010, 2010, 2011, 2013, 2012, 2012, 2015, 2017,…
$ c_month <int> 1, 12, 12, 1, 12, 12, 1, 1, 1, 1, 12, 1, 1, 1, …
$ c_day <int> 1, 31, 31, 1, 31, 28, 1, 1, 1, 1, 31, 2, 3, 3, …
$ c_dow <int> 5, 5, 6, 2, 1, 5, 4, 7, 1, 5, 4, 6, 7, 7, 7, 1,…
Theft of Parts from a Vehicle Dataset
This dataset contains theft of parts from a vehicle data for 2010-2021.
df_theft_parts <- db_make_query(
"SELECT
*
FROM
crimes
WHERE
offense_code = '2304' AND
offense_code_extension = 0;"
)
glimpse(df_theft_parts)Rows: 27,068
Columns: 20
$ incident_id <int64> 2017879, 2022753, 2022824, 20105223, 20108710…
$ offense_id <int64> 2017879230400, 2022753230400, 2022824230400, …
$ offense_code <chr> "2304", "2304", "2304", "2304", "2304", "2304",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date <dttm> 2016-12-28 21:00:00, 2021-12-30 18:00:00, 2021…
$ last_occurrence_date <dttm> 2016-12-29 09:00:00, 2022-01-01 09:30:00, 2021…
$ reported_date <dttm> 2017-01-01 11:13:00, 2022-01-01 12:22:00, 2022…
$ incident_address <chr> "5620 E MINNESOTA DR", "3870 N QUEBEC ST", "130…
$ geo_x <dbl> 3162734, 3168332, 3145914, 3212056, 3128393, 31…
$ geo_y <dbl> 1678389, 1706036, 1677387, 1706436, 1685588, 17…
$ geo_lon <dbl> -104.9217, -104.9012, -104.9815, -104.7456, -10…
$ geo_lat <dbl> 39.69456, 39.77035, 39.69208, 39.77056, 39.7148…
$ district_id <chr> "3", "5", "3", "5", "4", "5", "5", "5", "6", "4…
$ precinct_id <int> 322, 512, 313, 523, 411, 521, 521, 511, 611, 41…
$ neighborhood_id <chr> "virginia-village", "central-park", "platt-park…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year <int> 2016, 2021, 2021, 2010, 2010, 2010, 2011, 2011,…
$ c_month <int> 12, 12, 12, 1, 1, 1, 1, 1, 1, 1, 1, 12, 12, 12,…
$ c_day <int> 28, 30, 29, 4, 6, 5, 1, 1, 2, 3, 3, 25, 29, 30,…
$ c_dow <int> 3, 4, 3, 1, 3, 2, 6, 6, 7, 1, 1, 6, 3, 4, 3, 4,…
Theft of a Vehicle Dataset
This dataset contains vehicle theft data for 2010-2021.
df_theft_autos <- db_make_query(
"SELECT
*
FROM
crimes
WHERE
offense_code = '2404' AND
offense_code_extension = 0;"
)
glimpse(df_theft_autos)Rows: 63,652
Columns: 20
$ incident_id <int64> 2010811, 2011712, 2011892, 2011969, 2012480, …
$ offense_id <int64> 2010811240400, 2011712240400, 2011892240400, …
$ offense_code <chr> "2404", "2404", "2404", "2404", "2404", "2404",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date <dttm> 2010-01-01 06:20:00, 2010-12-31 16:00:00, 2011…
$ last_occurrence_date <dttm> 2010-01-01 06:25:00, 2011-01-01 04:30:00, 2011…
$ reported_date <dttm> 2010-01-01 06:51:00, 2011-01-01 05:25:00, 2011…
$ incident_address <chr> "1250 S KNOX CT", "4750 N PEORIA ST", "1023 N D…
$ geo_x <dbl> 3131775, 3183596, 3134465, 3167113, 3131900, 31…
$ geo_y <dbl> 1678035, 1710963, 1692454, 1705365, 1683082, 16…
$ geo_lon <dbl> -105.0317, -104.8467, -105.0219, -104.9055, -10…
$ geo_lat <dbl> 39.69407, 39.78359, 39.73362, 39.76853, 39.7079…
$ district_id <chr> "4", "5", "1", "2", "4", "4", "1", "1", "4", "1…
$ precinct_id <int> 421, 521, 122, 221, 412, 423, 122, 111, 411, 11…
$ neighborhood_id <chr> "mar-lee", "montbello", "sun-valley", "northeas…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year <int> 2010, 2010, 2011, 2011, 2012, 2011, 2011, 2013,…
$ c_month <int> 1, 12, 1, 1, 1, 12, 12, 1, 1, 12, 12, 1, 1, 12,…
$ c_day <int> 1, 31, 1, 1, 1, 31, 30, 1, 1, 31, 31, 1, 1, 31,…
$ c_dow <int> 5, 5, 6, 6, 7, 6, 5, 2, 2, 1, 1, 2, 2, 1, 1, 4,…
Robberies Dataset
This dataset contains robbery data for 2010-2021. The criteria for robbery are listed at the beginning of this section.
df_robbery <- db_make_query(
"SELECT
*
FROM
crimes
WHERE
offense_code IN ('1204', '1205', '1206', '1210', '1212') AND
offense_code_extension = 0;"
)
glimpse(df_robbery)Rows: 9,487
Columns: 20
$ incident_id <int64> 201475, 202092, 2012380, 2012670, 2012876, 20…
$ offense_id <int64> 201475120500, 202092120500, 2012380120500, 20…
$ offense_code <chr> "1205", "1205", "1205", "1205", "1205", "1205",…
$ offense_code_extension <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ first_occurrence_date <dttm> 2014-01-01 00:15:00, 2020-01-01 00:19:00, 2012…
$ last_occurrence_date <dttm> 2014-01-01 00:25:00, NA, NA, 2012-01-01 06:50:…
$ reported_date <dttm> 2014-01-01 00:39:00, 2020-01-01 02:10:00, 2012…
$ incident_address <chr> "26TH ST / WELTON ST", "3900 BLK N PEORIA WAY",…
$ geo_x <dbl> 3146414, 3183349, 3128566, 3142790, 3141412, 31…
$ geo_y <dbl> 1699893, 1706918, 1681624, 1696716, 1698759, 16…
$ geo_lon <dbl> -104.9793, -104.8477, -105.0431, -104.9922, -10…
$ geo_lat <dbl> 39.75386, 39.77249, 39.70397, 39.74519, 39.7508…
$ district_id <chr> "2", "5", "4", "6", "6", "1", "2", "6", "6", "6…
$ precinct_id <int> 211, 512, 412, 611, 612, 123, 223, 612, 612, 62…
$ neighborhood_id <chr> "five-points", "central-park", "westwood", "cbd…
$ victim_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_year <int> 2014, 2020, 2012, 2012, 2012, 2013, 2014, 2014,…
$ c_month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ c_day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4,…
$ c_dow <int> 3, 3, 7, 7, 7, 2, 3, 3, 4, 2, 3, 3, 5, 5, 6, 1,…