Chicago Homicides 2017, a Follow-Up Look with R

Chicago Homicides 2017, a Follow-Up Look with R


Guest blog post by Steve Miller.
A year ago, I posted an article on the disturbing 57% increase in Chicago homicides for 2016. There’s been no shortage of loaded commentary since, including strong statements by the POTUS. A bit more balanced prerspective was provided by fivethiryeight.
There can be no denying the off-the-charts Chicago homicide numbers — 493 in 2015, 777 in 2016. But while Chicago had far more murders than other U.S. cities, New York and LA included, the murder rate (homicides/100000 population) was exceeded by several other metropolises, including Detroit, New Orleans, St Louis, and Baltimore.

Al Capone
With daily access to Chicago data (after a 7 day lag), I kept up with Chicago crime throughout 2017, looking for rays of hope in the data. Yet while there was a decline of over 100 homicides from 2016 to 2017, I was in no way inclined to claim even a minor victory. After all, the final 2017 number was still 175 more than in 2015 — so the decline could simply be regression to the mean.
What follows is a look at the 2001 through January 2017 Chicago homicide data, embellished by additional numbers from wikipedia.
The technologies deployed are JupyterLab with an R 3.4 kernel. The scripts are driven primarily through the R data.table and tidyverse packages. Hopefully, readers will see just how powerful these tools are in collaboration. Notable is that neither data.table nor tidyverse is a part of “core” R; each is an addon maintained by the energetic R ecosystem.
In [ ]:

Set a few R options, load some libraries, change the working directory, and assign input file names.

In [122]:
options(warn=-1)options(scipen = 10)
suppressMessages(library(data.table))suppressMessages(library(dplyr))suppressMessages(library(dtplyr))suppressMessages(library(lubridate))suppressMessages(library(dtplyr))suppressMessages(library(purrr))suppressMessages(library(tidyverse))suppressMessages(library(forcats))suppressMessages(library(fst))
suppressMessages(library(tidyquant))suppressMessages(library(zoo))
suppressMessages(library(xts))
suppressMessages(library(splines))suppressMessages(library(forecast))suppressMessages(library(smooth))

setwd(“c:/bigdata/raw/chicago”)
cname <- “crimeplus.fst”
fname <- “fbicode.txt”
wname <- “homicidefromwicki.csv”

In [ ]:

Define several useful frequencies functions. We don’t use allfreqs here, but it computes individual frequencies for all attributes in an R data.table.

In [123]:
frequenciesdyn <- function(DTstr, xstr){ return(eval(parse(text=sprintf(‘%s[,.(count=.N),.(%s)]’, DTstr, xstr))) %>% arrange(desc(count)) %>% mutate(percent=100*count/sum(count)))}

In [124]:
allfreqs <- function(dtn){ dt <- eval(parse(text=paste(“data.table(“,dtn,”)”,sep=””))) nmes <- names(dt)
lst <- map(nmes,function(nm) frequenciesdyn(dtn,nm))
names(lst) <- nmes
return(lst)
}

In [ ]:

Read the enhanced Chicago crime data set into an R data.table, chicagocrime, projecting the attributes of interest. The underlying raw data are intially loaded and enriched in another JupyterLab process. This data.table is keyed on fbicodedesc.

In [125]:
varsout <- c(“date”,”fbicode”,”fbicodedesc”,”communityarea”,”name”,”longitude”,”latitude”)chicagocrime <- setkey(read_fst(cname,varsout,as.data.table = TRUE),fbicodedesc) str(chicagocrime)

Classes ‘data.table’ and ‘data.frame’: 6532900 obs. of 7 variables: $ date : POSIXct, format: “2006-03-11 02:00:00” “2006-03-20 15:40:00” … $ fbicode : Factor w/ 26 levels “01A”,”01B”,”02″,..: 5 5 5 5 5 5 5 5 5 5 …
$ fbicodedesc : Factor w/ 26 levels “Aggravated Assault”,..: 1 1 1 1 1 1 1 1 1 1 …
$ communityarea: num 22 23 29 22 61 30 54 75 49 38 …
$ name : Factor w/ 77 levels “Albany Park”,..: 41 33 52 41 50 64 59 45 61 29 …
$ longitude : num -87.7 -87.7 -87.7 -87.7 -87.7 …
$ latitude : num 41.9 41.9 41.9 41.9 41.8 …
– attr(*, “.internal.selfref”)=<externalptr> – attr(*, “sorted”)= chr “fbicodedesc”

In [ ]:

Last date for this iteration of the crime data. It just so happens to be Jan 31, 2018.

In [126]:
max(chicagocrime$date)

[1] “2018-01-31 23:55:00 UTC”

In [ ]:

Include an enrichment to the Chicago crime data from here. fbicode is in the Chicago crime data set; I built the (code,description) file from this, then joined with the raw Chicago data. Note the various “dimensions” of crime.

In [127]:
fbicodes <- fread(fname)idx <- c(“01A”)
homicide <- fbicodes[fbicode %in% idx,fbicodedesc]
nhomicide <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]
idx <- c(“01A”,”01B”,”02″,”03″,”04A”,”04B”)violentcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nviolentcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]
idx <- c(“05″,”06″,”07″,”09”)
propertycrime <- fbicodes[fbicode %in% idx,fbicodedesc]
npropertycrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]
idx <- c(’01A’,’02’,’03’,’04A’,’04B’,’05’,’06’,’07’,’09’) indexcrime <- fbicodes[fbicode %in% idx,fbicodedesc]
nindexcrime <- fbicodes[fbicode %in% setdiff(fbicode,idx),fbicodedesc]

In [ ]:

Consider homicide frequencies by year from the chicagocrime data — 2001 through Jan 2018. For the plots that immediately follow, we’ll include only 2001-2017. The frequenciesdyn function is quite handy. Again, note the collaboration of data.table and tidyverse.

In [128]:
murders <- frequenciesdyn(“chicagocrime[homicide]”,”year=year(date)”) %>% arrange(desc(year)) %>% select(count,year) %>% mutate(where=”D”)murders

count
year
where

39
2018
D

668
2017
D

777
2016
D

493
2015
D

424
2014
D

421
2013
D

504
2012
D

436
2011
D

438
2010
D

460
2009
D

513
2008
D

447
2007
D

471
2006
D

451
2005
D

453
2004
D

601
2003
D

656
2002
D

667
2001
D

In [ ]:

Graph homicides by year, 2001-2017, from the murders data.table. Note the huge spike from 2015-2016.

In [129]:
ggplot(murders[year<2018],aes(x=year,y=count)) +geom_point() +geom_line() +
ylim(0,1000) +
labs(title=”Chicago Homicides, 2001-2017″, x=”Year”, y=”Homicides”)

In [ ]:

Add more years of Chicago homicides from wikipedia. Ultimately, there’re murder counts from 1974-2017. For 2001-2017, I use the chicagocrime figures; from 1974-2000, the numbers are from wikipedia.

In [130]:
wickimurders <- setnames(fread(wname),c(“year”,”count”)) %>% mutate(where=”W”)murderscombined <- rbind(murders,wickimurders) %>% filter(year<2018)

In [ ]:

As disturbing as the 2016 homicide figure is, it’s not an outlier in Chicago history, unfortunately. Chicago’s population in the 70s-80s was about 25% higher than it is now, so the murder rates (homicides/100000 population) are 25% less than as they’d be with the same homicide counts today. Still, Chicago has a long history of violence.

In [131]:
ggplot(murderscombined[(between(year,2001,2017) & where==”D”) | (year<2001)],aes(x=year,y=count, color=where)) +geom_point() +geom_line() +
ylim(0,1000) +
xlim(1973,2018) +
labs(title=”Chicago Homicides, 1974-2017″, x=”Year”, y=”Homicides”)

In [ ]:

Violent crime in Chicago was on a steady decline from 2001-2014, with an upturn at that point. These figures are in line with violent crime decreases across the country. Is the 2016 rise in Chicago just a blip?

In [132]:
violent <- frequenciesdyn(“chicagocrime[violentcrime][year(date)<2018]”, “year=year(date)”)

In [133]:
ggplot(violent,aes(x=year,y=count)) +geom_point() +geom_line() +
ylim(0,50000) +
geom_smooth(method=”loess”,se = FALSE) +
labs(title=”Chicago Violent Crime, 2001-2017″, x=”Year”, y=”# Violent Crimes”)

In [ ]:

Now contrast homicides in 2017 with 2016 and the average of 2013,2014,2015. From a more than 50% increase in 2016, the 100+ decline in 2017 could just be regression to the mean. And it’s still a long ways up from the 446 average in 2013-2015. Note that as of the end of July 2017, 2017 homicides were actually higher than the disturbing 2016 figures.

In [134]:
nmurders <- frequenciesdyn(“chicagocrime[homicide][between(year(date),2013,2017)]”,”year=year(date),month=month(date)”)[ ,ryear:=factor(“2013-2015″)][year==2016,ryear:=”2016″][year==2017,ryear:=”2017”][ ,rmonth:=factor(“Jan-Jul”)][month>7,rmonth:=”Aug-Dec”][order(year,month)] %>% select(-percent)str(nmurders)

Classes ‘data.table’ and ‘data.frame’: 60 obs. of 5 variables: $ year : num 2013 2013 2013 2013 2013 … $ month : num 1 2 3 4 5 6 7 8 9 10 …
$ count : int 41 13 16 25 45 44 47 51 42 32 …
$ ryear : Factor w/ 3 levels “2013-2015″,”2016”,..: 1 1 1 1 1 1 1 1 1 1 …
$ rmonth: Factor w/ 2 levels “Jan-Jul”,”Aug-Dec”: 1 1 1 1 1 1 1 2 2 2 …
– attr(*, “.internal.selfref”)=<externalptr>

In [135]:
ggwork <- nmurders[,.(count=round(mean(count))),.(ryear,month)]Y <- ggwork[,.(cumcount=max(cumsum(count))),by=.(ryear)]$cumcountX <- max(ggwork$month)

ggplot(ggwork[,.(cumcount=cumsum(count),month),.(ryear)], aes(x=month, y=cumcount, color=ryear)) + geom_point(size=2) +
geom_line(size=1) +
ylim(c(0,800)) +
scale_x_discrete(limits=(1:12), labels=levels(month(Sys.Date(),label=TRUE))) + annotate(“text”, x = X + .5, y = Y, label = Y) +
labs(title=”Chicago Homicides, 2013-2017″, x=”Month”, y=”Cumulative Homicides”)

In [ ]:

Let’s dig a little deeper into these numbers, contrasting January-July with August-December. In this breakdown, the last five months of 2017 appear to be quite an improvement over 2016, actually looking more like 2013-2015 than 2016. There’s also anecdotal evidence of successes in new analytics and community policing initiatives to support the declines. Alas, call me skeptically hopeful rather than optimistic.

In [136]:
ggwork <- nmurders[,.(count=round(mean(count))),.(ryear,rmonth,month)]ggplot(ggwork[,.(month,cumcount=cumsum(count)),.(ryear,rmonth)], aes(x=month, y=cumcount, color=ryear)) + geom_point(size=2) +
geom_line(size=1) +
facet_grid( ~ rmonth) +
scale_x_discrete(limits=(1:12), labels=levels(month(Sys.Date(),label=TRUE))) + theme(axis.text.x = element_text(size=5, angle = 45)) +
labs(title=”Chicago Homicides, 2013-2017″, x=”Month”, y=”Cumulative Homicides”)

In [ ]:

So even being a week behind with the chicagocrime data, what can we say about 2018 so far? Let’s build a little data.table to look at January vs Annual homicides from 2001-2018. Notice that there were 39 homicides in January 2018, vs 54 in each of 2016 and 2017. Is this welcome decline a continuation of the last five months of 2017?

In [137]:
dt <-chicagocrime[, max(date)]dtmax <- date(dt)+days(1)-seconds(1)
dtmax
date(dtmax)
numdays <- yday(dtmax)
numdays
month(dt)

[1] “2018-01-31 23:59:59 UTC”

2018-01-31

31

1

In [138]:
emurders <- cbind(frequenciesdyn(“chicagocrime[homicide][yday(date)<=numdays]”,”year=year(date)”) %>% arrange(desc(year)) %>% select(-percent) %>% mutate(numdays,month=month(dt,label=TRUE,abbr=TRUE)), frequenciesdyn(“chicagocrime[homicide][month(date)<=month(dt)]”,”year=year(date)”) %>% arrange(desc(year)) %>% select(-percent,-year), frequenciesdyn(“chicagocrime[homicide]”,”year=year(date)”) %>% arrange(desc(year)) %>% select(-percent,-year) ) %>% setnames(c(“year”,”sofar”,”numdays”,”month”,”monthend”,”yearend”))

In [139]:
emurders

year
sofar
numdays
month
monthend
yearend

2018
39
31
Jan
39
39

2017
54
31
Jan
54
668

2016
54
31
Jan
54
777

2015
30
31
Jan
30
493

2014
20
31
Jan
20
424

2013
41
31
Jan
41
421

2012
39
31
Jan
39
504

2011
28
31
Jan
28
436

2010
22
31
Jan
22
438

2009
25
31
Jan
25
460

2008
34
31
Jan
34
513

2007
26
31
Jan
26
447

2006
27
31
Jan
27
471

2005
20
31
Jan
20
451

2004
29
31
Jan
29
453

2003
39
31
Jan
39
601

2002
45
31
Jan
45
656

2001
42
31
Jan
42
667

In [ ]:

Graph the relationships between January and Annual homicides. From the points and a smoother plot, it’s clear there’s an increasing relationship, though the fit’s much better at lower values.

In [140]:
ggplot(melt(emurders[-1],id.vars=c(“year”,”month”,”numdays”,”yearend”), variable.name=”counts”, value=”value”),aes(x=value,y=yearend,col=counts)) +geom_point() +geom_line() +
geom_smooth(method=”loess”,se = FALSE) +
facet_grid( ~ counts) +
xlim(0,60) +
ylim(0,1000) +
theme(legend.position=”none”) +
labs(title=”Chicago Homicides, 2001-2017″, x=”January Homicides”, y=”Annual Homicides”)

In [ ]:

Can we use this data to forecast Chicago’s 2018 homicides? Why not! I average the predictions of quick and dirty linear regeresssion models with two ARIMA time series to come up with an estimate of 577 Chicago homicides for 2018. That figure certainly passes the smell test: a second consecutive 100 decline after the off-the-charts almost 300 increase in 2016. I’m hopeful.

In [141]:
mod0 <- lm(yearend~ns(sofar,2),emurders[-1])p0 <- round(as.numeric(predict(mod0,newdata=emurders[1])))amod <- auto.arima(emurders[-1]$yearend)p1 <- round(as.numeric(round(forecast(amod,h=1)$mean)))
bmod <- auto.arima(emurders[-1]$yearend,xreg=emurders[-1]$sofar)p2 <- round(as.numeric(round(forecast(bmod,h=1,xreg=emurders[1]$sofar)$mean)))
p0
p1
p2

cat(“nn”)
round((p0+p1+p2)/3)

545

621

566

577

In [ ]:

Finally, compare January 2018 to 2017, 2016, and the average of 2013-2015 homicides. That the January 2018 number is closer to 2013-2015 average than the 2017 and 2016 figures is heartening. The violent crime numbers are similarly aligned in decline.

In [142]:
frequenciesdyn(“chicagocrime[homicide][between(year(date),2013,2018)][yday(date)<=numdays]”,”year=year(date)”) [, ryear:=factor(“2013-2015″)][year==2016,ryear:=”2016″][year==2017,ryear:=”2017″][year==2018,ryear:=”2018”][ ,.(numdays,homicidessofar=round(mean(count))),.(ryear)] %>% arrange(desc(ryear))

ryear
numdays
homicidessofar

2018
31
39

2017
31
54

2016
31
54

2013-2015
31
30

In [143]:
frequenciesdyn(“chicagocrime[violentcrime][between(year(date),2013,2018)][yday(date)<=numdays]”,”year=year(date)”) [, ryear:=factor(“2013-2015″)][year==2016,ryear:=”2016″][year==2017,ryear:=”2017″][year==2018,ryear:=”2018”][ ,.(numdays,violentcrimesofar=round(mean(count))),.(ryear)] %>% arrange(desc(ryear))

ryear
numdays
violentcrimesofar

2018
31
1964

2017
31
2263

2016
31
2124

2013-2015
31
1774

In [ ]:

I plan to stay close to the data in 2018, hopeful but not quite confident of the return to pre-2016 homicide/violence levels.



Link: Chicago Homicides 2017, a Follow-Up Look with R