1 Introduction

San Francisco has seen a huge explosion of open drug offenses during the recent years. Enraging citizens of the city have complained of stepping over people injecting heroin in the train stations, and walking past an endless line of encampments across the sidewalks and bike paths. This has been a huge problem for public health officials as well due to the piles of discarded needles ending up in the street. According to the San Francisco Department of Public Health, there are about 22,000 intravenous drug users in the city, or about 470 per square mile.

Federal policies, written and unwritten local rules surrounding redlining, and mortgage discrimination has led to low income minorities settling into segregated neighborhoods. Many of these neighborhoods have a pattern of blight and disrepair. Since many drug offenses go unobserved, or worse, police officers may selective choose to enforce drug crimes more fervently in low income communities than others, which can lead to over-reporting and hence over-policing in these communities.

In this assignment we will be modeling the latent risk prediction of drug offenses in San Francisco based on various risk factors, including a number of 311 requests (on encampments, medical waste, graffiti etc.), presence of nightlife amenities, transit stations, alcohol stores and, nearest neighbor distances to the above mentioned. It is possible that many of the variables mentioned above such as encampments, medical waste, and graffiti could be highly clustered in specific neighborhoods or racial communities, and the drug offense crime could be over-reported more in these areas. All these variables have a high likelihood of having selective bias due to this mentioned reason. Hence, we would also be looking at the performance of our model across different neighborhoods and racial groups. Finally, we compare our risk model to the more traditionally use kernel density model to see where it stands.

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(lubridate)
library(osmdata)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

2 Getting the Data

We start the analysis by first gathering the data we would need, that is, neighborhood data, census tracts data for year 2018 and, drug offense data for year 2018 and 2019 for San Francisco City.

In this analysis, we would be focusing on building a drug offense risk prediction model using crime data for year 2018 and then would test the goodness of the model on crime data for year 2019. In addition to this, we would also check the generalizability of the model across neighborhoods and race groups.

# CRS 
crs <- st_crs("+proj=lcc +lat_1=40 +lat_2=41.66666666666666 +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000 +y_0=500000 +ellps=GRS80 +units=m no_defs")

# Getting the neighborhoods
neighborhoods <- 
  st_read("https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON") %>%
  st_transform(crs=crs) %>% 
  dplyr::select(name)

# Getting the census tracts information
tracts18 <- get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2018, state='California', county='San Francisco' , geometry=T, output = "wide", progress_bar=FALSE) %>%
  st_transform(crs=crs)  %>%
  filter(!GEOID %in% c("06075980401")) %>% 
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E) %>% 
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White")
         ) %>% 
  .[neighborhoods,]

# Creating the San Francisco boundary
sfBoundary <- st_union(tracts18) %>% st_sf()

# Making a fishnet using the boundary
fishnet <- 
  st_make_grid(sfBoundary, cellsize = 350) %>%
  .[sfBoundary] %>%  
  st_sf() %>%
  mutate(uniqueID = rownames(.))

# Getting drug offense crime data for year 2018
drugOffense18 <- read.socrata("https://data.sfgov.org/resource/wg3w-h783.json?incident_year=2018&incident_category=Drug Offense") %>% 
  dplyr::select(latitude, longitude) %>% 
  mutate(year = 2018) %>% 
  drop_na() %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant")%>%
  st_transform(crs=crs) %>% 
  distinct()

# Getting drug offense crime data for year 2018
drugOffense19 <- read.socrata("https://data.sfgov.org/resource/wg3w-h783.json?incident_year=2019&incident_category=Drug Offense") %>% 
  dplyr::select(latitude, longitude) %>%
  mutate(year = 2019) %>%
  drop_na() %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant")%>%
  st_transform(crs=crs) %>% 
  distinct()

3 Mapping Drug Offenses

3.1 Point Form

After getting the crime data, we visualize it by mapping it in point form. As can be seen from Figure 1, both the point and the density plots show a clear hotspot for drug crimes in the north-eastern part of the city.

We discuss the effects of selection bias in Section 1 of this analysis.

# Mapping outcome of interest
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = sfBoundary) +
  geom_sf(data = drugOffense18, colour="red", size=0.1, show.legend = "point") +
  labs(title= "Drug Offense",
       subtitle = "SF-2018") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = sfBoundary, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(drugOffense18)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Drug Offenses",
       subtitle = "SF-2018") +
  mapTheme(title_size = 14) + theme(legend.position = "none"), bottom="Figure 1: Map of drug offenses in point and density form")

3.2 Fishnet

In order to model drug offense risk, not as a phenomenon that varies across administrative units, but something that varies smoothly across landscape, we lay a fishnet over our San Francisco boundary with each cell of size 350 ft X 350 ft and then aggregate each crime over these cells. Figure 2 shows the map of the drug offenses joined to the fishnet. Similar to our observation in the above section, clustered process of drug offense crime can be seen in the north-eastern part of the city.

# Aggregate point data on fishnet 
crime_net <- 
  dplyr::select(drugOffense18) %>% 
  mutate(countDrugOffenses = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countDrugOffenses = replace_na(countDrugOffenses, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

# Plotting
ggplot() +
  geom_sf(data = crime_net, aes(fill = countDrugOffenses)) +
  scale_fill_viridis() +
  labs(title = "Count of Drug Offenses for the Fishnet",
       subtitle = "SF-2018",
       caption = "Figure 2: Map of drug offenses joined to fishnet") +
  mapTheme()

4 Building Features

Next, we work on developing features to model the risk of drug crimes as function of exposure to a series of geospatial risk and protective factors. Specifically, we considered 4 types of features which are described in the following sections.

4.1 Count Based

These type of features correspond to the count of occurrence of a risk factor within each cell. Risk factors include:

311 calls, reporting problems like medical waste (needles), encampments, illegal postings etc.
Geospatial locations of alcohol stores, nightlife amenities (including bars, pubs and nightclubs - potential drug offense sites) and transit stops.

It is interesting to note from Figure 3 that the count of number of 311 calls complaining for encampments and medical/human waste are high in regions where the count of drug offenses in high.

# Uncomment the code below the generate the features in real time

# # Graffiti
# 
# graffiti <- read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Graffiti") %>%
#   mutate(year = year(requested_datetime)) %>%
#   filter(year == 2018) %>%
#   dplyr::select(Y = point.latitude, X = point.longitude) %>%
#   na.omit() %>%
#   filter(X < 0.0) %>%
#   st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#   st_transform(st_crs(fishnet)) %>%
#   mutate(Legend = "Graffiti") %>%
#   .[sfBoundary, ]
# 
# save(graffiti, file="graffiti.RData")
# 
# # Medical Waste
# 
# medicalWaste <- read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Street and Sidewalk Cleaning&service_subtype=Medical Waste") %>%
#   mutate(year = year(requested_datetime)) %>%
#   filter(year == 2018) %>%
#   dplyr::select(Y = point.latitude, X = point.longitude) %>%
#   na.omit() %>%
#   filter(X < 0.0) %>%
#   st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#   st_transform(st_crs(fishnet)) %>%
#   mutate(Legend = "Medical_Waste") %>%
#   .[sfBoundary, ]
# 
# save(medicalWaste, file="medicalWaste.RData")
# 
# 
# # Human and animal waste
# 
# humanWaste <- read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Street and Sidewalk Cleaning&service_subtype=Human or Animal Waste") %>%
#   mutate(year = year(requested_datetime)) %>%
#   filter(year == 2018) %>%
#   dplyr::select(Y = point.latitude, X = point.longitude) %>%
#   na.omit() %>%
#   filter(X < 0.0) %>%
#   st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#   st_transform(st_crs(fishnet)) %>%
#   mutate(Legend = "Human_Animal_Waste") %>%
#   .[sfBoundary, ]
# 
# save(humanWaste, file="humanWaste.RData")
# 
# 
# # Illegal postings
# 
# illegalPostings <-
#   read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Illegal Postings") %>%
#     mutate(year = year(requested_datetime)) %>%
#     filter(year == 2018) %>%
#     dplyr::select(Y = point.latitude, X = point.longitude) %>%
#     na.omit() %>%
#     filter(X < 0.0) %>%
#     st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#     st_transform(st_crs(fishnet)) %>%
#     mutate(Legend = "Illegal_Postings") %>%
#     .[sfBoundary, ]
# 
# save(illegalPostings, file="illegalPostings.RData")
# 
# 
# # Entertainment Noise
# 
# entertainmentNoise <-
#   read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Noise Report&service_subtype=entertainment") %>%
#     mutate(year = year(requested_datetime)) %>%
#     filter(year == 2018) %>%
#     dplyr::select(Y = point.latitude, X = point.longitude) %>%
#     na.omit() %>%
#     filter(X < 0.0) %>%
#     st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#     st_transform(st_crs(fishnet)) %>%
#     mutate(Legend = "Entertainment_Noise") %>%
#     .[sfBoundary, ]
# 
# save(entertainmentNoise, file="entertainmentNoise.RData")
# 
# 
# # Encampments
# 
# encampments <-
#   read.socrata("https://data.sfgov.org/resource/vw6y-z8j6.json?service_name=Encampments") %>%
#     mutate(year = year(requested_datetime)) %>%
#     filter(year == 2018) %>%
#     dplyr::select(Y = point.latitude, X = point.longitude) %>%
#     na.omit() %>%
#     filter(X < 0.0) %>%
#     st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
#     st_transform(st_crs(fishnet)) %>%
#     mutate(Legend = "Encampments") %>%
#     .[sfBoundary, ]
# 
# save(encampments, file="encampments.RData")
# 
# # Nightlife
# 
# sfBoundary.4326 <- sfBoundary %>% st_transform(crs=4326)
# 
# xmin <- st_bbox(sfBoundary.4326)[[1]]
# ymin <- st_bbox(sfBoundary.4326)[[2]]
# xmax <- st_bbox(sfBoundary.4326)[[3]]
# ymax <- st_bbox(sfBoundary.4326)[[4]]
# 
# bbox <- c(xmin, ymin, xmax, ymax)
# 
# nightlife <- opq(bbox = bbox) %>%
#     add_osm_feature(key = 'amenity', value = c("bar", "nightclub", "pub")) %>%
#     osmdata_sf()
# 
# nightlife <-
#   nightlife$osm_points %>%
#   .[sfBoundary.4326,] %>%
#   st_transform(crs=crs) %>%
#   mutate(Legend = "Nightlife") %>%
#   dplyr::select(Legend)
# 
# save(nightlife, file="nightlife.RData")
# 
# # Alcohol Stores
# 
# alcoholStores <- opq(bbox = bbox) %>%
#     add_osm_feature(key = 'shop', value = "alcohol") %>%
#     osmdata_sf()
# 
# alcoholStores <-
#   alcoholStores$osm_points %>%
#   .[sfBoundary.4326,] %>%
#   st_transform(crs=crs) %>%
#   mutate(Legend = "Alcohol_Store") %>%
#   dplyr::select(Legend)
# 
# save(alcoholStores, file="alcoholStores.RData")
# 
# 
# # Station
# 
# transit_station <- opq(bbox = bbox) %>%
#     add_osm_feature(key = 'railway', value = c("station","subway")) %>%
#     osmdata_sf()
# 
# transit_station <-
#   transit_station$osm_points %>%
#   .[sfBoundary.4326,] %>%
#   st_transform(crs=crs) %>%
#   mutate(Legend = "Transit_Station") %>%
#   dplyr::select(Legend)
# 
# save(transit_station, file="transit_station.RData")


# Load the features 
load("graffiti.RData")
load("medicalWaste.RData")
load("humanWaste.RData")
load("illegalPostings.RData")
load("entertainmentNoise.RData")
load("encampments.RData")
load("nightlife.RData")
load("alcoholStores.RData")
load("transit_station.RData")

# Combine all features

vars_net <- 
  rbind(graffiti, medicalWaste, humanWaste, illegalPostings, entertainmentNoise, encampments, 
        nightlife, alcoholStores, transit_station) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

# Model for whether the cell has transit stop or not
vars_net$Transit_Station[vars_net$Transit_Station > 0] <- 1

# Get nearest neighbor features

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

st_c <- st_coordinates
st_coid <- st_centroid

# Nearest features
vars_net <-
  vars_net %>%
    mutate(
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      MedicalWaste.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(medicalWaste),3),
      HumanWaste.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(humanWaste),3),
      IllegalPostings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(illegalPostings),3),
      EntertainmentNoise.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(entertainmentNoise),3),
      Encampments.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(encampments),3),
      Nightlife.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(nightlife),3),
      AlcoholStores.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(alcoholStores),3),
      TransitStation.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(transit_station),3)
      )

# Distance features

tenderloin <-
  filter(neighborhoods, name == "Tenderloin") %>%
  st_centroid()

vars_net$TenderloinDistance =
  st_distance(st_centroid(vars_net),tenderloin) %>%
  as.numeric() 

vars_net <- vars_net %>%
  mutate(WithinTenderloin = ifelse(TenderloinDistance < 1000, 1, 0))

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)

# Multimap of count features 
mapList <- list()

for(i in vars[1:9]){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size=12)}

do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors by Fishnet", bottom="Figure 3: Multimap of risk factors (counts) in fishnet"))

4.2 Nearest Neighbor Based

The average nearest neighbor distance features allow us to model a smoother exposure relationship across space as compared to hard exposure relationship captured by counts. We build these type of features by finding the k-nearest neighbor distance of each risk factor (described in the above section) for each cell in the fishnet. Figure 4 below shows a map of these features in fishnet.

# Multimap of Nearest Neighbor and Distance feature
mapList <- list()

for(i in vars[10:18]){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size=12)}

do.call(grid.arrange,c(mapList, ncol =3, top = "Risk Factors by Fishnet", bottom="Figure 4: Multimap of risk factors (nearest neighbor) in fishnet"))

4.3 Others

In addition to the count and nearest neighbor based features, we also consider a feature capturing the distance of each cell from the Tenderloin neighborhood in San Francisco (high crime neighborhood and known for many events of drug related activity). Also, we consider a separate feature that accounts for cells that fall within 1000 feet from the center of the Tenderloin neighborhood. Figure 5 shows the map of these two features.

# Multimap of Nearest Neighbor and Distance feature
mapList <- list()

for(i in vars[19:20]){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size=12)}

do.call(grid.arrange,c(mapList, ncol =2, top = "Risk Factors by Fishnet", bottom="Figure 5: Multimap of risk factors (distance and withinness) in fishnet"))

4.4 Distance to hotspot

Finally, we try to identify hotspots of drug offenses by utilizing the local moran’s I for each cell in the fishnet. In Figure 6 below, the relatively high values of I represents strong and statistically significant evidence of local clustering, with further evidence being seen in the p-value and significant hotspot maps. We use a p-value of 1e-7 to identify significant hotspots.

Based on these we create two new features accounting for whether a cell was identified as lying the hotspot region and the distance of each cell to the nearest hotspot.

# Join crime data with features till now
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID")

# Associate each cell to neighborhood
final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# Calculate local moran's I
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countDrugOffenses, final_net.weights)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(Drug_Offense_Count = countDrugOffenses, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z > 0)`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.0000001, 1, 0)) %>%
      gather(Variable, Value, -geometry)
  
# Add distance to hotspot and whether cell falls in hotspot as a feature 
final_net <-
  final_net %>% 
  mutate(drugOffense.isSig = 
           ifelse(localmoran(final_net$countDrugOffenses, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(drugOffense.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, drugOffense.isSig == 1))), 1))

# Plotting 
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size=12)}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, Drug Offenses", bottom="Figure 6"))

5 Exploratory Data Analysis

5.1 Correlations

Next we conduct some EDA by looking at the correlation of the various features we extracted till now with the count of drug offense related crimes in each cell. As can be seen from Figure 7, the hotspot features we extracted in Section 4.4 are highly correlated with the count of drug offense occurrence in the fishnet cells. Similar trend is seen for count based features like encampments, human/animal waste and medical waste. On the other hand, our nearest neighbor features show slight correlation with the drug offense data.

correlation.long <-
  st_drop_geometry(final_net) %>%
    # dplyr::select(-uniqueID, -cvID, -name, -District) %>%
    dplyr::select(-uniqueID, -cvID, -name) %>%
    gather(Variable, Value, -countDrugOffenses)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countDrugOffenses, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countDrugOffenses)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Drug Offense count as a function of risk factors",
       caption = "Figure 7: Scatterplot with correlations") +
  plotTheme()

5.2 Histograms

Next, we look at the histogram of the count of drug offenses in each cell. As can be seen from Figure 8, the distribution of drug offense crimes is a right skewed distribution with most of the cells having no crime events.

# Histogram of your dependent variable.
final_net %>% 
  ggplot() + 
  geom_histogram(aes(x = countDrugOffenses)) + 
  labs(title = "Drug Offense Distribution",
       x = "Count of Drug Offenses in Cell",
       caption = "Figure 8") + 
  plotTheme()

6 Risk Modeling

Given the skewness of drug offense counts found in Section 5.2, we next make use of Poisson Regression to model drug offense counts. For modeling we choose the nearest neighbor features as they express a smoother relationship across space, along with other distance and withiness features we described above.

6.1 Generalizability of Model

We check the generalizability of our model by considering two types of cross validation (i) Random k-fold CV, where we perform CV on random cells of our fishnet and, (ii) Spatial LOGO-CV, where we perform CV by holding one neighborhood out while performing CV. The LOGO-CV assumes that the experience in other neighborhoods generalizes to the hold out, which is a pretty rigid assumption.

From Figure 9, it can be seen that while our model seems to generalize quite well considering the Random k-fold CV, the Spatial LOGO-CV reveals that the model is slightly less robust, with some neighborhoods reporting mean absolute error close to 3.

reg.ss.vars <- c("Graffiti.nn", "MedicalWaste.nn", "HumanWaste.nn", "IllegalPostings.nn", "EntertainmentNoise.nn", "Encampments.nn" ,"Nightlife.nn", "AlcoholStores.nn", "TransitStation.nn","TenderloinDistance", "WithinTenderloin", "drugOffense.isSig", "drugOffense.isSig.dist")

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countDrugOffenses ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

# Random k-fold CV
reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countDrugOffenses",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countDrugOffenses, Prediction, geometry)

# Spatial CV
reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countDrugOffenses",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countDrugOffenses, Prediction, geometry)

# Regression Summary
reg.summary <- 
  rbind(
    mutate(reg.ss.cv,        Error = Prediction - countDrugOffenses,
                             Regression = "Random k-fold CV"),
    mutate(reg.ss.spatialCV, Error = Prediction - countDrugOffenses,
                             Regression = "Spatial LOGO-CV")) %>%
    st_sf() 

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countDrugOffenses, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

# Multiple map of model errors by random k-fold and spatial cross validation.

error_by_reg_and_fold %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "Drug Offense errors by Cross Validation",
         caption="Figure 9: Multiple map of model errors by random k-fold and spatial cross validation.") +
    mapTheme() + theme(legend.position="bottom")

Numbers in Table 1 back the above finding, with Spatial LOGO-CV giving higher errors compared to Random k-fold CV. But comparing the mean MAE of Spatial LOGO-CV (0.42) to the mean number of count of drug offense crimes in neighborhood (7.07) shows that this error is not very severe and the model is doing a decent job in generalizing across neighborhoods.

# Table of MAE and standard deviation MAE by regression.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "Table 1: MAE by regression") %>%
    kable_styling("striped", full_width = F)

Table 1: MAE by regression
Regression	Mean_MAE	SD_MAE
Random k-fold CV	0.18	0.12
Spatial LOGO-CV	0.42	0.51

6.2 Analyzing Bias

We next check the generalizability of the model across different race contexts (majority white and majority non-white). As can be seen from Figure 10, the city of San Francisco shows sustantial racial segregation.

# Q9. A table of raw errors by race context for a random k-fold vs. spatial cross validation regression.

# Visualizing different race contexts
tracts18 %>% 
  ggplot() + 
  geom_sf(aes(fill = raceContext)) + 
  scale_fill_manual(values = c("Majority Non-White"="#440154", "Majority White"="#FDE725")) + 
  labs(title="Race Context",
       caption="Figure 10") + 
  theme(legend.position="bottom") + 
  mapTheme()

From Table 2 it can be see that our model on average, under-predicts in majority non-white neighborhoods and over-predicts in majority white neighborhoods. The model not only reports lower errors overall, but a smaller difference in errors across racial context, indicating that it is generalizing quite well across the considered racial groups.

# Table of errors
reg.summary %>% 
    st_centroid() %>%
    st_join(tracts18) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error) %>%
      kable(caption = "Table 2: Mean Error by neighborhood racial context") %>%
        kable_styling("striped", full_width = F)

Table 2: Mean Error by neighborhood racial context
Regression	Majority Non-White	Majority White
Random k-fold CV	-0.0916696	0.0896803
Spatial LOGO-CV	-0.1018317	0.0909264

7 Comparing with KDE

Next we compare our risk prediction method with the traditional kernel density hotspot mapping tool used by the policing department. If the risk predictions capture more observed drug offenses than the kernel density, then we might be able to say that the risk prediction model provides a more robust targeting tool for allocating police resources.

Figure 11 compares the kernel density predictions with the predictions from our model.

# Finding kernels around crimes 
burg_ppp <- as.ppp(st_coordinates(drugOffense18), W = st_bbox(final_net))
burg_KD <- spatstat::density.ppp(burg_ppp, 1000)

# Kernel
burg_KDE_sf <- as.data.frame(burg_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(drugOffense19) %>% mutate(drugOffenseCount = 1), ., sum) %>%
    mutate(drugOffenseCount = replace_na(drugOffenseCount, 0))) %>%
  dplyr::select(label, Risk_Category, drugOffenseCount)

# Regression
burg_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV") %>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(drugOffense19) %>% mutate(drugOffenseCount = 1), ., sum) %>%
      mutate(drugOffenseCount = replace_na(drugOffenseCount, 0))) %>%
  dplyr::select(label,Risk_Category, drugOffenseCount)

# The map comparing kernel density to risk predictions for the next year's crime.
rbind(burg_KDE_sf, burg_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(drugOffense19, 750), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2018 drug offense risk predictions; 2019 drug offenses",
         caption="Figure 11: Map comparing kernel density to risk predictions for the next year's crime.") +
    mapTheme()

Figure 11 shows that the risk prediction model performs marginally better in predicting drug offenses for year 2019 in moderate (50%-69% and 70%-89%) and high risk categories (90% to 100%) relative to the kernel density method.

# Q11 The bar plot making this comparison.
rbind(burg_KDE_sf, burg_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countDrugOffenses = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countDrugOffenses / sum(countDrugOffenses)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE) +
      labs(title = "Risk prediction vs. Kernel density, 2019 Drug Offenses",
           x = "Risk Category",
           y = "Rate of Test Set Crimes",
           caption = "Figure 11: Bar plot making kernel density and risk prediction comparisons.") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) + 
      plotTheme()

8 Conclusion

From the above analysis we can see that our geospatial risk model generalizes quite well across different neighborhoods and racial groups and has low mean MAE. In addition, our model marginally does better in predicting potential drug offense crimes in moderate to high risk areas compared to the currently used kernel density model. Both these facts make our model a lucrative model to be put in production at first glance. However, we should also note that generalizing across neighborhoods and racial groups does not eliminate the possibility of the model not having any selection bias. It is possible that this model would fail to generalize over different income groups, given that we did not take this criteria into account. This selection bias, if it exists in the model, might result in prediction of high drug offense crime risk in low income areas which could then perpetuate to over-reporting and over-policing these areas.

Given (i) good generalizability across neighborhoods and racial groups, (ii) at par or marginally better performance of the model in predicted high risk areas compared to KDE, but (iii) the non-accounting of selection bias across all possible criteria, it is recommended to use this model just as an aid but not as a tool that can be completely relied on for making policing decisions. Further improvements are required in this model, by introducing more risk and protective features, and spatial features, which can work towards reducing the biases and achieve better generalizability.

Drug Offense Risk Modeling in San Francisco

By Shubham Annadate