1 Introduction

California wildfires have always posed a hazardous and harmful phenomenon to the people and the environment. California recorded its first gigafire after the expansive August Complex this year scorched more than a million acres of land, an area larger than the size of Rhode Island. Although more than 3000 firefighters and personnel were deployed to the fire, it took almost three months to fully contain the fire. It left the country with a cost of more than $264 million to deal with the fire and in losses. The cause of the wildfire was lightning strikes, and the rugged terrain combined with consistent high winds and record heat complicated the firefighting efforts. The 2020 fire season in California is the culmination of three main factors: weather & climate, topology and human activity.

Having knowledge in advance of these and many other factors from the past could prove to provide a precious tool for addressing Forest Management Practices and Wildfire Risk Assessment in avoiding large wildfires from happening in future years.

This motivated us to provide analysis, visualizations and a model to assess the likelihood of potential wildfires in the upcoming season. Modeling wildfire risk based on the intelligence we gather from various topological, anthropogenic, vegetation, weather and climate features could result in a modern instrument that would help forest officials in their forest protection plans and address fuel management strategies such as targeted prescribed burns and clearing of brushes in order to avoid potential wildfires in future.

Please find a narrated presentation of our Wildfire Risk Assessment App here.

2 Setup

Before getting started with the analysis, we first import the required libraries and functions, and initialize values for our color palettes and CRS.

# Import required libraries
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(gifski)
library(tidycensus)
library(viridis)
library(osmdata)
library(FNN)
library(RANN)
library(elevatr)
library(sp)
library(caret)
library(ggcorrplot)
library(stargazer)

options(tigris_class = "sf")

# Import required functions
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Initialize color pallets
palette6 <- c("#ffffcc", "#c7e9b4", "#7fcdbb", "#41b6c4", "#2c7fb8", "#253494")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#ffffb2","#fecc5c","#fd8d3c","#e31a1c")
palette3 <- c("#eff3ff","#6baed6","#08519c")
palette2 <- c("#ffffb2","#f03b20")

# Set the CRS
crs <- st_crs("+proj=lcc +lat_1=40 +lat_2=41.66666666666666 +lat_0=39.33333333333334 +lon_0=-122 +x_0=2000000 +y_0=500000 +ellps=GRS80 +units=m no_defs")

3 Area of Study

To narrow down on the scope of this analysis, we decided to focus on the the southern side of California as our area of study for wildfire risk modeling. This area comprises of a total of 10 counties as shown in Figure 1.

# Getting counties in California

cali_county_bdry <- st_read("https://opendata.arcgis.com/datasets/8713ced9b78a4abb97dc130a691a8695_0.geojson") %>% 
  st_transform(crs=crs)

# Filtering out counties in Southern California

south_cali_counties <- c("San Luis Obispo", "Kern", "San Bernardino", "Santa Barbara",
                         "Ventura", "Orange", "Riverside", "San Diego", "Imperial", "Los Angeles")

south_cali_county_boundary <- cali_county_bdry %>% 
  filter(COUNTY_NAME %in% south_cali_counties) %>% 
  select(OBJECTID, COUNTY_NAME)

# Generating the Southern California boundary

X <- st_coordinates(st_centroid(south_cali_county_boundary))[,1][1:10] 
Y <- st_coordinates(st_centroid(south_cali_county_boundary))[,2][1:10]
label <- south_cali_county_boundary %>% select(COUNTY_NAME) %>% st_drop_geometry() %>% head(10) %>% unlist() %>% unname()

south_cali_boundary <- st_union(south_cali_county_boundary) %>% st_sf()

# Plotting Study Area

south_cali_county_boundary %>%
  ggplot() +
  geom_sf(aes(fill = COUNTY_NAME)) + 
  geom_text(data = south_cali_county_boundary %>% head(10), aes(x = X, y = Y, label = label)) + 
  labs(title = "Study Area",
       subtitle = "Southern California",
       caption = "Figure 1") + 
  mapTheme() + 
  theme(legend.position = "none")

4 Fire Perimeter

Our analysis begins with extracting the fire perimeters data, providing us with the boundaries of wildfires. This analysis focuses on wildfires that occurred from 2014-19 in Southern California, although we also extract the data for the year 2013 to deal with lag feature values as discussed in Section 5.9. Figure 2 below shows the wildfire perimeters for the years we considered. As can be seen, most of the wildfires seem to be concentrated on the western side of Southern California, with some of the major fires occurring near the coast of Ventura and Santa Barbara counties.

# Uncomment to get the data in real-time

# fire_perimeter <- st_read("https://opendata.arcgis.com/datasets/e3802d2abf8741a187e73a9db49d68fe_0.geojson") %>% 
#   st_transform(crs=crs) %>% 
#   filter(YEAR_ >= 2013) %>% 
#   st_intersection(south_cali_boundary) %>% 
#   mutate(month = as.factor(month(ALARM_DATE))) %>% 
#   mutate(season = ifelse(month %in% c(5,6,7,8), "summer", ifelse(month %in% c(9,10,11,12), "fall", "winter"))) %>% 
#   mutate(season = as.factor(season))

# Load fire perimeter data

load(file = "../rdata/fire_perimeter.RData")

# Plot fire perimeters 

ggplot() + 
  geom_sf(data = south_cali_boundary) + 
  geom_sf(data = fire_perimeter, fill ="red") + 
  mapTheme() + 
  labs(title = "Fire Perimeter",
       subtitle = "Southern California, 2013-19",
       caption = "Figure 2")

4.1 Creating Fishnet and Mapping Fires

We next create a fishnet over the entire Southern California region, with each fishnet cell covering an area of almost 2000 acres. Having done that, we map the fire perimeter data for each year and season to the fishnet cells, indicating whether/not a given fishnet cell was burned during a particular year and season. Season is an important aspect of our analysis, and provides the forest management officials with a time frame of almost 4 months to analyze the risk of wildfire in a given region and take appropriate actions.

Figure 3 below shows an example of mapping the wildfires that occurred in the year 2016 onto the fishnet cells. As can be seen, summer is the season having high likelihoods of a wildfire outbreak.

# Uncomment this to create fishnet in real-time 

# fishnet <-
#   st_make_grid(south_cali_boundary, cellsize = 9333) %>%
#   .[south_cali_boundary] %>%
#   st_sf() %>%
#   mutate(uniqueID = rownames(.))

# Load the fishnet data

load(file = "../rdata/south_cali_fishnet_2000_acres.RData")

# For each fire perimeter get fishnet cells associate with it 

fire_incidents_fishnet <- fire_perimeter %>% 
  st_intersection(fishnet) %>% 
  st_drop_geometry() %>% 
  select(uniqueID, year=YEAR_, season) %>% 
  distinct() %>% 
  mutate(Burned = 1)

fishnet_year_season_expand <- expand.grid(uniqueID = unique(fishnet$uniqueID),
            year = as.character(seq(2013, 2019)),
            season = c("summer", "fall", "winter"))

fire_incidents_fishnet <- fire_incidents_fishnet %>% 
  right_join(fishnet_year_season_expand) %>% 
  mutate(Burned = ifelse(is.na(Burned), 0, 1)) %>% 
  inner_join(fishnet %>% select(uniqueID)) %>% 
  st_sf()

# Map fires for year 2016 on fishnet by season

fire_incidents_fishnet %>%
  filter(year == 2016) %>% 
  mutate(season = ifelse(season == "fall", "Fall", ifelse(season == "winter", "Winter", "Summer"))) %>% 
  mutate(Burned = ifelse(Burned==0, "No", "Yes")) %>%
  ggplot() +
  geom_sf(aes(fill = Burned), color="transparent") +
  scale_fill_manual(values = palette2) + 
  mapTheme() + 
  labs(title = "Map of Fires on Fishnet by Season",
       subtitle = "Southern California, 2016",
       caption = "Figure 3") + 
  facet_wrap(~season) + 
  theme(legend.position="bottom")

Figure 4 below strengths the claim we made above for summer being the most prone month for wildfire outbreak. In addition, it also shows that winter (2014 and 2018) and fall months also show some wildfire activity, hence making it appropriate to seasonally model this phenomenon.

# Count of wildfires across year and month

fire_perimeter %>% 
  st_drop_geometry() %>% 
  mutate(month = as.factor(month(ALARM_DATE))) %>%
  count(YEAR_, month) %>%
  filter(YEAR_ > 2013) %>% 
  drop_na() %>% 
  ggplot(aes(x= month, y = n)) + 
  geom_bar(stat = "identity", fill = "steelblue") + 
  facet_wrap(~YEAR_, scales = "free_y") + 
  plotTheme() + 
  labs(
    title = "Count of Wildfires Across Years and Month",
    subtitle = "Southern California",
    x = "Month",
    y = "Count",
    caption = "Figure 4"
  )

5 Potential Signals for Wildfire Risk

Modeling wildfires can get tricky, with a number of factors playing role in its outbreak. Some of these factors include the weather, anthropogenic features (like nearness to settlements and roads) and topological features (like vegetation type, presence of lakes and elevation of land). In the following section we focus of extracting some of these signals and building features around them that would help us in modeling wildfire risk (Section 8).

5.1 Weather Data

Weather plays a key factor in increasing the risk and extent of wildfires. Warmer temperatures and lower relative humidity makes the forest fuels more receptive to ignition, with stronger winds providing the required oxygen supply and transporting embers ahead of the flaming front. Such weather signals can prove to be of utmost importance to forest officials for assessing the wildfire risk in different regions of interest.

5.1.1 Associating Fishnet Cell to Weather Station

For extracting the weather data, we consider the data from 81 weather stations spread across the region of Southern California. In order to get the weather conditions for a given fishnet cell, we first associate each fishnet cell with the nearest weather station, as shown in Figure 5. This modeling strategy assumes that the weather stays more of less uniform in vicinity of the weather station.

# Associating each fishnet cell with a weather station

# Getting weather station data
weather_stations <- riem_stations(network = "CA_ASOS") %>% 
  st_as_sf(coords = c("lon", "lat"), crs = 4326, agr = "constant")%>%
  st_transform(crs=crs) %>% 
  .[south_cali_boundary, ] %>% 
  mutate(weather_station_rid = row_number())

weather_station_mapping <- st_drop_geometry(weather_stations) %>% 
  select(id, weather_station_rid)

# Mapping each fishnet to nearest weather station

weather_station_coords <- st_coordinates(st_centroid(weather_stations))
fishnet_coordinates <- st_coordinates(st_centroid(fishnet))

closest_weather_station_to_fishnet <- nn2(weather_station_coords, fishnet_coordinates, k = 1 )$nn.idx

fishnet$weather_station_rid <- closest_weather_station_to_fishnet

fishnet <- fishnet %>% 
  inner_join(weather_station_mapping) %>% 
  select(-weather_station_rid) %>% 
  rename(weather_station_id = id)

# Plotting

ggplot() + 
  geom_sf(data = fishnet, aes(fill = weather_station_id)) + 
  geom_sf(data = weather_stations, fill = "black") + 
  theme(legend.position = "none") + 
  mapTheme() + 
  labs(title = "Fishnet Associated with their Nearest Weather Station",
       subtitle = "Black dots represent the weather station",
       caption = "Figure 5")

5.1.2 Extracting Weather Data

Next, given that we are dealing with seasonal wildfires, we extract the weather data for each weather station for each season (winter, summer and fall). This is done by aggregating the daily level weather data from the station at the seasonal level. Specific weather features that we considered include mean and standard deviations of temperature, dew point temperature, relative humidity, wind direction, wind speed and total precipitation.

Figure 6 below shows an example of the mean temperature at each weather station in the summer of 2016.

# Get weather data for each time period for each weather station

season_date_map <- list(
  "summer" = list(
    "start" = "-05-01",
    "end" = "-08-31"
  ),
  "fall" = list(
    "start" = "-09-01",
    "end" = "-12-31"
  ),
  "winter" = list(
    "start" = "-01-01",
    "end" = "-04-30"
  )
)


# Function to get weather data 
get_weather_features_by_station <- function(weather_station_ids, start_year, end_year, season_date_map){
  
  year_vec <- seq(start_year, end_year)

  i <- 1
  weather_data_list <- list()
  
  for(station_id in weather_station_ids){
    
    print(paste("Processing station", station_id))
    
    for(year in year_vec){
      for(season in names(season_date_map)){
        
        start_date = paste0(year, season_date_map[[season]][["start"]])
        end_date = paste0(year, season_date_map[[season]][["end"]])
        
        weather_data <- riem_measures(station = station_id, date_start = start_date, date_end = end_date) %>% 
          summarise(weather_station_id = station_id,
                    year = year,
                    season = season,
                    mean_tmpf = mean(tmpf, na.rm = TRUE),
                    sd_tmpf = sd(tmpf, na.rm = TRUE),
                    mean_dwpf = mean(dwpf, na.rm = TRUE),
                    sd_dwpf = sd(dwpf, na.rm = TRUE),
                    mean_relh = mean(relh, na.rm = TRUE),
                    sd_relh = sd(relh, na.rm = TRUE),
                    mean_drct = mean(drct, na.rm = TRUE),
                    sd_drct = sd(drct, na.rm = TRUE),
                    mean_sknt = mean(sknt, na.rm = TRUE),
                    sd_sknt = sd(sknt, na.rm = TRUE),
                    sum_p01i = sum(p01i, na.rm = TRUE)
                    ) 
        
        weather_data_list[[i]] <- weather_data
        i <- i + 1
        
      }
    }
  }


  do.call("rbind", weather_data_list) 
}

weather_station_ids <- unique(fishnet$weather_station_id)

# Uncomment to get weather data in rea-time
# weather_data <- get_weather_features_by_station(weather_station_ids, 2013, 2019, season_date_map)

# Load weather data
load(file = "../rdata/weather_data_13_19.RData")

# Plotting 
weather_data %>% 
  filter(year == 2016, season == "summer") %>% 
  inner_join(weather_stations, by = c("weather_station_id" = "id")) %>% 
  st_sf() %>% 
  ggplot() + 
  geom_sf(data = south_cali_boundary) +  
  geom_sf(aes(size=mean_tmpf), color = "red", alpha = 0.5) + 
  mapTheme() + 
  labs(title = "Temperature at each Weather Station",
       subtitle = "Southern California, Summer 2016",
       caption = "Figure 6",
       size = "Mean\nTemperature") + 
  theme(legend.position = "bottom")

5.1.3 Dealing with Missing Values

For some of the weather station, extracting the weather data for certain time frames results in missing weather values. To overcome this, we employ the following strategy to intelligently fill in these values:

For stations with missing weather values, we look at the next nearest weather station to fill in the missing values, again assuming that the weather conditions remain more or less uniform going one from station to its nearest station.
If the nearest station also has missing values, we resort to relying on the nearest weather station of the neighboring station to fill in the missing values.

Fortunately, this two hop neighbor missing value filling process results in filling in all the missing values. Given that we had mapped each fishnet cell to its nearest weather station in Section 5.1.1, we next associate the extracted weather data to each fishnet using this mapping.

# Get weather station coordinates
weather_station_coords <- st_coordinates(st_centroid(weather_stations))

# Find the closest weather station to a given station
closest_weather_station_to_station <- nn2(weather_station_coords, weather_station_coords, k = 2 )$nn.idx[,2]

weather_station_mapping$closeset_weather_station_id <- closest_weather_station_to_station

# Filter our values that are not NA
not_nas_weather_data <- weather_data %>% 
  filter(!is.na(mean_tmpf))

# Filter out values that are NA
nas_weather_data <- weather_data %>%  
  filter(is.na(mean_tmpf)) %>% 
  select(weather_station_id, year, season)

# Deal with NAs by looking at neighboring station's data
nas_weather_data_clean_1 <- nas_weather_data %>% 
  inner_join(weather_station_mapping, by = c("weather_station_id"="id")) %>% 
  select(-weather_station_rid) %>% 
  inner_join(weather_station_mapping, by = c("closeset_weather_station_id"="weather_station_rid")) %>% 
  select(weather_station_id, year, season, closest_weather_station_id = id) %>% 
  inner_join(weather_data, by = c("closest_weather_station_id"="weather_station_id", "year"="year", "season"="season")) %>% 
  select(-closest_weather_station_id) 

# Filter out not NAs after first pass
not_nas_weather_data_1 <- nas_weather_data_clean_1 %>% 
  filter(!is.na(mean_tmpf))

# Filter out NAs after first pass 
nas_weather_data_1 <- nas_weather_data_clean_1 %>%  
  filter(is.na(mean_tmpf)) %>% 
  select(weather_station_id, year, season) 
  
# Deal with NAs by looking at neighbour's of neighbouring station's data
nas_weather_data_clean_2 <- nas_weather_data_1 %>% 
  inner_join(weather_station_mapping, by = c("weather_station_id"="id")) %>% 
  select(-weather_station_rid) %>% 
  inner_join(weather_station_mapping, by = c("closeset_weather_station_id"="weather_station_rid")) %>% 
  select(-closeset_weather_station_id, -id) %>% 
  inner_join(weather_station_mapping, by = c("closeset_weather_station_id.y"="weather_station_rid")) %>% 
  select(weather_station_id, year, season, closest_weather_station_id = id) %>% 
  inner_join(weather_data, by = c("closest_weather_station_id"="weather_station_id", "year"="year", "season"="season")) %>% 
  select(-closest_weather_station_id) 
  
# Bind all data together 
weather_data_clean <- rbind(not_nas_weather_data, not_nas_weather_data_1, nas_weather_data_clean_2) %>% 
  mutate(year = as.character(year))

5.2 Incorporated Cities

Human settlements can also prove to be an important signal for modeling wildfire risk. As per a study*, almost 85% of the wildfires in the Untied States are caused by humans, either due to the leaving the campfires unattended or due to burning of debris, to name a few causes. This makes analyzing proximity to human settlements important.

Figure 7 below shows the human settlement areas (in blue) along with the wildfire (in red) that have occurred from 2013-19. As can be seen, most of the wildfires have occurred in regions that are in close proximity to a human settlement. Given such a pattern, we extract two signals from this data for each fishnet cell, (i) the distance of a fishnet cell to the human settlement area and (ii) whether/not a fishnet cell lies within the human settlement area.

*Source: 2000-2017 data based on Wildland Fire Management Information (WFMI) and U.S. Forest Service Research Data Archive.

# Distance to incorporated cities / presence within incorporated cities 

south_cali_incorporated_cities <- 
  st_read("https://opendata.arcgis.com/datasets/8322505e8f1741c7b0de85684594e32a_0.geojson") %>% 
  st_transform(crs=crs) %>% 
  st_intersection(south_cali_boundary)

# Plotting 

ggplot() + 
  geom_sf(data = south_cali_boundary, fill = "#d9d9d9") + 
  geom_sf(data = fire_perimeter, aes(fill = "Wildfire"), color = "transparent", alpha=0.6) + 
  geom_sf(data = south_cali_incorporated_cities,aes(fill = "City Area"), color = "transparent", alpha=0.6) + 
  labs(
    title = "Cities and Wildfire",
    subtitle = "Southern California, 2013-19",
    fill = "",
    caption = "Figure 7"
  ) + 
  scale_fill_manual(values = c("Wildfire" = "#ef3b2c", "City Area"= "#4292c6")) + 
  mapTheme() + 
  theme(legend.position="bottom")

# Distance of fishnet cell to city

incorporated_cities_coords <- st_coordinates(st_centroid(south_cali_incorporated_cities))
fishnet_coordinates <- st_coordinates(st_centroid(fishnet))

cell_dist_to_nearest_settlement <- nn2(incorporated_cities_coords, fishnet_coordinates, k = 1 )$nn.dists

fishnet$dist_to_nearest_settlement <- cell_dist_to_nearest_settlement

# Presence/Absence of fishnet cell in city 

fishnet <- fishnet %>% 
  st_intersection(south_cali_incorporated_cities) %>% 
  st_drop_geometry() %>% 
  select(uniqueID) %>% 
  distinct() %>%
  mutate(fall_within_settlement = 1) %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(fall_within_settlement = ifelse(is.na(fall_within_settlement), 0, 1))

5.3 Lakes

Lakes act as natural “firebreaks” stopping wildfires from spreading and providing refuge to plant and animal species. One of the many attributes of a lake is its size, and one might hypothesize that areas with large water bodies are less prone to wildfire activity. To capture this signal, we categorize the size of each lake in Southern California from 1-5 (5 being the highest) and associate each fishnet cell with the size of lake it lies in (with 0 indicating absence of lake). Figure 8 below shows the fishnet cells colored according to the size of the contained lake.

# South california lakes 

# Uncomment to get data in real-time
# south_cali_lakes <- st_read("https://opendata.arcgis.com/datasets/4d7024209b3d4b7796d4dd4119764d3e_5.geojson") %>% 
#   st_transform(crs=crs) %>% 
#   st_intersection(south_cali_boundary)

# Load South California lakes data
load(file = "../rdata/south_cali_lakes.RData")

# Assign each cell a number based on the size of lake that lies within it 

south_cali_lakes$lake_size_quantile <- as.integer(as.character(q5(south_cali_lakes$sfc_acres)))

fishnet <- fishnet %>% 
  st_intersection(south_cali_lakes) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>% 
  summarise(
    lake_size_quantile = max(lake_size_quantile)
  ) %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(lake_size_quantile = ifelse(is.na(lake_size_quantile), 0, lake_size_quantile))

# Plotting 

fishnet %>% 
  ggplot() + 
  geom_sf(aes(fill = as.factor(lake_size_quantile))) + 
  scale_fill_manual(values = palette6) +
  labs(title = "Contained Lake Size Within Fishnet Cell",
       subtitle = "Southern California",
       caption = "Figure 8",
       fill = "Lake Body Size") + 
  mapTheme() + 
  theme(legend.position = "bottom")

5.4 Fire Facilities

Next we consider data from facilities that have been set up for fire suppression, including state and local funded fire stations, stations providing wildland firefighting service, conservation camps, etc. To model this signal, we consider to count of these facilities within each fishnet cell as can be seen in Figure 9.

# Fire facilities 
fire_facilities <- st_read("https://opendata.arcgis.com/datasets/1c8a93cac92f418e98a8fa6a2eaf4265_0.geojson") %>% 
  st_transform(crs=crs) %>% 
  .[south_cali_boundary, ]

# Find number within each fishnet cell

fishnet <- fishnet %>% 
  st_intersection(fire_facilities) %>% 
  st_drop_geometry() %>% 
  count(uniqueID) %>%
  rename(n_fire_facilites = n) %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(n_fire_facilites = ifelse(is.na(n_fire_facilites), 0, n_fire_facilites)) %>% 
  mutate(n_fire_facilites = case_when(
    n_fire_facilites == 0 ~ "0",
    n_fire_facilites > 0 & n_fire_facilites <= 3 ~ "1-3",
    n_fire_facilites > 3 & n_fire_facilites <= 6 ~ "4-6",
    n_fire_facilites > 6 ~ "7+"
  ))

# Plotting 

fishnet %>% 
  ggplot() + 
  geom_sf(aes(fill = n_fire_facilites)) + 
  scale_fill_manual(values = palette4) + 
  mapTheme() + 
  labs(title = "Count of Fire Facilites in Fishnet Cell",
       subtitle = "Southern California",
       caption = "Figure 9",
       fill = "Count Fire\nFacilities") +
  theme(legend.position = "bottom")

5.5 Hazard Zones

Hazard zones are areas of tree mortality that hold the greatest potential to result in wildfire and/or falling tree and threaten people and property in these areas. These zones are divided into different tiers, with tier 1 zones having the highest potential of being a safety issue to people, followed by tier 2. Given that these hazard zones can indicate potential areas that have high risk of catching wildfire and also areas that need to be prioritized by forest officials to remove dead and dying tree, we model this signal by mapping each hazard zone to the fishnet cells, capturing whether/not the cell lies within the hazard zone. Figure 10 shows the map of hazard zones across the two tiers.

# High hazard zone 2

# Uncomment to get hazard zone in real-time
# hazard_zone_t2 <- st_read("https://opendata.arcgis.com/datasets/e50b7577426c4367a518b80b38e9b5d8_0.geojson") %>%
#   st_make_valid() %>% 
#   st_transform(crs=crs) %>%
#   st_intersection(south_cali_boundary)

# Load hazard zone 2 data
load(file = "../rdata/hazard_zone_t2.RData")

# Associate fishnet cell with hazard zone
fishnet <- fishnet %>% 
  st_intersection(hazard_zone_t2) %>% 
  st_drop_geometry() %>% 
  select(uniqueID) %>% 
  distinct() %>%
  mutate(fall_within_harard_zone_t2 = 1) %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(fall_within_harard_zone_t2 = ifelse(is.na(fall_within_harard_zone_t2), 0, 1)) 

p1 <- fishnet %>% 
  mutate(fall_within_harard_zone_t2 = ifelse(fall_within_harard_zone_t2 == 0, "No", "Yes")) %>% 
  ggplot() + 
  geom_sf(aes(fill = as.factor(fall_within_harard_zone_t2))) +
  scale_fill_manual(values = palette2) + 
  labs(title = "Hazard Zone Tier 2",
       fill = "Falls in Hazard Zone T2") + 
  mapTheme() + 
  theme(legend.position = "bottom")


# High hazard zone 1 -- Area filter 1500000

# Uncomment to get hazard zone in real-time
# hazard_zone_t1 <- st_read("https://opendata.arcgis.com/datasets/a71a85136b0b414ea734fdfbe3d7674a_0.geojson") %>%
#   filter(SHAPEAREA > 10000) %>% 
#   st_make_valid() %>% 
#   st_transform(crs=crs) %>%
#   st_intersection(south_cali_boundary)

# Load hazard zone 1 data
load(file = "../rdata/hazard_zone_t1.RData")

# Associate fishnet cell with hazard zone
fishnet <- fishnet %>% 
  st_intersection(hazard_zone_t1) %>% 
  st_drop_geometry() %>% 
  select(uniqueID) %>% 
  distinct() %>%
  mutate(fall_within_harard_zone_t1 = 1) %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(fall_within_harard_zone_t1 = ifelse(is.na(fall_within_harard_zone_t1), 0, 1)) 

p2 <- fishnet %>% 
  mutate(fall_within_harard_zone_t1 = ifelse(fall_within_harard_zone_t1 == 0, "No", "Yes")) %>% 
  ggplot() + 
  geom_sf(aes(fill = as.factor(fall_within_harard_zone_t1))) +
  scale_fill_manual(values = palette2) + 
  labs(title = "Hazard Zone Tier 1",
       fill = "Falls in Hazard Zone T1") + 
  mapTheme() + 
  theme(legend.position = "bottom")

# Plotting 

grid.arrange(p2, p1, bottom="Figure 10", ncol=2)

5.6 Elevation

Elevation is know to effect fire behavior, influencing the amount and timing of precipitation, exposure to prevailing wind and affecting the seasonal dying of fuel. All these factors change the risk of wildfire in one way or the other and hence, to capture this signal, we calculate the point elevation for each of the fishnet cells. Figure 11 below shows a map of the same and also indicates that areas in the vicinity of high elevation areas might have higher risk of wildfire.

# Getting the elevation for each cell

fishnet_center_coords_df <- fishnet %>% 
  st_transform(crs=4326) %>% 
  st_centroid() %>% 
  st_coordinates() %>% 
  as.data.frame() %>% 
  select(long = X, lat = Y) 

fishnet_center_coords_df$uniqueID <- fishnet$uniqueID

prj <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"

# Uncomment this to get data in real-time
# fishnet_cell_point_elevations <- get_elev_point(fishnet_center_coords_df, prj = prj, src = "epqs")

# Load elevation data
load(file = "../rdata/fishnet_cell_point_elevations.RData")

elevation_df <- data.frame(fishnet_cell_point_elevations) %>% 
  select(uniqueID, elevation) %>% 
  drop_na()
  
fishnet <- fishnet %>% 
  left_join(elevation_df) %>%
  mutate(elevation = ifelse(is.na(elevation), 0.00, elevation))

# Plotting 

fishnet %>% 
  ggplot() + 
  geom_sf(aes(fill = elevation)) +
  geom_sf(data = fire_perimeter, fill = "red", color = "transparent", alpha=0.7) + 
  labs(title = "Point Elevation of each Fishnet Cell, Southern California",
       subtitle = "Map is overlayed with Wildfire from 2013-19",
       caption = "Figure 11",
       fill = "Elevation") +  
  scale_fill_gradient(low = "#f6e8c3", high = "#8c510a") + 
  mapTheme()

5.7 Tree Seed Zone

Tree seed zones (having 3-digit codes) are regions that tend to have similar topographic, climatic and edaphic conditions, and hence indicate the suitability of seed for a planting area. Figure 12 below shows the different seed zones in Southern California overlayed with the wildfires that have occurred in this region from 2013-19. As can be seen certain tree seed zones are more likely to catch wildfire outbreak than than others. To capture this signal, we associate each fishnet cell with the tree zone it lies in.

For details on the seed zone code and their associated topographic conditions, please refer to https://www.fs.fed.us/psw/publications/conkle/psw_1970_conkle001_buck.pdf

# Cali Tree Seed Zone
# https://www.fs.fed.us/psw/publications/conkle/psw_1970_conkle001_buck.pdf

# Tree Zone Data
tree_seed_zones <- st_read("../gdbs/seedzones02_1.gdb/") %>% 
  st_transform(crs=crs) %>% 
  st_intersection(south_cali_boundary)

# Associating each fishnet cell with a treezone  
fishnet <- fishnet %>% 
  st_intersection(tree_seed_zones %>% select(SEED_ZONE)) %>% 
  st_drop_geometry() %>% 
  group_by(uniqueID) %>% 
  slice(1) %>% 
  ungroup() %>% 
  right_join(fishnet) %>% 
  st_sf() %>% 
  mutate(SEED_ZONE = ifelse(is.na(SEED_ZONE), "150", SEED_ZONE)) 

# Plotting

ggplot() + 
  geom_sf(data = tree_seed_zones, aes(fill = SEED_ZONE)) +
  geom_sf(data = fire_perimeter, fill = "red", color = "transparent", alpha=0.7) + 
  mapTheme() + 
  labs(title = "Tree Seed Zones, Southern California",
       subtitle = "Map is overlayed with Wildfire from 2013-19",
       caption = "Figure 12",
       fill = "Seed Zone Code")

  #theme(legend.position = "none")

5.8 Vegetation

In close lines with the tree seed zones discussed in Section 5.7, the vegetation demographics of a region (including the type of tree cover, distribution of life forms, density of cover, etc.) also play an important roles in informing the risk of wildfire outbreak. Given the resolution of data that was available, multiple vegetation demographics were present within a single fishnet cell, and hence to tackle this issue, we resorted to the associating each fishnet cell with the vegetation demographic that majorly dominated it.

Figure 13 below shows the map of tree cover type across Southern California and overlayed with wildfires that occurred from 2013-19. This map provides an interesting insight of shrubs and hardwood being more prone to wildfire outbreaks, a vital piece of information for the forest officials!

# Uncomment below code to get vegegation features in real-time

### South Sierra Vegetation Extraction ###

# south_sierra_veg<-st_read("../../../../Downloads/S_USA.EVMid_R05_SouthSierra.gdb/S_USA.EVMid_R05_SouthSierra.gdb", layer = "EVMid_R05_SouthSierra")%>%
#   st_transform(crs=crs) 
# save(south_sierra_veg, file = "../rdata/south_sierra_veg.RData")


### South Interior Vegetation Extraction ###

# south_interior_veg<-st_read("../../../../Downloads/S_USA.EVMid_R05_SouthInterior.gdb/S_USA.EVMid_R05_SouthInterior.gdb", layer ="EVMid_R05_SouthInterior") %>% 
#   st_transform(crs=crs)
# save(south_interior_veg, file = "../rdata/south_interior_veg.RData")

### South Coast Vegetation Extraction ###

# south_coast_veg<-st_read("../../../../Downloads/S_USA.EVMid_R05_SouCoast.gdb/S_USA.EVMid_R05_SouCoast.gdb", layer ="EVMid_R05_SouCoast") %>% 
#   st_transform(crs=crs)
# save(south_coast_veg, file = "../rdata/south_coast_veg.RData")

### Central Valley Vegetation Extraction ###

# central_valley_veg<-st_read("../../../../Downloads/S_USA.EVMid_R05_CentralValley.gdb/S_USA.EVMid_R05_CentralValley.gdb", layer ="EVMid_R05_CentralValley") %>% 
#   st_transform(crs=crs)
# save(central_valley_veg, file = "../rdata/central_valley_veg.RData")

### Central Coast Vegetation Extraction ###

# central_coast_veg<-st_read("../../../../Downloads/S_USA.EVMid_R05_CentralCoast.gdb/S_USA.EVMid_R05_CentralCoast.gdb", layer ="EVMid_R05_CentralCoast") %>% 
#   st_transform(crs=crs)
# save(central_coast_veg, file = "../rdata/central_coast_veg.RData")

### Keeping only Relevant Columns ###
cols_to_keep=c("USGS_ANDERSON_1","SAF_COVER_TYPE",
               "SRM_COVER_TYPE","NVCS_CLASS",
               "AGGREGATION_TYPE","TREE_CFA_CLASS","SHRUB_CFA_CLASS",
               "HARDWOOD_CFA_CLASS","CONIFER_CFA_CLASS","ECOREGION_DIVISION",
               "COVERTYPE","CWHR_LIFEFORM", "CWHR_TYPE", "CWHR_SIZE","CWHR_DENSITY")


# south_coast_veg<-south_coast_veg %>% select(cols_to_keep)
# south_interior_veg<-south_interior_veg %>% select(cols_to_keep)
# south_sierra_veg<-south_sierra_veg %>% select(cols_to_keep)
# central_coast_veg<-central_coast_veg %>% select(cols_to_keep)
# central_valley_veg<-central_valley_veg %>% select(cols_to_keep)

### Finding South Sierra Vegetation that only Comes under South California ###

# load(file = "../rdata/south_sierra_veg.RData")
# south_sierra_veg<-south_sierra_veg %>% select(cols_to_keep) %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_cali_boundary))

### Finding South Interior Vegetation that only comes under South California ###

# load(file = "../rdata/south_interior_veg.RData")
# south_interior_veg<-south_interior_veg %>% select(cols_to_keep) %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_cali_boundary))

### Finding Central Valley Vegetation that only comes under South California ###

# load(file = "../rdata/central_valley_veg.RData")
# central_valley_veg<-central_valley_veg %>% select(cols_to_keep) %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_cali_boundary))


### Finding Central Coast Vegetation that only comes under South California ###

# load(file = "../rdata/central_coast_veg.RData")
# central_coast_veg<-central_coast_veg %>% select(cols_to_keep) %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_cali_boundary))

### Finding South Coast Vegetation that only comes under South California ###

# load(file = "../rdata/south_coast_veg.RData")
# south_coast_veg<-south_coast_veg %>% select(cols_to_keep)


### Saving All the Bounded Vegetation areas within South California ###

#save(south_sierra_veg, file = "../rdata/bounded_south_sierra_veg.RData")
#save(south_interior_veg, file = "../rdata/bounded_south_interior_veg.RData")
#save(central_coast_veg, file = "../rdata/bounded_central_coast_veg.RData")
#save(central_valley_veg, file = "../rdata/bounded_central_valley_veg.RData")


### Load all the Bounded Vegetations ###

# load(file = "../rdata/south_coast_veg.RData")
# south_coast_veg<-south_coast_veg %>% select(cols_to_keep)
# 
# load(file = "../rdata/bounded_south_sierra_veg.RData")
# load(file= "../rdata/bounded_south_interior_veg.RData")
# load(file = "../rdata/bounded_central_coast_veg.RData")
# load(file = "../rdata/bounded_central_valley_veg.RData")


### The final vegetation across Southern California ###

# final_bounded_veg=rbind(south_coast_veg,south_sierra_veg,south_interior_veg,central_coast_veg, central_valley_veg)
#unioned_buffer<-st_union(st_buffer(final_bounded_veg,26400)) %>% st_sf()

### Function to get the majority categorical value for Vegetation Features from each fishnet ###

find_max<-function(column) {

  if(sum(is.na(column))!=length(column)){
  freq=as.data.frame(table(column))
  # print(freq)
  max_value <- max(freq$Freq)
  get_x=as.character(freq$column[freq$Freq==max_value][1])
  }
  else{
    get_x=NA
  }
  # print(get_x)
  return(get_x)
}

### Mapping South Coast Vegetation to Fishnets ###

# fishnet_south_coast<-fishnet %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_coast_veg)) %>% 
#   st_drop_geometry() %>% 
#   group_by(uniqueID) %>% 
#   summarise(usgs=find_max(USGS_ANDERSON_1),saf=find_max(SAF_COVER_TYPE),
#             srm=find_max(SRM_COVER_TYPE), nvcs=find_max(NVCS_CLASS),
#             agg_type=find_max(AGGREGATION_TYPE), tree_cfa=find_max(TREE_CFA_CLASS),
#             shrub_cfa=find_max(SHRUB_CFA_CLASS), hard_cfa=find_max(HARDWOOD_CFA_CLASS),
#             conifer_cfa=find_max(CONIFER_CFA_CLASS),eco_div=find_max(ECOREGION_DIVISION),
#             covertype=find_max(COVERTYPE), life_form=find_max(CWHR_LIFEFORM),
#             life_type=find_max(CWHR_TYPE), cover_size=find_max(CWHR_SIZE),
#             cover_density=find_max(CWHR_DENSITY)) %>% 
#   select(uniqueID, usgs, saf, srm, nvcs, agg_type, tree_cfa, shrub_cfa, hard_cfa, 
#          conifer_cfa, eco_div, covertype, life_form, life_type, cover_size, 
#          cover_density) %>% 
#   left_join(fishnet) %>% 
#   st_sf()

### Mapping Central Coast Vegetation to Fishnet ###

# fishnet_central_coast<-fishnet %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(central_coast_veg)) %>% 
#   st_drop_geometry() %>% 
#   group_by(uniqueID) %>% 
#   summarise(usgs=find_max(USGS_ANDERSON_1),saf=find_max(SAF_COVER_TYPE),
#             srm=find_max(SRM_COVER_TYPE), nvcs=find_max(NVCS_CLASS),
#             agg_type=find_max(AGGREGATION_TYPE), tree_cfa=find_max(TREE_CFA_CLASS),
#             shrub_cfa=find_max(SHRUB_CFA_CLASS), hard_cfa=find_max(HARDWOOD_CFA_CLASS),
#             conifer_cfa=find_max(CONIFER_CFA_CLASS),eco_div=find_max(ECOREGION_DIVISION),
#             covertype=find_max(COVERTYPE), life_form=find_max(CWHR_LIFEFORM),
#             life_type=find_max(CWHR_TYPE), cover_size=find_max(CWHR_SIZE),
#             cover_density=find_max(CWHR_DENSITY)) %>% 
#   select(uniqueID, usgs, saf, srm, nvcs, agg_type, tree_cfa, shrub_cfa, hard_cfa, 
#          conifer_cfa, eco_div, covertype, life_form, life_type, cover_size, 
#          cover_density) %>% 
#   left_join(fishnet) %>% 
#   st_sf()

### Mapping South Interior Vegetation to Fishnets ###

# fishnet_south_interior<-fishnet %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_interior_veg)) %>% 
#   st_drop_geometry() %>% 
#   group_by(uniqueID) %>% 
#   summarise(usgs=find_max(USGS_ANDERSON_1),saf=find_max(SAF_COVER_TYPE),
#             srm=find_max(SRM_COVER_TYPE), nvcs=find_max(NVCS_CLASS),
#             agg_type=find_max(AGGREGATION_TYPE), tree_cfa=find_max(TREE_CFA_CLASS),
#             shrub_cfa=find_max(SHRUB_CFA_CLASS), hard_cfa=find_max(HARDWOOD_CFA_CLASS),
#             conifer_cfa=find_max(CONIFER_CFA_CLASS),eco_div=find_max(ECOREGION_DIVISION),
#             covertype=find_max(COVERTYPE), life_form=find_max(CWHR_LIFEFORM),
#             life_type=find_max(CWHR_TYPE), cover_size=find_max(CWHR_SIZE),
#             cover_density=find_max(CWHR_DENSITY)) %>% 
#   select(uniqueID, usgs, saf, srm, nvcs, agg_type, tree_cfa, shrub_cfa, hard_cfa, 
#          conifer_cfa, eco_div, covertype, life_form, life_type, cover_size, 
#          cover_density) %>% 
#   left_join(fishnet) %>% 
#   st_sf()

### Mapping Central Valley Vegetation to Fishnets ###

# fishnet_central_valley<-fishnet %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(central_valley_veg)) %>% 
#   st_drop_geometry() %>% 
#   group_by(uniqueID) %>% 
#   summarise(usgs=find_max(USGS_ANDERSON_1),saf=find_max(SAF_COVER_TYPE),
#             srm=find_max(SRM_COVER_TYPE), nvcs=find_max(NVCS_CLASS),
#             agg_type=find_max(AGGREGATION_TYPE), tree_cfa=find_max(TREE_CFA_CLASS),
#             shrub_cfa=find_max(SHRUB_CFA_CLASS), hard_cfa=find_max(HARDWOOD_CFA_CLASS),
#             conifer_cfa=find_max(CONIFER_CFA_CLASS),eco_div=find_max(ECOREGION_DIVISION),
#             covertype=find_max(COVERTYPE), life_form=find_max(CWHR_LIFEFORM),
#             life_type=find_max(CWHR_TYPE), cover_size=find_max(CWHR_SIZE),
#             cover_density=find_max(CWHR_DENSITY)) %>% 
#   select(uniqueID, usgs, saf, srm, nvcs, agg_type, tree_cfa, shrub_cfa, hard_cfa, 
#          conifer_cfa, eco_div, covertype, life_form, life_type, cover_size, 
#          cover_density) %>% 
#   left_join(fishnet) %>% 
#   st_sf()

### Mapping South Sierra Vegetations to Fishnets ###

# fishnet_south_sierra<-fishnet %>% 
#   st_make_valid() %>% 
#   st_intersection(st_make_valid(south_sierra_veg)) %>% 
#   st_drop_geometry() %>% 
#   group_by(uniqueID) %>% 
#   summarise(usgs=find_max(USGS_ANDERSON_1),saf=find_max(SAF_COVER_TYPE),
#             srm=find_max(SRM_COVER_TYPE), nvcs=find_max(NVCS_CLASS),
#             agg_type=find_max(AGGREGATION_TYPE), tree_cfa=find_max(TREE_CFA_CLASS),
#             shrub_cfa=find_max(SHRUB_CFA_CLASS), hard_cfa=find_max(HARDWOOD_CFA_CLASS),
#             conifer_cfa=find_max(CONIFER_CFA_CLASS),eco_div=find_max(ECOREGION_DIVISION),
#             covertype=find_max(COVERTYPE), life_form=find_max(CWHR_LIFEFORM),
#             life_type=find_max(CWHR_TYPE), cover_size=find_max(CWHR_SIZE),
#             cover_density=find_max(CWHR_DENSITY)) %>% 
#   select(uniqueID, usgs, saf, srm, nvcs, agg_type, tree_cfa, shrub_cfa, hard_cfa, 
#          conifer_cfa, eco_div, covertype, life_form, life_type, cover_size, 
#          cover_density) %>% 
#   left_join(fishnet) %>% 
#   st_sf()


### Combining all the Vegetation Mapped Fishnets Together ###

# final_vegetation<-rbind(fishnet_south_coast, fishnet_central_coast, fishnet_central_valley,
#                         fishnet_south_interior, fishnet_south_sierra)

# final_vegetation<-final_vegetation %>% distinct(uniqueID, .keep_all = TRUE)

# final_data<-final_vegetation %>% 
#   st_drop_geometry() %>% 
#   right_join(fishnet) %>%
#   st_sf() %>% 
#   mutate(usgs_new = ifelse(is.na(usgs), "UNK", usgs),
#          saf_new = ifelse(is.na(saf), "UNK", saf),
#          srm_new = ifelse(is.na(srm), "UNK", srm),
#          nvcs_new = ifelse(is.na(nvcs), "UNK", nvcs),
#          agg_type_new = ifelse(is.na(agg_type), "UNK", agg_type),
#          tree_cfa_new = ifelse(is.na(tree_cfa), "UNK", tree_cfa),
#          shrub_cfa_new = ifelse(is.na(shrub_cfa), "UNK", shrub_cfa),
#          hard_cfa_new = ifelse(is.na(hard_cfa), "UNK", hard_cfa),
#          conifer_cfa_new = ifelse(is.na(conifer_cfa), "UNK", conifer_cfa),
#          eco_div_new = ifelse(is.na(eco_div), "UNK", eco_div),
#          covertype_new = ifelse(is.na(covertype), "UNK", covertype),
#          life_form_new = ifelse(is.na(life_form), "UNK", life_form),
#          life_type_new = ifelse(is.na(life_type), "UNK", life_type),
#          cover_size_new = ifelse(is.na(cover_size), "UNK", cover_size),
#          cover_density_new = ifelse(is.na(cover_density), "UNK", cover_density)) %>% 
#   select(uniqueID, usgs_new, saf_new, srm_new, nvcs_new, agg_type_new, tree_cfa_new,
#          shrub_cfa_new, hard_cfa_new, conifer_cfa_new, eco_div_new, covertype_new,
#          life_form_new, life_type_new, cover_size_new, cover_density_new)
#   

# final_data<-final_data %>% st_drop_geometry()

# save(final_data,file = "../rdata/fishnet_vegetation_data.RData")

# Load the vegetation data

load(file = "../rdata/fishnet_vegetation_data.RData")

fishnet <- final_data %>% 
  right_join(fishnet) %>% 
  st_sf()

# Plotting 

fishnet %>% 
  ggplot() +
  geom_sf(aes(fill = covertype_new), color = "transparent") + 
  geom_sf(data = fire_perimeter , fill = "red", alpha=0.9, color="transparent") + 
  mapTheme() + 
  labs(title = "Vegetation Cover Across Southern California",
       subtitle = "Map is overlayed with fire perimeters in red",
       caption = "Figure 13",
       fill = "Cover Type")

5.9 Lag Fires

One can also hypothesize that an area that was burned due to wildfire is prone to initiating a fire again in the future. To capture this, for each fishnet cell, we consider two lag features, capturing whether a cell had witnessed wildfire in the past season and the past year.

Note: Taking into account the lag features results in missing data for 2013, hence we would only be considering wildfires from year 2014-19 in Section 8

# Lag features -- did the same fire occur last year in the same season? or last season ? 

fire_incidents_fishnet_with_lags <- fire_incidents_fishnet %>% 
  st_drop_geometry() %>% 
  group_by(uniqueID, season) %>% 
  arrange(year) %>% 
  mutate(burned_last_year = lag(Burned)) %>% 
  ungroup() %>% 
  mutate(season = factor(season, levels = c("winter", "summer", "fall"))) %>%
  group_by(uniqueID) %>% 
  arrange(year, season) %>% 
  mutate(burned_last_season = lag(Burned)) %>% 
  ungroup()

5.10 Combining Data

To conclude this section on potential signals for wildfire risk, we combine data from all the signals that we extracted from this section, associating each signal to the respective fishnet cell.

# Remove the first year from incidents and lag dataframes 
fire_incidents_fishnet <- fire_incidents_fishnet %>% 
  filter(year != 2013)

fire_incidents_fishnet_with_lags <- fire_incidents_fishnet_with_lags %>% 
  filter(year != 2013)

# Combine fire incidents with lag features 
fire_incidents_fishnet <- fire_incidents_fishnet %>% 
  left_join(fire_incidents_fishnet_with_lags) %>% 
  st_sf()

# get combined features of incidents and weather 
fire_incidents_fishnet <- fire_incidents_fishnet %>% 
  st_drop_geometry() %>% 
  left_join(fishnet) %>%
  left_join(weather_data_clean) 

# save(fire_incidents_fishnet, file = "../rdata/fire_incident_fishnet_14_19.RData")

6 Additional Exploratory Data Analysis

Having extracted a number of features for each fishnet cell in Section 5, we next conduct some additional analysis geared towards coming up with better binning of these features and removing features that might not be that important for modeling.

6.1 Correlation

First, we look at the correlation among the numeric features that we extracted. As can be seen from Figure 14, some of our features are highly correlated with each other, for example, mean speed of wind and standard deviation of speed of wind, and standard deviation of temperature and standard deviation of dew point temperature. We would need to get rid of such features prior to modeling to ensure stability of the model feature coefficients.

# Filtering out numeric variables 
numericVars <- select_if(fire_incidents_fishnet, is.numeric) %>% na.omit() 

# Building the correlation plot
ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation Across Numeric Variables",
       caption = "Figure 14")

6.2 Density Distributions

Next, we look at density distribution of the numeric features (excluding those that were correlated in Section 6.1). As can be seen from Figure 15, closeness to settlement makes an area more prone to catch wildfire, and so does an high mean dew point temperature and being in a low elevation area. Surprisingly, higher relative humidity and lower wind speed tend to be more associated with chances of wildfire, a bit counter-intuitive of what one might generally expect.

# Density Distribution of Burned vs Not Burned classes for Different Continuous Variables 

fire_incidents_fishnet %>%
  dplyr::select(Burned,dist_to_nearest_settlement, elevation, mean_tmpf,
                mean_dwpf,mean_relh, sd_relh,
                mean_drct, sd_drct, mean_sknt, sd_sknt, sum_p01i) %>%
    rename(`Dist Settlement`=dist_to_nearest_settlement,
         `Elevation`=elevation,
         `Mean Temp`=mean_tmpf,
         `Mean Dwpt Tmp`=mean_dwpf,
         `Mean Rel Hum`=mean_relh,
         `SD Rel Hum`=sd_relh,
         `Mean Wind Dir`=mean_drct,
         `SD Wind Dir`=sd_drct,
         `Mean Wind Spd`=mean_sknt,
         `SD Wind Spd`=sd_sknt,
         `Sum Pptn`=sum_p01i) %>% 
  mutate(Burned=ifelse(Burned==1, "Burned", "Not Burned")) %>% 
  gather(Variable, value, -Burned) %>%
  ggplot() + 
  geom_density(aes(value, color=as.factor(Burned)), fill = "transparent") + 
  facet_wrap(~Variable, scales = "free") +
  scale_fill_manual(values = palette2) +
  labs(title = "Feature Distributions of Burned vs Not Burned",
       x = "Feature Value",
       y = "Density",
       subtitle = "(continous outcomes)",
       color = "",
       caption="Figure 15") +
  plotTheme() + 
  theme(legend.position = "bottom")

6.3 Vegetation Distribution

Following on from our discussion in Section 5.8, we next look at the distribution of vegetation type on a fishnet cell that did/did not get burned due to wildfire. As can be seen from Figure 16, two of the vegetation types are most likely to burn are shrubs and hardwood. In addition to this, hardwood majority covered areas seem to be more prone to wildfires compared to shrubs majority covered areas, and looking from the map we can observe that most of the wildfires have been captured at the intersection of shrub and hardwood covered areas. Shrubs and dying/fallen trees are one of main sources of fuel for wildfire. Hence this information can be useful for forest officials to do prescribed burns, and selective clearing of brushes and dying plants on and around the intersection we discussed.

fire_incidents_fishnet %>%
  select(usgs_new, saf_new, srm_new, nvcs_new, agg_type_new,
         tree_cfa_new, shrub_cfa_new, hard_cfa_new, conifer_cfa_new,
         eco_div_new, covertype_new, life_form_new, life_type_new,
         cover_size_new, cover_density_new, fall_within_harard_zone_t1,
         fall_within_harard_zone_t2, n_fire_facilites, lake_size_quantile,
         fall_within_settlement,Burned, season, year) %>%
  filter(season=="summer") %>% 
  select(covertype_new, Burned) %>% 
  count(covertype_new, Burned) %>%
  filter(covertype_new!="UNK") %>% 
  group_by(Burned) %>% 
  mutate(n=n/sum(n)) %>% 
  ungroup() %>% 
  ggplot(aes(as.factor(Burned), n, fill = covertype_new)) +   
  geom_bar(position = "dodge", stat="identity") +
  scale_x_discrete(labels= c("Not Burned","Burned"))+
  labs(x="", y="Proportion",
       title = "Distribution of Vegetation in Southern California",
       subtitle = "For wildfires that occurred in the summer season, 2014-19",
       caption="Figure 16",
       fill = "Cover Type"
  ) + 
  plotTheme()

6.4 Feature Binning & Cleaning

Based on the findings in Section 6.1 and 6.2, we next carry out the following three features binning/cleaning steps:

Remove correlated features that were identified using the correlation plot.
Bin features into categories based on density distribution.
For the remaining numeric features that were not binned, standard normalize the feature values.

fire_feats <- fire_incidents_fishnet %>% 
  mutate(
    dist_to_nearest_settlement = ifelse(dist_to_nearest_settlement < 25000, "<25000", ">=25000"),
    elevation = ifelse(elevation < 500, "low", "high"),
    mean_drct = ifelse(mean_drct < 150, "low", "high"),
    mean_dwpf = ifelse(mean_dwpf < 41, "low", "high"),
    mean_relh = ifelse(mean_relh < 50, "low", "high"),
    mean_tmpf = ifelse(mean_tmpf >= 60 & mean_tmpf <= 85, "likelyBurn", "likelyNotBurn"),
    sd_sknt = ifelse(sd_sknt <= 5, "low", "high")
  ) %>% 
  select(-mean_sknt, -sd_dwpf, -sd_tmpf, -sum_p01i) %>% 
  mutate(
    burned_last_year = as.factor(burned_last_year),
    burned_last_season =  as.factor(burned_last_season),
    fall_within_harard_zone_t1 = as.factor(fall_within_harard_zone_t1),
    fall_within_harard_zone_t2 = as.factor(fall_within_harard_zone_t2),
    lake_size_quantile = as.factor(lake_size_quantile),
    fall_within_settlement = as.factor(fall_within_settlement)
  ) %>% 
  mutate(
    sd_relh = (sd_relh-mean(sd_relh))/sd(sd_relh),
    sd_drct = (sd_drct-mean(sd_drct))/sd(sd_drct)
  )

7 Wildfire Risk Modeling

Having built a number of features by now that signal the susceptibility of an area to be involved in a wildfire outbreak, we next move on to building a wildfire risk prediction model, which given an input of these signals, can output a map of risk zones over the area of study, thus helping the forest officials to take appropriate actions from there after.

7.1 Creating Train/Test Set

We first create the train and test split from the data we have. For our modeling purpose we would treat the wildfire data that we have gathered for 2019 as testing set, and intelligently sample the remaining data from 2014-19 (5 years) as our training set.

This intelligent sampling is required due the fact that wildfires are a rare event, and given the 5 years of data that we have gathered (across 3 different seasons), only a small proportion of fishnet cells were actually involved in a wildfire outbreak. Given such a case, we sample the training data as follows:

Include all the fishnet cells that were burned from 2014-19 as part of the training set (positive sample).
Intelligently sample from the remaining (not burned) fishnet cells ensuring that the resulting sample is able to capture all the different values a feature can take (thus avoiding new levels while predicting on the test set). The size of this negative sample is kept such that it more or less equates the size of the positive sample.

set.seed(1520)

# Use the 2019 data as test set
fire_test_data <- fire_feats %>% 
  filter(year == "2019") %>% 
  select(-year, -weather_station_id, -geometry)

# Sample from the remaining data to create the training set 

fire_feats_not_test <- fire_feats %>% 
  filter(year != "2019") %>% 
  select(-year, -weather_station_id, -geometry)

train_index <- createDataPartition(fire_feats_not_test$Burned,
                    y = paste(fire_feats_not_test$usgs_new, fire_feats_not_test$saf_new, fire_feats_not_test$srm_new,
                              fire_feats_not_test$hard_cfa_new, fire_feats_not_test$life_type_new, 
                              fire_feats_not_test$eco_div_new, fire_feats_not_test$SEED_ZONE),
                    p = 0.04,
                    list = FALSE,
                    times = 1
                    )

fire_feats_sample_1 <- fire_feats_not_test[train_index, ]
fire_feats_sample_2  <- fire_feats_not_test[-train_index, ]

fire_train_data <- rbind(fire_feats_sample_1,  
      fire_feats_sample_2 %>% filter(Burned == 1))

7.2 Fitting Logistic Regression on Train Set

Having got the training and the test set, we next fit a Logistic Regression model on our training data, owing to the fact that the outputs of these model always lie between 0 and 1 and can be interpreted as risk values.

Below we summarize the model coefficients we get after fitting the model to the training data. Some of the highly significant features include lag wildfire features, presence within human settlements, lake size (quantile 3), a number of different seed zones, weather features (like low relative humidity and wind direction) and the season being considered.

# Fit logistic regression

fire_risk_model <- glm(Burned ~ .,
                      data=fire_train_data %>% select(-uniqueID),
                      family="binomial" (link="logit"))

# Analyze variable coefficients

stargazer(fire_risk_model,
          type = "text",
          align=TRUE,
          no.space = TRUE,
          single.row = TRUE)

## 
## ==============================================================
##                                        Dependent variable:    
##                                    ---------------------------
##                                              Burned           
## --------------------------------------------------------------
## seasonsummer                            2.162*** (0.268)      
## seasonwinter                            -0.642*** (0.237)     
## burned_last_year1                        0.543** (0.249)      
## burned_last_season1                     0.978*** (0.268)      
## usgs_new2                                -0.061 (1.378)       
## usgs_new3                                 0.348 (0.617)       
## usgs_new4                                -0.154 (0.602)       
## usgs_new5                                -1.768 (1.458)       
## usgs_new6                                -1.613 (1.738)       
## usgs_new7                               0.668 (4,697.304)     
## usgs_newUNK                           -21.071 (10,754.020)    
## saf_new211                               -0.077 (1.178)       
## saf_new229                                0.979 (1.293)       
## saf_new237                               -1.770 (2.214)       
## saf_new239                               -1.030 (0.846)       
## saf_new243                               -0.378 (1.077)       
## saf_new245                            -19.914 (10,754.010)    
## saf_new246                                1.054 (1.158)       
## saf_new247                             -16.955 (7,142.104)    
## saf_new248                                1.228 (0.965)       
## saf_new249                              -2.800** (1.102)      
## saf_new250                                0.452 (0.523)       
## saf_new255                               -0.071 (0.329)       
## saf_new260                             -17.092 (5,029.740)    
## saf_new261                             -18.400 (5,029.740)    
## saf_new262                                0.283 (0.328)       
## saf_new265                              0.104 (8,810.285)     
## saf_newUNK                                                    
## srm_new001                               -0.598 (1.227)       
## srm_new002                               -1.380 (2.224)       
## srm_new004                               -0.522 (1.206)       
## srm_new005                                0.278 (1.180)       
## srm_new007                               -1.077 (1.331)       
## srm_new205                               -0.346 (1.278)       
## srm_new206                               -2.352 (1.458)       
## srm_new207                               -1.319 (1.333)       
## srm_new211                               -1.564 (1.412)       
## srm_new212                               -1.754 (1.918)       
## srm_new215                               -1.623 (1.323)       
## srm_new401                             0.195 (12,242.730)     
## srm_new414                             -17.063 (6,791.369)    
## srm_new421                             -2.692 (12,242.730)    
## srm_newUNK                                                    
## nvcs_newC02                               0.234 (0.342)       
## nvcs_newC03                              -0.932 (0.866)       
## nvcs_newUNK                               0.103 (2.165)       
## agg_type_newH                             0.510 (1.910)       
## agg_type_newUNK                                               
## tree_cfa_new20                            0.849 (1.063)       
## tree_cfa_new40                            0.763 (1.013)       
## tree_cfa_new80                           -0.453 (1.162)       
## tree_cfa_newUNK                        21.544 (2,349.980)     
## shrub_cfa_new20                          -0.579 (1.460)       
## shrub_cfa_new40                          -0.865 (1.354)       
## shrub_cfa_new80                          -0.826 (1.366)       
## shrub_cfa_newUNK                         -1.687 (1.314)       
## hard_cfa_new20                         17.143 (7,564.765)     
## hard_cfa_new40                         17.229 (7,564.765)     
## hard_cfa_new80                         18.352 (7,564.765)     
## hard_cfa_newUNK                        -0.827 (7,921.368)     
## conifer_cfa_new20                         0.198 (0.351)       
## conifer_cfa_new40                         0.131 (0.536)       
## conifer_cfa_new80                         0.445 (0.861)       
## conifer_cfa_newUNK                        0.050 (0.377)       
## eco_div_new320                           -1.311 (1.550)       
## eco_div_newUNK                                                
## covertype_newBAR                      -18.684 (10,754.010)    
## covertype_newCON                      -20.455 (10,754.010)    
## covertype_newHDW                      -20.822 (10,754.010)    
## covertype_newHEB                      -18.750 (10,754.010)    
## covertype_newMIX                      -20.769 (10,754.010)    
## covertype_newSHB                      -20.421 (10,754.010)    
## covertype_newUNK                                              
## covertype_newURB                      -17.274 (12,154.500)    
## covertype_newWAT                      -18.306 (10,754.010)    
## life_form_newWHR_CON                      0.213 (1.554)       
## life_form_newWHR_HDW                      0.447 (0.633)       
## life_form_newWHR_HEB                     -1.938 (1.836)       
## life_form_newWHR_NFO                     -2.606* (1.521)      
## life_form_newWHR_SHB                                          
## life_type_newASC                         -1.110 (1.261)       
## life_type_newBAR                         -0.108 (1.086)       
## life_type_newBOP                         -1.388 (1.337)       
## life_type_newBOW                         -0.486 (1.014)       
## life_type_newCOW                         -0.002 (0.982)       
## life_type_newCPC                       18.680 (5,029.740)     
## life_type_newCRC                         -0.248 (1.213)       
## life_type_newCRP                      -14.318 (15,212.340)    
## life_type_newCSC                         -0.596 (1.182)       
## life_type_newDOR                       -17.486 (4,726.721)    
## life_type_newDSC                         -0.768 (1.326)       
## life_type_newEPN                       -19.844 (4,943.436)    
## life_type_newEUC                        0.216 (5,537.635)     
## life_type_newJPN                       -3.891 (9,669.338)     
## life_type_newJUN                       18.638 (2,349.979)     
## life_type_newLAC                          0.424 (1.285)       
## life_type_newMCH                         -0.458 (0.972)       
## life_type_newMCP                       16.811 (7,142.105)     
## life_type_newMHC                         -1.601 (2.033)       
## life_type_newMHW                         -1.515 (1.073)       
## life_type_newMRI                          0.562 (1.989)       
## life_type_newPJN                         -0.605 (1.249)       
## life_type_newRIV                          1.479 (1.843)       
## life_type_newSGB                       -18.231 (5,851.117)    
## life_type_newSMC                         -1.397 (1.925)       
## life_type_newUNK                                              
## life_type_newURB                         -0.467 (0.973)       
## life_type_newVOW                          0.983 (1.522)       
## life_type_newVRI                         -0.544 (1.012)       
## cover_size_new1                           1.001 (1.888)       
## cover_size_new2                           0.539 (1.130)       
## cover_size_new3                          -0.067 (0.703)       
## cover_size_new4                           0.387 (0.673)       
## cover_size_new5                           1.443 (1.263)       
## cover_size_newUNK                        -2.918 (1.815)       
## cover_density_newM                       -0.486 (0.385)       
## cover_density_newP                       -0.516 (0.385)       
## cover_density_newS                       -1.408* (0.819)      
## cover_density_newUNK                      0.190 (0.974)       
## cover_density_newX                      -1.354*** (0.461)     
## fall_within_harard_zone_t11              -0.224 (0.216)       
## fall_within_harard_zone_t21             19.972 (342.998)      
## n_fire_facilites1-3                      -0.051 (0.189)       
## n_fire_facilites4-6                      -0.350 (0.476)       
## n_fire_facilites7+                       -0.507 (0.923)       
## lake_size_quantile1                       0.483 (0.361)       
## lake_size_quantile2                       0.356 (0.349)       
## lake_size_quantile3                      0.902** (0.354)      
## lake_size_quantile4                       0.264 (0.273)       
## lake_size_quantile5                       0.133 (0.215)       
## fall_within_settlement1                  0.516** (0.226)      
## dist_to_nearest_settlement> =25000       -0.109 (0.259)       
## elevationlow                              0.320 (0.276)       
## SEED_ZONE140                              0.671 (0.538)       
## SEED_ZONE150                              0.749 (1.422)       
## SEED_ZONE164                              1.059 (0.743)       
## SEED_ZONE165                              0.489 (0.521)       
## SEED_ZONE540                              1.084 (0.736)       
## SEED_ZONE550                             2.085** (0.814)      
## SEED_ZONE560                             1.487** (0.738)      
## SEED_ZONE570                            2.222*** (0.764)      
## SEED_ZONE785                           -15.724 (7,212.472)    
## SEED_ZONE963                             1.608* (0.963)       
## SEED_ZONE982                             1.498** (0.693)      
## SEED_ZONE983                              1.085 (1.113)       
## SEED_ZONE984                              0.127 (1.401)       
## SEED_ZONE985                             2.749** (1.166)      
## SEED_ZONE986                              0.994 (1.458)       
## SEED_ZONE991                             0.993* (0.536)       
## SEED_ZONE992                            1.590*** (0.552)      
## SEED_ZONE993                             1.168* (0.636)       
## SEED_ZONE994                              1.022 (0.640)       
## SEED_ZONE995                            1.973*** (0.588)      
## SEED_ZONE996                              0.847 (0.574)       
## SEED_ZONE997                              0.152 (0.788)       
## SEED_ZONE998                              0.397 (0.537)       
## mean_tmpflikelyNotBurn                    0.006 (0.214)       
## mean_dwpflow                             -0.512* (0.263)      
## mean_relhlow                             0.608** (0.249)      
## sd_relh                                   0.087 (0.107)       
## mean_drctlow                             0.438** (0.213)      
## sd_drct                                 0.395*** (0.121)      
## sd_skntlow                               -0.309 (0.237)       
## Constant                              -17.680 (13,152.650)    
## --------------------------------------------------------------
## Observations                                  2,089           
## Log Likelihood                              -633.176          
## Akaike Inf. Crit.                           1,580.353         
## ==============================================================
## Note:                              *p<0.1; **p<0.05; ***p<0.01

# tab_model(fire_risk_model)

7.3 Visualizing Risk Predictions on Test Set

Next, we evaluate the performance of the model trained by getting the risk predictions on the test set. Figure 17 below shows a visualization of the risk predictions for wildfires in the year 2019 across the different seasons. As can be seen from the figure, the model is able to capture the risk differences across the seasons, with summer being with more vulnerable for wildfire outbreak. It has also captured the pattern of the western parts of Southern California being more prone to wildfires, as was also noted in Section 4.

It is important to note here that the changes that are visible across seasons are not just due to difference of season in the model input, but also the difference in the weather conditions that existed during the time, which were also fed as input to the model. Hence, one can say that the model relies on effective forecast of weather in the future to get a reasonable risk forecast of wildfires.

# Get risk predictions on test set

test_risk_preds <- predict(fire_risk_model, fire_test_data %>% select(-uniqueID), type = "response")

fire_test_data$fire_risk_pred <- test_risk_preds

# Assign Risk Category
fire_test_data_rc <- fire_test_data %>% 
  mutate(Risk_Category = ntile(fire_risk_pred, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%"))

# Plotting 

fire_test_data_rc %>% 
  left_join(fishnet %>% select(uniqueID)) %>% 
  mutate(season = ifelse(season == "fall", "Fall", ifelse(season == "summer", "Summer", "Winter"))) %>% 
  st_sf() %>% 
  ggplot() + 
  geom_sf(aes(fill = Risk_Category), color="transparent") + 
  scale_fill_manual(values = c("#1a9641", "#a6d96a", "#ffffbf", "#fdae61", "#d7191c")) + 
  facet_wrap(~season) + 
  labs(title = "Wildfire Risk Across Seasons",
       subtitle = "Southern California, 2019",
       fill = "Risk Category",
       caption = "Figure 17") +
  mapTheme() + 
  theme(legend.position="bottom")

7.4 Performance on Test Set

To get a clear picture of the efficacy of the model, we next look at how good our model was in capturing the wildfires that broke out in 2019.

Table 1 below indicates the number of true fires that were captured in the different risk categories predicted by the model. For example, in summer 2019, 60% of the true wildfires that broke out were captured in the 90-100% risk category prediction by the model. In other words, if the forest officials just prioritize on areas that were predicted by the model in the high risk category (90-10% ) in summer, they would be able to curb 60% of the potential wildfire outbreaks.

Having understood that, our model does substantially well in the fall and the summer months, being able to capture almost 84% of the wildfires that broke out in the >70% risk category. Although the same is not true for the winter season, with the model being able to capture only 43% of the true fires in the >70% risk category.

Considering the occurrence wildfire in a fishnet cell as 1, the risk predictions from our model result in an RMSE value of 0.34, considerably less than a chance.

# Calculate RMSE
# caret::RMSE(fire_test_data$fire_risk_pred, fire_test_data$Burned)

# Get performance in capturing true fires 
fire_test_data_rc_perf <- fire_test_data_rc %>% 
   mutate(season = ifelse(season == "fall", "Fall", ifelse(season == "summer", "Summer", "Winter"))) %>%
   group_by(season, Risk_Category) %>% 
   summarise(
     count_fires = sum(Burned)
   ) %>% 
   group_by(season) %>% 
   mutate(
     pct_fires = count_fires/sum(count_fires)
   ) %>% 
   select(-count_fires) %>% 
   spread(key = Risk_Category, value = pct_fires, fill = 0) 

# Tabulate Results
fire_test_data_rc_perf %>% 
  kable(caption = "Table 1: Accuracy in Capturing True Wildfires, 2019", digits = 2) %>%
  kable_styling("striped", full_width = F)

Table 1: Accuracy in Capturing True Wildfires, 2019
season	1% to 29%	30% to 49%	50% to 69%	70% to 89%	90% to 100%
Fall	0.06	0.07	0.03	0.54	0.3
Summer	0.04	0.05	0.07	0.24	0.6
Winter	0.14	0.29	0.14	0.43	0.0

It is easy to get fooled by values reported in Table 1, without looking at the percentage of fishnet cells that fall within a risk category. For example, a naive model could just predict high risk (90-100%) for all the fishnet cells and would achieve 100% accuracy in capturing true wildfire outbreaks. Given limited amount of resources that can be allocated for suppression of wildfire and doing prescribed burns, such a model would not be viable.

To further support our model, Table 2 presents the percentage of fishnet cells that fall within a risk category within a given season. For example, in summer 2019, 25% of the fishnet cells were in the 90-100% risk category prediction by the model.

Having understood that, our model again does substantially well for fall and summer seasons. For fall, the 84% of the wildfires that broke out in the >70% risk category comprise of almost 32% of the entire area of Southern California, while for summer it comprises 35% of the entire area. For the winter season, one can capture around 57% of the wildfire that broke our in the >50% risk category by focusing on 36% of the entire area of Southern California. All these values reported from the model indicate that the risk prediction are quite targeted.

# Calculate area captured by each risk category
fire_test_data_rc_area <- fire_test_data_rc %>% 
  mutate(season = ifelse(season == "fall", "Fall", ifelse(season == "summer", "Summer", "Winter"))) %>%
  group_by(season, Risk_Category) %>% 
  summarise(
    n_fishnet_cells = n()
  ) %>% 
  group_by(season) %>% 
  mutate(
    pct_fishnet_cells = n_fishnet_cells/sum(n_fishnet_cells)
  ) %>% 
  select(-n_fishnet_cells) %>% 
  spread(key = Risk_Category, value = pct_fishnet_cells, fill = 0) 

# Tabulate Results
fire_test_data_rc_area %>% 
  kable(caption = "Table 2: Proportion of Area in a Risk Category. 2019", digits = 2) %>%
  kable_styling("striped", full_width = F)

Table 2: Proportion of Area in a Risk Category. 2019
season	1% to 29%	30% to 49%	50% to 69%	70% to 89%	90% to 100%
Fall	0.30	0.25	0.13	0.27	0.05
Summer	0.05	0.25	0.35	0.10	0.25
Winter	0.53	0.10	0.11	0.23	0.02

7.5 Generalization

7.5.1 Assigning CV Group to each Cell

Next, we check the generalization of our prediction model across space. Given the magnitude of our study area and the varying differences in feature values across different regions, we decided to create our own groups for cross validation, rather that relying on existing groups like counties or the fishnet cells. To create these new groups, we laid another fishnet with much larger cell area (57K acres) over the existing fishnet cells and then associated each of the smaller fishnet cells to the larger one, each larger one being one group. Figure 18 depicts this process in more detail.

# Make groupings based on new fishnet

fishnet_cv <-
  st_make_grid(south_cali_boundary, cellsize = 50000) %>%
  .[south_cali_boundary] %>%
  st_sf() %>%
  mutate(cvID = rownames(.))

fishnet_cv %>% 
  ggplot() + 
  geom_sf(alpha = 0.3, color = "#f03b20") +
  geom_sf(data = fishnet, alpha = 0.3) + 
  mapTheme() + 
  labs(title = "Assining Spatial CV Group to each Fishnet Cell",
       subtitle = "Each small fishnet gets associated with the overlayed big fishnet cell",
       caption = "Figure 18")

fishnet_cell_cv_mapping <- fishnet %>% 
  st_intersection(fishnet_cv) %>%
  st_drop_geometry() %>% 
  group_by(uniqueID) %>% 
  slice(1) %>% 
  select(uniqueID, cvID)
  
# Create cross validation data

cv_data <- rbind(fire_train_data, fire_test_data %>% filter(Burned == 1) %>% select(-fire_risk_pred))

cv_data <- cv_data %>% left_join(fishnet_cell_cv_mapping)

7.5.2 Cross Validation

Having got the cross validation groups, we then use LOGO cross validation on each of the new groups. Figure 19 below shows the distribution of the RMSE values across different cross validation groups. As can be seen from the figure, although a part of the distribution is clustered on one side, there are a couple of groups having RMSE values >0.3, indicating inadequacy of the our model in generalizing across space.

## LOGO CV

crossValidate <- function(dataset, id) {

  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  new_levels <- 0
  
  for (i in cvID_list) {
  
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
  
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() 
    
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame()
    
    fire_risk_model <- glm(Burned ~ .,
                       data=fold.train %>% select(-uniqueID, -cvID),
                       family="binomial" (link="logit"))
    
    tryCatch(
        {thisPrediction <- 
        mutate(fold.test, 
               Prediction = predict(fire_risk_model, fold.test %>% select(-uniqueID, -cvID), type = "response")
               )
        
        allPredictions <- rbind(allPredictions, thisPrediction)
        },
        error = function(e){
          print('Fold had new levels')
          new_levels <- new_levels + 1
        }
      )

  }
  
  print(paste('Total CVs with new levels:', new_levels))
  return(allPredictions)
}


# reg.spatialCV <- crossValidate(dataset = cv_data, id = "cvID")
# save(reg.spatialCV, file = "regSpatialCV.RData")

load("regSpatialCV.RData")

# Get Errors by fold
error_by_reg_and_fold <-
  reg.spatialCV %>%
    group_by(cvID) %>%
    summarize(Mean_Error = mean(Prediction - Burned, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T),
              RMSE = sqrt(Mean_Error**2)) %>%
  ungroup()

# Distribution of RMSE

error_by_reg_and_fold %>%
  ggplot(aes(RMSE)) +
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    labs(title="Distribution of RMSE", subtitle = "Spatial CV by Assigned Spatial Groups",
         x="RMSE", y="Count",
         caption = "Figure 19") +
    plotTheme()

Figure 20 shows a map of the RMSE errors across different groups. As can be seen, the relatively higher RMSE errors are on the western side of Southern California, the side which is more prone to wildfire outbreaks.

# Spatial Visualization of Errors

fishnet_cell_cv_mapping %>% 
  inner_join(error_by_reg_and_fold) %>% 
  right_join(fishnet %>% select(uniqueID)) %>% 
  st_sf() %>% 
  ggplot() + 
  geom_sf(aes(fill = RMSE)) +
  scale_fill_viridis() +
  labs(title = "Wildfire Errors Across Space, Southern California",
       subtitle = "Grayed groups resulted in new levels for features while prediction",
       fill = "RMSE",
       caption = "Figure 20") + 
  #geom_sf(data = fishnet_cv, fill = "transparent") + 
  mapTheme()

8 How Can this Risk Help a Forest Official?

We believe that our model has all over shown a decent performance in assessing the likelihood of a wildfire in a particular area and a particular season. From the above analysis, we found out that our model was able to capture 60% of the true wildfires in the 90-100% risk category for summer 2019. Given the resources available to the forest officials and the large scale of wildfires, these results are a good start in aiding forest officials in prioritizing their resources on areas that were predicted by the model in the high risk category (90-100%) and they could have potentially curbed the likelihood 60% of the wildfire outbreaks in summer 2019. We also admit that our modeling for the winter season has not been up to mark, since our model was only able to capture 43% of true fires in moderate and high risk categories in winter 2019. But this should not deter the forest officials from using this model for addressing their Risk Assessment Plan and Fuel Management Strategies. They can use our model to identify the areas coming under high risk categories for a particular season and accordingly take steps to optimize their Risk Management Plan and Fuel Management Strategies as follows:

Conducting targeted prescribed burns and selective clearing of brushes(dry/dead grasses, leaves and shrubs) and hardwoods(dying/fallen trees) especially on and around the intersection of majority shrubs and majority hardwood covered areas thus creating fuel breaks.
Strengthen their Forest Protection Plan and Community Defensible Spaces for particular human settlements that are near to forests that are more prone to wildfires, since we observed in our analysis that human activities are one of the main causes for Wildfire.
Promoting Upper Watershed Health for areas that lie in Moderate and High Risk Categories to mitigate the catastrophic fires.

As future direction, the analysis conducted can be benefited by adding additional features to the model, for example, one can also consider the different aspects of slope like northness and eastness. Also, the models considered in this analysis only touch on the geospatial features while modeling wildfire risk, but one can also integrate features from physics-based behavior of fire to model this risk better.

Forecasting Wildfire Risk in Southern California

Shubham Annadate & Aayush Dua

December 15, 2020