1 Introduction

The aim of this project is to build a good predictive model of home prices for Miami, Florida. Purchasing a house is an important decision in a person’s life and requires a lot of thought and research. We care about building a better predictive model of home prices so that it can help people who plan to purchase a house in Miami, know the price of a house in today’s market. At the same time, this predictive model can help in providing homeowners in Miami, an independent and unbiased assessment of what their home prices might be worth in present. This helps potential buyers/ home owners in planning their finances and investments well.

Many factors go into play while predicting home prices in general. We have to research what these factors are and what is the underlying context behind these factors. Along with that, while building predictive models we make assumptions that ignore non-quantifiable factors such consumer confidence and sudden factors such as COVID outbreak. Coupling all of this together makes this exercise difficult.

In this project, we identify three major types of features that we believe will have a huge impact on predicting the home prices in Miami, Florida. These include:

  • Internal Characteristics: Includes internal features of a house such as house size, whether it has pool or not, number of beds etc.

  • Amenities/ Public Services: Includes features such as distance to amenities and public services that Miami has to offer from a house. For example, distance to bars, pubs, schools, hospitals, police stations etc.

  • Spatial Structure: Includes features that are geographically relevant to a house. For example, driving distance to a beach from house, and significance of which neighborhood the house lies in.

Our final model does a respectable job in predicting home prices for almost all houses except for some luxury homes giving a mean absolute error of around $200K for houses prices less than $5m.

2 Setup

Before proceeding with the analysis, we undertake some initial setups which involve importing required libraries and writing helper functions.

library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(osmdata)
library(geosphere)
library(kableExtra)
library(tidycensus)
library(scales)
library(mapboxapi)
library(leaflet)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(stargazer)
library(pander)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}


palette5 <- c("#FA7800","#C48C04","#8FA108","#5AB60C","#25CB10")
palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

st_c <- st_coordinates

############################## HELPER FUNCTIONS ##############################

# Function to extract amenity from a passed string 
get_amenity <- function(x){
  
  amenities <- c("Pool", "Patio", "Fence", "Carport", "Wall", "Dock", "Gazebo", "Elevator",
                 "Enclosure", "Paving", "Jacuzzi", "Tiki Hut", "Tennis Court")
  
  for(am in amenities){
    if(grepl(am, x, ignore.case = TRUE)){
      return(am)
    }
  }
  ifelse(length(str_trim(x)) > 1, "Other", NA)
}

# Function to get one hot encoded vectors for amenities
get_one_hot_encoded_amneties <- function(x){
  
  amenities <- c("Pool", "Patio", "Fence", "Carport", "Wall", "Dock", "Gazebo", "Elevator",
                 "Enclosure", "Paving", "Jacuzzi", "Tiki Hut", "Tennis Court")
  
  oh_amenities <- rep(0, length(amenities))
  names(oh_amenities) <- amenities
  
  for(n in names(x)){
    if(!is.na(x[n])){
      oh_amenities[x[n]] <- 1
    }
  }
  oh_amenities
}

# Function to extract zone code from ZondCode string 
get_zone_code <- function(x) {str_split(x, " ")[[1]][1]}

# Function to get the data of the amenity 
get_amenity_data <- function(bbox, key, value, crs, base.osm){
  
  am_osm_data <- opq(bbox = bbox) %>% 
    add_osm_feature(key = key, value = value) %>%
    osmdata_sf()
  
  points <- 
    am_osm_data$osm_points %>%
    .[base.osm,] 
  
  points <- points %>% st_transform(crs=crs)
  
  points
  
}

# Function to get the average distance to n nearest amenities 
get_n_nn <- function(df, points, n, var_name){
  
  for(i in 1:n){
    col = paste0(var_name, "_nn", i)
    df[[col]] <- nn_function(st_c(df), st_c(points), i)
  }
  
  df 
}

# Function to get the driving distance from a house to South Beach
get_driving_dist_to_south_beach <- function(prop_addr){
  
  tryCatch({
    prop_geo_code <- mb_geocode(prop_addr)
    
    route <- mb_directions(origin = prop_geo_code,
                           destination = "South Beach, MIAMI",
                           profile = "driving")
    
    route$distance  
  }, error = function(cond){
      return(NA)
  })
  
}

3 Data

3.1 Data Gathering Methodology

For this prediction task, we used data from five sources as described below:

  • Housing Data: This data was provided to us as part of the analysis. It contains one row per house along with some of its features like living area, number of bedrooms and so on.

  • Neighborhood Data: This is an open source data which we use to find boundaries of various neighborhoods in the Miami region.

  • Municipality Data: The neighborhood data discussed above does not contain boundaries for the Miami Beach, hence we rely on the municipal boundaries of the same as a proxy for the neighborhood.

  • OpenStreetMap (OSM) Data: We use this to get access to interesting geodata like Miami Coastline, and location of amenities like nightclubs, transit stops and so on in the Miami region.

  • Census Data: We use this data to check the generalizability of the predicitve model we build across race and income groups.

# CRS 
crs = st_crs("+proj=lcc +lat_1=41.71666666666667 +lat_2=42.68333333333333 +lat_0=41 +lon_0=-71.5 +x_0=200000 +y_0=750000 +ellps=GRS80 +units=m no_defs")

# Loading the given house data
miami_houses.sf <- st_read("studentsData.geojson") %>% 
  st_transform(crs=crs) %>% 
  st_centroid() %>% 
  mutate(PricePerSq = SalePrice/ActualSqFt)

# Loading neighborhood data and filtering out some not needed neighborhoods
nbhood <- st_read("https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson") %>%
  st_transform(crs=crs) %>% 
  select(LABEL) %>% 
  rename(nbhood = LABEL) %>% 
  filter(!nbhood %in% c("Virginia Key", " ", "Port of Miami", "Watson Island", "Pelican Harbor"))

# Use municipal data to find Miami Beach boundary
miami_beach_boundary <- st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
  st_transform(crs=crs) %>%
  filter(NAME == 'MIAMI BEACH') %>%
  select(NAME) %>% 
  st_union() %>% 
  st_sf() %>% 
  mutate(nbhood = "Miami Beach")

# Adding Miami Beach to neighborhood data
nbhood <- rbind(nbhood, miami_beach_boundary)

# Correcting for house that lies on two neighborhood boundaries
miami_houses.sf <- miami_houses.sf %>% 
  st_join(nbhood) %>% 
  mutate(nbhood = ifelse(is.na(nbhood), "Belle Meade", nbhood))

# Loading the base map for Miami
miami.base <- 
  st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>% 
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>% 
  st_transform(crs=crs) %>% 
  st_union()

3.2 Feature Engineering

After collecting the data, we next focused on cleaning some of the existing features in the data and also using the open data sources as described in the above section, to engineering new features of interest that might help us in predicted the house prices. Following we provide a brief idea of the various features we engineered:

  • Using Provided Data
    • Utilities in the house: Using the columns in the existing data, we came up with features capturing the presence/absence of a utility within a house, for example, presence of pool, patio, elevator and so on.
    • Categorizing Effective Year Built: Every house was bucketed into 3 categories, namely, Before 1960, 1960-2010 and After 2010.
    • Age of the house: Calculated as the difference between year the house was built and the sale year.
  • Using Open Source Data
    • Distance to coastline: For every house, we found the distance of it to the Miami coast line and also bucketed it based on whether/not the house lies within 0.5 miles of the coastline.
    • Amenities/Public Services: For every house, we found the nearest distance to amenities/public services that we felt were important in determining the price of a house like eateries, schools, police station, public transport, healthcare and so on.
    • Driving Distance to South Beach: South Beach is the most famous beach of Miami, having lots of tourist related activities and retail. We thought that it would be interesting to see whether the easy of getting from a house to this beach effects its selling price. This feature was also bucketed based on whether/not the house lies within 7 miles of driving distance to the South Beach.
################ Feature Engineering with the given set of features ################

# Generating one-hot-encoded vectors for amenities 
miami_houses.sf$Amenity1 <- unname(sapply(miami_houses.sf$XF1, get_amenity))
miami_houses.sf$Amenity2 <- unname(sapply(miami_houses.sf$XF2, get_amenity))
miami_houses.sf$Amenity3 <- unname(sapply(miami_houses.sf$XF3, get_amenity))

oh_amenities_df <- st_drop_geometry(miami_houses.sf) %>% 
  select(Amenity1, Amenity2, Amenity3) %>% 
  apply(1, get_one_hot_encoded_amneties) %>% 
  as.matrix() %>% 
  t() %>% 
  as.data.frame()

# Cleaning the ZoneCode column 
# zonecode.clean <- unname(sapply(miami_houses.sf$Zoning, get_zone_code))
# miami_houses.sf$ZoneCode <- zonecode.clean

# Cleaning the Year Build and Effective Year Build column
miami_houses.sf <- miami_houses.sf %>% 
  mutate(
    EffectiveYearBuilt.cat = case_when(
      EffectiveYearBuilt < 1960 ~ "Before 1960",
      EffectiveYearBuilt >= 1960 & EffectiveYearBuilt <= 2010 ~ "1960 - 2010",
      EffectiveYearBuilt > 2010 ~ "After 2010"
    )
  ) 
  # mutate(
  #   YearBuilt.cat = case_when(
  #     YearBuilt < 1990 ~ "Before 1990",
  #     YearBuilt >= 1990 & YearBuilt <= 2010 ~ "1990 - 2010",
  #     YearBuilt > 2010 ~ "After 2010"
  #   )
  # )


# Adding the age of the building 
miami_houses.sf <- miami_houses.sf %>% 
  mutate(Age = as.numeric(saleYear) - YearBuilt)

################ Engineering new features ################

#################################
##### Distance to Coastline ##### 
#################################

# Get base map for Miami to use with the OSM data
miami.base.osm <- 
  st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
  filter(NAME == "MIAMI BEACH" | NAME == "MIAMI") %>%
  st_union()

# Generate the bounding box around Miami region we want
xmin = st_bbox(miami.base.osm)[[1]]
ymin = st_bbox(miami.base.osm)[[2]]
xmax = st_bbox(miami.base.osm)[[3]]  
ymax = st_bbox(miami.base.osm)[[4]]

bbox <- c(xmin, ymin, xmax, ymax)

#Coastline Data
Coastline<-opq(bbox = bbox) %>% 
  add_osm_feature("natural", "coastline") %>%
  osmdata_sf() 

# Uncomment below code to calculate shoreline distance 
# This takes some time therefore we load and use precomputed values

# miami_houses.sf <-
#   miami_houses.sf %>%
#   mutate(CoastDist=(geosphere::dist2Line(p=st_coordinates(miami_houses.sf %>% st_transform(crs=4326)),
#                                          line=st_coordinates(Coastline$osm_lines)[,1:2])*0.00062137)[,1])

# Load precomputed values
load("CoastDist.RData")
miami_houses.sf$CoastDist <- CoastDist 

# Creating categories for distance to shoreline
miami_houses.sf <- 
  miami_houses.sf %>% 
  mutate(
    CoastDist.cat = ifelse(CoastDist < 0.5, "LT 0.5", "GT 0.5")
  )

#############################
##### Various Amenities ##### 
#############################

# Eateries
eateries <- get_amenity_data(bbox, 'amenity', c("restaurant", "cafe", "fast_food"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, eateries, 1, "eateries")

### Nightclub
nightlife <- get_amenity_data(bbox, 'amenity', c("bar", "nightclub", "pub", "casino", "gambling"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, nightlife, 1, "nightlife")

### Schools
schools <- get_amenity_data(bbox, 'amenity', c("school", "university", "kindergarten","college"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, schools, 3, "school")

### Police stations 
police_stations <- get_amenity_data(bbox, 'amenity', "police", crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, police_stations, 1, "police_station")

### Healthcare
healthcare <- get_amenity_data(bbox, 'amenity', c("clinic", "hospital"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, healthcare, 1, "healthcare")

### Public transport 
public_transport <- get_amenity_data(bbox, 'public_transport', c("stop_position"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, public_transport, 1, "public_transport")

### Supermarket
supermarket <- get_amenity_data(bbox, 'shop', c("supermarket"), crs, miami.base.osm)
miami_houses.sf <- get_n_nn(miami_houses.sf, supermarket, 1, "supermarket")

###########################################
##### Driving Distance to South Beach ##### 
###########################################

# Get a column for complete house address 
miami_houses.sf <- miami_houses.sf %>% 
  mutate(prop_comp_addr = paste0(Property.Address, ", ",
                                 Property.City, ", ",
                                 Property.Zip)) 

# Uncomment the below code to calcuate the driving distance in real-time
# This would take some time hence we use a precomputed one 

# miami_beach_dists <- unname(sapply(miami_houses.sf$prop_comp_addr, get_driving_dist_to_south_beach))
# miami_beach_dists[sapply(miami_beach_dists, is.null)] <- mean(unlist(miami_beach_dists), na.rm = TRUE)
# miami_beach_dists <- unlist(miami_beach_dists)

# Load the precomputed driving distance to south beach
load("miami_beach_dists_prop.RData")

# Add column to data
miami_houses.sf$south_beach_drive_dist <- miami_beach_dists

# Create a categorical variable for the driving distance to south beach variable 
miami_houses.sf <- 
  miami_houses.sf %>% 
  mutate(
    south_beach_drive_dist.cat = ifelse(south_beach_drive_dist < 7.5, "LT 7", "GT 7")
  )

##################################
##### Filtering out features ##### 
##################################

# Selecting out the final features we would use for the analysis
miami_houses_feats.sf <- cbind(miami_houses.sf, oh_amenities_df) %>%
  select(toPredict, SalePrice, saleYear, AdjustedSqFt, LotSize, LivingSqFt, ActualSqFt, Bed, Bath, Stories,
         EffectiveYearBuilt.cat, Pool, Patio, Fence, Carport, Wall, Dock, Gazebo, Elevator,Enclosure,
         Paving, Jacuzzi, Tiki.Hut, Tennis.Court,
         CoastDist, CoastDist.cat,
         eateries_nn1, nightlife_nn1, school_nn1,
         police_station_nn1, healthcare_nn1, public_transport_nn1, supermarket_nn1,
         nbhood,
         south_beach_drive_dist, south_beach_drive_dist.cat,
         Age)

3.3 Potential Features

Following we present the summary statistics of all the potential features we considered for this analysis grouped by their category (internal characteristics, amenities/public services and spatial structure).

stargazer(st_drop_geometry(miami_houses_feats.sf) %>% select("AdjustedSqFt","LotSize","LivingSqFt","ActualSqFt","Bed","Bath","Stories","Pool","Patio","Fence","Carport","Wall","Dock","Gazebo","Elevator","Enclosure","Paving","Jacuzzi","Tiki.Hut","Tennis.Court","Age"), type = "text",
          title="Descriptive statistics- Internal Characteristics", digits=1,median=TRUE,iqr=FALSE,
          covariate.labels=c("Adjusted Square Feet","Lot Size","Living Square Feet","Actual Square Feet","Number of Beds","Number of Baths","Number of Stories","Presence of Pool","Presence of Patio","Presence of Fence","Presence of Carport","Presence of Wall","Presence of Dock","Presence of Gazebo","Presence of Elevator","Presence of Enclosure","Presence of Paving","Presence of Jacuzzi","Presence of Tikki Hut","Presence of Tennis Court"))
## 
## Descriptive statistics- Internal Characteristics
## ======================================================================================
## Statistic                  N    Mean   St. Dev.  Min  Pctl(25) Median  Pctl(75)  Max  
## --------------------------------------------------------------------------------------
## Adjusted Square Feet     3,503 2,082.2 1,426.0   331   1,274    1,690   2,379   18,006
## Lot Size                 3,503 7,657.8 4,401.4  1,250  5,500   6,693.8  8,025   80,664
## Living Square Feet       3,503 2,011.4 1,369.2   288  1,220.5   1,632   2,321   18,006
## Actual Square Feet       3,503 2,367.7 1,732.3   388   1,412    1,884  2,676.5  20,192
## Number of Beds           3,503   3.0     1.1      0      2        3       4       13  
## Number of Baths          3,503   2.1     1.3      0      1        2       3       12  
## Number of Stories        3,503   1.2     0.4      0      1        1       1       4   
## Presence of Pool         3,503   0.3     0.5      0      0        0       1       1   
## Presence of Patio        3,503   0.6     0.5      0      0        1       1       1   
## Presence of Fence        3,503   0.6     0.5      0      0        1       1       1   
## Presence of Carport      3,503   0.1     0.2      0      0        0       0       1   
## Presence of Wall         3,503   0.1     0.3      0      0        0       0       1   
## Presence of Dock         3,503  0.04     0.2      0      0        0       0       1   
## Presence of Gazebo       3,503  0.01     0.1      0      0        0       0       1   
## Presence of Elevator     3,503  0.01     0.1      0      0        0       0       1   
## Presence of Enclosure    3,503  0.01     0.1      0      0        0       0       1   
## Presence of Paving       3,503  0.004    0.1      0      0        0       0       1   
## Presence of Jacuzzi      3,503  0.01     0.1      0      0        0       0       1   
## Presence of Tikki Hut    3,503  0.003    0.1      0      0        0       0       1   
## Presence of Tennis Court 3,503  0.001    0.02     0      0        0       0       1   
## Age                      3,503  67.9     22.5    -1      66      72       80     115  
## --------------------------------------------------------------------------------------
stargazer(st_drop_geometry(miami_houses_feats.sf) %>% select("eateries_nn1","nightlife_nn1","school_nn1","police_station_nn1","healthcare_nn1","public_transport_nn1","supermarket_nn1"), type = "text",
          title="Descriptive statistics- Public Services/Amenities", digits=1,median=TRUE,iqr=FALSE,
          covariate.labels=c("Dist to nearest eatery","Dist to nearest nightlife","Dist to nearest school","Dist to nearest police station","Dist to nearest hospital","Dist to nearest transit stop","Dist to nearest supermarket"))
## 
## Descriptive statistics- Public Services/Amenities
## ============================================================================================
## Statistic                        N    Mean   St. Dev. Min  Pctl(25) Median  Pctl(75)   Max  
## --------------------------------------------------------------------------------------------
## Dist to nearest eatery         3,503  788.6   558.7   35.6  374.4    639.3  1,056.4  2,856.6
## Dist to nearest nightlife      3,503 1,564.0  942.2   52.7  945.6   1,362.9 1,967.7  5,984.2
## Dist to nearest school         3,503  388.7   357.1   13.0  191.9    306.5   459.0   2,423.9
## Dist to nearest police station 3,503 1,662.7 1,106.8  13.0  864.0   1,384.7 2,078.8  5,346.6
## Dist to nearest hospital       3,503 1,826.2 1,116.3  28.1 1,081.2  1,652.3 2,262.1  6,391.8
## Dist to nearest transit stop   3,503  682.7   684.2   22.5  210.2    408.5   920.7   3,551.2
## Dist to nearest supermarket    3,503 1,460.0  865.9   83.0  778.4   1,263.1 2,019.2  4,070.0
## --------------------------------------------------------------------------------------------
stargazer(st_drop_geometry(miami_houses_feats.sf) %>% select("CoastDist","south_beach_drive_dist"), type = "text",
          title="Descriptive statistics- Public Services/Amenities", digits=1,median=TRUE,iqr=FALSE,
          covariate.labels=c("Distance to Coastline","Driving Distance to Miami South Beach"))
## 
## Descriptive statistics- Public Services/Amenities
## =============================================================================================
## Statistic                               N   Mean St. Dev.  Min  Pctl(25) Median Pctl(75) Max 
## ---------------------------------------------------------------------------------------------
## Distance to Coastline                 3,503 1.1    1.3    0.000   0.1     0.5     1.6    5.4 
## Driving Distance to Miami South Beach 3,503 13.9   5.2     1.3    10.8    14.7    17.5   25.4
## ---------------------------------------------------------------------------------------------
other<- data.frame("Internal Characteristics" = c("Sale Year","Effective Year Built", ""), "Spatial Characteristics" = c("Whether Distance to Coast Line>=0.5 or not", "Whether Driving Distance to South Beach>=7.5 miles or not","Belongs to which Neigborhood"))
other %>% pander("Categorical Variables")
Categorical Variables
Internal.Characteristics Spatial.Characteristics
Sale Year Whether Distance to Coast Line>=0.5 or not
Effective Year Built Whether Driving Distance to South Beach>=7.5 miles or not
Belongs to which Neigborhood

3.4 Exploring the Data

3.4.1 Correlation Matrix

Next we present a correlation matrix (Figure 1) of all the numeric potential features we have till now which can help us in predicting the price of the house. Key takeaways from this plot include:

  • Many of the features have high correlation with Sale Price for example, the adjusted square feet of the house, presence of pool and elevator, distance to the coast line and so on.

  • Many of the features are correlated among themselves, for example features accounting for the size of the house, the number of bedrooms and bathrooms a house has and so on. We would have to deal with these features to ensure stability in the predictive model we make.

  • Some features like presence of wall, enclosure, tikihut and so on are not correlated with the price of the house, and hence we might do away with these features.

# Filter out the numerical variables 
numericVars <- 
  select_if(st_drop_geometry(miami_houses_feats.sf), is.numeric) %>% na.omit() %>% 
  filter(toPredict == 0) %>% 
  select(-toPredict)

# Create a correlation plot
ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
  labs(title = "Correlation Across Numeric Variables",
       caption = "Figure 1") 

3.4.2 Correlation Scatterplots

Following we present home price correlation scatterplots for 4 variables (Figure 2) namely, age of the house, distance to the coast line, distance to nearest nightlife amenity and distance to nearest police station. Key takeaways from this plot include:

  • Newer houses tend to have higher prices.
  • Houses near the coastline tend to be more costly.
  • Houses near nightlife amenities like pubs, bars, and nightclubs tend to be more costly.
  • Interestingly, houses near police stations are much cheaper.
# Correlation Scatterplots
st_drop_geometry(miami_houses_feats.sf) %>%
  filter(toPredict==0) %>%
  dplyr::select(SalePrice, police_station_nn1, Age, nightlife_nn1, CoastDist) %>%
  rename(
    `Nearest Police Station` = police_station_nn1,
    `Age` = Age,
    `Nearest Nightlife` = nightlife_nn1,
    `Distance to Coast` = CoastDist
  ) %>% 
  gather(Variable, Value, -SalePrice) %>%
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a Function of Continuous Variables",
       caption = "Figure 2") +
  plotTheme()

3.4.3 Map of Dependent Variable

Following we present a map of the sale price of the houses (Figure 3). A quick glimpse at the map reveals that houses near to the coastline tend to be priced higher and the prices tend to go down as one goes away from the coastline. This is in consensus with the correlation plot we saw in the above section for the distance to coast variable (Figure 2).

# Map of Sale Price 
ggplot() +
  geom_sf(data = miami.base, fill = "grey40") +
  geom_sf(data = miami_houses.sf %>% filter(toPredict == 0), aes(colour = q5(SalePrice)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_houses.sf %>% filter(toPredict == 0),"SalePrice"),
                      name="Sale Price\n(Quantile Breaks)") +
  labs(title="Sale Price of Houses, Miami",
       caption = "Figure 3") +
  mapTheme()

3.4.4 Map of Independent Variables

Following we map three of the independent variables (Figure 4), namely whether/not the house has a pool, driving distance from a house to the South Beach and nearest distance to a public transport stop. An interesting observation to make here is that the utility of pool is generally associated with those houses that are close to the coastline or lie in the Miami Beach region.

# Map of 3 independent variables

# Plot 1
ggplot() +
  geom_sf(data = miami.base, fill = "grey40") +
  geom_sf(data = miami_houses_feats.sf, aes(color=factor(Pool)), 
          show.legend = "point", size = .75) +
  scale_color_manual(values=c("#fc8d59", "#91cf60"))+
  labs(title="Houses having Pool, Miami",
       color = "Pool Present"
       ) +
  mapTheme()

# Plot 2
ggplot() +
  geom_sf(data = miami.base, fill = "grey40") +
  geom_sf(data = miami_houses_feats.sf, aes(colour = q5(south_beach_drive_dist)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_houses_feats.sf,"south_beach_drive_dist"),
                      name="Distance\n(Quantile Breaks)") +
  labs(title="Driving Distance to South Beach, Miami") +
  mapTheme()

# Plot 3
ggplot() +
  geom_sf(data = miami.base, fill = "grey40") +
  geom_sf(data = miami_houses_feats.sf, aes(colour = q5(public_transport_nn1)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_houses_feats.sf,"public_transport_nn1"),
                      name="Distance\n(Quantile Breaks)") +
  labs(title="Distance to Nearest Public Transport Stop, Miami",
       caption = "Figure 4") +
  mapTheme()

4 Method

In order to come up with a model for predicting the home prices, we use the following mentioned methodology:

  • Given our data, we divide it into two sets, one on which we would build our prediction model and second on which we would test out this model and analyze its performance on unknown data.

  • To build the model, we use a simple OLS (Ordinary Least Squares) method, which given a set of features of the house, learns the relationship between these features and the sale price of the house.

  • Next, we analyze the errors of our model to check where the model is going wrong and if we are missing some spatial information.

  • Lastly, we check whether the model we build is generalizable across neighborhood, race and income groups.

5 Results

5.1 Summary Results

Following table shows the results after fitting an OLS model for predicting house price. For some of the features that we considered, the results from this can be interpreted as follows:

  • An increase of 1 square feet in the adjusted square feet of the house, results in an increase of approximately $282 in sale price.
  • Having a pool in the house results in an increase of approximately $150,000 in the sale price.
  • Having a house within the Miami Beach neighborhood results in an increase of approximately $380000 in the sale price (realtive to baseline neighborhood).
  • Our model, given the set of features, roughly explains 83% of the variation in the sale price of the homes in the training data.

Note: We drop some of the features that were discussed in the above sections to make the model more stable and improve the fitness and generalizability.

# Remove features we do not need
miami_houses_feats.sf <- miami_houses_feats.sf %>% 
  select(-c(LotSize, LivingSqFt, ActualSqFt, Bath, 
            Wall, Gazebo, Paving, Jacuzzi, Tiki.Hut, Tennis.Court, CoastDist, south_beach_drive_dist))

# Set seed for reproducibility
set.seed(2010)

# Separate out data available to building model
miami_test.sf <- miami_houses_feats.sf %>% filter(toPredict == 1) %>% select(-toPredict)
miami_train.sf <- miami_houses_feats.sf %>% filter(toPredict == 0) %>% select(-toPredict)

# Create partition of the available data into training and testing 
inTrain <- createDataPartition(
  y = paste(miami_train.sf$saleYear, miami_train.sf$EffectiveYearBuilt.cat,  miami_train.sf$nbhood),
  p = .80, list = FALSE)

miami.training <- miami_train.sf[inTrain,] 
miami.test <- miami_train.sf[-inTrain,]  

# Remove outliers from the training data 
upper_bound <- quantile(miami.training$SalePrice, 0.95)
lower_bound <- quantile(miami.training$SalePrice, 0.05)

to_keep_in_training <- miami.training$SalePrice < upper_bound & miami.training$SalePrice > lower_bound 

miami.training <- miami.training[to_keep_in_training,]

# Train OLS model and output trained model parameters and fitness 
reg.training <- lm(SalePrice ~ ., data = st_drop_geometry(miami.training))

tab_model(reg.training)
  SalePrice
Predictors Estimates CI p
(Intercept) -191898.63 -621242.39 – 237445.13 0.381
saleYear [2019] 11050.87 -10026.76 – 32128.49 0.304
saleYear [2020] 56510.96 29422.64 – 83599.28 <0.001
AdjustedSqFt 282.44 260.38 – 304.51 <0.001
Bed -12635.47 -26897.65 – 1626.71 0.082
Stories 19996.91 -11550.76 – 51544.58 0.214
EffectiveYearBuilt.cat
[After 2010]
237810.02 181639.00 – 293981.04 <0.001
EffectiveYearBuilt.cat
[Before 1960]
-50724.40 -72413.95 – -29034.84 <0.001
Pool 150349.06 119535.20 – 181162.92 <0.001
Patio 6562.51 -14762.69 – 27887.72 0.546
Fence -11250.84 -33484.79 – 10983.11 0.321
Carport 2357.48 -37329.06 – 42044.03 0.907
Dock 439774.46 355811.52 – 523737.40 <0.001
Elevator 150378.80 -66961.86 – 367719.46 0.175
Enclosure -82269.90 -167496.08 – 2956.27 0.058
CoastDist.cat [LT 0.5] 27232.63 -13419.10 – 67884.36 0.189
eateries_nn1 22.18 -23.01 – 67.38 0.336
nightlife_nn1 20.34 -4.43 – 45.12 0.107
school_nn1 49.33 -0.92 – 99.57 0.054
police_station_nn1 -19.21 -39.83 – 1.42 0.068
healthcare_nn1 29.53 6.30 – 52.75 0.013
public_transport_nn1 -75.10 -123.50 – -26.71 0.002
supermarket_nn1 -43.10 -69.83 – -16.37 0.002
nbhood [Allapattah
Industrial District]
108296.90 -383154.63 – 599748.44 0.666
nbhood [Auburndale] 48360.30 -381568.60 – 478289.21 0.825
nbhood [Bay Heights] 444518.50 6431.29 – 882605.70 0.047
nbhood [Baypoint] 936637.66 505786.42 – 1367488.89 <0.001
nbhood [Bayside] 178418.87 -251073.45 – 607911.19 0.415
nbhood [Belle Island] 720494.90 235543.90 – 1205445.90 0.004
nbhood [Belle Meade] 220395.96 -204972.17 – 645764.10 0.310
nbhood [Belle Meade West] 162846.60 -269872.06 – 595565.26 0.461
nbhood [Bird Grove East] 270622.66 -172402.18 – 713647.49 0.231
nbhood [Bird Grove West] 284222.70 -157910.01 – 726355.40 0.208
nbhood [Biscayne Island] 34390.72 -582374.23 – 651155.67 0.913
nbhood [Biscayne Plaza] -224660.71 -822170.96 – 372849.55 0.461
nbhood [Brentwood] 132602.78 -334587.69 – 599793.25 0.578
nbhood [Buena Vista
Heights]
223362.06 -208153.87 – 654877.99 0.310
nbhood [Buena Vista West] 189832.23 -237455.93 – 617120.39 0.384
nbhood [Citrus Grove] 4858.59 -424458.74 – 434175.91 0.982
nbhood [Civic Center] 124235.97 -368911.69 – 617383.64 0.621
nbhood [Coral Gate] 121620.19 -310792.97 – 554033.35 0.581
nbhood [Curtis Park] -11317.31 -456381.42 – 433746.80 0.960
nbhood [Douglas Park] 97104.14 -331338.65 – 525546.93 0.657
nbhood [East Grove] 682616.04 257502.82 – 1107729.26 0.002
nbhood [East Little
Havana]
21960.85 -413807.82 – 457729.53 0.921
nbhood [Edgewater] 620393.12 135645.06 – 1105141.17 0.012
nbhood [Edison] 153445.69 -274190.62 – 581082.00 0.482
nbhood [Fair Isle] 468282.90 27254.81 – 909311.00 0.037
nbhood [Flagami] 27851.30 -399194.51 – 454897.11 0.898
nbhood [Flora Park] 5481.55 -433489.17 – 444452.27 0.980
nbhood [Grove Center] 457759.66 739.41 – 914779.90 0.050
nbhood [Hadley Park] 84281.42 -345777.42 – 514340.26 0.701
nbhood [Highland Park] 662956.21 188556.16 – 1137356.27 0.006
nbhood [Historic Buena
Vista East]
216180.67 -223586.67 – 655948.01 0.335
nbhood [King Heights] 135689.97 -307314.05 – 578693.99 0.548
nbhood [La Pastorita] 188513.64 -267317.23 – 644344.51 0.417
nbhood [Latin Quarter] -52084.35 -521715.14 – 417546.43 0.828
nbhood [Le Jeune Gardens] 88635.71 -510731.53 – 688002.95 0.772
nbhood [Lemon City/Little
Haiti]
198802.98 -235351.96 – 632957.92 0.369
nbhood [Liberty Square] 197016.75 -256812.05 – 650845.56 0.395
nbhood [Little River
Central]
142107.73 -304195.36 – 588410.81 0.532
nbhood [Little River
Gardens]
113694.43 -398640.68 – 626029.55 0.663
nbhood [Melrose] 67938.66 -369908.42 – 505785.73 0.761
nbhood [Miami Avenue] 545621.14 109977.47 – 981264.80 0.014
nbhood [Miami Beach] 379531.85 -54222.03 – 813285.72 0.086
nbhood [Morningside] 363824.80 -61027.73 – 788677.34 0.093
nbhood [North Grapeland
Heights]
34662.93 -395790.69 – 465116.55 0.875
nbhood [North Grove] 408295.55 -18958.66 – 835549.76 0.061
nbhood [North Sewell
Park]
55459.38 -387639.45 – 498558.21 0.806
nbhood [Northeast
Overtown]
17154.26 -453669.43 – 487977.96 0.943
nbhood [Northwestern
Estates]
211070.53 -236508.34 – 658649.39 0.355
nbhood [Oakland Grove] -111425.54 -702786.60 – 479935.52 0.712
nbhood [Old San Juan] 499146.47 67765.10 – 930527.85 0.023
nbhood [Orange Bowl] -123292.15 -610406.90 – 363822.60 0.620
nbhood [Orchard Villa] 105152.74 -335108.94 – 545414.41 0.640
nbhood [Palm Grove] 227708.40 -242356.25 – 697773.06 0.342
nbhood [Parkdale North] 57500.73 -382334.08 – 497335.53 0.798
nbhood [Parkdale South] 71794.97 -365189.58 – 508779.52 0.747
nbhood [Roads] 178273.83 -246231.10 – 602778.76 0.410
nbhood [San Marco Island] 642796.55 22430.48 – 1263162.63 0.042
nbhood [Santa Clara] 27880.24 -404620.36 – 460380.84 0.899
nbhood [Shenandoah North] 70597.10 -354579.20 – 495773.41 0.745
nbhood [Shenandoah South] 89764.70 -336914.73 – 516444.14 0.680
nbhood [Shorecrest] 121438.92 -302131.52 – 545009.36 0.574
nbhood [Silver Bluff] 110407.01 -314848.46 – 535662.48 0.611
nbhood [South Grapeland
Heights]
11280.67 -424003.63 – 446564.97 0.959
nbhood [South Grove] 510433.07 81030.68 – 939835.47 0.020
nbhood [South Grove
Bayside]
682577.59 241943.85 – 1123211.32 0.002
nbhood [South Sewell
Park]
19274.43 -418134.94 – 456683.80 0.931
nbhood [Spring Garden] 153948.75 -335163.52 – 643061.01 0.537
nbhood [West Grapeland
Heights]
13060.71 -431557.27 – 457678.68 0.954
nbhood [West Grove] 183167.39 -248744.00 – 615078.78 0.406
south_beach_drive_dist.cat
[LT 7]
344619.33 232522.95 – 456715.70 <0.001
Age 886.89 266.21 – 1507.57 0.005
Observations 2015
R2 / R2 adjusted 0.843 / 0.835

5.2 Performance on Test Set

To get a sense of how good our fitted model is, we evaluate it on the testing set. From Table 1 it can be seen that our model make an absolute error of around $438000 (Mean absolute percent error-MAPE of 35%), which is considerably high given that the average sale price of a home in the test set is around $1.1 million.

# Find median price 
median_training_sale_price <- median(miami.training$SalePrice)

# Find predictions and errors on test set 
miami.test <- miami.test %>%
  mutate(SalePrice.Predict = predict(reg.training, miami.test),
         SalePrice.Predict = ifelse(SalePrice.Predict < 0, median_training_sale_price, SalePrice.Predict),
         SalePrice.Error = SalePrice.Predict - SalePrice,
         SalePrice.AbsError = abs(SalePrice.Predict - SalePrice),
         SalePrice.APE = abs((SalePrice.Predict - SalePrice) / SalePrice.Predict))

# Clean Errors to remove houses with extremely high values to 
# get better estimate of mean error later
miami_errors <- miami.test %>% 
  filter(SalePrice < 5000000)

# Model performance on test set 
st_drop_geometry(miami.test) %>% 
  summarise(
    `Mean Absolute Error` = mean(SalePrice.AbsError),
    `MAPE` = mean(SalePrice.APE)
  )%>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  footnote(general_title = "\n",
           general = "Table 1")
Mean Absolute Error MAPE
438546.2 0.3570333

Table 1

If we remove houses that are more than $5 million in sale price, the absolute errors reduces to around $195000 (MAPE of 29.7%) as shown in Table 2. This indicates that our model might be failing in accurately predicting houses that have high sale price.

# Model performance on test set 
st_drop_geometry(miami_errors) %>% 
  summarise(
    `Mean Absolute Error` = mean(SalePrice.AbsError),
    `MAPE` = mean(SalePrice.APE)
  )%>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  footnote(general_title = "\n",
           general = "Table 2")
Mean Absolute Error MAPE
195597.2 0.2969766

Table 2

5.3 Cross-Validation

In order to check whether our model is generalizable to new data, we employ a technique of 100-fold cross validation, essentially repeating the above model training/testing process 100 times on different splits and analyzing the errors of the 100 models. We get a mean of the MAE across all 100 folds around $502000 with a standard deviation of around $140000 (Table 3). This indicates that there is substantial variation in the error we get on the 100 folds. This can also be seen from the distribution of MAE for the 100 folds (Figure 5) where we see that for some folds we get substantially higher errors. These two findings indicate that our model might not generalize well to new data.

# Set seed for reproducibility
set.seed(2010)

# Configure 100 fold cross validation
fitControl <- trainControl(method = "cv", number = 100)

# Perform 100 fold cross validation 
reg.cv <- train(SalePrice ~ ., data = st_drop_geometry(miami_train.sf), 
          method = "lm", trControl = fitControl, na.action = na.pass)

# Mean and Stddev MAE
reg.cv$resample %>% 
  summarise(
    `Mean MAE` = mean(MAE),
    `Standard Deviation MAE` = sd(MAE)
  )%>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  footnote(general_title = "\n",
           general = "Table 3")
Mean MAE Standard Deviation MAE
502538.8 139973.2

Table 3
# Distribution of MAE
reg.cv$resample %>% 
  ggplot() +
  geom_histogram(aes(x=MAE), bins=100, fill = "#FA7800", color="white") + 
  labs(
    title = "Distribution of MAE",
    subtitle = "K-fold cross validation, k=100",
    x = "Mean Absolute Error",
    caption = "Figure 5"
  ) + 
  plotTheme()

5.4 Predicted Price as function of Observed Price

Next, we take a deeper dive into analyzing where our prediction model might be going wrong. As can be seen from Figure 6, the model does quite well in predicting house prices that have values less that $200000 but under predicts value for homes that exceeds this sale value. This provides evidence for why we were getting high absolute errors and MAPE on the test set.

# Plot of predicted price as a function of observed price 
miami_errors %>%
  select(SalePrice.Predict, SalePrice) %>%
  ggplot(aes(SalePrice, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(SalePrice, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, SalePrice), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") + 
  labs(title="Predicted sale price as a function of observed price",
       x = "Observed Sale Price",
       y = "Predicted Sale Price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction",
       caption="Figure 6") +
  plotTheme() + theme(plot.title = element_text(size = 18, colour = "black"))

5.5 Map of Residuals

Following from our above discussion, next we see where are the errors spatially distributed. Mapping the residuals (Figure 7) reveals that our model makes substantial errors on houses that either belong in the Miami Beach region or are close to the coastline. We also saw from the map of our sales price that these regions tend to have houses that have high sale prices.

# Map of residuals on test set
miami_errors %>% 
  ggplot() + 
  geom_sf(data = miami.base, fill = "grey40") + 
  geom_sf(data = miami_errors, aes(colour = q5(SalePrice.AbsError)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_errors,"SalePrice.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Sale Price Errors on Test Set",
       caption = "Figure 7") +
  mapTheme()

5.6 Moran’s I test

Next we conduct a Moran’s I test to check whether our model is missing out on some spatial information. The test results in a Moran’s I of 0.153 with a significant p-value indicating that the errors from the current model tend to cluster more that what one would expect due to random chance (Figure 8). This can also be seen from Figure 9 which shows a relation of how errors on neighborhood house relate to errors on the current house. This relation has a correlation significant correlation of 0.27 indicating that the model is missing out on some spatial information.

# Filtering extreme errors that effect plotting  
miami.test.filtered <- miami.test %>% 
  filter(SalePrice < 1500000)

# Calculating Spatial Lag in Errors
coords.test <-  st_coordinates(miami.test.filtered) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

miami.test.filtered$lagPriceError <- lag.listw(spatialWeights.test, miami.test.filtered$SalePrice.Error)

# Moran's I test
moranTest <- moran.mc(miami.test.filtered$SalePrice.Error, 
                      spatialWeights.test, nsim = 999)

moranTest 
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  miami.test.filtered$SalePrice.Error 
## weights: spatialWeights.test  
## number of simulations + 1: 1000 
## 
## statistic = 0.15321, observed rank = 1000, p-value = 0.001
## alternative hypothesis: greater
# Plotting 
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count",
       caption = "Figure 8") +
  plotTheme()

# Plotting error in sale price as a function of spatial lag
miami.test.filtered %>% 
  ggplot(aes(x = SalePrice.Error, y = lagPriceError)) + 
  geom_point(color = 'orange') + 
  geom_smooth(method = "lm", color="green", se=FALSE) + 
  labs(
    title = "Spatial Lag as a function of Sale Price",
    x = "Sale Price Error",
    y = "Lag Price Error",
    caption = "Figure 9"
  ) + 
  plotTheme()

5.7 Map of Predicted Values

Next, we make use of our model to get prediction on the entire training set (toPredict = 0) and the secret hold out set (toPredict = 1). Comparing these two plots (Figure 10) to the map of the dependent variable we saw in Figure 3, we can see that our model is good enough in capturing the spatial process of higher prices being associated with houses that are close to the coast line or in the Miami beach region, and as one goes far, the prices tend to go down.

# Calculate prediction on both training and test set
miami_houses_feats.sf <- miami_houses_feats.sf %>%
  mutate(SalePrice.Predict = predict(reg.training, miami_houses_feats.sf))

miami_houses_feats.sf %>% 
  filter(toPredict == 0) %>% 
  ggplot() + 
  geom_sf(data = miami.base, fill = "grey40") + 
  geom_sf(aes(colour = q5(SalePrice.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_houses_feats.sf,"SalePrice.Predict"),
                      name="Sale Price Predict\n(Quintile Breaks)") +
  labs(title="Predictions on Training set (toPredict == 0)") +
  mapTheme()

miami_houses_feats.sf %>% 
  filter(toPredict == 1) %>% 
  ggplot() + 
  geom_sf(data = miami.base, fill = "grey40") + 
  geom_sf(aes(colour = q5(SalePrice.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                      labels=qBr(miami_houses_feats.sf,"SalePrice.Predict"),
                      name="Sale Price Predict\n(Quintile Breaks)") +
  labs(title="Predictions on Testing set (toPredict == 1)",
       caption="Figure 10") +
  mapTheme()

5.8 Map of MAPE by Neighborhood

Next we look at whether our model generalizes well across neighborhoods. From Figure 11, it can be seen that the model does a fairly decent job in predicting household values across neighborhoods, although there are few neighborhoods like East Little Havana, Santa Clara and Lemon City where the MAPE values are a bit on the higher side. But given that there are very few houses in this region, the estimate of MAPE might not be that reliable.

# Map of mean absolute percentage error (MAPE) by neighborhood
st_drop_geometry(miami_errors) %>%
  group_by(nbhood) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  right_join(nbhood) %>%
  st_sf() %>%
  ggplot() + 
  geom_sf(aes(fill = mean.MAPE)) +
  geom_sf(data = miami_errors, colour = "black", size = .5) +
  scale_fill_gradient(low = palette5[1], high = palette5[5],
                      name = "MAPE", limits = c(0,1)) +
  labs(title = "Mean test set MAPE by neighborhood",
       caption = "Figure 11") +
  mapTheme()

5.9 MAPE by Neighborhood as Function of Mean Price by Neighborhood

To further extend on the generalizability of the model across neighborhoods, we look at the relation between the mean price of a house in a neighborhood compared to the MAPE of the model in the neighborhood. Ideally, one would expect constant low MAPE across all neighborhood to vouch for the model being generalizable, but from Figure 12, it can be seen that there are some neighborhood, as also discussed in the above section, that have high MAPE values, questioning the generalizability.

# Scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood
st_drop_geometry(miami_errors) %>%
  group_by(nbhood) %>%
  summarize(
    MAPE = mean(SalePrice.APE, na.rm = T),
    mean.SalePrice = mean(SalePrice)
    ) %>% 
  ggplot(aes(x = mean.SalePrice, y = MAPE)) +
  geom_point(color="orange") + 
  geom_smooth(method = "lm", color="green", se=FALSE)+
  labs(
    title = "MAPE by neighborhood as a function of mean sale price",
    x = "Mean Sale Price",
    caption = "Figure 12"
  ) + 
  plotTheme()

5.10 Generalizability by Race and Income

Finally, we look into whether our model is generalizable across different race and income groups in Miami. For this we rely on the census data and label a census tract as being either Majority White or Majority Non-White for race and High Income and Low Income for income as shown in Figure 13. As can be seen in the plot, a crucial spatial process can be seen, with the Non-White population being concentrated in the North while on the other hand it can be seen that the High Income groups are generally situated around the coastline and in the Miami Beach Neighborhood.

# Download tracts data for Miami and divide the city into two groups (by race or income) 
tracts18 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
          year = 2018, state=12, county=086, geometry=T, output = "wide", progress_bar=FALSE) %>%
  st_transform(crs=crs)  %>%
  rename(TotalPop = B01001_001E,
         NumberWhites = B01001A_001E,
         Median_Income = B06011_001E) %>% 
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(Median_Income > 29943, "High Income", "Low Income"))

# Only keep tracts that  belong to one of the neighbood 
tracts18 <- tracts18[nbhood, ]

# Mapping groups 
grid.arrange(ncol = 2,
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = raceContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
               labs(title = "Race Context") +
               mapTheme() + theme(legend.position="bottom"), 
             ggplot() + geom_sf(data = na.omit(tracts18), aes(fill = incomeContext)) +
               scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
               labs(title = "IncomeContext", caption="Figure 13") +
               mapTheme() + theme(legend.position="bottom"))

As can be seen from Table 4 and Table 5, our model has relatively similar MAPE values, indicating that our model is quite generalizable across racial and income groups.

# Generalizability across race
st_join(miami_errors, tracts18) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  footnote(general_title = "\n",
           general = "Table 4")
Majority Non-White Majority White
35% 28%

Table 4
# Generalizability across income 
st_join(miami_errors, tracts18) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) %>% 
  footnote(general_title = "\n",
           general = "Table 5")
High Income Low Income
27% 31%

Table 5

6 Discussion

The analysis reveals that our model has not been completely effective in predicting the house prices. On a deeper dive into this, we found that our model was poor in predicting for homes having sale prices greater than $2m, while it was doing quite well for the rest. These were the houses that were spatially located around the coast line and in the Miami Beach neighborhood. We did incorporate features like distance to coastline and presence of house in Miami Beach neighborhood into the model, but it seems that we were unable to capture other spatial processes that were affecting prices for these homes. Although, our model did not generalize quite well across different folds of data, we suspect that this might be due to the MAPE value getting skewed due to inaccuracy of predictions on the high sale price homes. That being said, our model did quite well in terms of generalizing across racial and household income groups. Some of the interesting variables that we investigated into were the driving distance from a house to the famous South Beach of Miami and, distance to amenities like pubs and nightclub. We made an interesting observation with the distance to the nearest police station being negatively correlated with sale price of a house. The model is able to capture 85% of the variance in the sales price with the more important features being adjusted square feet of the house, presence of pool and dock, distance to healthcare, public transport and supermarket, age of the house and, presence of house in neighborhoods such as Miami Avenue and Highland Park.

7 Conclusion

Our model did give a respectable performance in predicting the home sales prices. However, before recommending this model to Zillow, we need to work on capturing more spatial processes that might be at play in affecting the prices of highly expensive homes, where our model currently lacks. In addition to this, many more improvements can be made to the model by incorporating census variables like household income, racial groups in neighborhood, percentage of bachelors etc. During the analysis, we did some reasoning to do away with features like presence of tennis court and jacuzzi. More research is required to study whether these features are really required in the model, since they are an important attribute of a luxury home.