1 Introduction

The Tax Repair Credit Program introduced by the Department of Housing and Community Development (HCD) around 20 years ago has played a integral role in subsidizing the improvement and repair of houses, which in turn benefits the society by boosting the housing economy. The approach acquired by the HCD is to randomly reach out to eligible homeowners every year to take the home repair tax credit. This approach has proven to be insufficient since only 11% of eligible homeowners they reach out to take the credit. In this way HCD might accrue high costs themselves and deliver low benefits to the society.

For this assignment we will be focusing on reaching out to the eligible homeowners in Emil City. We plan to apply a targeted approach instead of a random approach to reach out to the homeowners who potentially would take the credit. Given that we are provided with the homeowner’s data collected from the previous campaigns, we intend to model this data for better decision making in allocating limited outreach resources to targeted homeowners. Furthermore, we will also be conducting the cost/benefit analysis to see how well our model performs in reducing cost and increasing benefits. We hypothesize that our targeted model approach will work better than the random approach in terms of Cost/Benefit Analysis.

2 Setup

We start the analysis by first loading the required libraries, helper function and the marketing campaign data.

options(scipen=10000000)

# Import libraries
library(tidyverse)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(scales)
library(kableExtra)
library(memisc)
library(stargazer)

# Get helper functions
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

# Set color palettes
palette5 <- c("#981FAC","#CB0F8B","#FF006A","#FE4C35","#FE9900")
palette4 <- c("#981FAC","#FF006A","#FE4C35","#FE9900")
palette2 <- c("#981FAC","#FF006A")

# Path for the data
data_path <- "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter6/housingSubsidy.csv"

# Read in the data
credit_df <- read.csv(data_path) %>% 
  dplyr::rename(credit = y, 
         credit_numeric = y_numeric) %>% 
  dplyr::select(-X)

3 Exploratory Data Analysis

Next, we perform Exploratory Data Analysis (EDA) on the various features we are provided with the data. This involves looking at mean and density distribution for continuous features, and counts for categorical features. Before, proceeding with the EDA, it is also important to note that 11% of the given data has a value of 1 (individual homeowners who took the credit), indicating that there is imbalance in the class labels.

3.1 Continuous Features

3.1.1 Mean

First, we look at the differences in the means for the continuous features. From Figure 1 it can be seen that features capturing the inflation rate (inflation_rate), number of contacts before this campaign for the individual (previous) and, the unemployment rate (unemploy_rate), show differences in behavior of homeowner who end up taking the credit. This behavior can be summarized as follows:

Homeowners tend to take the credit when the inflation rate is relatively low.
A homeowner who has been contacted more number of times before this campaign tends to take the credit.
Homeowners tend to take the credit when the unemployment rate is low.

# Continuous features 
credit_df %>%
  dplyr::select(credit, age, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
  gather(Variable, value, -credit) %>%
  ggplot(aes(credit, value, fill=credit)) + 
    geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(x="Credit Acceptance", y="Mean", 
         title = "Feature associations with the likelihood of Credit Acceptance",
         subtitle = "(continous outcomes)",
         caption = "Figure 1: Mean value for continous features of homeowners") +
    plotTheme() + theme(legend.position = "none")

3.1.2 Density

Figure 1 abstracts a lot of differences in the behaviors of homeowners who take/do not take the credit under the umbrella of a single number (mean), hence for these features we also look at the density distributions. Figure 2 below strengthens our observations from the previous section, and also brings forth some new observations which can be exploited to differentiate between a homeowner who would take/not take the credit. For example, looking at the feature that capture the amount of money a homeowner spends on repair, it can be seen that homeowner who take the credit generally tend to spend less compared to those who do not take the credit (this behavior gets abstracted on taking the mean given the shape of the distribution). Similar observations can also be made for some bucketed values of age and consumer confidence index (cons.conf.idx) features.

# Density wise 
credit_df %>%
    dplyr::select(credit, age, previous, unemploy_rate, cons.price.idx, cons.conf.idx, inflation_rate, spent_on_repairs) %>%
    gather(Variable, value, -credit) %>%
    ggplot() + 
    geom_density(aes(value, color=credit), fill = "transparent") + 
    facet_wrap(~Variable, scales = "free") +
    scale_fill_manual(values = palette2) +
    labs(title = "Feature distributions Credit Acceptance vs. No Credit Acceptance",
         subtitle = "(continous outcomes)",
         caption="Figure 2: Density distribution for continous features of homeowners") +
    plotTheme()

3.2 Yes/No Features

Next, we look at two of the yes/no answer features in our data, namely whether the homeowner carries a mortgage (mortgage) and whether the homeowner’s full time residence is not in Philadelphia (taxbill_in_phl). Looking at Figure 3, it seems that homeowner who have a mortgage or have full time residence not in Philadelphia tend not to take the credit, but given that the data has considerably more count of homeowners who said no to credit acceptance, it is hard to reason from this plot about the behavior of homeowners. More specifically, around 11% of the homeowners who have mortgage and, 8% of the homeowners who do not have residence in Philadelphia, end up taking the credit. These percentages are close to the overall percent of credit acceptance in the data (~11%), hence making it difficult to comment on homeowner behavior.

# Yes/No features 
credit_df %>%
  dplyr::select(credit, mortgage, taxbill_in_phl) %>%
  gather(Variable, value, -credit) %>%
  count(Variable, value, credit) %>%
  filter(value == "yes") %>%
    ggplot(aes(credit, n, fill = credit)) +   
      geom_bar(position = "dodge", stat="identity") +
      facet_wrap(~Variable, scales = "free", ncol=4) +
      scale_fill_manual(values = palette2) +
      labs(x="Credit Acceptance", y="Count",
           title = "Feature associations with the likelihood of Credit Acceptance",
           subtitle = "Two category features (Yes and No)",
           caption="Figure 3: Count of credit acceptance for homeowner who responded 'yes' to the shown categories") +
      plotTheme() + theme(legend.position = "none")

3.3 Categorical Features

Next we look at the counts of the credit acceptance behavior of homeowners across different categorical features. From Figure 4 and Figure 5, a common theme can be identified in all plots with the count for no credit acceptance being considerable high than yes, which can be attributed to the count distribution of the same in the original data. Nevertheless, some of the interesting takeaways from these plots are:

There is not much variation in whether the homeowner would accept the credit given the weekday (day_of_week) the homeowner was last contacted.
If a homeowner accepted the credit from the previous marketing campaign (poutcome), it is likely that the homeowner would accept it this time as well.
It is likely to get no as response from a homeowner after contacting for more than 4 times (campaign).

# Categorical features 
credit_df %>%
  dplyr::select(credit, marital, contact, day_of_week, poutcome, taxLien) %>%
  gather(Variable, value, -credit) %>%
  count(Variable, value, credit) %>%
  ggplot(aes(value, n, fill = credit)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Credit Acceptance", y="Count",
         title = "Feature associations with the likelihood of credit acceptance",
         caption="Figure 4: Counts for categorical features of the homeowner"
        ) +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Categorical features
credit_df %>%
  dplyr::select(credit, job, education, month, campaign, pdays) %>%
  gather(Variable, value, -credit) %>%
  count(Variable, value, credit) %>%
  ggplot(aes(value, n, fill = credit)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Credit Acceptance", y="Count",
         title = "Feature associations with the likelihood of credit acceptance",
         caption="Figure 5: Counts for categorical features of the homeowner"
        ) +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

4 Modeling Credit Acceptance

Following our EDA, which provided us with insights into features that can be important in differentiating between a homeowner who would/would not take the credit, we next start off by modeling the credit acceptance of a homeowner given the various features.

4.1 Splitting the Data

We begin with splitting the given data into training and testing set (65:35 ratio). Doing so will allow us to quantify the performance of the model we built on unseen data and comment better on the model’s behavior.

# Set seed for reproducibility
set.seed(3456)

groupings <- paste(credit_df$taxLien, credit_df$education, credit_df$pdays)

# Splitting the data into train and test 
trainIndex <- createDataPartition(credit_df$credit, 
                                  y = groupings, 
                                  p = .65,
                                  list = FALSE,
                                  times = 1)

creditTrain <- credit_df[ trainIndex,]
creditTest  <- credit_df[-trainIndex,]

4.2 Kitchen Sink Model

We start off by first building a kitchen sink model, which essentially considers all the features we have about the homeowners as they are to model their credit acceptance behavior. This model gives a McFadden R-Squared value of 0.235 and a Sensitivity (True Positive Rate) of 0.172 on the test set (using a naive threshold of 0.5), which indicates that out of all the homeowners who were likely to take the credit, our model is able to select 17% of them.

# Building the model
creditreg <- glm(credit_numeric ~ .,
                  data=creditTrain %>% dplyr::select(-credit),
                  family="binomial" (link="logit"))

# McFadden R Squared
# pR2(creditreg)[4]

# Predictions on test set
testProbs <- data.frame(Outcome = as.factor(creditTest$credit_numeric),
                        Probs = predict(creditreg, creditTest, type= "response"))

# Thresholding 
testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

# Confusion matrix
conf_mat <- caret::confusionMatrix(testProbs$predOutcome, testProbs$Outcome, 
                       positive = "1")

#conf_mat$byClass['Sensitivity']

4.3 Feature Engineering

Given that our kitchen sink model does not do that well in identifying homeowners who are likely to take the credit, we next work on engineering some features to improve upon this. On a broader level, we perform the following transformations based on our observations in Section 3 (EDA):

We bucket a number of the continuous features (campaign, pdays, age, inflation_rate, unemploy_rate, sons.conf.idx and spent_on_repairs) such that the bucketing causes the model to capture the credit acceptance behavior of homeowners.
We remove some of our features (previous and day_of_week) for which there was not much evidence of differentiation in credit acceptance behavior.

# Function to engineer new features 
get_feature_engineered_df <- function(df){
  
  df.fe <- df %>%
    mutate(campaign = case_when(
      campaign == 1 ~ "once",
      campaign == 2 ~ "twice",
      campaign == 3 ~ "thrice",
      campaign > 3 ~ "more than three times"
    )) %>%
    mutate(pdays = case_when(
      pdays >= 0 & pdays < 7 ~ "past week",
      pdays >= 7 & pdays < 14 ~ "1-2 weeks ago",
      pdays >= 14 & pdays < 21 ~ "2-3 weeks ago",
      pdays >= 21 & pdays < 999 ~ "more than 3 weeks ago",
      pdays == 999 ~ "never"
    )) %>%
    mutate(age = case_when(
      age <= 30 ~ "<=30",
      age >30 & age < 50 ~ "31-50",
      age >=50 ~ ">=50"
    )) %>% 
    mutate(
      inflation_rate = ifelse(inflation_rate < 1.25, "low_inflation", "high_inflation")
    ) %>% 
    mutate(
      unemploy_rate = ifelse(unemploy_rate < -0.5, "less unemp", "more unemp")
    ) %>% 
    mutate(cons.conf.idx = case_when(
      cons.conf.idx < -47.5 ~ 'bin1',
      cons.conf.idx >= -47.5 &  cons.conf.idx < -45~ 'bin2',
      cons.conf.idx >= -45 &  cons.conf.idx < -44~ 'bin3',
      cons.conf.idx >= -44 &  cons.conf.idx < -41~ 'bin4',
      cons.conf.idx >= -41 &  cons.conf.idx < -37.5~ 'bin5',
      cons.conf.idx >= -37.5 &  cons.conf.idx < -35 ~ 'bin6',
      cons.conf.idx >= -35 ~ 'bin7'
    )) %>% 
    mutate(
      spent_on_repairs = ifelse(spent_on_repairs < 5080, "low spend", "high spend")
    ) %>% 
    dplyr::select(-previous, -day_of_week )

  df.fe
}

4.4 Engineered Model

Keeping the data splits same, we next engineer the features as described in the above section, and repeat the process of model training and evaluation on the test split. Our engineered model gives an McFadden R-squared value of 0.237 and a Sensitivity (True Positive Rate) of 0.234 on the test set (using a naive threshold of 0.5), which is approximately 6% improvement in absolute percent terms. We also note that, the highly imbalanced distribution of homeowners with just 11% of them accepting credit in the provided data, makes the task of correctly identifying homeowners who would accept the credit quite difficult.

Following we show a regression summary for both the kitchen sink and engineered model.

# Get engineered train and test splits 
creditTrain.fe <- get_feature_engineered_df(creditTrain)
creditTest.fe <- get_feature_engineered_df(creditTest)

# Fit logistic regression
creditreg.fe <- glm(credit_numeric ~ .,
                  data=creditTrain.fe %>% dplyr::select(-credit),
                  family="binomial" (link="logit"))

# Comparing regression summaries
stargazer(creditreg, creditreg.fe, type = "text",
          align=TRUE,
          no.space = TRUE,
          single.row = TRUE,
          column.labels=c("Kitchen Sink Model","Engineered Model"))

## 
## ==================================================================
##                                       Dependent variable:         
##                              -------------------------------------
##                                         credit_numeric            
##                              Kitchen Sink Model  Engineered Model 
##                                      (1)                (2)       
## ------------------------------------------------------------------
## age                             0.013 (0.008)                     
## age> =50                                          0.561** (0.271) 
## age31-50                                           0.227 (0.211)  
## jobblue-collar                -0.564** (0.275)    -0.505* (0.276) 
## jobentrepreneur                -0.877 (0.544)     -0.856 (0.543)  
## jobhousemaid                   -0.264 (0.473)     -0.288 (0.469)  
## jobmanagement                 -0.646** (0.309)   -0.641** (0.313) 
## jobretired                     -0.270 (0.367)     -0.281 (0.352)  
## jobself-employed               -0.849* (0.448)    -0.859* (0.448) 
## jobservices                    -0.404 (0.295)     -0.392 (0.297)  
## jobstudent                     -0.025 (0.404)     -0.104 (0.412)  
## jobtechnician                  -0.169 (0.234)     -0.145 (0.236)  
## jobunemployed                  -0.013 (0.398)     -0.073 (0.399)  
## jobunknown                     -0.897 (0.729)     -0.898 (0.734)  
## maritalmarried                 -0.083 (0.242)     -0.152 (0.239)  
## maritalsingle                   0.057 (0.277)      0.076 (0.271)  
## maritalunknown                  0.664 (1.111)      0.649 (1.110)  
## educationbasic.6y               0.472 (0.411)      0.458 (0.411)  
## educationbasic.9y               0.239 (0.342)      0.219 (0.340)  
## educationhigh.school            0.222 (0.324)      0.196 (0.321)  
## educationilliterate           -12.409 (535.411)  -12.076 (535.411)
## educationprofessional.course    0.285 (0.348)      0.224 (0.348)  
## educationuniversity.degree      0.222 (0.322)      0.204 (0.318)  
## educationunknown                0.360 (0.423)      0.325 (0.423)  
## taxLienunknown                  0.226 (0.210)      0.237 (0.211)  
## taxLienyes                    -10.289 (535.411)  -10.518 (535.411)
## mortgageunknown                -0.331 (0.574)     -0.267 (0.559)  
## mortgageyes                    -0.071 (0.144)     -0.052 (0.145)  
## taxbill_in_phlyes               0.052 (0.192)      0.070 (0.194)  
## contacttelephone              -1.022*** (0.305)  -0.858*** (0.285)
## monthaug                        0.026 (0.435)     -0.710 (0.505)  
## monthdec                        0.646 (0.709)      0.239 (0.750)  
## monthjul                       -0.305 (0.377)     -0.923* (0.543) 
## monthjun                       -0.150 (0.451)     -0.270 (0.513)  
## monthmar                      2.036*** (0.566)     0.315 (0.801)  
## monthmay                       -0.275 (0.309)     -0.423 (0.321)  
## monthnov                       -0.433 (0.428)     -0.397 (0.606)  
## monthoct                        0.268 (0.537)      0.167 (0.596)  
## monthsep                        0.205 (0.632)      0.056 (0.691)  
## day_of_weekmon                 -0.041 (0.227)                     
## day_of_weekthu                  0.023 (0.225)                     
## day_of_weektue                 -0.158 (0.232)                     
## day_of_weekwed                  0.281 (0.225)                     
## campaign                       -0.066* (0.040)                    
## pdays                          0.0004 (0.001)                     
## previous                        0.187 (0.192)                     
## campaignonce                                       0.184 (0.223)  
## campaignthrice                                     0.376 (0.271)  
## campaigntwice                                      0.154 (0.241)  
## pdays2-3 weeks ago                                -0.548 (0.930)  
## pdaysmore than 3 weeks ago                       13.465 (535.412) 
## pdaysnever                                         0.126 (0.828)  
## pdayspast week                                     0.393 (0.550)  
## poutcomenonexistent            0.670** (0.322)    0.438* (0.226)  
## poutcomesuccess               2.128*** (0.735)    1.485* (0.847)  
## unemploy_rate                 -1.247*** (0.483)                   
## unemploy_ratemore unemp                           -2.548* (1.420) 
## cons.price.idx                 1.963** (0.847)     0.522 (0.338)  
## cons.conf.idx                  0.055* (0.028)                     
## inflation_rate                 -0.040 (0.445)                     
## spent_on_repairs                0.009 (0.010)                     
## cons.conf.idxbin2                                 -1.570 (0.955)  
## cons.conf.idxbin4                                  0.246 (0.876)  
## cons.conf.idxbin5                                 1.375** (0.657) 
## cons.conf.idxbin6                                  0.012 (0.847)  
## cons.conf.idxbin7                                  1.369 (0.915)  
## inflation_ratelow_inflation                        0.057 (0.494)  
## spent_on_repairslow spend                         -1.368 (1.238)  
## Constant                     -233.126* (127.740) -49.593 (31.355) 
## ------------------------------------------------------------------
## Observations                        2,708              2,708      
## Log Likelihood                    -730.415           -728.422     
## Akaike Inf. Crit.                 1,564.829          1,570.844    
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

# McFadden R-Squared 
# pR2(creditreg.fe)[4]

# Predictions on test set
testProbs.fe <- data.frame(Outcome = as.factor(creditTest.fe$credit_numeric),
                        Probs = predict(creditreg.fe, creditTest.fe, type= "response"))

# Thresholding 
testProbs.fe <- 
  testProbs.fe %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs.fe$Probs > 0.5 , 1, 0)))

# Confusion matrix
conf_mat.fe <- caret::confusionMatrix(testProbs.fe$predOutcome, testProbs.fe$Outcome, 
                       positive = "1")

# conf_mat.fe$byClass['Sensitivity']

4.5 Cross Validation

Next, we cross validate both the kitchen sink and the engineered models using 100-fold cross validation. This will give us a better idea as to whether/not our models are generalizing to new data.

Note: The cross validation in the below section is performed using a default threshold value of 0.5.

4.5.1 Kitchen Sink Model

Looking at the distribution of the metric values around the mean in Figure 6, it can be seen that the kitchen sink model generalizes well with respect to the Specificity of the model (True Negative Rate or identifying homeowners who would not take the credit) but lacks in generalizing with respect to the Sensitivity (True Positive Rate or identifying homeowner who would potentially take the credit) and ROC.

# Factor the credit column into yes no so that metrics are calculated in the correct order of 1 and 0
credit_df$credit <- factor(credit_df$credit, levels = c("yes", "no"))

# Perform Cross Validation
ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- caret::train(credit ~ ., data = credit_df %>% 
                                   dplyr::select(-credit_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

# Plotting CV
dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 6: Distribution of ROC, Sencificity and Specitivity of the kitchen sink model across folds") +
    plotTheme()

4.5.2 Engineered Model

Coming to our engineered model, from Figure 7 it can be seen that this model performs more of less similar to the kitchen sink model in terms of Specificity, but we can see increase in density around the mean value of Sensitivity and ROC compared to the kitchen sink model in Figure 6 (although not that significant and not giving tighter bounds around the mean). This indicates that, although our engineered model demonstrated an improvement in Sensitivity on the test set (Section 4.4), it still might not be adequate model to be used in the marketing campaign to target homeowners, at least by using the current threshold of 0.5.

# Get the engineered feature on the complete data
credit_df.fe <- get_feature_engineered_df(credit_df)

# Perform Cross Validation
cvFit <- caret::train(credit ~ ., data = credit_df.fe %>% 
                                   dplyr::select(-credit_numeric), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

# Plotting CV
dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines",
         caption = "Figure 7: Figure 6: Distribution of ROC, Sencificity and Specitivity of the engineered model across folds") +
    plotTheme()

4.6 ROC Curve

As was seen from the above section, using a threshold of 0.5 does not allow the model to achieve a considerable high Sensitivity value, something our model should have to make sure that our model is able to target homeowner who are likely to take credit. In Figure 8 below, we plot the ROC curve, which in simple words, provides the trade-off between True Positive Rate (homeowner the model identify would take credit and potentially take credit) and False Positive Rate (homeowners the model identify would take credit and do not take credit), and allows us to choose the appropriate threshold for the model based on the cost/benefit associated with this tradeoff.

From Figure 8, it can also be seen that there are some thresholds for which we can achieve a better True Positive Rate with little expense on the False Positive Rate. In the following section we would look at how we choose this threshold by looking at the business process associated with this marketing campaign. The ROC curves gives an AUC (Area Under Curve) value of 0.758, which indicates that the engineered model might still be a useful fit and definitely does better than just random selection of homeowners for targeting in this campaign.

# ROC curve
ggplot(testProbs.fe, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Engineered Credit Model",
       caption = "Figure 8") + 
  plotTheme()

# AUC
# pROC::auc(testProbs.fe$Outcome, testProbs.fe$Probs)

5 Cost Benefit Analysis

Working on our engineered model and carrying forward our discussion from the AUC curve about different thresholds in the model resulting in trade-offs between True-Positives and False-Positives, we next move on to understand the business process underlying the HCD marketing campaign and using that to optimize our model and find the appropriate threshold.

Before proceeding with the Cost/Benefit analysis, let us first look at the finances that are involved with this marketing campaign.

Of all the homeowners that our model predicts as willing to take the credit, the HCD is willing to allocate marketing resources worth $2850 per homeowner (which includes expenses for staff, mailers, counseling sessions, etc.)
On an average, only 25% of the eligible contacted homeowners end up taking the home repair credit, and the tax credit allocated per homeowner is $5000.
Homeowners who take the credit usually sell their house with an average premium of $5000.
The houses which surround the repaired house generate an aggregated premium of $56000 on average.

5.1 Mathematical Formulation

Next we formulate the this Cost/Benefit analysis in terms of metrics we encounter in the confusion matrix.

True Positive (TP) Profit: These are the homeowners that our model correctly predicted as potential homeowners who would take the credit, and hence HCD will allocated marketing resources for these homeowners. However, only 25% of these homeowners would end up taking the credit and would be subsidized by the HCD for the repair and improvement for their house. Having done so, the HCD can expect a benefit to the housing economy when these houses are sold with a premium of $10000 on average. In addition to this, the HCD can also expect a long term housing economy benefit from the houses surrounding the repaired house, who saw an average premium of $56000. Hence the equation for True Positive profit would be:

\[\begin{equation}\label{eq. 1} \begin{gathered} 0.25 * TP * (-2850-5000+10000+56000) + 0.75 * TP * (-2850) \end{gathered} \end{equation}\]

True Negative (TN) Profit: These are the homeowners for whom we correctly predicted that they would not take the credit. HCD does not incur any cost in this case, since they do not allocate any marketing resources, nor do they give out credit to these homeowners. Hence, the True Negative Profit is 0.
False Positive (FP) Profit: These are the homeowners for whom we incorrectly predict that they would take the credit. Hence, HCD would allocated marketing resources for these homeowners, but these homeowners would not take the credit, resulting in a loss of $2850 per homeowner. The equation for the False Positive Profit can be written as:

\[\begin{equation}\label{eq. 2} \begin{gathered} FP * (-2850) \end{gathered} \end{equation}\]

False Negative (FN) Profit: These are the homeowners who we incorrectly predicted as would not take the credit but they ended up doing so, maybe due to reasons unrelated to the marketing campaign. As given, for the purpose of this analysis, we would assume that the False Negative Profit in this case is 0. This intuitively also makes sense because these homeowners were not influenced by the marketing campaign, hence it would be incorrect to account for them in them in the Cost/Benefit analysis.

5.2 Intermediate Result

Table 1 shows the Cost/Benefit analysis done using the engineered model on the test set (using a threshold of 0.5). It can be observed from the table that the True Positive count is 34, out of which we know what around 8-9 (25%) homeowners would end up taking the credit and generate the positive revenue. On the other hand, the False Positive count is 18, and the HCD would incur marketing costs by reaching out to them but in return would not get any positive benefit from them, hence the negative revenue. Using the equations from the above section and using a threshold of 0.5 for the engineered model, the total revenue this model can generate is of $375160.

# Getting the cost benefit table
cost_benefit_table <-
   testProbs.fe %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ 0,
                         Variable == "True_Positive"  ~ Count * ((0.25*(-2850-5000+66000)) + (0.75*(-2850))),
                         Variable == "False_Negative" ~ 0,
                         Variable == "False_Positive" ~ Count * (-2580))
              ) %>%
    bind_cols(data.frame(Description = c(
              "Predicted correctly that these homeowners will not take the credit",
              "Predicted correctly that these homeowners would take the credit",
              "Predicted incorrectly that these homeowners will not take the credit",
              "Predicted incorrectly that these homeowners will take the credit")))

kable(cost_benefit_table) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "Table 1: Cost/Benefit table using a threshold of 0.5")

Variable	Count	Revenue	Description
True_Negative	1248	0	Predicted correctly that these homeowners will not take the credit
True_Positive	34	421600	Predicted correctly that these homeowners would take the credit
False_Negative	111	0	Predicted incorrectly that these homeowners will not take the credit
False_Positive	18	-46440	Predicted incorrectly that these homeowners will take the credit

Table 1: Cost/Benefit table using a threshold of 0.5

5.3 Choosing the Optimal Threshold

Next, we look for other thresholds that can give a better Cost/Benefit. Figure 9 below shows how the revenue generated by the different confusion metrics vary as we vary the model threshold from 0 to 1. The trend can be summarized as follows:

For True Positive, for a low threshold, most of the homeowners are predicted as willing to take the credit, many of whom would actually take the credit, and generate the positive revenue as seen. As one increases the threshold from 0 to 1, the model becomes more and more conservative in predicting a homeowner as willing to take credit, hence the number of True Positive decrease, resulting in reduction in revenue.
For False Positive, for a low threshold, most of the homeowner are predicted as willing to take the credit, including many who had no intention of taking the credit. Given this huge number of incorrectly positively predicted homeowner, a lot of money is lost in market allocation with no benefit. As one increases the threshold, the model gets more conservative, and starts avoiding False Positive, hence reducing the negative revenue.
False Negative and True Negative do not generate any revenue.
Given the above points, we now need to find the sweetspot for threshold that would maximize the overall revenue.

# Iterating Thresholds
iterateThresholds <- function(data, observedClass, predictedProbs, group) {
#This function takes as its inputs, a data frame with an observed binomial class (1 or 0); a vector of predicted probabilities; and optionally a group indicator like race. It returns accuracy plus counts and rates of confusion matrix outcomes. It's a bit verbose because of the if (missing(group)). I don't know another way to make an optional parameter.
  observedClass <- enquo(observedClass)
  predictedProbs <- enquo(predictedProbs)
  group <- enquo(group)
  x = .01
  all_prediction <- data.frame()
  
  if (missing(group)) {
  
    while (x <= 1) {
    this_prediction <- data.frame()
    
    this_prediction <-
      data %>%
      mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
      count(predclass, !!observedClass) %>%
      summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
                Count_TP = sum(n[predclass==1 & !!observedClass==1]),
                Count_FN = sum(n[predclass==0 & !!observedClass==1]),
                Count_FP = sum(n[predclass==1 & !!observedClass==0]),
                Rate_TP = Count_TP / (Count_TP + Count_FN),
                Rate_FP = Count_FP / (Count_FP + Count_TN),
                Rate_FN = Count_FN / (Count_FN + Count_TP),
                Rate_TN = Count_TN / (Count_TN + Count_FP),
                Accuracy = (Count_TP + Count_TN) / 
                           (Count_TP + Count_TN + Count_FN + Count_FP)) %>%
      mutate(Threshold = round(x,2))
    
    all_prediction <- rbind(all_prediction,this_prediction)
    x <- x + .01
  }
  return(all_prediction)
  }
  else if (!missing(group)) { 
   while (x <= 1) {
    this_prediction <- data.frame()
    
    this_prediction <-
      data %>%
      mutate(predclass = ifelse(!!predictedProbs > x, 1,0)) %>%
      group_by(!!group) %>%
      count(predclass, !!observedClass) %>%
      summarize(Count_TN = sum(n[predclass==0 & !!observedClass==0]),
                Count_TP = sum(n[predclass==1 & !!observedClass==1]),
                Count_FN = sum(n[predclass==0 & !!observedClass==1]),
                Count_FP = sum(n[predclass==1 & !!observedClass==0]),
                Rate_TP = Count_TP / (Count_TP + Count_FN),
                Rate_FP = Count_FP / (Count_FP + Count_TN),
                Rate_FN = Count_FN / (Count_FN + Count_TP),
                Rate_TN = Count_TN / (Count_TN + Count_FP),
                Accuracy = (Count_TP + Count_TN) / 
                           (Count_TP + Count_TN + Count_FN + Count_FP)) %>%
      mutate(Threshold = round(x,2))
    
    all_prediction <- rbind(all_prediction,this_prediction)
    x <- x + .01
  }
  return(all_prediction)
  }
}

# Iterating through thresholds and getting confusion metrics
whichThreshold <- 
  iterateThresholds(
     data=testProbs.fe, observedClass = Outcome, predictedProbs = Probs)

# Calculating revenue associated with each confusion metric
whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Revenue =
                 case_when(Variable == "Count_TN"  ~ 0,
                           Variable == "Count_TP"  ~ Count * ((0.25*(-2850-5000+66000)) + (0.75*(-2850))),
                           Variable == "Count_FN"  ~ 0,
                           Variable == "Count_FP"  ~ Count * (-2580))
    )

# Plotting
whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Profit by confusion matrix type and threshold",
       y = "Profit",
       caption = "Figure 9") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

From Figure 10 we observe patterns as discussed above. As we increase the threshold of our model, our model becomes stricter in choosing whom to contact and give the credit, because of which we would lose out on potential homeowners who might want to take the credit, and hence the decreasing trend in the count of credits plot. On the other hand, in the total revenue plot we can observe that the revenue is negative for very low thresholds (effect of False Positives overpower the effect of True Positives), increases till a threshold of 0.18 (optimal number of True Positives and False Positives that generate maximum revenue) and then decrease thereafter as the model gets stricter in predicting positives (both True Positive and False Positive).

Note: For count of credits we do not consider the count of homeowners who were predicted as False Negative. Since we did not consider the benefit from these homeowners in the equations we defined above, it would be incorrect to include them in a Cost/Benefit analysis for the marketing campaign, as they were likely not influenced by it.

# Calculating Total Revenue and Credits
whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(Credits = ifelse(Variable == "Count_TP", (Count * .25), 
                            ifelse(Variable == "Count_FN", 0, 0))) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_Credits = round(sum(Credits))
              ) 
# Plotting
whichThreshold_revenue %>%
  dplyr::select(Threshold, Total_Revenue, Total_Count_Credits) %>%
  gather(Variable, Value, -Threshold) %>%
  ggplot(aes(Threshold, Value) ) +
  geom_point(color=palette2[2]) +
  geom_vline(xintercept = pull(arrange(whichThreshold_revenue, -Total_Revenue)[1,1])) + 
  facet_wrap(~Variable, scales = "free")+
  plotTheme() +
  labs(title = "Threshold as a function of Revenue and Credits",
       subtitle = "Total count credits are rounded to nearest interger value",
       caption = "Figure 10")

5.4 Result

From Table 2 we can observe that the difference in total revenue generated by optimal threshold model is almost twice of that generated by the 50% threshold model. In addition to this, the optimal threshold model is able to correctly target more homeowner who end up taking the credit as compared to the 50% threshold model.

selected_threshold <- arrange(whichThreshold_revenue, -Total_Revenue)[1,1]

# Comparing optimal and 50% threshold
whichThreshold_revenue %>% 
  filter(Threshold %in% c(0.5, selected_threshold)) %>% 
  mutate(Threshold = ifelse(Threshold == 0.5, "50% Threshold", "Optimal_Threshold")) %>% 
  kable() %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "Table 2: Cost/Benefit table comparing optimal and 50% threshold")

Threshold	Total_Revenue	Total_Count_Credits
Optimal_Threshold	709200	20
50% Threshold	375160	8

Table 2: Cost/Benefit table comparing optimal and 50% threshold

6 Conclusion

From the above analysis, we believe that the model so developed can be a good start in the direction of a targeted approach for the marketing campaign compared to the previous random approach. It is capable of generating a gross positive revenue of $709,200 by engineering new features into the model along with carefully choosing the appropriate threshold. However, we should note that the Cost/Benefit analysis of this marketing campaign through the predictions of the model does not take into account the non-monetary and non-quantifiable benefits, such as home-owner satisfaction and healthy lifestyle. At the same time, we need to make sure that the model does not disproportanetly affect low-income and minority racial groups.

A number of things can be done to improve this model. First, we need to make sure that the model generalizes on the sensitivity metric. This can be done by either adding more relevant features such as income, gender, etc. of the homeowner or augmenting data (upsampling) to make proportion of homeowners who accept/not accept credit equal in the data. Secondly, we can further validate the model by running it through a domain expert who can allow for more context to be introduced into the model.

We feel that after this marketing campaign ends, it is important to reach of to homeowners who either were the 75% of the True Positives who did not take the credit, or the False Positives, to understand the reasons for not opting for the credit, and use this intelligence (in the predictive model or in general) for improved response rates in future campaigns.

Assignment 4: Improving Efficiency of HCD Tax Credit Program

By Shubham Annadate