[return to overview page]

In this section, we will generate our third and last hybrid human-computer model. This will be a neural network model. The general training-testing and model building procedure will be exactly the same as for our earlier neural network model, with again the exception that our hybrid model will also take into account human predictions as a feature. The performance of this model will be compared to the performance of a non-hybrid model (which uses only textual features) trained on the same statements.

As with our original neural network models, our new hybrid and non-hybrid neural network models will be: “feed-forward”, neural networks with a single hidden layer.

Packages

Again, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(caret) # modeling
library(nnet) # for neural networks specifically (caret "nnet" wraps these functions)

Load Data

Next, I will load the data file, we created earlier, which has both the cleaned and processed textual features and human predictions. Note, again, that this data file consists of a total of 3,663 statments.

# load df of combined human and processed textual feature and ground truth
load("stats_combo.Rda")

# For rendering, I'm going to cheat here and load results created when this model was first run
# For some reason, chunks that were supposed to be cached when originally run are rerunning
load("results_HYB_neural.Rda")
# change the specific names (renamed at end), back to generic name
results_HYB_neural -> results

# print
stats_combo

EXAMPLE (Neural Network)

Let’s begin with a singular example. As with the earlier two hybrid model analyses, this will entail training and testing two neural network models: one hybrid model that takes into accounts both human predictions and textual features and one that only takes into account textual features.

Split Sample Into Training and Testing Sets

As before, we are going to conduct an exact 50-50 split, randomly allocating one half of the statements to the training set, and the other half to the testing set (using the createDataPartition function in the caret packages; Kuhn, 2008; Kuhn, 2019). The exact same training set will be used to build the hybrid and non-hybrid models, both of which will then also be tested on the exact same testing set.

# set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# partition data in 50-50 lgocv split (create index for test set)
index_train_ex <- 
  createDataPartition(y = stats_combo$stat_id,
                      p = 0.50,
                      times = 1,
                      list = FALSE)

# actually create data frame with training set (predictors and outcome together)
train_set_ex <- stats_combo[index_train_ex, ]

# actualy create data frame with test set (predictors and outcome together)
test_set_ex <- stats_combo[-index_train_ex, ]

Build Model (on Training Set)

Now that the data are split, we can fit neural network models to the training data. Again, we will fit two neural network models. One, a hybrid model that uses both textual features and human guesses as predictors; the other, a non-hybrid model that uses on textual features as predictors.

As with our earlier neural networks models, and unlike with our logistic regression models, this type of model has two tuning parameters that we need to set: (1) the number of nodes in the hidden layer and the (2) the value of the “weight decay” parameter (for more discussion of these, see introduction and “Build Model (on Training Set)” section of original neural networks page).

As before, I will cycle through the following possible values for the number of nodes in the hidden layer: 1, 2, 3, 4, 5, and 10. And I will cycle through the following possible decay rate values: 0, 0.05, 0.1, 1, and 2. This leads to a total of 30 parameter value combinations that will be considered (6 different hidden layer candidate values times 5 different decay rate values). As usual, parameter values will be selected based on which combination leads to the highest level of overall accuracy across 3 repeated 50-50 training testing splits within the training set (for more discussion on this sub-training see “Build Model (on Training Set)” section of the basic SVM page).

When we go through this process, we end up with two trained models with the following tuning parameter values:

Model Type node weight decay
Non-Hybrid 2 2
Hybrid 10 2

Although the number of nodes in the hybrid model and non-hybrid model are fairly different (10 v. 2) leading the models to differ in ways other than just the set of predictors they incorporate, introducing what might be considered a confound, the analysis can be thought of as comparing the performance best non-hybrid neural network model we can construct to the best hybrid neural network model we can construct. (It would nevertheless be useful to also examine the performance of these two model types, when tuning parameters are held constant.)

# note: these setting chunks are separated for reuse later

# set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# -----------------------------------------------------------------------------
# STEP 1: SELECT TUNING PARAMETERS

# part a: set range of tuning parameters (layer size and weight decay)
tune_grid_neural <- expand.grid(size = c(1:5, 10),
                                decay = c(0, 0.05, 0.1, 1, 2))


# part b: set some other consrains to be imposed on network (to keep computation manageable)
# see: p. 361 of Kuhn & Johnson (2013, 
max_size_neaural <- max(tune_grid_neural$size)
max_weights_neural <- max_size_neaural*(nrow(train_set_ex) + 1) + max_size_neaural + 1

# -----------------------------------------------------------------------------
# STEP 2: SELECT TUNING METHOD
# set up train control object, which specifies training/testing technique
train_control_neural <- trainControl(method = "LGOCV",
                                     number = 3,
                                     p = 0.50)
# set seed, so that statistics don't keep changing for every analysis
# (applies for models which might have random parameters)
set.seed(2019)

# start timer
start_time <- Sys.time()

# -----------------------------------------------------------------------------
# STEP 3: TRAIN MODEL

# use caret "train" function to train svm
model_hybrid <- 
   train(form = grd_truth ~ . - stat_id,
        data = train_set_ex,
        method = "nnet",
        tuneGrid = tune_grid_neural,
        trControl = train_control_neural,
        metric = "Accuracy", # how to select among models
        trace = FALSE,
        maxit = 100,
        MaxNWts = max_weights_neural) # don't print output along the way

model_non_hybrid <- 
   train(form = grd_truth ~ . -stat_id -predict,
        data = train_set_ex,
        method = "nnet",
        tuneGrid = tune_grid_neural,
        trControl = train_control_neural,
        metric = "Accuracy", # how to select among models
        trace = FALSE,
        maxit = 100,
        MaxNWts = max_weights_neural) # don't print output along the way

# end timer
total_time <- Sys.time() - start_time
# examine tuning parameters selected for non-hybrid model
model_non_hybrid
## Neural Network 
## 
## 1832 samples
##   92 predictor
##    2 classes: 'lie', 'truth' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (3 reps, 50%) 
## Summary of sample sizes: 916, 916, 916 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa     
##    1    0.00   0.5691412  0.13828239
##    1    0.05   0.5553130  0.11062591
##    1    0.10   0.5454876  0.09097525
##    1    1.00   0.5742358  0.14847162
##    1    2.00   0.5793304  0.15866084
##    2    0.00   0.5440320  0.08806405
##    2    0.05   0.5800582  0.16011645
##    2    0.10   0.5724163  0.14483261
##    2    1.00   0.5738719  0.14774381
##    2    2.00   0.5804221  0.16084425
##    3    0.00   0.5629549  0.12590975
##    3    0.05   0.5331150  0.06622999
##    3    0.10   0.5443959  0.08879185
##    3    1.00   0.5509461  0.10189229
##    3    2.00   0.5727802  0.14556041
##    4    0.00   0.5564047  0.11280932
##    4    0.05   0.5578603  0.11572052
##    4    0.10   0.5433042  0.08660844
##    4    1.00   0.5382096  0.07641921
##    4    2.00   0.5534934  0.10698690
##    5    0.00   0.5633188  0.12663755
##    5    0.05   0.5393013  0.07860262
##    5    0.10   0.5516739  0.10334789
##    5    1.00   0.5618632  0.12372635
##    5    2.00   0.5582242  0.11644833
##   10    0.00   0.5600437  0.12008734
##   10    0.05   0.5680495  0.13609898
##   10    0.10   0.5640466  0.12809316
##   10    1.00   0.5655022  0.13100437
##   10    2.00   0.5764192  0.15283843
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 2 and decay = 2.
# examine tuning parameters select for hybrid model
model_hybrid
## Neural Network 
## 
## 1832 samples
##   92 predictor
##    2 classes: 'lie', 'truth' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (3 reps, 50%) 
## Summary of sample sizes: 916, 916, 916 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa     
##    1    0.00   0.5669578  0.13391557
##    1    0.05   0.5611354  0.12227074
##    1    0.10   0.5614993  0.12299854
##    1    1.00   0.5658661  0.13173217
##    1    2.00   0.5735080  0.14701601
##    2    0.00   0.5582242  0.11644833
##    2    0.05   0.5564047  0.11280932
##    2    0.10   0.5564047  0.11280932
##    2    1.00   0.5553130  0.11062591
##    2    2.00   0.5735080  0.14701601
##    3    0.00   0.5425764  0.08515284
##    3    0.05   0.5378457  0.07569141
##    3    0.10   0.5473071  0.09461426
##    3    1.00   0.5662300  0.13245997
##    3    2.00   0.5571325  0.11426492
##    4    0.00   0.5443959  0.08879185
##    4    0.05   0.5473071  0.09461426
##    4    0.10   0.5534934  0.10698690
##    4    1.00   0.5625910  0.12518195
##    4    2.00   0.5593159  0.11863173
##    5    0.00   0.5516739  0.10334789
##    5    0.05   0.5520378  0.10407569
##    5    0.10   0.5527656  0.10553130
##    5    1.00   0.5469432  0.09388646
##    5    2.00   0.5760553  0.15211063
##   10    0.00   0.5393013  0.07860262
##   10    0.05   0.5520378  0.10407569
##   10    0.10   0.5614993  0.12299854
##   10    1.00   0.5695051  0.13901019
##   10    2.00   0.5775109  0.15502183
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 10 and decay = 2.

Evaluate Model (on Testing Set)

Now, let’s examine and compare the performance of the two models. As before, we will do this by using the models to generate predictions on the remaining ~50% of the data (1831 statements) in the training set. As usual, performance statistics are derived using the confusionMatrix function from the caret package (Kuhn, 2008).

When we do this, we see that both models have an overall accuracy level that is a again above chance:

  • hybrid model: 59.1% [95% CI: 56.8, 61.4%]
  • non-hybrid model: 62.3% [95% CI: 60.1, 64.54%]

Unlike with our hybrid and non-hybrid logistic regression models and hybrid and non-hybrid support vector machine models, the overall accuracy of the hybrid model is worse than the overall accuracy of the non-hybrid model. Worse yet, this is the only time that the difference is significant, chi-squared = 3.851, p-value = 0.0497! (Although, this is just barely so, suggesting the result might not be one that will prove robust.) This is especially surprising to me. My best guess as to why we get this result is that it is a result of random variation that might be expected between the models exacerbated by the additional noise we introduced by allowing the models to have different tuning parameters (i.e. 10 nodes for the hybrid model and 2 for the non-hyrbid model).

Another surprise comes when comparing sensitivity scores between the models. The non-hyrbid model has a significantly higher sensitivity score (71.9%) than the hybrd model does (59.5%), chi-squared = 30.956, p < 0.001. This result is particularly puzzling to me because, as we saw earlier, the one performance area in which humans particularly excelled was sensitivity (correcty identifying truths as truths). So, if the sensitivity scores were to differ between models, I would expect it to be the hybrid model performing better on this metric (although perhaps that is an overly simplistic expectation about how human predictions would be incorporated by the hybrid model).

On the other hand, there is some evidence that hybrid model has higher specificity (58.7%) than the non-hybrid model (52.7%), chi-squred = 6.452, p = 0.011.

Precision scores are within a percentage point (hybrid = 59.2% v. non-hyrbid = 60.3%), a non-significant difference, chi-squared = 0.304, p = 0.582. While negative predictive value is higher for the non-hybrid model (65.3%) than for the hybrid model (59.2%), a significant difference, chi-squared = 6.149, p = 0.013. I’m not quite sure what to make of that result.

# generate predictions for hybrid model 
preds_hybrid <-
  predict(object = model_hybrid,
                 newdata = test_set_ex,
                 type = "raw")

preds_non_hybrid <-
  predict(object = model_non_hybrid,
                 newdata = test_set_ex,
                 type = "raw")
# examine hybrid model performance
conf_hybrid <-
  confusionMatrix(data = preds_hybrid,
                  reference = test_set_ex$grd_truth,
                  positive = "truth")

# print
conf_hybrid
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction lie truth
##      lie   538   371
##      truth 378   544
##                                          
##                Accuracy : 0.5909         
##                  95% CI : (0.568, 0.6136)
##     No Information Rate : 0.5003         
##     P-Value [Acc > NIR] : 4.361e-15      
##                                          
##                   Kappa : 0.1819         
##  Mcnemar's Test P-Value : 0.8265         
##                                          
##             Sensitivity : 0.5945         
##             Specificity : 0.5873         
##          Pos Pred Value : 0.5900         
##          Neg Pred Value : 0.5919         
##              Prevalence : 0.4997         
##          Detection Rate : 0.2971         
##    Detection Prevalence : 0.5035         
##       Balanced Accuracy : 0.5909         
##                                          
##        'Positive' Class : truth          
## 
# examine non-hybrid model performance
conf_non_hybrid <-
  confusionMatrix(data = preds_non_hybrid,
                  reference = test_set_ex$grd_truth,
                  positive = "truth")

# print
conf_non_hybrid
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction lie truth
##      lie   483   257
##      truth 433   658
##                                           
##                Accuracy : 0.6232          
##                  95% CI : (0.6005, 0.6454)
##     No Information Rate : 0.5003          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2464          
##  Mcnemar's Test P-Value : 2.699e-11       
##                                           
##             Sensitivity : 0.7191          
##             Specificity : 0.5273          
##          Pos Pred Value : 0.6031          
##          Neg Pred Value : 0.6527          
##              Prevalence : 0.4997          
##          Detection Rate : 0.3594          
##    Detection Prevalence : 0.5958          
##       Balanced Accuracy : 0.6232          
##                                           
##        'Positive' Class : truth           
## 
# calculate values relevant for test (and some extras, in case i want to compare sensitivity, etc)
hybrid_truth_correct = conf_hybrid$table[4]
hybrid_truth_total = conf_hybrid$table[3] + conf_hybrid$table[4]
hybrid_lie_correct = conf_hybrid$table[1]
hybrid_lie_total = conf_hybrid$table[1] + conf_hybrid$table[2]
hybrid_total_correct = hybrid_truth_correct + hybrid_lie_correct
hybrid_total_truth_guesses = conf_hybrid$table[2] + conf_hybrid$table[4]
hybrid_total_lie_guesses = conf_hybrid$table[1] + conf_hybrid$table[3]
hybrid_total = sum(conf_hybrid$table)

non_hybrid_truth_correct = conf_non_hybrid$table[4]
non_hybrid_truth_total = conf_non_hybrid$table[3] + conf_non_hybrid$table[4]
non_hybrid_lie_correct = conf_non_hybrid$table[1]
non_hybrid_lie_total = conf_non_hybrid$table[1] + conf_non_hybrid$table[2]
non_hybrid_total_correct = non_hybrid_truth_correct + non_hybrid_lie_correct
non_hybrid_total_truth_guesses = conf_non_hybrid$table[2] + conf_non_hybrid$table[4]
non_hybrid_total_lie_guesses = conf_non_hybrid$table[1] + conf_non_hybrid$table[3]
non_hybrid_total = sum(conf_non_hybrid$table)
# conduct actual test (chi-squre on overall accuracy)
prop.test(x = c(hybrid_total_correct, non_hybrid_total_correct),
          n = c(hybrid_total, non_hybrid_total),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_total_correct, non_hybrid_total_correct) out of c(hybrid_total, non_hybrid_total)
## X-squared = 3.851, df = 1, p-value = 0.04972
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -6.438915e-02 -5.650366e-05
## sample estimates:
##    prop 1    prop 2 
## 0.5909339 0.6231567
# examine z value, using two proportion z-test just for comparison

# function to compute z-value
# from: https://www.r-bloggers.com/comparison-of-two-proportions-parametric-z-test-and-non-parametric-chi-squared-methods/
z.prop = function(x1,x2,n1,n2){
  # compute z-value 
  numerator = (x1/n1) - (x2/n2)
  p.common = (x1+x2) / (n1+n2)
  denominator = sqrt(p.common * (1-p.common) * (1/n1 + 1/n2))
  z.prop.ris = numerator / denominator
  
  # compute p-value
  p_value <- 2*pnorm(-abs(z.prop.ris))
  
  # combine results in string vector
  test_result <- c(paste("z=", round(z.prop.ris, 3),
                         ", p=", round(p_value, 3),
                         sep = ""))
  
  # return z and p-value
  return(test_result)
}

# get actual two proportion z-test results
z.prop(x1 = hybrid_total_correct,
         x2 = non_hybrid_total_correct,
         n1 = hybrid_total, 
         n2 = non_hybrid_total)
## [1] "z=-1.996, p=0.046"
# chi-square on sensitivity scores
prop.test(x = c(hybrid_truth_correct, non_hybrid_truth_correct),
          n = c(hybrid_truth_total, non_hybrid_truth_total),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_truth_correct, non_hybrid_truth_correct) out of c(hybrid_truth_total, non_hybrid_truth_total)
## X-squared = 30.956, df = 1, p-value = 2.64e-08
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.16881137 -0.08036896
## sample estimates:
##    prop 1    prop 2 
## 0.5945355 0.7191257
# chi-square on specificity scores
prop.test(x = c(hybrid_lie_correct, non_hybrid_lie_correct),
          n = c(hybrid_lie_total, non_hybrid_lie_total),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_lie_correct, non_hybrid_lie_correct) out of c(hybrid_lie_total, non_hybrid_lie_total)
## X-squared = 6.4516, df = 1, p-value = 0.01109
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.01354542 0.10654191
## sample estimates:
##    prop 1    prop 2 
## 0.5873362 0.5272926
# chi-square on precision scores
prop.test(x = c(hybrid_truth_correct, non_hybrid_truth_correct),
          n = c(hybrid_total_truth_guesses, non_hybrid_total_truth_guesses),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_truth_correct, non_hybrid_truth_correct) out of c(hybrid_total_truth_guesses, non_hybrid_total_truth_guesses)
## X-squared = 0.30382, df = 1, p-value = 0.5815
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.05711470  0.03092527
## sample estimates:
##    prop 1    prop 2 
## 0.5900217 0.6031164
# chi-square on negative predictive value
prop.test(x = c(hybrid_lie_correct, non_hybrid_lie_correct),
          n = c(hybrid_total_lie_guesses, non_hybrid_total_lie_guesses),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_lie_correct, non_hybrid_lie_correct) out of c(hybrid_total_lie_guesses, non_hybrid_total_lie_guesses)
## X-squared = 6.1487, df = 1, p-value = 0.01315
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.10894770 -0.01273933
## sample estimates:
##    prop 1    prop 2 
## 0.5918592 0.6527027

Change in predictions

As we did before, let’s try to get some insight into how the incorporating human predictions changed the predictions made by the hybrid model (as compared to the predictions made by the non-hybrid model).

To do this, I have tabulated the total number of times each of the eight possible changes in predictions have occurred, when the non-hybrid model’s predictions are compared to the hybrid model’s predictions. As a reminder, these eight possible changes are:

  • nothing changes
    • 1: non-hybrid prediction of true positive stays true positive prediction in hybrid model
    • 2: true negative prediction in non-hybrid model stays true negative prediction in hybrid model
    • 3: false positive prediction in non-hybrid model stays false positive prediction in hybrid model
    • 4: false negative prediction in non-hybrid model stays false negative prediction in hybrid model
  • correct prediction changed to incorrect prediction
    • 5: true positive prediction in non-hybrid model changes to false positive prediction in hybrid model
    • 6: true negative prediction in non-hybrid model changes to false negative prediction in hybrid model
  • incorrect prediction changed to correct prediction
    • 7: false positive prediction in non-hybrid model changes to true negative prediction in hybrid model
    • 8: false negative prediction in non-hybrid model changes to true positive prediction in hybrid model

(Again, see “Change in predictions” section in earlier hybrid logistic regression analysis

As with the other two types of hybrid models (logistic and svm), although to a notably lesser extent, we see that most of the predictions stayed the same between the non-hybrid and hybrid model (1246 of 1831 predictions, 68.1%, were the same; a number that hovered above 90% in both the logistic regression and svm analyses). Unlike before, when moving from the non-hybrid to the hybrid model, the number of predictions that went from incorrect to correct (251 cases, 13.7%, were either false negative that became true positives or false positives that became true negatives) was lower than than the number of predictions that went from correct to incorrect (313 cases, 17.1%, were either true positives that became false negatives or true negatives that became false positives).

# combine: grd_truth, non_hybrid predictions, hybrd predictions
preds_comp <-
  cbind(as.data.frame(test_set_ex$grd_truth),
        as.data.frame(preds_non_hybrid),
        as.data.frame(preds_hybrid))

# label columns
colnames(preds_comp) <- c("grd_truth", "non_hyb", "hyb")

# print
# preds_comp

# generate additional information
preds_tally <-
  preds_comp %>%
  group_by(grd_truth, non_hyb, hyb) %>%
  summarize(n = n()) %>%
  arrange(desc(grd_truth), desc(non_hyb), desc(hyb)) %>%
  mutate(change_present = case_when(non_hyb == hyb ~ "no",
                            non_hyb != hyb ~ "yes")) %>%
  mutate(change_type = case_when(grd_truth == "truth" & non_hyb == "truth" & hyb == "truth" ~ "TP -> TP",
                            grd_truth == "truth" & non_hyb == "truth" & hyb == "lie" ~ "TP -> FN",
                            grd_truth == "truth" & non_hyb == "lie" & hyb == "truth" ~ "FN -> TP",
                            grd_truth == "truth" & non_hyb == "lie" & hyb == "lie" ~ "FN -> FN",
                            grd_truth == "lie" & non_hyb == "truth" & hyb == "truth" ~ "FP -> FP",
                            grd_truth == "lie" & non_hyb == "truth" & hyb == "lie" ~ "FP -> TN",
                            grd_truth == "lie" & non_hyb == "lie" & hyb == "truth" ~ "TN -> FP",
                            grd_truth == "lie" & non_hyb == "lie" & hyb == "lie" ~ "TN -> TN")) %>%
  mutate(change_effect = case_when(change_type == "TP -> TP" ~ "nothing",
                                   change_type == "TN -> TN" ~ "nothing",
                                   change_type == "FP -> FP" ~ "nothing",
                                   change_type == "FN -> FN" ~ "nothing",
                                   change_type == "FN -> TP" ~ "gain",
                                   change_type == "FP -> TN" ~ "gain",
                                   change_type == "TP -> FN" ~ "loss",
                                   change_type == "TN -> FP" ~ "loss")) %>%
  arrange(change_effect)

# print
preds_tally
# tally the total number of times each type of "change_effect" occurred
preds_tally %>%
  group_by(change_effect) %>%
  summarize(n = sum(n))

FULL (Hybrid and Non-Hybrid Logistic Regression Models)

To get more reliable estimates of the performance of non-hybrid and hybrid support vector machine models, let’s repeat the exact model training and testing procedure from above 10 times, storing and aggregating the results.

Run 10 models

Below is the code that runs through this modeling process 10 different times and saves the result from each round.

# # -----------------------------------------------------------------------------
# STEP 0: set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# # -----------------------------------------------------------------------------
# STEP 1: decide how many times to run the model
rounds <- 10

# -----------------------------------------------------------------------------
# STEP 2: set up object to store results
# part a: create names of results to store
result_cols <- c("model_type", "hyb_type", "round", "accuracy", "accuracy_LL", "accuracy_UL",
                 "sensitivity", "specificity", "precision", "npv", "n",
                 "total_truths", "total_lies", "total_truth_guesses", "total_lie_guesses")

# part b: create matrix
results <-
  matrix(nrow = rounds,
         ncol = length(result_cols))

# part c: actually name columns in results marix
colnames(results) <- result_cols

# part d: convert to df (so multiple variables of different types can be stored)
results <- data.frame(results)

# -----------------------------------------------------------------------------
# STEP 2: start timer
start_time <- Sys.time()

# -----------------------------------------------------------------------------
# STEP 3: create rounds number of models, and store results each time

for (i in 1:rounds){
  
  # part a: partition data in 50-50 lgocv split (create index for test set)
  index_train <- 
    createDataPartition(y = stats_combo$stat_id,
                        p = 0.50,
                        times = 1,
                        list = FALSE)
  
  # part b: create testing and training data sets
  train_set <- stats_combo[index_train, ]
  test_set <- stats_combo[-index_train, ]
  
  # part c: use caret "train" function to train hybrid and non-hybrid logistic regression model
  model_hybrid_i <- 
     train(form = grd_truth ~ . - stat_id,
        data = train_set,
        method = "nnet",
        tuneGrid = tune_grid_neural,
        trControl = train_control_neural,
        metric = "Accuracy", # how to select among models
        trace = FALSE,
        maxit = 100,
        MaxNWts = max_weights_neural) # don't print output along the way
    

  model_non_hybrid_i <-
    train(form = grd_truth ~ . -stat_id -predict,
        data = train_set,
        method = "nnet",
        tuneGrid = tune_grid_neural,
        trControl = train_control_neural,
        metric = "Accuracy", # how to select among models
        trace = FALSE,
        maxit = 100,
        MaxNWts = max_weights_neural) # don't print output along the way
    
  # part d: make predictions, using both hybrid and non-hybrid models
  preds_hyb_i <-
    predict(object = model_hybrid_i,
            newdata = test_set,
            type = "raw")
  
  preds_non_hyb_i <-
    predict(object = model_non_hybrid_i,
            newdata = test_set,
            type = "raw")
  
  # part e: store model performance of both hybrid and non-hybrid model
  conf_hyb_i <-
    confusionMatrix(data = preds_hyb_i,
                    reference = test_set$grd_truth,
                    positive = "truth")
  
  conf_non_hyb_i <-
    confusionMatrix(data = preds_non_hyb_i,
                    reference = test_set$grd_truth,
                    positive = "truth")
  
  # part f: store hybrid and non-hybrod model results
  # model type
  results[2*i-1, 1] <- "neural"
  results[2*i, 1] <- "neural"
  # hybrid o non-hybrid
  results[2*i-1, 2] <- "hybrid"
  results[2*i, 2] <- "non_hybrid"
  # round
  results[2*i-1, 3] <- i
  results[2*i, 3] <- i
  # accuracy
  results[2*i-1, 4] <- conf_hyb_i$overall[1]
  results[2*i, 4] <- conf_non_hyb_i$overall[1]
  # accuracy LL
  results[2*i-1, 5] <- conf_hyb_i$overall[3]
   results[2*i, 5] <- conf_non_hyb_i$overall[3]
  # accuracy UL
  results[2*i-1, 6] <- conf_hyb_i$overall[4]
  results[2*i, 6] <- conf_non_hyb_i$overall[4]
  # sensitivity
  results[2*i-1, 7] <- conf_hyb_i$byClass[1]
  results[2*i, 7] <- conf_non_hyb_i$byClass[1]
  # specificity
  results[2*i-1, 8] <- conf_hyb_i$byClass[2]
  results[2*i, 8] <- conf_non_hyb_i$byClass[2]
  # precision
  results[2*i-1, 9] <- conf_hyb_i$byClass[3]
  results[2*i, 9] <- conf_non_hyb_i$byClass[3]
  # negative predictive value
  results[2*i-1, 10] <- conf_hyb_i$byClass[4]
  results[2*i, 10] <- conf_non_hyb_i$byClass[4]
  # sample size (total)
  results[2*i-1, 11] <- sum(conf_hyb_i$table)
  results[2*i, 11] <- sum(conf_non_hyb_i$table)
  # total truths
  results[2*i-1, 12] <- conf_hyb_i$table[3] + conf_hyb_i$table[4]
  results[2*i, 12] <- conf_non_hyb_i$table[3] + conf_hyb_i$table[4]
  # total lies
  results[2*i-1, 13] <- conf_hyb_i$table[1] + conf_hyb_i$table[2]
  results[2*i, 13] <- conf_non_hyb_i$table[1] + conf_non_hyb_i$table[2]
  # total truth guesses
  results[2*i-1, 14] <- conf_hyb_i$table[2] + conf_hyb_i$table[4]
  results[2*i, 14] <- conf_non_hyb_i$table[2] + conf_non_hyb_i$table[4]
  # total lie guesses
  results[2*i-1, 15] <- conf_hyb_i$table[1] + conf_hyb_i$table[3]
  results[2*i, 15] <- conf_non_hyb_i$table[1] + conf_non_hyb_i$table[3]
  
  # part g: print round and total elapsed time so far
  cumul_time <- difftime(Sys.time(), start_time, units = "mins")
  print(paste("round #", i, ": cumulative time ", round(cumul_time, 2), " mins",
              sep = ""))
  print("--------------------------------------")

}

View Results (Tabular)

Directly below, I’ve displayed a raw tabular summary of the results. We have 20 rows, which result from the 2 models (hybrid and non-hybrid) created in each of 10 rounds.

Here we can examine whether the better performance of the non-hybrid model over the hybrid model is a consistent trend. It seems that it is not. Focusing on overall accuracy, the hybrid model actually had a higher overall accuracy for 7 out of 10 rounds. This trend does not suggest that one type of model (hybrid v. non-hybrid) clearly outperforms the other. When I conduct a simple sign test, which uses a binomial distribution to examine how likely it is that one model should have a higher accuracy score than the other model for 7 out of 10 rounds, under the null assumption that each model should have a higher accuracy score an equal number of times, this pattern does not diverge significanty from that null hypothesis, p = 0.344. Similarly, when I conduct a Wilcoxon signed-rank test (which is non-parametric test which tests whether there is sufficient evidence to reject the null hypothesis that two dependent samples are drawn from the same population), this difference does not appear to be significant, V = 38, p = 0.322. With more computational power and time, it would be worth re-running these models more than 10 times, to see if any consistent pattern of dominance emerges.

Looking across the 30 total rounds from each of out model types (i.e. logistic regression, svm, and neural network), it is the case that a hybrid model outperformed a non-hybrid model 25 out of 30 times (10/10 for logistic regression, 8/10 for svm, and 7/10 for neural network). This is not the result we would expect if we expected each type of model to perform better than the other an equal proportion (i.e. 50%) of the time, p < 0.001 (by exact binomial test, “sign” test).

# print results (long form)
results
# get data in wide format (with accuracy for hybrid and non_hybrid models next to each other)
results_acc_wide <-
  results %>%
    select(hyb_type, round, accuracy) %>%
    spread(key = hyb_type,
           value = accuracy)

# print
results_acc_wide %>%
  mutate(winner = case_when(hybrid > non_hybrid ~ "hybrid",
                            hybrid < non_hybrid ~ "non-hybrid",
                            hybrid == non_hybrid ~ "tie"))
# sign test, via binomial test
binom.test(x = 7,
           n = 10,
           p = 0.5)
## 
##  Exact binomial test
## 
## data:  7 and 10
## number of successes = 7, number of trials = 10, p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3475471 0.9332605
## sample estimates:
## probability of success 
##                    0.7
# conduct Wilcoxon signed-rank test 
wilcox.test(x = results_acc_wide$hybrid,
            y = results_acc_wide$non_hybrid,
            paired = TRUE,
            conf.int = 0.95)
## 
##  Wilcoxon signed rank test
## 
## data:  results_acc_wide$hybrid and results_acc_wide$non_hybrid
## V = 38, p-value = 0.3223
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -0.01501912  0.02348443
## sample estimates:
## (pseudo)median 
##    0.009557619
# sign test across all three model types (logistic, svm, neural)
binom.test(x = 25,
           n = 30,
           p = 0.5)
## 
##  Exact binomial test
## 
## data:  25 and 30
## number of successes = 25, number of trials = 30, p-value =
## 0.0003249
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6527883 0.9435783
## sample estimates:
## probability of success 
##              0.8333333

View Results (Graphically; Overall Accuracy)

Finally, let’s quickly visually examine the average overall accuracy rate of non-hybrid and hybrid models across 10 rounds. When we do this, we see that the hybrid model has a slightly higher level of overall accuracy than the non-hybrid model (although, as we see the confidence intervals for the two models overall accuracy rates overlap). Perhaps with more observations (i.e. predictions on the full set of 5,004 statements), this pattern might become stronger.

# average n in test set for hybrid and non-hybrid models
total_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$n)
total_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$n)

# vector of denominators to use, when creating error bars
denoms_overall <- c(total_hyb,
                    total_non_hyb)

# visualize overall accuracy
results %>%
  group_by(hyb_type) %>%
  summarize(accuracy = mean(accuracy)) %>%
  ggplot(aes(x = hyb_type,
             y = accuracy)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = accuracy - 1.96*sqrt(accuracy*(1-accuracy)/denoms_overall),
                     ymax = accuracy + 1.96*sqrt(accuracy*(1-accuracy)/denoms_overall)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.49, to = 0.70, by = 0.01),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("non_hybrid", "hybrid")) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        panel.background = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Accuracy by Model Type",
       x = "Model Type",
       y = "Overall Accuracy")

View Results (Graphically; Sensitivity, Specificity, Precision, Negative Predictive Value)

We see a similar pattern of results when we examine the other four major performance statistics average across 10 rounds. The hybrid model has a higher average level of sensitivity, precision, and negative predictive value (although again confidence intervals overlap). The (directional) exception is specificity, in which the hybrid model performs worse. This is more in line with what I would have expected (as explained in the example case I went through earlier), given that a particular weakness of human lie detection performance is indeed specificity (correctly identifying lies as lies).

# average n in test set for hybrid and non-hybrid models
total_truths_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_truths)
total_truths_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_truths)
total_lies_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_lies)
total_lies_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_lies)
total_truth_guesses_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_truth_guesses)
total_truth_guesses_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_truth_guesses)
total_lie_guesses_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_lie_guesses)
total_lie_guesses_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_lie_guesses)

# vector of denominators to use, when creating error bars
denoms_perf <- c(total_truths_non_hyb,
                 total_truths_hyb,
                 total_lies_non_hyb,
                 total_lies_hyb,
                 total_truth_guesses_non_hyb,
                 total_truth_guesses_hyb,
                 total_lie_guesses_non_hyb,
                 total_lie_guesses_hyb)

# visualize results
results %>%
  select(hyb_type, round, sensitivity, specificity, precision, npv) %>%
  gather(key = "metric",
         value = "value",
         sensitivity, specificity, precision, npv) %>%
  group_by(hyb_type, metric) %>%
  summarize(value = mean(value)) %>%
  ungroup() %>%
  mutate(metric = factor(metric,
                            levels = c("sensitivity", "specificity", "precision", "npv"))) %>%
  ggplot(aes(x = hyb_type,
             y = value)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = value - 1.96*sqrt(value*(1-value)/denoms_perf),
                     ymax = value + 1.96*sqrt(value*(1-value)/denoms_perf)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.50, to = 0.70, by = 0.05),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("non_hybrid", "hybrid")) +
  facet_grid(metric ~ .) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        plot.background = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Metrics by Model Type",
       x = "Model Type",
       y = "Proportion")

Save Results

For now, this concludes our analysis. As our last step, lets save our results.

# rename results df, to be particular to this model type (for disambiguation later)
results_HYB_neural <- results

# clear results variable
rm(results)

# save results in Rda file
save(results_HYB_neural,
     file = "results_HYB_neural.Rda")

Citations

END