[return to overview page]

In this section, we will generate our first hybrid human-computer model. This will be a logistic regression model. The general training-testing and model building procedure will be exactly the same as for our earlier logistic regression model, with the exception of course that our hybrid model will also take into account human predictions as a feature (in addition to the textual features used earlier). The performance of this model will be compared to the performance of a non-hybrid model (which uses only textual features) trained on the same statements.

Packages

Again, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(caret) # modeling

Load Data

Next, I will load the data file, we created earlier, which has both the cleaned and processed textual features and human predictions. Note, again, that these data files consists of a total of 3,663 statments.

# load df of combined human and processed textual feature and ground truth
load("stats_combo.Rda")

# For rendering, I'm going to cheat here and load results created when this model was first run
# For some reason, chunks that were supposed to be cached when originally run are rerunning
load("results_HYB_log.Rda")
# change the specific names (renamed at end), back to generic name
results_HYB_log -> results
# model_hybrid_log -> model_hybrid
# model_non_hybrid_log -> model_non_hybrid

# print combined file
stats_combo

EXAMPLE (Single Predictive Logistic Regression Model)

As usual, let’s begin with a singular example. Here we will simply train and test two logistic regression models: one hybrid model that takes into accounts both human predictions and textual features and one that only takes into account textual features.

Split Sample Into Training and Testing Sets

As before, we are going to conduct an exact 50-50 split, randomly allocating one half of the statements to the training set, and the other half to the testing set (using the createDataPartition function in the caret packages; Kuhn, 2008; Kuhn, 2019). The exact same training set will be used to build the hybrid and non-hybrid models, both of which will then also be tested on the exact same testing set. (By “exactly the same”, I mean the statements in the training set for the hybrid model will be exactly the same as the statements in the training set for the non-hybrid model. And the statements in the testing set for the hybrid model will be exactly the same as the statements in the testing set for the non-hybrid model. Of course, the features used in the two models will differ, in the way stated above.)

# set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# partition data in 50-50 lgocv split (create index for test set)
index_train_ex <- 
  createDataPartition(y = stats_combo$stat_id,
                      p = 0.50,
                      times = 1,
                      list = FALSE)

# actually create data frame with training set (predictors and outcome together)
train_set_ex <- stats_combo[index_train_ex, ]

# actualy create data frame with test set (predictors and outcome together)
test_set_ex <- stats_combo[-index_train_ex, ]

Build Model (on Training Set)

Now that the data are split, we can fit the logistic regression models to the training data. Again, we will fit two logistic regression models. One, a hybrid model that uses both textual features and human guesses as predictors. And the other, a non-hybrid model that uses on textual features as predictors. These two models are fit to the training data below.

# set seed, so that statistics don't keep changing for every analysis
# (applies for models which might have random parameters)
set.seed(2019)

# start timer
start_time <- Sys.time()

# use caret "train" function to train logistic regression model
model_hybrid <- 
  train(form = grd_truth ~ . - stat_id,
        data = train_set_ex,
        method = "glm",
        family = "binomial")

model_non_hybrid <-
  train(form = grd_truth ~ . -stat_id -predict,
        data = train_set_ex,
        method = "glm",
        family = "binomial")

# end timer
total_time <- Sys.time() - start_time

Evaluate Model (on Testing Set)

Finally, let’s examine and compare the performance of the two models. As before, this is done by using the models to generate predictions on the remaining ~50% of the data (1831 statements) in the training set. And then examining the performance statistics using the confusionMatrix function from the caret package (Kuhn, 2008).

Both models have an overall accuracy level that is above chance:

  • hybrid model: 64.1% [95% CI: 61.9, 66.3%]
  • non-hybrid model: 63.2% [95% CI: 60.9, 65.4%]

And the overall accuracy of the hybrid model is better than the overall accuracy of the non-hybrid model. Dissapointingly, however, this difference is not significant, chi-squared = 0.30, p = 0.583. Not even close in fact. This is both dissapointing and surprising. Earlier, we saw that human predictions alone achieved an accuracy of over 55%, which was well above chance. And a model trained only on textual features here does well above chance as well (overall accuracy > 63%, as seen just above). It would stand to reason that adding in human predictions would serve to significantly improve the model. This is of course not what happened. I will need to conduct further analyses to examine why this is. One possible hypothesis at the moment is that (1) human predictions are somehow redundant with some existing feature or set of features. (Note: this is not to say that such a feature or set of features is the actual basis of human judgments, just that there might be, for whatever reason, an assocation). However, a priori, this seems unlikely to me, knowing the features we have extracted (i.e. it seems unlikely to me that human judgments are going to track something as clinical as the number of pronouns in a sentence, or even more “human” features like sentiment). Another hypothesis relates to (2) the different pattern of guessing by which humans and computer models achieve their accuracy. As we saw in our previous analyses, computer models had roughly comparable levels of sensitivity (rate of truth detection) and specificity (rate of lie detection). However, humans had much higher sensitivity than specificity. I am not sure of the exact effect of this on the hybrid models, but a loose speculation might be that any gains in accuracy that might come from adding in human predictions would come from correctly identifying a larger portion of truths as truths, however, these would be “offset” by increasing the portion of lies that are incorrectly identified.

With regard to other performance metrics, I have not yet conducted formal statistical for each of theme, but sensitivity, specificity, precision, and negative predictive value look roughly the same between the hybrid and non-hyrbid models.

# generate predictions for hybrid model 
preds_hybrid <-
  predict(object = model_hybrid,
                 newdata = test_set_ex,
                 type = "raw")

# generate predictions for non-hybrid model
preds_non_hybrid <-
  predict(object = model_non_hybrid,
                 newdata = test_set_ex,
                 type = "raw")
# examine hybrid model performance
conf_hybrid <-
  confusionMatrix(data = preds_hybrid,
                  reference = test_set_ex$grd_truth,
                  positive = "truth")

# print
conf_hybrid
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction lie truth
##      lie   583   324
##      truth 333   591
##                                           
##                Accuracy : 0.6412          
##                  95% CI : (0.6187, 0.6632)
##     No Information Rate : 0.5003          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.2824          
##  Mcnemar's Test P-Value : 0.755           
##                                           
##             Sensitivity : 0.6459          
##             Specificity : 0.6365          
##          Pos Pred Value : 0.6396          
##          Neg Pred Value : 0.6428          
##              Prevalence : 0.4997          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.5046          
##       Balanced Accuracy : 0.6412          
##                                           
##        'Positive' Class : truth           
## 
# examine non-hybrid model performance
conf_non_hybrid <-
  confusionMatrix(data = preds_non_hybrid,
                  reference = test_set_ex$grd_truth,
                  positive = "truth")

# print
conf_non_hybrid
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction lie truth
##      lie   563   321
##      truth 353   594
##                                          
##                Accuracy : 0.6319         
##                  95% CI : (0.6093, 0.654)
##     No Information Rate : 0.5003         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.2638         
##  Mcnemar's Test P-Value : 0.2324         
##                                          
##             Sensitivity : 0.6492         
##             Specificity : 0.6146         
##          Pos Pred Value : 0.6272         
##          Neg Pred Value : 0.6369         
##              Prevalence : 0.4997         
##          Detection Rate : 0.3244         
##    Detection Prevalence : 0.5172         
##       Balanced Accuracy : 0.6319         
##                                          
##        'Positive' Class : truth          
## 
# calculate values relevant for test (and some extras, in case i want to compare sensitivity, etc)
hybrid_truth_correct = conf_hybrid$table[4]
hybrid_truth_total = conf_hybrid$table[3] + conf_hybrid$table[4]
hybrid_lie_correct = conf_hybrid$table[1]
hybrid_lie_total = conf_hybrid$table[1] + conf_hybrid$table[2]
hybrid_total_correct = hybrid_truth_correct + hybrid_lie_correct
hyrbid_total_truth_guesses = conf_hybrid$table[2] + conf_hybrid$table[4]
hyrbid_total_lie_guesses = conf_hybrid$table[1] + conf_hybrid$table[3]
hybrid_total = sum(conf_hybrid$table)

non_hybrid_truth_correct = conf_non_hybrid$table[4]
non_hybrid_truth_total = conf_non_hybrid$table[3] + conf_non_hybrid$table[4]
non_hybrid_lie_correct = conf_non_hybrid$table[1]
non_hybrid_lie_total = conf_non_hybrid$table[1] + conf_non_hybrid$table[2]
non_hybrid_total_correct = non_hybrid_truth_correct + non_hybrid_lie_correct
non_Hyrbid_total_truth_guesses = conf_non_hybrid$table[2] + conf_non_hybrid$table[4]
non_hyrbid_total_lie_guesses = conf_non_hybrid$table[1] + conf_non_hybrid$table[3]
non_hybrid_total = sum(conf_non_hybrid$table)
# conduct actual test
prop.test(x = c(hybrid_total_correct, non_hybrid_total_correct),
          n = c(hybrid_total, non_hybrid_total),
          alternative = "two.sided",
          conf.level = 0.95)
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(hybrid_total_correct, non_hybrid_total_correct) out of c(hybrid_total, non_hybrid_total)
## X-squared = 0.30216, df = 1, p-value = 0.5825
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.02241752  0.04098661
## sample estimates:
##    prop 1    prop 2 
## 0.6411797 0.6318951
# examine z value, using two proportion z-test just for comparison

# function to compute z-value
# from: https://www.r-bloggers.com/comparison-of-two-proportions-parametric-z-test-and-non-parametric-chi-squared-methods/
z.prop = function(x1,x2,n1,n2){
  numerator = (x1/n1) - (x2/n2)
  p.common = (x1+x2) / (n1+n2)
  denominator = sqrt(p.common * (1-p.common) * (1/n1 + 1/n2))
  z.prop.ris = numerator / denominator
  return(z.prop.ris)
}

# actual z prop (z = 0.58)
z_value <-
  z.prop(x1 = hybrid_total_correct,
         x2 = non_hybrid_total_correct,
         n1 = hybrid_total, 
         n2 = non_hybrid_total)

# get corresponding p-value (two-sided) (p = 0.559)
p_value <- 2*pnorm(-abs(z_value))

# print z and p
print(paste("z=", round(z_value, 3),
            ", p=", round(p_value, 3),
            sep = ""))
## [1] "z=0.584, p=0.559"

Change in predictions

Let’s try to get some insight into how the incorporating human predictions changed the predictions made by the hybrid model (as compared to the predictions made by the non-hybrid model). As we have discussed before, there are only four possibile outcomes when the predictions of a model are compared to reality: true positive, true negative, false positive, false negative. All of the model’s performance depends on these four quantities. We can think of the non-hybrid model as a “baseline”, upon which the hybrid model might improve. The only way the hybrid model might improve upon the non-hybrid model is through one of two ways: a statment that was incorrectly predicted as a false negative by the non-hybrid model is correctly predicted as a true positive by the hybrid model, or a statement that was incorrectly predicted as a false positive by the non-hybrid model is correctly predicted as a tur negative by the hybrid model. The rest of the changes either have no effect (a true positive stays a true positive from the non-hybrid to a hybrid model, a true negative stays a true negative, a false positive stays a false positive, or a false negative stays a false negative) or are detrimental (a true positive becomes a false negative, or a true negative becomes a false positive).

Below, I have tabulated the total number of times these eight possible changes have occurred, when the non-hybrid model’s predictions are compared to the hybrid model’s predictions. What we see is that most of the predictions stayed the same between the non-hybrid and hybrid model (1698 of 1831 predictions, 92.7%, were the same). And only a slightly higher number of predictions went from incorrect to correct (75 cases, 4.1%, were either false negative that become true positives or false positives that become true negatives) than the number of predictions that went from correct to incorrect (58 cases, 3.2%, were either true positives that become false negatives or true negatives that became false positives).

More detailed analyses should examine how the human predictions relate to the other features used for prediction.

# combine: grd_truth, non_hybrid predictions, hybrd predictions
preds_comp <-
  cbind(as.data.frame(test_set_ex$grd_truth),
        as.data.frame(preds_non_hybrid),
        as.data.frame(preds_hybrid))

# label columns
colnames(preds_comp) <- c("grd_truth", "non_hyb", "hyb")

# print
# preds_comp

# generate additional information
preds_tally <-
  preds_comp %>%
  group_by(grd_truth, non_hyb, hyb) %>%
  summarize(n = n()) %>%
  arrange(desc(grd_truth), desc(non_hyb), desc(hyb)) %>%
  mutate(change_present = case_when(non_hyb == hyb ~ "no",
                            non_hyb != hyb ~ "yes")) %>%
  mutate(change_type = case_when(grd_truth == "truth" & non_hyb == "truth" & hyb == "truth" ~ "TP -> TP",
                            grd_truth == "truth" & non_hyb == "truth" & hyb == "lie" ~ "TP -> FN",
                            grd_truth == "truth" & non_hyb == "lie" & hyb == "truth" ~ "FN -> TP",
                            grd_truth == "truth" & non_hyb == "lie" & hyb == "lie" ~ "FN -> FN",
                            grd_truth == "lie" & non_hyb == "truth" & hyb == "truth" ~ "FP -> FP",
                            grd_truth == "lie" & non_hyb == "truth" & hyb == "lie" ~ "FP -> TN",
                            grd_truth == "lie" & non_hyb == "lie" & hyb == "truth" ~ "TN -> FP",
                            grd_truth == "lie" & non_hyb == "lie" & hyb == "lie" ~ "TN -> TN")) %>%
  mutate(change_effect = case_when(change_type == "TP -> TP" ~ "nothing",
                                   change_type == "TN -> TN" ~ "nothing",
                                   change_type == "FP -> FP" ~ "nothing",
                                   change_type == "FN -> FN" ~ "nothing",
                                   change_type == "FN -> TP" ~ "gain",
                                   change_type == "FP -> TN" ~ "gain",
                                   change_type == "TP -> FN" ~ "loss",
                                   change_type == "TN -> FP" ~ "loss")) %>%
  arrange(change_effect)

# print
preds_tally
# tally the total number of times each type of "change_effect" occurred
preds_tally %>%
  group_by(change_effect) %>%
  summarize(n = sum(n))

FULL (Hybrid and Non-Hybrid Logistic Regression Models)

To get more reliable estimates of the performance of non-hybrid and hybrid models, let’s repeat the exact model training and testing procedure from above 10 times, storing and aggregating the results.

Run 10 models

Below is the code that runs through this modeling process 10 different times and saves the result from each round.

# # -----------------------------------------------------------------------------
# STEP 0: set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# # -----------------------------------------------------------------------------
# STEP 1: decide how many times to run the model
rounds <- 10

# -----------------------------------------------------------------------------
# STEP 2: set up object to store results
# part a: create names of results to store
result_cols <- c("model_type", "hyb_type", "round", "accuracy", "accuracy_LL", "accuracy_UL",
                 "sensitivity", "specificity", "precision", "npv", "n",
                 "total_truths", "total_lies", "total_truth_guesses", "total_lie_guesses")

# part b: create matrix
results <-
  matrix(nrow = rounds,
         ncol = length(result_cols))

# part c: actually name columns in results marix
colnames(results) <- result_cols

# part d: convert to df (so multiple variables of different types can be stored)
results <- data.frame(results)

# -----------------------------------------------------------------------------
# STEP 2: start timer
start_time <- Sys.time()

# -----------------------------------------------------------------------------
# STEP 3: create rounds number of models, and store results each time

for (i in 1:rounds){
  
  # part a: partition data in 50-50 lgocv split (create index for test set)
  index_train <- 
    createDataPartition(y = stats_combo$stat_id,
                        p = 0.50,
                        times = 1,
                        list = FALSE)
  
  # part b: create testing and training data sets
  train_set <- stats_combo[index_train, ]
  test_set <- stats_combo[-index_train, ]
  
  
  # part c: use caret "train" function to train hybrid and non-hybrid logistic regression model
  model_hybrid_i <- 
  train(form = grd_truth ~ . - stat_id,
        data = train_set,
        method = "glm",
        family = "binomial")

  model_non_hybrid_i <-
    train(form = grd_truth ~ . -stat_id -predict,
          data = train_set,
          method = "glm",
          family = "binomial")
  
  
  # part d: make predictions, using both hybrid and non-hybrid models
  preds_hyb_i <-
    predict(object = model_hybrid_i,
            newdata = test_set,
            type = "raw")
  
  preds_non_hyb_i <-
    predict(object = model_non_hybrid_i,
            newdata = test_set,
            type = "raw")
  
  # part e: store model performance of both hybrid and non-hybrid model
  conf_hyb_i <-
    confusionMatrix(data = preds_hyb_i,
                    reference = test_set$grd_truth,
                    positive = "truth")
  
  conf_non_hyb_i <-
    confusionMatrix(data = preds_non_hyb_i,
                    reference = test_set$grd_truth,
                    positive = "truth")
  
  # part f: store hybrid and non-hybrod model results
  # model type
  results[2*i-1, 1] <- "logistic"
  results[2*i, 1] <- "logistic"
  # hybrid o non-hybrid
  results[2*i-1, 2] <- "hybrid"
  results[2*i, 2] <- "non_hybrid"
  # round
  results[2*i-1, 3] <- i
  results[2*i, 3] <- i
  # accuracy
  results[2*i-1, 4] <- conf_hyb_i$overall[1]
  results[2*i, 4] <- conf_non_hyb_i$overall[1]
  # accuracy LL
  results[2*i-1, 5] <- conf_hyb_i$overall[3]
   results[2*i, 5] <- conf_non_hyb_i$overall[3]
  # accuracy UL
  results[2*i-1, 6] <- conf_hyb_i$overall[4]
  results[2*i, 6] <- conf_non_hyb_i$overall[4]
  # sensitivity
  results[2*i-1, 7] <- conf_hyb_i$byClass[1]
  results[2*i, 7] <- conf_non_hyb_i$byClass[1]
  # specificity
  results[2*i-1, 8] <- conf_hyb_i$byClass[2]
  results[2*i, 8] <- conf_non_hyb_i$byClass[2]
  # precision
  results[2*i-1, 9] <- conf_hyb_i$byClass[3]
  results[2*i, 9] <- conf_non_hyb_i$byClass[3]
  # negative predictive value
  results[2*i-1, 10] <- conf_hyb_i$byClass[4]
  results[2*i, 10] <- conf_non_hyb_i$byClass[4]
  # sample size (total)
  results[2*i-1, 11] <- sum(conf_hyb_i$table)
  results[2*i, 11] <- sum(conf_non_hyb_i$table)
  # total truths
  results[2*i-1, 12] <- conf_hyb_i$table[3] + conf_hyb_i$table[4]
  results[2*i, 12] <- conf_non_hyb_i$table[3] + conf_hyb_i$table[4]
  # total lies
  results[2*i-1, 13] <- conf_hyb_i$table[1] + conf_hyb_i$table[2]
  results[2*i, 13] <- conf_non_hyb_i$table[1] + conf_non_hyb_i$table[2]
  # total truth guesses
  results[2*i-1, 14] <- conf_hyb_i$table[2] + conf_hyb_i$table[4]
  results[2*i, 14] <- conf_non_hyb_i$table[2] + conf_non_hyb_i$table[4]
  # total lie guesses
  results[2*i-1, 15] <- conf_hyb_i$table[1] + conf_hyb_i$table[3]
  results[2*i, 15] <- conf_non_hyb_i$table[1] + conf_non_hyb_i$table[3]
  
  # part g: print round and total elapsed time so far
  cumul_time <- difftime(Sys.time(), start_time, units = "mins")
  print(paste("round #", i, ": cumulative time ", round(cumul_time, 2), " mins",
              sep = ""))
  print("--------------------------------------")

}

View Results (Tabular)

Directly below, I’ve displayed a raw tabular summary of the results. We have 20 rows, which result from the 2 models (hybrid and non-hybrid) created in each of 10 rounds. One pattern that seems to stand out is that, although like before, the hybrid model only has a slightly higher overall accuracy than the hybrid model, this holds in every round. In the second chunk below, I have displayed this directly. For each round, I print the overall accuracy rate of the hybrid model next to the overall accuracy rate for the non-hybrid model. We see that in literally every single round, the hybrid model performed better than the non-hybrid model. This pattern of results seems unlikely to me, if the hybrid and non-hybrid models truly were equivalent in performance. Indeed, when I conduct a simple sign test, which uses a binomial distribution to examine how likely it is that one model should have a higher accuracy score than the other model for 10 out of 10 rounds, under the null assumption that each model should have a higher accuracy score an equal number of times, this pattern diverges significanty from that null hypothesis, p = 0.002. Similarly, when I conduct a Wilcoxon signed-rank test (which is non-parametric test which tests whether there is sufficient evidence to reject the null hypothesis that two dependent samples are drawn from the same population), this difference appears to be significant, V = 55, p = 0.002. This is some evidence to suggest that hybrid human-computer models can perform better than non-hybrid computer models, which do not incorporate human judgments. (There may be some statistical artifact or error that a more statistically savvy person may point out that I am failing to account for, but I cannot think of one. Perhaps there is some form of non-independence that I have not accounted for.)

# print results (long form)
results
# get data in wide format (with accuracy for hybrid and non_hybrid models next to each other)
results_acc_wide <-
  results %>%
    select(hyb_type, round, accuracy) %>%
    spread(key = hyb_type,
           value = accuracy)

# print
results_acc_wide %>%
  mutate(winner = case_when(hybrid > non_hybrid ~ "hybrid",
                            hybrid < non_hybrid ~ "non-hybrid",
                            hybrid == non_hybrid ~ "tie"))
# sign test, via binomial test
binom.test(x = 10,
           n = 10,
           p = 0.5)
## 
##  Exact binomial test
## 
## data:  10 and 10
## number of successes = 10, number of trials = 10, p-value =
## 0.001953
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6915029 1.0000000
## sample estimates:
## probability of success 
##                      1
# conduct Wilcoxon signed-rank test 
wilcox.test(x = results_acc_wide$hybrid,
            y = results_acc_wide$non_hybrid,
            paired = TRUE,
            conf.int = 0.95)
## 
##  Wilcoxon signed rank test
## 
## data:  results_acc_wide$hybrid and results_acc_wide$non_hybrid
## V = 55, p-value = 0.001953
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  0.006280721 0.015838340
## sample estimates:
## (pseudo)median 
##     0.01037684

View Results (Graphically; Overall Accuracy)

When we visually examine the average overall accuracy rate of non-hybrid and hybrid models across 10 rounds, we see much the same pattern of results we saw when compared these two model types only once in the example case with which we led. The hybrid model has a slightly higher level of overall accuracy than the non-hybrid model (although, as we see the confidence intervals for the two models overall accuracy rates overlap).

# average n in test set for hybrid and non-hybrid models
total_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$n)
total_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$n)

# vector of denominators to use, when creating error bars
denoms_overall <- c(total_hyb,
                    total_non_hyb)

# visualize overall accuracy
results %>%
  group_by(hyb_type) %>%
  summarize(accuracy = mean(accuracy)) %>%
  ggplot(aes(x = hyb_type,
             y = accuracy)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = accuracy - 1.96*sqrt(accuracy*(1-accuracy)/denoms_overall),
                     ymax = accuracy + 1.96*sqrt(accuracy*(1-accuracy)/denoms_overall)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.49, to = 0.70, by = 0.01),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("non_hybrid", "hybrid")) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        panel.background = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Accuracy by Model Type",
       x = "Model Type",
       y = "Overall Accuracy")

View Results (Graphically; Sensitivity, Specificity, Precision, Negative Predictive Value)

We see a similar pattern of results when we examine the other four major performance statistics average across 10 rounds. The hybrid model has a higher average level of sensitivity, specificity, precision, and negative predictive value (although again confidence intervals overlap).

# average n in test set for hybrid and non-hybrid models
total_truths_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_truths)
total_truths_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_truths)
total_lies_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_lies)
total_lies_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_lies)
total_truth_guesses_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_truth_guesses)
total_truth_guesses_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_truth_guesses)
total_lie_guesses_hyb <- mean((results %>% filter(hyb_type == "hybrid"))$total_lie_guesses)
total_lie_guesses_non_hyb <- mean((results %>% filter(hyb_type == "non_hybrid"))$total_lie_guesses)

# vector of denominators to use, when creating error bars
denoms_perf <- c(total_truths_non_hyb,
                 total_truths_hyb,
                 total_lies_non_hyb,
                 total_lies_hyb,
                 total_truth_guesses_non_hyb,
                 total_truth_guesses_hyb,
                 total_lie_guesses_non_hyb,
                 total_lie_guesses_hyb)

# visualize results
results %>%
  select(hyb_type, round, sensitivity, specificity, precision, npv) %>%
  gather(key = "metric",
         value = "value",
         sensitivity, specificity, precision, npv) %>%
  group_by(hyb_type, metric) %>%
  summarize(value = mean(value)) %>%
  ungroup() %>%
  mutate(metric = factor(metric,
                            levels = c("sensitivity", "specificity", "precision", "npv"))) %>%
  ggplot(aes(x = hyb_type,
             y = value)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = value - 1.96*sqrt(value*(1-value)/denoms_perf),
                     ymax = value + 1.96*sqrt(value*(1-value)/denoms_perf)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.50, to = 0.70, by = 0.05),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("non_hybrid", "hybrid")) +
  facet_grid(metric ~ .) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        plot.background = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Metrics by Model Type",
       x = "Model Type",
       y = "Proportion")

Save Results

For now, this concludes our analysis. As our last step, lets save our results.

# rename results df, to be particular to this model type (for disambiguation later)
results_HYB_log <- results

# clear results variable
rm(results)

# save results in Rda file
save(results_HYB_log,
     file = "results_HYB_log.Rda")

Citations

END