[return to overview page]

In this final series of analyses, I will create hybrid human-computer models to make truth-lie predictions. These models will be almost exactly like the computer models we created earlier (i.e. here, here, and here). These models will use the textual features we extracted and cleaned earlier (i.e. statement length, parts of speech, sentiment, readability & complexity, bag of words, which we then cleaned). However, these news models will include one more feature: human predictions. As we saw, humans were able to perform above chance when making predictions about the truth or falsity of statements in our corpus (55.5% accuracy across 3663 statements). Thus, the hope is that feeding these human predictions into a computer model as an additional feature that they may take into account (in addition to textual features) will further improve their accuracy. The expectations is that these hybrid human-computer models should perform better (have higher overall accuracy) than either purely text-based computer models and pure human prediction.

In the sections that follow, I will create hybrid models of the same three types as earlier: logistic regression, support vector machine, and neural network. In each case, these hybrid human-computer models will then be compared to computer models trained solely on textual features. (We will use the same training test splits as earlier, i.e. 50-50 training-testing split, repeated 10 times. Note, however, that we only have human predictions for 3,663 of the full 5,004 statements. Thus, our hybrid models will be restricted to this set of 3,663 statements for which both textual features and human predictions are available. As we saw in the “OTHER ANALYSES” section of our logistic regression model, model performance increased as a model had more data to train on. Thus, it would not exactly be an equivalent comparison to examine the performance of the hybrid models trained on 50% of 3,663 statements relative to earlier non-hybrid computer models we created which were trained on 50% of the full 5,004 statements. Thus, the hybrid models will be compared to new non-hybrid models trained on 50% of the same 3,663 statements.)

In the remainder of this page, I just go through some basic preliminary steps needed to eventually build these hybrid human-computer models (namely, combining the textual features with human predictions into one nice tidy data file).

Packages

As usual, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization

Load Data

Next, I will load in the two basic data file we need: the processed textual features for each of the 5,004 statements and human predictions that were made for 3,663 of the statements.

# load in cleaned features
load("stats_proc.Rda")
rm(stats_raw) # remove raw (un-processed) feature data

# load in guesses
load("stats_guess.Rda")

Join File Together

And here, I will join those files together, and then print the resultant data frame.

# combine data
stats_combo <-
  stats_guess %>%
  select(stat_id, predict) %>%
  left_join(y = stats_proc,
            by = "stat_id") %>%
  select(stat_id,
         grd_truth,
         predict,
         everything())

# print resultant df
stats_combo