[return to overview page]

As Gelman & Hill (2006, p. 79) note “logistic regression is the standard way to model binary outcomes.” So let’s start our predictive modeling here. Our focus throughout will be on assessing the performance of the logistic regression models we build – more so than interpreting coefficients, which is often the focus in a great deal of social science research (where the emphasis is on explanation and thus the variables which “explain” some outcome).

Packages

Again, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(ggthemes) # visualization
library(caret) # modeling
library(broom) # needed for ggcoef function
library(GGally) # has ggcoef function
library(AppliedPredictiveModeling)
library(pROC) # ROC curve

Load Data

Next, I will load the pre-processed data, which we created earlier (see Data Cleaning & Pre-Processing). This dataset has a row for each of 5,004 statements, a column indicating whether that particular statement was a truth or a lie, and 90 possible predictor variables for each statement, which comes from the textual features we extracted earlier.

# load pre-processed df's
load("stats_proc.Rda")

EXAMPLE (Single Predictive Logistic Regression Model)

As usual, let’s begin with an example. Here we will simply train and test one single logistic regression model.

Split Sample Into Training and Testing Sets

Our first step will be to split the entire dataset into two parts – our training data set, on which the model will be build, and our testing data set, on which the performance of our model will be evaluated. Although many possible splits would be acceptable (e.g. 75-25, 90-10), we are going to conduct an exact 50-50 split, randomly allocating one half of the statements to the training set, and the other half to the testing set. The createDataPartition function in the caret packages makes this easy (Kuhn, 2008).

# set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# partition data in 50-50 lgocv split (create index for test set)
index_train_ex <- 
  createDataPartition(y = stats_proc$stat_id,
                      p = 0.50,
                      times = 1,
                      list = FALSE)

# actually create data frame with training set (predictors and outcome together)
train_set_ex <- stats_proc[index_train_ex, ]

# actualy create data frame with test set (predictors and outcome together)
test_set_ex <- stats_proc[-index_train_ex, ]

Build Model (on Training Set)

Now that the data are split, we can fit a logistic regression model to the training data. Again, the caret package makes this easy with its “train” function (Kuhn, 2008), which allows us to select from over 238 different model type (Kuhn, 2019; see: Chapter 7, including of course the logistic regression model from the family of general lineal models. A single logistic regression model is fitted below.

# set seed, so that statistics don't keep changing for every analysis
# (applies for models which might have random parameters)
set.seed(2019)

# start timer
start_time <- Sys.time()

# use caret "train" function to train logistic regression model
model_ex <- 
  train(form = grd_truth ~ . - stat_id,
        data = train_set_ex,
        method = "glm",
        family = "binomial")

# end timer
total_time <- Sys.time() - start_time

Coefficients

In logistic regression, we model the log odds of a binary event (i.e. log(p/(1-p))) as a linear combination of our chosen predictor variables (with some arithmetic, we can then convert things so we are estimating raw probabilites, from 0 to 1) (Gelman & Hill (2006), Chapter 5; Logistic regression 2019). Normally, our focus would be on interpreting the coefficients of the individual predictor variables in this resulting model – for insight into what sorts of factors explains the occurrence of our binary outcome. These are shown below (with 95% CI’s) in decreasing order.

However, our focus in this context is in simply using this entire model to make new out of sample predictions. (Our entire model is basically an “equation”, where we can plug in values for each feature, multiple that value by the coefficient from the equation and generate a probability estimate of a statement being a lie. We can then use this probability estimate to make a prediction about whether a statement is a lie; i.e. if predicted probability of lie is greater than 50%, predict lie, otherwise predict truth. This is what we will do below, 2504 times for each of the 2504 statements in the testing set).

# use ggcoef from GGally, that allows for nice plotting of coeffs
ggcoef(model_ex$finalModel,
       sort = "decending")

Evaluate Model (on Testing Set)

Finally, let’s see if our model is any good. To do this, we will use it to make predictions about the remaining 2,504 statments in the test set, which we set aside earlier. This is done below. The confusionMatrix function from the caret package provides an easy way to collect some basic statistics on how our model performed. As we can see from the text output of this function, our model did pretty well (Kuhn, 2008). Its overall accuracy was significantly better than chance: 60.7% [95% CI: 58.8, 62.6%]. And it performed well both in identifying truths (i.e. sensitivity: 58.2%) and identifying lies (i.e. specificity: 63.3%). When it made a prediction that a statement was a truth, it was correct more often than not (i.e. precision or positive predictive value: 61.3%). And when it made a prediction that a statement was a lie, it was also correct more often than not (i.e. negative predictive value: 60.2%). (Confidence intervals can easily be generated for these other four statistics as well (i.e. +/- z*(sqrt(p*(1-p)/n), where z = 1.96 under the normal approximation method for calculating binomial proportion confidence intervals (Binomial proportion confidence interval, 2019); I won’t calculate these for this example, but I will do so below in our full analysis.)

(Note: some of the point statistics referenced throughout ths analys)

# make predictions
preds_ex <-
  predict(object = model_ex,
          newdata = test_set_ex,
          type = "raw")
      
# record model performance
conf_ex <-
  confusionMatrix(data = preds_ex,
                  reference = test_set_ex$grd_truth,
                  positive = "truth")
      
# print confusion matrix
conf_ex
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction lie truth
##      lie   791   523
##      truth 459   727
##                                           
##                Accuracy : 0.6072          
##                  95% CI : (0.5877, 0.6264)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.2144          
##  Mcnemar's Test P-Value : 0.04439         
##                                           
##             Sensitivity : 0.5816          
##             Specificity : 0.6328          
##          Pos Pred Value : 0.6130          
##          Neg Pred Value : 0.6020          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2908          
##    Detection Prevalence : 0.4744          
##       Balanced Accuracy : 0.6072          
##                                           
##        'Positive' Class : truth           
## 

FULL (Predictive Logistic Regression Models)

Our full analysis will almost exactly replicate what we did in our example case above, except we will replicate the procedure ten times. Thus, we will build 10 different logistic regression models using 10 different training sets and evaluate them on their 10 different (corresponding) test sets.

Run 10 models

Below is the code that runs through this modeling process 10 different times and saves the result from each round.

# # -----------------------------------------------------------------------------
# STEP 0: set seed, so that statistics don't keep changing for every analysis
set.seed(2019)

# # -----------------------------------------------------------------------------
# STEP 1: decide how many times to run the model
rounds <- 10

# -----------------------------------------------------------------------------
# STEP 2: set up object to store results
# part a: create names of results to store
result_cols <- c("model_type", "round", "accuracy", "accuracy_LL", "accuracy_UL",
                 "sensitivity", "specificity", "precision", "npv", "n")

# part b: create matrix
results <-
  matrix(nrow = rounds,
         ncol = length(result_cols))

# part c: actually name columns in results marix
colnames(results) <- result_cols

# part d: convert to df (so multiple variables of different types can be stored)
results <- data.frame(results)

# -----------------------------------------------------------------------------
# STEP 2: start timer
start_time <- Sys.time()

# -----------------------------------------------------------------------------
# STEP 3: create rounds number of models, and store results each time

for (i in 1:rounds){
  
  # part a: partition data in 50-50 lgocv split (create index for test set)
  index_train <- 
    createDataPartition(y = stats_proc$stat_id,
                        p = 0.50,
                        times = 1,
                        list = FALSE)
  
  # part b: create testing and training data sets
  train_set <- stats_proc[index_train, ]
  test_set <- stats_proc[-index_train, ]
  
  
  # part c: use caret "train" function to train logistic regression model
  model <- 
    train(form = grd_truth ~ . - stat_id,
          data = train_set,
          method = "glm",
          family = "binomial")
  
  # part d: make predictions
  preds <-
    predict(object = model,
            newdata = test_set,
            type = "raw")
  
  # part e: store model performance
  conf_m <-
    confusionMatrix(data = preds,
                    reference = test_set$grd_truth,
                    positive = "truth")
  
  # part f: store model results
  # model type
  results[i, 1] <- "logistic"
  # round
  results[i, 2] <- i
  # accuracy
  results[i, 3] <- conf_m$overall[1]
  # accuracy LL
  results[i, 4] <- conf_m$overall[3]
  # accuracy UL
  results[i, 5] <- conf_m$overall[4]
  # sensitivity
  results[i, 6] <- conf_m$byClass[1]
  # specificity
  results[i, 7] <- conf_m$byClass[2]
  # precision
  results[i, 8] <- conf_m$byClass[3]
  # negative predictive value
  results[i, 9] <- conf_m$byClass[4]
  # sample size (of test set)
  results[i, 10] <- sum(conf_m$table)
  
  # part g: print round and total elapsed time so far
  cumul_time <- difftime(Sys.time(), start_time, units = "mins")
  print(paste("round #", i, ": cumulative time ", round(cumul_time, 2), " mins",
              sep = ""))
  print("--------------------------------------")

}
## [1] "round #1: cumulative time 0.13 mins"
## [1] "--------------------------------------"
## [1] "round #2: cumulative time 0.25 mins"
## [1] "--------------------------------------"
## [1] "round #3: cumulative time 0.36 mins"
## [1] "--------------------------------------"
## [1] "round #4: cumulative time 0.48 mins"
## [1] "--------------------------------------"
## [1] "round #5: cumulative time 0.59 mins"
## [1] "--------------------------------------"
## [1] "round #6: cumulative time 0.71 mins"
## [1] "--------------------------------------"
## [1] "round #7: cumulative time 0.83 mins"
## [1] "--------------------------------------"
## [1] "round #8: cumulative time 0.95 mins"
## [1] "--------------------------------------"
## [1] "round #9: cumulative time 1.08 mins"
## [1] "--------------------------------------"
## [1] "round #10: cumulative time 1.19 mins"
## [1] "--------------------------------------"

View Results (Tabular)

Below, I’ve displayed a raw tabular summary of the results from each of the 10 models. As we can see, the results vary somewhat from model to model (e.g. our first model had an overall accuracy of 60.7%, while our second model had an overall accuracy of 60.2%), although are highly consistent (the variation in our overall performance of our best peforming model (round 6: 61.5%) and our worst performing model (round 9: 59.4%) is less than 3%).

results

View Results (Graphically)

Let’s visualize average performance across our 10 different models, on some of the key performance metrics. This is done below. As we can see, over 10 models, overall accuracy is above chance (with mean performance hovering just below 60%, and even the lower limit of the confidence interval on this estimate well above 55%). Similarly, the models performed above chance when predicting make predictions about statements that were truths and when making predictions about statements that were lies (confidence intervals for both sensitivity and specificity well above 50%). And the models were also more reliable than chance when making a prediction that a statment was a truth and when making a prediction that a statement was a lie (confidence intervals for precision and npv above 50%). These results are promising. They reveal that even basic textual features allow for deciphering of lies from truth.

# calculate average sample size
mean_n <- mean(results$n)

# create df to use for visualization
results_viz <-
  results %>%
  group_by(model_type) %>%
  summarize(accuracy = mean(accuracy),
            sensitivity = mean(sensitivity),
            specificity = mean(specificity),
            precision = mean(precision),
            npv = mean(npv)) %>%
  select(-model_type) %>%
  gather(key = "perf_stat",
         value = "value") %>%
  mutate(value = as.numeric(value))

# actual visualization
ggplot(data = results_viz,
  aes(x = perf_stat,
           y = value)) +
geom_point(size = 2,
           color = "#545EDF") +
geom_errorbar(aes(ymin = (value - 1.96*sqrt(value*(1-value)/mean_n)),
                   ymax = (value + 1.96*sqrt(value*(1-value)/mean_n))),
              color = "#545EDF",
              width = 0.15,
              size = 1.25) +
geom_hline(yintercept = 0.5,
           linetype = "dashed",
           size = 0.5,
           color = "red") +
scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.05),
                   limits = c(0, 1)) +
scale_x_discrete(limits = rev(c("accuracy", "sensitivity", "specificity", 
                            "precision", "npv"))) + 
coord_flip() +
theme(panel.grid.major.x = element_line(color = "grey",
                                        size = 0.25),
      panel.grid.minor.x = element_blank(),
      panel.grid.major.y = element_blank(),
      panel.background = element_blank(),
      axis.ticks = element_blank(),
      plot.title = element_text(hjust = 0.5),
      axis.title.y = element_text(margin = 
                                    margin(t = 0, r = 10, b = 0, l = 0)),
      axis.title.x = element_text(margin = 
                                    margin(t = 10, r = 00, b = 0, l = 0)),
      axis.text.x = element_text(angle = 90)) +
labs(title = "Performance Statistics (Logistic Regression)",
     x = "Performance Statistic",
     y = "Proportion (0 to 1)")

Save Results

# rename results df, to be particular to this model type (for disambiguation later)
results_log <- results

# clear results variable
rm(results)

# save results in Rda file
save(results_log,
     file = "results_log.Rda")

OTHER ANALYSES

Finally, out of curiosity, I wanted to conduct one other analysis. I wanted to examine how the predictive performance of the models was affected by what percentage of the overall data it was trained on. In the analyses above (and for the other models we will build) we relied on repeated training/testing splits, where we split the dataset exaclty in half each time (50-50 split). However, I was curious to know if performance might vary depending on how we conducted this split. It was my suspicion that there would be diminishing returns to the amount of data that the model was trained on. (That is, after a certain point, the models wouldn’t perform better if given a larger portion of the overall data to train on.) To do this, I evaluated performance under differing splitting strategies, ranging from splits as as extreme as 1% of data used for traing and 99% for testing, to 50%-50% splits, up until splits of 90% for training and 10% for testing.

Load Data Used In These Analyses

I actually conducted this analysis in earlier iteration of preparing this report, before I had fully pre-processed and cleaned the data. Thus, this analysis was conducted on the raw uncleaned features. At the point I conducted this analysis, I was also curious to separate out performance by different features (e.g. training models just on the sentiment features, or just on the parts of speech, etc). Thus, to begin this analysis, I need to load in the raw data (from an older raw data file) as well as the raw data for each of the individual feature sets.

# load all the nice tidy df's of features we created (remember stats_words has multiple dtm's)
load("stats_all.Rda")

# load individual feature dfs (for training individual models)
load("stats_clean.Rda")
load("stats_length.Rda")
load("stats_pos.Rda")
load("stats_sent.Rda")
load("stats_complex.Rda")
load("stats_words.Rda")

# so, load the results I generated and saved when I ran these analyses earlier
# (note that that this should have a different "results" df, which we can use
# instead of halving to regenerate the 960 models created just below)
load("log_results1.Rda")

Build Models

My analytic plan was to run through the following training-test splits:

Training Proportion Testing Proportion
1% 99%
2% 98%
5% 95%
10% 90%
20% 80%
30% 70%
40% 60%
50% 50%
60% 40%
70% 30%
80% 20%
90% 10%

As above, for each of the splits, I would repeat the splitting 10 different times, training and testing 10 different models. I would also do these for each of the features sets:

  • Statement Length
  • Parts of Speech
  • Sentiment
  • Readability & Complexity](./hld_FEATURE_complex.html)
  • Bag of Words
  • top 10 words
  • top 25 words
  • top 50 words
  • top 100 words

(And for the bag of words, I would also train models across different numbers of the top words: models only taking into account the top 10 words, the top 25 words, top 50 words, and top 100 words.)

Below is the actual code to run and store the results of these models. (In total, this involved training 960 seperate logistic regression models, which took about half an hour to run on my computer.)

# NOTE: warnings are turned off

# -----------------------------------------------------------------------------
# STEP 1: split probabilities to loop through
split_probs <- c(0.01, 0.02, 0.05, seq(from = 0.10, to = 0.90, by = 0.10))

# -----------------------------------------------------------------------------
# STEP 2: decide how many times to run each model
rounds <- 10

# -----------------------------------------------------------------------------
# STEP 3: create list of all df's to look through
feature_sets <- list(stats_dtm_10, stats_dtm_25, stats_dtm_50, stats_dtm_100,
                     stats_length, stats_pos, stats_sent, stats_complex)
feature_names <- list("top 10 words", "top 25 words", "top 50 words", "top 100 words",
                      "length", "parts of speech", "sentiment", "readability")
feature_sets <- list(feature_sets, feature_names)
num_feature_sets <- length(feature_sets[[1]])

# -----------------------------------------------------------------------------
# STEP 4: set up object to store results
# part a: create matrix
results <-
  matrix(nrow = num_feature_sets * length(split_probs) * rounds,
         ncol = 7)
# part b: name columns
colnames(results) <- c("feature_set", "split", "round", "accuracy", "sensitivity", "specificity", "precision")
# part c: convert to df (so multiple variables of different types can be stored)
results <- data.frame(results)

# -----------------------------------------------------------------------------
# STEP 5: build models
# part a: initialize counter
counter = 0
# part b: set up timer
start_time <- Sys.time()
# part c: loop through each feature set
for (i in 1:num_feature_sets){
  
  # store current feature set
  feature_set_i <- feature_sets[[1]][[i]]
  
  # house-keeping: attach ground truth data to feature set (if feature set does not have it)
  if(!("grd_truth" %in% colnames(feature_set_i))){
    feature_set_i <-
      feature_set_i %>%
        mutate(stat_id = as.integer(stat_id)) %>%
        left_join(y = (stats_all %>% # NOTE "stats_all" may be old variable name
                         select(stat_id,
                                grd_truth)),
                  by = "stat_id")
  }
# part d: loop through all training split probabilities
  for (split_i in split_probs) {
    
# part e: loop through each training split probability, round number of times
    for (round_i in 1:rounds){
      
      # increment counter
      counter = counter + 1
      
      # record current feature set
      results[counter, 1] <- feature_sets[[2]][[i]]
      
      # record current split
      results[counter, 2] <- split_i
      
      # record current round
      results[counter, 3] <- round_i
      
      # create partition
      index_i <- createDataPartition(y = feature_set_i$grd_truth,
                                     p = split_i,
                                     list = FALSE)
      
      # create training and test set
      train_set <- feature_set_i[index_i, ]
      test_set <- feature_set_i[-index_i, ]
      
      # make model
      model_i <-
        train(form = grd_truth ~ . - stat_id,
              data = train_set,
              method = "glm",
              family = "binomial")
      
      # make predictions
      model_preds <-
        predict(object = model_i,
                newdata = test_set,
                type = "raw")
      
      # record model performance
      conf_i <-
        confusionMatrix(data = model_preds,
                        reference = test_set$grd_truth,
                        positive = "truth")
      
      # record accuracy
      results[counter, 4] <- conf_i$overall[1]
      
      # get sensitivity
      results[counter, 5] <- conf_i$byClass[1]
      
      # get specificity
      results[counter, 6] <- conf_i$byClass[2]
      
      # get precision
      results[counter, 7] <- conf_i$byClass[3]
      
      # print progress
      print(paste("iteration: ", counter, sep = ""))
    }
  }
}

# part f: record total time
total_time <- Sys.time() - start_time

Results

Now let’s examine how performance various across these different training-testing splits and different features sets.

Results (Collect Summary Statistics)

First, I averaged together the results of the 10 models for each of the subsets of the analyses (e.g. in the first first row, we can see that the overall accuracy across 10 different models, trained only on the statement length features, with a 1% training set, 99% test set split is 50.2%). It will be easier to assess these results visually.

# get average for each round
results_summ <-
  results %>%
    # dplyr::mutate(split = as.factor(split)) %>%
    dplyr::group_by(feature_set, split) %>%
    dplyr::summarise(avg_accuracy = mean(accuracy),
                     min_accuracy = min(accuracy),
                     max_accuracy = max(accuracy),
                     avg_sensitivity = mean(sensitivity),
                     min_sensitivity = min(sensitivity),
                     max_sensitivity = max(sensitivity),
                     avg_specificity = mean(specificity),
                     min_specificity = min(specificity),
                     max_specificity = max(specificity),
                     avg_precision = mean(precision),
                     min_precision = min(precision),
                     max_precision = max(precision)) %>%
  ungroup() %>%
  mutate(feature_set = factor(feature_set,
                              levels = c("top 10 words",
                                         "top 25 words",
                                         "top 50 words",
                                         "top 100 words",
                                         "length",
                                         "readability",
                                         "parts of speech",
                                         "sentiment")))
# look over results
results_summ

Results (Overall Accuracy)

Below, I plot overall accuracy as a function of training-testing split proportion (along the x-axis) and features the model was training on (the seperate lines of different colors). What we can see from these results is that there are some diminishing results of including a larger proportion of our data in the training set (accuracy does not continue to increase linearly, but rather tapers off as test set proportion increases in size.) Among the features, across all the different training-testing splits, the one which seem to provide the best peformance is the top 100 words. The worst seems to be tie between the statement length features and the sentiment features. (Note that the “error bars” are actually not 95% confidence intervals, but bounds of best and worst performance among the 10 different models.)

# set colors
color_map <- c("top 10 words" = "#91C68D",
               "top 25 words" = "#55C66E",
               "top 50 words" = "#45B731",
               "top 100 words" = "#166B28",
               "length" = "#CB4154",
               "readability" = "#1D588E",
               "parts of speech" = "#ED721A",
               "sentiment" = "#EDD41A")

# recreated split_probs variale here (created in non-evaluted chunk, but used for graph)
split_probs <- c(0.01, 0.02, 0.05, seq(from = 0.10, to = 0.90, by = 0.10))

# print plot
ggplot(data = results_summ,
       aes(x = round(split * 100, 1),
           y = round(avg_accuracy * 100, 1),
           color = feature_set)) +
  geom_point(size = 2) +
  geom_line(size = 0.5) +
  geom_errorbar(aes(ymin = round(min_accuracy * 100, 1),
                  ymax = round(max_accuracy * 100, 1)),
                alpha = 0.5,
                width = 0) +
  geom_hline(yintercept = 50,
             color = "grey",
             linetype = "dotted",
             size = 1) +
  scale_x_continuous(breaks = (split_probs * 100)) +
  scale_y_continuous(breaks = seq(from = 47, to = 65, by = 1)) +
  scale_color_manual(values = color_map) +
  labs(x = "Percent of Data Used for Training",
       y = "Accuracy (avg. of 10 models)",
       title = "Accuracy v. Amount of Training, by Features in Model",
       color = "Features") +
  theme_solarized() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45),
        panel.grid.minor = element_blank(),
        legend.key = element_rect(color = "transparent", fill = "transparent"),
        axis.title.y = element_text(margin = margin(t = 0, r = 8, b = 0, l = 0)))