[return to overview page]

Now that we have extracted various features (sentiment, parts of speech, indvidual word counts) and these features have been cleaned, we are ready to use them to build statistical models that can make binary predictions about whether individual statements are truths or lies. Because these models will differ from each other (e.g. logistic regression v. neural nets), it is important to identify common criteria by which the performance of these different models can be assessed and thus compared. In this section my aim is to provide an overview of how we will assess model performance.

At a very high level, our modeling process will have two basic components. First, we build the model (“model training”). Second, we assess its performance (“model testing”).

(The model should be trained on separate data from the data it is tested on. In the final part of this section, “Data Splitting, for Model Training and Testing”, we will review various strategies for splitting a dataset into portions for testing and training. For now, assume we have done this correctly and are at the model “testing” phase.)

The setup for model testing will always be the same. We will have a set of statements (from which various textual features have been extracted), our “testing set”. For each statement in this testing set, the model will be fed the value of the features for that statement (e.g. statement length, sentiment score), and it will spit out a prediction for each of those statements. We can then compare these predictions to reality. Most obviously, we can look at overall accuracy (# correct predictions / # of predictions). But other important metrics exist as well: sensitivity, specificity, precision, and negative predictive value. I review these below, using example data.

(Note, this is all occuring in the domain of “binary classification”, and “classification” more generally. Models that perform classification aim to predict membership of items in discrete categories or “classes”. Here, we are predicting whether a statement belongs in the class “lies” or the class “truths”. Other binary classification tasks might including predicting a whether a person will or will not default on a loan based on their previous financial history. Not all classification tasks are binary; for example, one might try to build a model that takes in images of animals and uses features extracted from those images, to predict what species of animal is in those pictures, e.g. dog, cat, elephant. And finally, an entirely different domain of prediction involves making predictions about continuous quantities, e.g. predicting someone’s expected salary based on features like their education and previous employment history. Different metrics, like root mean squared error, are used to assess the performance of models that make predictions about continuous quantities. These are not covered here.)

(Interestingly, the metrics we will review for assessing model performance in our binary classification task can (and will) also be used to quantify human lie detection performance.)

As usual I wil start with the house-keeping matter of loading some packages that will be used in this section.

```
# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(ggthemes) # visualization
```

We will focus on five key performance statistics (overall accuracy, sensitivity, specificity, precision, and negative predictive value). All of these statistics can be derived from a table that is central to evaluating performance in binary classification. This table is called a “confusion matrix” and is a 2x2 table that cross-tabulates predictions and actual outcomes (I use the terms “actual outcomes” and “reality” interchangeably throughout).

For any other binary classification task, we can imagine we are trying to identify one of the two classes, which we refer to as “positive” cases. For example, in our present task, we might take the perspective that we are trying to identify which statements are true statements and call these our positive cases (making lies our “negative” cases). Thus, when we compare predictions to reality in binary classification there are only four possible outcomes: a positive case is correctly identified as a positive case (a “true positive”), a positive case is incorrectly identified as a negative case (a “false negative”), a negative case is correctly identified as a negative case (a “true negative”), and a negative case is incorrectly identified as a positive (a “false negative”). A confusion matrix simply counts up the number of times each of these four types of events happen.

A generic example is shown below for how this confusion matrix would be filled out once predictions are made on set of statements in our datset.

. | Prediction = Truth | Prediction = Lie |
---|---|---|

Reality = Truth | # True Positives | # False Negatives |

Reality = Lie | # False Positives | # True Negatives |

To make things a little more concrete, imagine that we had a set of 100 statements, 50 of which are true, and 50 of which are lies. And we made a prediction about each of them, with the end result being 40 true statements correctly identified as true, 10 true statements incorrectly identified as lies, 30 lies correctly identified as lies, and 20 lies incorrectly identified as truths. Our resultant confusion matrix would look like the following.

. | Prediction = Truth | Prediction = Lie |
---|---|---|

Reality = Truth | 40 | 10 |

Reality = Lie | 20 | 30 |

We can also display this information visually, and use this to walk through the binary classification performance metrics.

To do this, first, let’s generate some example data (with the same distributions of 40 true positives, 10 false negatives, 30 true negatives, and 20 false negatives).

```
# make matrix of the right shape
example <-
matrix(ncol = 4,
nrow = 10 * 10)
# name columns
colnames(example) <- c("x_cord", "y_cord", "Reality", "Prediction")
# convert to data frame
example <- data.frame(example)
# fill in values
counter = 0
for (y in 1:10){
for (x in 1:10) {
counter = counter + 1
example[counter, 1] <- x
example[counter, 2] <- y
if (y <= 5) {
if (x <= 8) {
example[counter, 3:4] <- c("Truth", "Truth")
} else if (x > 8) {
example[counter, 3:4] <- c("Truth", "Lie")
}
} else if (y > 5) {
if (x <= 6) {
example[counter, 3:4] <- c("Lie", "Lie")
} else if (x > 6) {
example[counter, 3:4] <- c("Lie", "Truth")
}
}
}
}
# make variable to denote correct and incorrect responses
example <-
example %>%
mutate(correct = case_when(Reality == Prediction ~ "Correct",
Reality != Prediction ~ "Incorrect"))
# print resulting df (turned off for knitting)
# example
```

In the visualization below, each of the 100 statements is visualized by an individual box (numbered in the upper right hand corner from 1 to 100). The color of box represents the underlying “reality” with regard to each statement (green boxes represent truthful statements, and red boxes represent lies). The colored circles “within” these boxes represent our predictions for each statment (green circles represent statements predicted to be truthful, and red circles represent statements predicted to be lies). We can compare reality to our predictions by compared the color of the box to the circles within them. If the colors match (e.g. green box with green circle in it), our prediction was correct; if the colors don’t match (e.g. green box with red circle in it), our prediction was wrong. To make this clear, an X is placed through a box it if our prediction was incorrect, and a plus-sign is placed through a box if our prediction was correct. Let’s now walk through the key performance metrics.

```
# make plot
pred_plot <-
ggplot(data = example,
aes(x = x_cord,
y = y_cord)) +
geom_point(aes(color = Reality),
shape = 15, # full square
size = 11) +
scale_color_manual(values = c("#de2d26", "#2ca25f")) + # set custom color colors
geom_point(aes(fill = Prediction),
shape = 21, # outlined circle
color = "transparent",
size = 5) +
scale_fill_manual(values = c("#930000", "#276b1c")) + # set custom fill colors
geom_point(aes(shape = correct),
size = 11,
color = "white",
position = position_nudge(x = 0, y = -0)) + # x = 0.1, y = -0.2
scale_shape_manual(values = c(3, 4)) + # set custom shape shapes
geom_text(aes(label = 1:100),
size = 3,
hjust = 1,
vjust = -0.5,
color = "white") +
labs(title = "Reality v. Predictions",
y = "",
x = "") +
scale_y_continuous(breaks = seq(from = 10, to = 0, by = -1),
trans = "reverse") +
scale_x_continuous(breaks = seq(from = 0, to = 10, by = 1),
position = "top") +
guides(color = guide_legend(order = 1),
fill = guide_legend(order = 2),
shape = guide_legend(order = 3,
title = NULL,
label = FALSE)) +
theme(plot.title = element_text(hjust = 0.5),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
legend.key = element_rect(color = "transparent", fill = "transparent"),
legend.position = "right",
legend.title.align = 0.5,
legend.box.margin=margin(t = 0, b = 0, r = 10, l = 0))
# print plot (with overall accuracy highlighted)
pred_plot
```

Let’s start with the most basic metric of performance – overall accuracy. In our diagram, overall accuracy is the sum of all our correct predictions (the boxes with a plus sign) divided by our total number of predictions (the boxes with a plus sign plus the boxes with an X). Here that gives us 70% overall accuracy.

Mathematically, overall accuracy can also be thought of as:

\[\frac{\sum_{}^{}TRUEpositive + \sum_{}^{}TRUEnegative}{\sum_{}^{}TRUEpositive + \sum_{}^{}TRUEnegative + \sum_{}^{}FALSEpositive + \sum_{}^{}FALSEnegative}\] For our example data, this calculation is also depicted in the figure below. The total number of boxes within the border of the solid black lines correspond the numerator in the equation above (70), and the total number of boxes within the border of the dotted black lines represent the denominator (100).

```
# print plot (with overall accuracy highlighted)
pred_plot +
geom_rect(aes(xmin = 0,
xmax = 11,
ymin = 0,
ymax = 11),
fill = "transparent",
color = "black",
linetype = "dotted") +
geom_rect(aes(xmin = 0.25,
xmax = 8.5,
ymin = 0.25,
ymax = 5.4),
fill = "transparent",
color = "black",
linetype = "solid") +
geom_rect(aes(xmin = 0.25,
xmax = 6.5,
ymin = 5.5,
ymax = 10.8),
fill = "transparent",
color = "black",
linetype = "solid")
```

Another useful meric by which classification models are assessed is sensitivity (also called “recall”). Sensitivity is the percent of all “positive” outcomes that we actually detect. Here is an example. When we put our luggage through the x-ray machine at the airport, we can imagine that the machine is trying to detect whether our luggage has a gun in it or not (presence of a gun is a “positive” case). Sensitivity is the percent of all guns in suitcases that the x-ray machine actually finds. Such a machine could have very good overall accuracy simply by guessing that no one has a gun in their suitcase – because the vast majority of people don’t have guns in their suitcases. However, such a model would have poor sensitivity (in fact 0% sensitivity) because it would detect 0% of the suitcases that do have guns in them.

In our case, what sensitivity means depends on how we define our “positive” cases. If our task is “truth detection” (i.e. truth are “positive” cases), then sensitivity is the percent of all true statements that we correcntly predict to be true. In the figure, this is the sum of the green boxes with a green circle in them, divided by all the green boxes (both those with a green circle in them and those with a red circle in them). Here, that gives us 80%. (If our task were “lie detection” (i.e. lies were considered positive cases), then sensitivity would be the percent of all lies that we correctly identify as lies (red boxes with a red dot in them, divided by all red boxes). Unless stated otherwise, in this report, truths will always be considered “positive cases”. And thus sensitivity will correspond to a task of “truth detection” (i.e. truths are positive cases).)

Thus, in this particular case, we can think of sensitivity as the “truth detection rate” – the percent of all true statements that we correctly identify as true.

Mathematically, sensitivity is:

\[\frac{\sum_{}^{}TRUEpositive}{\sum_{}^{}TRUEpositive + \sum_{}^{}FALSEnegative}\]

Again, for our example data, this calculation is also depicted in the figure. The total number of boxes within the border of the solid black lines correspond the numerator in the equation above (40), and the total number of boxes within the border of the dotted black lines represent the denominator (50).

```
pred_plot +
geom_rect(aes(xmin = 0,
xmax = 11,
ymin = 0,
ymax = 5.5),
fill = "transparent",
color = "black",
linetype = "dotted") +
geom_rect(aes(xmin = 0.25,
xmax = 8.5,
ymin = 0.25,
ymax = 5.4),
fill = "transparent",
color = "black",
linetype = "solid")
```

Specificity is the other side of the coin, in a sense, of sensitivity. Specificity is the percent of all “negative” cases that a model correctly identifies. To continue with the X-ray machine example from above (where the task is “gun detection” in suitcases), specificity is the percent of all suitcases that do not have a gun, which we correctly identified as not having a gun. Again, if we turn off the X-ray machine and simply predict that no one has a gun, this “model” would have great (in fact, perfect, 100%) specificity. (Of course, to the detriment of sensitivity. Without perfect information, there is a tradeoff between sensitivity and specificity. We are always between the extremes of being overbroad and identifying everything as a positive case, thus achieving high sensitivity at the cost of low specificity, or being overly conservative erring on the side of not identifying anything as a positive case, thus achieving high specificity at the expensive of low sensitivity. This tradeoff can be depicted with the use of ROC curves, although I will not go into that here.)

In our case, if our task is “truth detection” (identifying true statements), then specificity is the proportion of all lies that we correctly identify as lies. Thus, here we can think of specificity as the “lie detection rate” of our model. For our example data of 100 statements that we can have been working with this would be the 60% (the red boxes with a red dot, divided by all the red boxes).

Mathematically, specificity is: \[\frac{\sum_{}^{}TRUEnegative}{\sum_{}^{}TRUEnegative + \sum_{}^{}FALSEpositive}\]

Again, for our example data, this calculation is also depicted in the figure. The total number of boxes within the border of the solid black lines correspond the numerator in the equation above (30), and the total number of boxes within the border of the dotted black lines represent the denominator (50).

```
pred_plot +
geom_rect(aes(xmin = 0,
xmax = 11,
ymin = 5.5,
ymax = 11),
fill = "transparent",
color = "black",
linetype = "dotted") +
geom_rect(aes(xmin = 0.25,
xmax = 6.5,
ymin = 5.6,
ymax = 10.8),
fill = "transparent",
color = "black",
linetype = "solid")
```

In a sense, the previous metrics focused on outcomes. That is, we looked at metrics that assessed how well our model could identity positive cases (sensitivity, e.g. “truths”) or how well our model could identify negative cases (specificity, e.g. “lies”). These next two metrics are more focused on predictions. The first such metric I will explain is “precision” (also called positive predictive value). Most simply, this can be thought of as the percent of times that a model is correct when it predicts a “positive” outcome. To continue with the x-ray machine example, it is the percentage of times that, after the machine beeps (predicts a suitcase has a gun), that suitcase actually has a gun.

In our case, assuming our task is “truth detection”, it is the percent of the time that a statement is true given that we’ve predicted the statement is true. It can be thought as measuring a model’s “non-gullibility”. If someone is not gullible, then they have high precision (a large proportion of the statements they predict to be true are in fact true). In the figure below for our example data, precision, is the sum of all green boxes with a green circle in them divided by all boxes which have a green circle in them (66.7%).

Mathematically, precision is:

\[\frac{\sum_{}^{}TRUEpositive}{\sum_{}^{}TRUEpositive + \sum_{}^{}FALSEpositive}\]

Again, for our example data, this calculation is also depicted in the figure. The total number of boxes within the border of the solid black lines correspond the numerator in the equation above (40), and the total number of boxes within the border of the dotted black lines represent the denominator (60).

```
pred_plot +
geom_rect(aes(xmin = 0,
xmax = 8.5,
ymin = 0,
ymax = 5.5),
fill = "transparent",
color = "black",
linetype = "dotted") +
geom_rect(aes(xmin = 6.5,
xmax = 11,
ymin = 5.5,
ymax = 11),
fill = "transparent",
color = "black",
linetype = "dotted") +
geom_rect(aes(xmin = 0.25,
xmax = 8.4,
ymin = 0.25,
ymax = 5.4),
fill = "transparent",
color = "black",
linetype = "solid")
```