[return to overview page]

The feature I am now going to extract is some proxy for each statement’s readability or linguistic complexity. This is another textual feature which may give us some clue about whether a person is lying or not. Vrij, Fisher, Mann, & Leal (2006) marshal convincing evidence that lying is mentally taxing (results in greater “cognitive load”). For example, participants in mock interrogations indeed report that lying is more mentally taxing than telling the truth; lying is also associated with greater activation of “executive control” areas of the brain, like the prefrontal cortex. As a result of this mental strain placed on liars, lying might be accompanied by speech this simpler and less complex. Indeed, Newman, Pennebaker, Berry, & Richards (2003) present evidence that when lying, people use fewer “exclusive words” (e.g. but, except) – which they take to indicate that people are speaking in less sophisticated ways (e.g. making less nuanced, qualified points, that often accompany exclusive words like “but” and “except”). Pérez-Rosas & Mihalcea (2015) also find that various metrics for quantifying the complexity of a piece of text are useful for predicting whether a statement is a truth or a lie. Thus, we will extract this feature from our statements.

Packages

Again, I will start by loading relevant packages.

library(tidyverse) # cleaning and visualization
library(quanteda) # text analysis
library(ggthemes)

Load Data

I will load the most recent version of the cleaned statements, which comes from Feature Extraction.
(Note, we created a more recent object, recording the sentiment of each statement. However, we will not be using that object right now.)

# this loads: stats_clean (a data-frame of out cleaned statements)
load("stats_clean.Rda")

Readability

There are various methods that attempt to quantify the extent to which a piece of text is easily “readable”. One category of methods tries to measure the complexity of the the “content” of the text (e.g. complexity of words used). Another category of methods focuses on visual features of the text (Readability, 2018, from Wikipedia). Obviously, here we are not concerned with the visual ease of reading text. And thus I will focus on the former types of readability. Specifically, I will extract the following two popular and fairly crude readability metrics.

These assign a numeric score to pieces of text, based on simple text statistics that aim to approximate the extent to which a text is difficult to read.

Flesch Reading Ease

The Flesch reading ease metric was developed in 1975, by a U.S. Navy contractor, and assigns a readability score to a piece of text based on a formula (explained below); that readability score can then be translated to a school “grade level”, to which the writing roughly corresponds (Flesch-Kincaid readability tests, 2018).

Formula

The formula used to compute this metric is shown below. (The formula and table below are taken directly from Wikipedia (Flesch-Kincaid readability tests, 2018)). As we can see, essentially the formula assumes a piece of text is more complex if it has more words per sentence and more syllables per word (the rest is just weighting and adjustment).

Baselines

And here is a table showing various Flesch readability scores and the grade level those scores are supposed to correspond to. Higher scores indicate more readability (note that scores can go above or below the values presented in the table).

Score Grade Level Description
100.00-90.00 5th grade Very easy to read. Easily understood by an average 11-year-old student.
90.0-80.0 6th grade Easy to read. Conversational English for consumers.
80.0-70.0 7th grade Fairly easy to read.
70.0-60.0 8th & 9th grade Plain English. Easily understood by 13- to 15-year-old students.
60.0-50.0 10th to 12th grade Fairly difficult to read.
50.0-30.0 College Difficult to read.
30.0-0.0 College graduate Very difficult to read. Best understood by university graduates.

According to Flesch-Kincaid readability tests (2018), here are the empirical readability scores of various publications.

Publication Score
Reader’s Digest 65
Time Magazine 52
Harvard Law Review “low 30s”

Gunning Fog Index

The Gunning Fog Index is another very simple and similar formula for computing readability, created in 1952, by Robert Gunning, who worked in newspaper and textbook publishing (Gunning fog index, 2018). This index tries to directly compute a “grade level” for the writing.

Formula

As we can see, the Gunning Fog Index formula is very similar to the Flesch-Kincaid formula, except the numerator in the second term is “complex words” rather than syllables (and trivially, the coefficients and their sign). - although note the designation of “complex words” is still based on a syllabic count (words with 3 or more syllables are complex). (Again, the formula and table below are taken directly from Wikipedia (Flesch-Kincaid readability tests, 2018)))

Baselines

And this table shows how Gunning Fog scores are supposed to map on to grade level.

Score Grade Level
17 College graduate
16 College senior
15 College junior
14 College sophomore
13 College freshman
12 High school senior
11 High school junior
10 High school sophomore
9 High school freshman
8 Eighth grade
7 Seventh grade
6 Sixth grade

Extracting Readability Scores

Now let’s go about extracting readability scores from actual text. Again, the quanteda package has a function, textstat_readabiliy(), that makes the extraction of these readabilty scores very easy.

Example

As usual, let’s start with a simple example. We will use two statements, one that is complex and low on readability, and another that is simple and high on readability.

  • “The deteriorating octogenarian somnambulated, meandering clumsily about his condominium. Disoriented the disheleved derelict perenigrated tortoislike toward the unprepossessing lavatory facility.”
  • “The old guy walked in his sleep. He made his way to the toilet.”
# Generate sentences
example_1 <- c("The deteriorating octogenarian somnambulated, meandering clumsily about his condominium. Disoriented the disheleved derelict perenigrated tortoislike toward the unprepossessing lavatory facility.")
example_2 <- c("The old guy walked in his sleep. He made his way to the toilet.")

# Store to data frame
example_df <-
  data.frame(statement = c(example_1, example_2),
             stat_num = c(1, 2)) %>%
  mutate(statement = as.character(statement))

# Print Sentences
example_df

Example (Flesch-Kincaid)

And now let’s compute the Flesch-Kincaid score for each sentence. As we can see, the simple sentence gets a high score (109), while the complex sentence gets an extremely low score (-103). (The fact that I was able to come up with a sentence that scores below 0 constitutes proof that I am a graduate student. Not to be outdone, a sentence from Proust’s Swann’s Way scores a -515, (Flesch-Kincaid readability tests, 2018)).

# Compute Flesch readability score for each sentence
for (i in 1:nrow(example_df)){
  example_df$read_FLESCH[i] <- textstat_readability(x = example_df$statement[i],
                     measure = "Flesch")$Flesch
}

# Print result
example_df %>%
  select(stat_num,
         read_FLESCH,
         statement)

Example (Gunning Fog)

And now let’s compute the Gunning Fog Index for each of these statements. (Here higher numbers indicate higher “grade levels”, i.e. lower readability.) As we can see, the simple sentence gets a low score (2.8), and the complex sentence gets a high score (32, suggesting the writer is in the 32nd grade, which if I stay around Cornell any longer, will not be too far off).

# Compute Gunning Fog score for each sentence
for (i in 1:nrow(example_df)){
  example_df$read_FOG[i] <- textstat_readability(x = example_df$statement[i],
                     measure = "FOG")$FOG
}

# Print result
example_df %>%
  select(stat_num,
         read_FOG,
         read_FLESCH,
         statement)

Full Dataset

And now let’s apply this to our full set of 5,004 statements, computing Flesch-Kincaid and and Gunning Fog scores for each statement.

# Compute Flesch and Gunning-Fog column for all statements
stats_clean$read_FLESCH <-
  textstat_readability(x = stats_clean$statement,
                       measure = "Flesch")$Flesch
## Warning in nsentence.character(x): nsentence() does not correctly count
## sentences in all lower-cased text
stats_clean$read_FOG <-
  textstat_readability(x = stats_clean$statement,
                         measure = "FOG")$FOG
## Warning in nsentence.character(x): nsentence() does not correctly count
## sentences in all lower-cased text
# Print result data frame
stats_clean %>%
  select(stat_id,
         read_FLESCH,
         read_FOG,
         statement)