[return to overview page]

Before beginning attempting to build models to predict the truth or falsity of statements, we need to do some house-keeping. We need to join together, examine, pre-process and clean the sets of features we created in the earlier sections. That is what I do in this section.

Packages

Again, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(plyr) # has join_all function
library(ggthemes) # visualization
library(caret) # modeling
library(AppliedPredictiveModeling)
library(e1071) # has skewness() function
library(DescTools) # has Winsorize() function
library(ggcorrplot) # for correlation plot

Load Data

First, I will load the various data from the features we just extracted.

# load all the nice tidy df's of features we created (remember stats_words has multiple dtm's)
load("stats_clean.Rda") # has text of statement
load("stats_length.Rda") # statement lengths
load("stats_pos.Rda") # parts of speech
load("stats_sent.Rda") # sentiment
load("stats_complex.Rda") # complexity and readability
load("stats_words.Rda") # bag of words (mini document-term matrices)

Join Together All Data

To begin, let’s take all the feature sets we created (statement length, parts of speech, sentiment, readability, and word frequency) and put them together. We can join together these different feature sets by linking individual statements together by their individual statement identification number (“stat_id” column), which we made sure to attach and keep constant throughout the feature extraction process. (This can be done easily using SQL-style join functions available through various R packages, particuly in the tidyverse. For background on SQL joins, which are a super useful and universal theme in data management: Join (SQL), 2018.)

# join all features with ground truth, by stat_id
stats_raw <-
  join_all(dfs = list(stats_length,
                      stats_pos,
                      stats_sent,
                      stats_complex,
                      stats_dtm_100),
           by = "stat_id",
           type = "left") %>%
  left_join(stats_clean %>%
              select(stat_id,
                     grd_truth),
            by = "stat_id") %>%
  select(stat_id,
         grd_truth,
         everything())


# print joined df
stats_raw

# remoe other df (for memory)
rm(stats_length,
   stats_pos,
   stats_sent,
   stats_complex,
   stats_dtm_10,
   stats_dtm_25,
   stats_dtm_50,
   stats_dtm_100)

Cleaning Data

Before engaging in any predictive modeling, it’s important to make sure that there are no major underlying problems in that data. As Kuhn & Johnson (2013, p. 27) note “data preparation can make or break a model’s predictive ability”. They outline a few key issues to check for when cleaning or pre-processing data. The ones that are particular relevant to us are

skew in predictors
zero or near zero variance in predictors
collinearity among predictors

I am going to check for each of these problems and adjust (i.e. delete or transform) data in cases where any major problems seem to arise. Later, when we actually build predictive models, we can compare the performance of models that use this cleaned data compared to models that use the raw data.

Visualizing Variable Distributions

To get an overall sense of our variables, let’s take a look at their distributions. This is plotted below (one distribution for each of our 124 variables; different categories of features are delineated in different colors). when we do this, we see there is a fair deal of skew in many of our variables. And some variables do not seem to have much variance (e.g. the majority of the data points are the same value). The outlying points in these skewed distributions may exert a high degree of influence on some of our model’s estimates. And the variables with zero or near zero variance simply won’t be very useful for prediction. Let’s go about adjusting our data a bit to account for this.

stats_raw %>%
  select(-stat_id) %>%
  keep(is.numeric) %>% 
  gather(key = "variable",
         value = "value") %>%
  mutate(feature_type = case_when(grepl("^n_", variable) ~ "length",
                                  grepl("^pos_", variable) ~ "part_of_speech",
                                  grepl("^sent_", variable) ~ "sentiment",
                                  grepl("^read_", variable) ~ "complexity",
                                  grepl("^wrd_", variable) ~ "word")) %>%
  ggplot(aes(x = value,
             fill = feature_type)) +
    facet_wrap(~ variable,
               ncol = 7,
               scales = "free") +
    geom_histogram() +
    scale_fill_discrete(breaks=c("length", "part_of_speech", "complexity",
                                 "sentiment", "word")) +
    theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Skew

Kuhn & Johnson, (2013) suggest a fair number of popular ways of dealing with skewed data. One common solution is to apply different types of transformations to the data, depending on the direction of the skew. One very popular method (that also has the advantage of minimizing thinking) is to apply the Box & Cox (1964) transformations, which are entire family of transformations that adjust various skewed distributions in pre-defined ways according to their skew. However, I don’t love these transformations because they reduce the interpretability of coefficients (e.g. some variables might be squared, others raised to the power -1). Further, many of our distributions are simply poisson-looking count distributions, with most variables occuring in a few different count bins – various transformations aren’t really going have much of an effect on the distributions between these bins, just the spacing between them. What I am more worried about are those distributions where a few extreme outliers are significantly skewing the distributions (e.g. the distributions for “n_words”, which counts the number of words in each statement). A nicer and simpler type of transformation exists that eliminates these outliers, without needing to change the scales, and thus leaving our variables in interpretable units. This is to simply select some cut points at the ends of the distribution (e.g. at 5% and 95%) and make those the lowest and highest possible values (i.e. any values below the 5th percentile are transformed into whatever the 5th percentile value is, and any values above the 95th percentile value are transformed into whatever the 95 percentile value is). This method is called Winsorizing (Winsorizing, 2019).

Skew (Winsorizing Example)

Let me demonstrate what the this look like on one variable, which appears highly skewed (the “n_words” count, referenced earlier). Visualized below are the raw distribution and the winsorized distribution (windsorized at 1% and 99%). As we can see, this massively reduces the skew in the distribution.

# make data frame to store columns for raw and winsorized "n_words" column
winsor_exmpl <- as.data.frame(stats_raw$n_words)
winsor_exmpl <- winsor_exmpl %>% dplyr::rename(n_words_raw = 1)

# winsorize the variables (at 1% and 99%)
winsor_exmpl$n_words_winsorized <- 
  Winsorize(stats_raw$n_words,
            probs = c(0.01, 0.99))

# visually compare raw and winsorized distributions
winsor_exmpl %>%
  select(n_words_raw,
         n_words_winsorized) %>%
  gather(key = "transform",
         value = "value") %>% 
  ggplot(aes(x = value)) +
    facet_wrap(~ transform,
               scales = "free") +
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Skew (Winsorizing Across All Variables)

To apply this transformation across all variables, we have to have some method for deciding which variables are skewed enough that they warrant winsorization. We could decide this based on visual inspection of the individual histograms above using personal judgment. However, that is a lot of data to look through one by one. And there exist metrics for quantifying skew, which we can leverage here – kurtosis, skew, and other rules of thumb (e.g. the ratio between the max and min value for a variable should be no more than 20) (Kuhn & Johnson, 2013, Chapter 3). I am going to simply choose the metric called “skewness” (there’s no obvious reason why one metric is better than another). The formula for this metric is shown below, and some rules of thumb suggest that we should worry about distributions with skewness values greater than 2. So, we’ll winsorize all variables that have a skewness with an absolute value greater than 2.

# make new data frame to store cleaned data
stats_proc <- stats_raw 

# find skew of each variable
allvar_skewness <-
  as.data.frame(
    lapply(subset(stats_raw, select = -c(stat_id, grd_truth)),
           e1071::skewness)) %>%
    gather(key = "feature",
           value = "skew") %>%
    arrange(desc(skew))

# print variables in order of skewness
allvar_skewness

# make df with names and skew of features that have absolute skew value of more than 2
too_skewed <-
  allvar_skewness %>%
  filter(abs(skew) > 2)

# update cleaned doc (too skewed columns, replaces with winsorized columns)
stats_proc[, names(stats_proc) %in% c(too_skewed$feature, "wrd_I'M")] <- 
# for some reason the column "wrd_I'M" gets left out (can check with setdiff()),
# so I just put it back manually (for some reason gets converted to "wrd_I.M" by too_skewed$feature)
# check with: setdiff(too_skewed$feature, names(stats_proc[, names(stats_proc) %in% too_skewed$feature]))
  as.data.frame(
    lapply(stats_proc[, names(stats_proc) %in% c(too_skewed$feature, "wrd_I'M")],
           Winsorize))

Skew (Visualize All Variables, Now Winsorized)

Let’s take a look at our results. That is, let’s look at all the distributions again, with our now winsorized variables. From looking at these distributions, we see that the situation is improved from before, but certainly not perfect. Let’s move on to our next problem.

stats_proc %>%
  select(-stat_id) %>%
  keep(is.numeric) %>% 
  gather(key = "variable",
         value = "value") %>%
  mutate(feature_type = case_when(grepl("^n_", variable) ~ "length",
                                  grepl("^pos_", variable) ~ "part_of_speech",
                                  grepl("^sent_", variable) ~ "sentiment",
                                  grepl("^read_", variable) ~ "complexity",
                                  grepl("^wrd_", variable) ~ "word")) %>%
  ggplot(aes(x = value,
             fill = feature_type)) +
    facet_wrap(~ variable,
               ncol = 7,
               scales = "free") +
    geom_histogram() +
    scale_fill_discrete(breaks=c("length", "part_of_speech", "complexity",
                                 "sentiment", "word")) +
    theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Computation failed in `stat_bin()`:
## `binwidth` must be positive

## Warning: Computation failed in `stat_bin()`:
## `binwidth` must be positive

## Warning: Computation failed in `stat_bin()`:
## `binwidth` must be positive

Zero or Near Zero Variance (Identify)

Another type of issue in our data we might want to deal with is those variables that exhibit zero or near zear variances – that is, almost all data points take on one value. The caret package has a built in function, nearZeroVar, which helps us automatically identify such variables. (It does this by identifying variables where either literally 100% of the values are the same or the ratio of the number of occurrences of the first most frequent to the number of occurrences of the second most frequent value is beyong some specified threshold; the default, I believe is 95/5. But I will adjust this to be an even more liberal 90/10).

As we can see below, this highlights 30 variables for deletion – e.g. “pos_PROPN”, in which literally all the values are the same, or “wrd_SHE” and “wrd_HE” where the ratio of the first to second most common value of well over 90/10 (i.e. 9.0); other features like “wrd_I” are much more balance, with a frequency ratio of 1.06).

# create object that stores near zero variance stats
features_nzv <-
  nearZeroVar(stats_proc %>% select(-stat_id, -grd_truth),
              saveMetrics = TRUE,
              freqCut = 90/10,
              name = TRUE) %>%
  rownames_to_column(var = "feature")

# display sorted object
features_nzv %>%
  arrange(desc(zeroVar), desc(nzv), desc(freqRatio), percentUnique)

Zero or Near Zero Variance (Eliminate)

Let’s now actually eliminate these 30 zero or near zero variance features, from our set of cleaned features.

# first, make df with names of features (and other nzv stats) to delete
nzv_delete <- 
  features_nzv %>%
  filter(zeroVar == TRUE | nzv == TRUE)

# now remove identified columns for data frame
stats_proc <-
  stats_proc[, !(names(stats_proc) %in% nzv_delete$feature)]

Zero or Near Zero Variance (Visualize Remaining)

Again, let’s check what we are left with. As we can see, we still have some predictors with pretty unequal bins. But at least our worst offenders have been removed.

stats_proc %>%
  select(-stat_id) %>%
  keep(is.numeric) %>% 
  gather(key = "variable",
         value = "value") %>%
  mutate(feature_type = case_when(grepl("^n_", variable) ~ "length",
                                  grepl("^pos_", variable) ~ "part_of_speech",
                                  grepl("^sent_", variable) ~ "sentiment",
                                  grepl("^read_", variable) ~ "complexity",
                                  grepl("^wrd_", variable) ~ "word")) %>%
  ggplot(aes(x = value,
             fill = feature_type)) +
    facet_wrap(~ variable,
               ncol = 7,
               scales = "free") +
    geom_histogram() +
    scale_fill_discrete(breaks=c("length", "part_of_speech", "complexity",
                                 "sentiment", "word")) +
    theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Collinearity

A final problem that that affects some models (like logistic regression) is collinearity. A high degree of correlation between our variables may lead to unstable and unreliable estimates. We already know this will be a problem among at least two of our variables (the two complexity measures,flesch-kincaid and gunning fog scores, which have an absolute pearson correlation coefficient value greater than 0.94). Classic regression statistics like variance inflation factors can be used to identify problems of collinearity. However, Kuhn & Johnson point out that such statistics are often less useful outside the linear regression cases. They suggest another another algorithm for diagnosing and eliminating problems of collinearity (2013, p. 47), which is intuitive and is useful for more cases. We simply pick a correlation threshold (say |r| = 0.75). We then examine the correlation matrix of all the predictors. We then examine if any predictors have a correlation with an absolute value higher then our threshold. If they do, then we compute the average absolute correlation coefficient value of each of those two variables have with the rest of the predictors, and we eliminate the one with the higher average absolute correlation coefficient. We continue this process until there are no pairwise correlations above our threshold. (Note however, that this method only resolves issues of collinearity in two dimensions.)

Collinearity (Visualizing)

Let’s start by simply visualizing all pairwise (96 x 96) correlations between our remaining pre-processed variables. We can notice a few things from this figure. Features seem to have higher correlations within their own type then outside their own type (e.g. look along the red diaganol and notice there are dark sets of boxes eminating out; these are clusters of similar variable types – e.g. all the parts of speech, form the second biggest such box in the bottom left (“pos_ADJ”, “pos_ADP”, …), followed by a smaller box for the three sentiment metrics (“sent_POS”, “sent_NEG”, “sent_NET”, followed by an even smaller box for the readability features (“read_FLESCH”, “read_FOG”), followed by the very largest box for all the individual words (“wrd_I”, “wrd_AND”, …). As we suspected, the two readability features are highly (negatively) correlated. Further, we see that the statement length metrics (“n_words”, “n_unique”, “n_unique_prop”) are fairly highly correlated with a number of variables (follow along vertically the three bottom-most rows or the the three left-most columns horizontally). Finally, all the individual words are somewhat correlated with each other, but not overwhelmingly so.

# get correlations
feature_cors <- cor(stats_proc %>% select(-stat_id, -grd_truth))
feature_pcors <- ggcorrplot::cor_pmat(stats_proc %>% select(-stat_id, -grd_truth))

# visualize all pairwise correlations
ggcorrplot(feature_cors,
           p.mat = feature_pcors,
           insig = "blank") + 
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 90,
                                   vjust = 0.1),
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_blank())

Collinearity (Candidates for Elimination)

Now let’s go about eliminating any variables with problematically high correlations with the other variables. We will use the algorithm describes above (with a cutoff of 0.70). This suggest for deletion four variables: read_FOG, n_words, n_unique, and sent_POS. These confirm what we might have suspected from our visual inspection and prior knowledge. We knew the two readability metrics were highly correlated (and so one, the FOG index, was suggested for deletion.) As we saw in the correlation matrix, the statement length metrics were highly correlated with a number of variables, and so they are suggested for deletion. (Likely, many of the counts of individual words are often also a fair proxy for the length of those statements, as certain words occur with great frequency as and regularity as we saw earlier. Although, we had some theoretical reasons for including statement length metrics, I am okay deleting them. It seems some of the other variables already account for a large portion of the variance that these variables pick up. Thus, if statement length is important, it’s not like our model won’t have features which account for this.) Finally, one of the sentiment features (“sent_POS”) was suggested for deletion. Remember, we computed overall number of negative words (“sent_NEG”), overall number of positive words (“sent_POS”), and then net sentiment, the difference between these two (“sent_NET”). This means “sent_POS” and “sent_NEG” will be highly correlated with “sent_NET” – meaning that either of the two or “sent_NET” might be suggested for deletion. The algorithm suggested “sent_POS”.

# find high correlations
high_corr_names <- findCorrelation(x = cor(stats_proc %>% select(-stat_id, -grd_truth)),
                                     cutoff = 0.70,
                                     names = TRUE)

# print names of columns suggested for deletion
high_corr_names

## [1] "n_words"  "n_unique" "read_FOG" "sent_POS"

Collinearity (Actual Elimination)

To deal with possible problems of collinearity, let’s delete these four variables suggested by the algorithm. We now wind up with a total of 92 predictor variables.

# eliminate high correlation columns from cleaned dataset
stats_proc <- 
  stats_proc[ , !(names(stats_proc) %in% high_corr_names)]

# print remaining data frame
stats_proc

Transforming Data

This is the end of the housecleaning. But as a last step, I would like to transform the data to ease interpretation and allow for better comparison in later analyses. Specifically, I am going to center and rescale all predictor variables (i.e. features we previously extracted). First, I will center all variables around their mean – i.e. for each variable, compute its mean and then subtract that value from each row. In this way, we will know that any value which approaches zero will be the near the mean for that variable. Second, I will rescale each variable by its standard deviation – i.e. divide the values in each column by the standard deviation of that column, so that deviations between variables are comparable. Thus, for any variable, a value like 0.5 will mean the same thing – that value is 0.5 standard deviations above the mean for that feature. Luckily, the preProcess function in the caret package (Kuhn, 2008) makes this extremely easy. (Of course, many of variables are still notably skewed or uneven, so some of this centering and scaling could lead to misleading interpretations. However, this process still imposes some more uniformity on our data. And we can keep in mind the caveat just noted in our subsequent interpretations.)

# center and scale with preProcess and predict from caret
stats_proc <- 
  predict(newdata = stats_proc,
          preProcess(x = stats_proc %>% select(-stat_id, -grd_truth),
                     method = c("center", "scale")))

# print centered and scales data frame 
# stats_proc # (turned off, bc it makes knitr exceed pandoc memory limits)

Final Visualization

With all cleaning, transforming and other pre-processing complete, let’s have one final look at the distributions of our remaining processed variables (to make sure we didn’t mess anything up in that last step).

stats_proc %>%
  select(-stat_id) %>%
  keep(is.numeric) %>% 
  gather(key = "variable",
         value = "value") %>%
  mutate(feature_type = case_when(grepl("^n_", variable) ~ "length",
                                  grepl("^pos_", variable) ~ "part_of_speech",
                                  grepl("^sent_", variable) ~ "sentiment",
                                  grepl("^read_", variable) ~ "complexity",
                                  grepl("^wrd_", variable) ~ "word")) %>%
  ggplot(aes(x = value,
             fill = feature_type)) +
    facet_wrap(~ variable,
               ncol = 7,
               scales = "free") +
    geom_histogram() +
    scale_fill_discrete(breaks=c("length", "part_of_speech", "complexity",
                                 "sentiment", "word")) +
    theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Save

And lastly, let’s save both the raw version of the full data set we created in the beginning and the processed version of the data we just finished creating.

# save the raw and various cleaned data objects
save(stats_raw,
     stats_proc,
     file = "stats_proc.Rda")

Now, let’s save the various objects created.

Citations

Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211-243.
Join (SQL). (2018). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Join_(SQL)&oldid=868927067
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). Springer.
Winsorizing. (2019). In Wikipedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Winsorizing&oldid=891819633

Data Cleaning & Pre-Processing