[return to overview page]

Before beginning attempting to build models to predict the truth or falsity of statements, we need to do some house-keeping. We need to join together, examine, pre-process and clean the sets of features we created in the earlier sections. That is what I do in this section.

Packages

Again, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization
library(plyr) # has join_all function
library(ggthemes) # visualization
library(caret) # modeling
library(AppliedPredictiveModeling)
library(e1071) # has skewness() function
library(DescTools) # has Winsorize() function
library(ggcorrplot) # for correlation plot

Load Data

First, I will load the various data from the features we just extracted.

# load all the nice tidy df's of features we created (remember stats_words has multiple dtm's)
load("stats_clean.Rda") # has text of statement
load("stats_length.Rda") # statement lengths
load("stats_pos.Rda") # parts of speech
load("stats_sent.Rda") # sentiment
load("stats_complex.Rda") # complexity and readability
load("stats_words.Rda") # bag of words (mini document-term matrices)

Join Together All Data

To begin, let’s take all the feature sets we created (statement length, parts of speech, sentiment, readability, and word frequency) and put them together. We can join together these different feature sets by linking individual statements together by their individual statement identification number (“stat_id” column), which we made sure to attach and keep constant throughout the feature extraction process. (This can be done easily using SQL-style join functions available through various R packages, particuly in the tidyverse. For background on SQL joins, which are a super useful and universal theme in data management: Join (SQL), 2018.)

# join all features with ground truth, by stat_id
stats_raw <-
  join_all(dfs = list(stats_length,
                      stats_pos,
                      stats_sent,
                      stats_complex,
                      stats_dtm_100),
           by = "stat_id",
           type = "left") %>%
  left_join(stats_clean %>%
              select(stat_id,
                     grd_truth),
            by = "stat_id") %>%
  select(stat_id,
         grd_truth,
         everything())


# print joined df
stats_raw