[return to overview page]

As before, very quickly and without doing any statistical tests for the moment, I would like to visually compare the performance of the three different types of hybrid models we created: logistic regression, support vector machine, and neural network,

Packages

As usual, I will start by loading relevant packages.

# before knitting: message = FALSE, warning = FALSE
library(tidyverse) # cleaning and visualization

Load Data

I will now load the summary results from each of the different models we created in the earlier sections, and combine them together.

# load results df's from each of the models we created
load("results_HYB_log.Rda")
load("results_HYB_svm.Rda")
load("results_HYB_neural.Rda")

# combine all results together
results <-
  rbind(results_HYB_log,
        results_HYB_svm,
        results_HYB_neural)

# filter out non-hybrid results
results <-
  results %>%
  filter(hyb_type == "hybrid")

# calculate average sample sizes (for later CI's)
n_log <- mean(results[results$model_type == "logistic", ]$n)
n_svm <- mean(results[results$model_type == "svm", ]$n)
n_neural <- mean(results[results$model_type == "neural", ]$n)

Comparison (Overall Accuracy)

First, I will look average overall accuracy, for each of the three models, averaged across their 10 rounds of training. When we display these results, it looks like the hybrid supprt vector machine model had the best performance, followed by the logistic regression model, which itself was followed by the support vector machine model.

results %>%
  group_by(model_type) %>%
  summarize(accuracy = mean(accuracy)) %>%
  ggplot(aes(x = model_type,
             y = accuracy)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = accuracy - 1.96*sqrt(accuracy*(1-accuracy)/n_log),
                     ymax = accuracy + 1.96*sqrt(accuracy*(1-accuracy)/n_log)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.49, to = 0.70, by = 0.01),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("logistic", "svm", "neural")) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        panel.background = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Accuracy by (Hybrid) Model Type",
       x = "Model Type",
       y = "Overall Accuracy")

Comparison (Sensitivity, Specificity, Precision, Negative Predictive Value)

We see the same ordering of results when examining the other four key performance statistics: sensitivity, specificity, precision, and negative predictive value. The highest performance always comes from the support vector machine model, followed by the logistic regression machine model, which is then followed by the neural network model.

results %>%
  select(model_type, round, sensitivity, specificity, precision, npv) %>%
  gather(key = "metric",
         value = "value",
         sensitivity, specificity, precision, npv) %>%
  group_by(model_type, metric) %>%
  summarize(value = mean(value)) %>%
  ungroup() %>%
  mutate(metric = factor(metric,
                            levels = c("sensitivity", "specificity", "precision", "npv"))) %>%
  ggplot(aes(x = model_type,
             y = value)) +
  geom_point(size = 2,
             color = "#545EDF") +
  geom_errorbar(aes(ymin = value - 1.96*sqrt(value*(1-value)/n_log),
                     ymax = value + 1.96*sqrt(value*(1-value)/n_log)),
                color = "#545EDF",
                width = 0.05,
                size = 1) +
  geom_hline(yintercept = 0.5,
             linetype = "dashed",
             size = 0.5,
             color = "red") +
  scale_y_continuous(breaks = seq(from = 0.50, to = 0.70, by = 0.05),
                     limits = c(0.49, 0.70)) +
  scale_x_discrete(limits = c("logistic", "svm", "neural")) +
  facet_grid(metric ~ .) +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.y = element_line(color = "grey",
                                          size = 0.25),
        plot.background = element_blank(),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill=NA, size=1),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5),
        axis.title.y = element_text(margin = 
                                      margin(t = 0, r = 10, b = 0, l = 0)),
        axis.title.x = element_text(margin = 
                                      margin(t = 10, r = 00, b = 0, l = 0))) +
  labs(title = "Metrics by (Hybrid) Model Type",
       x = "Model Type",
       y = "Proportion")

This is all I wanted to examine, for the moment.

Hybrid Modeling Comparison

Sebastian Deri

Packages

Load Data

Comparison (Overall Accuracy)

Comparison (Sensitivity, Specificity, Precision, Negative Predictive Value)

END