WinVector.github.io/FluidData/PlotExample/PlotExample.Rmd at main · WinVector/WinVector.github.io

353 lines (298 loc) · 11.2 KB
title: "cdata Transforms"
author: "Win-Vector LLC"
date: "12/26/2017"
  tufte::tufte_html: default
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
link-citations: yes
Start up and show original `Keras` plot.
```{r setup}
library("ggplot2")
library("cdata")
library("seplyr")
library("keras")
library("kableExtra")
options(knitr.table.format = "html") 
h <- readRDS("historyobject.rds")
dOrig <- readRDS("metricsframe.rds")
dOrig$epoch <- seq_len(nrow(dOrig))
# flip loss score so that larger = better
d$loss <- -d$loss
d$val_loss <- -d$val_loss
cR <- d %.>%
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") 
Or with a bit more color:
```{r setup2}
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  row_spec(., 0, background = "lightgreen") %.>%
  column_spec(., 1:4, background = "yellow")
# Reproducing the first plot
First plot: reproduce the original with bulk-renaming of columns (via the new [`cdata::map_fieldsD()`](https://winvector.github.io/cdata/reference/map_fieldsD.html) function).
First we perform the equivalent of a "shred", "un-pivot", or "gather":
```{r firstplot}
cT <- build_unpivot_control(
  nameForNewKeyColumn= 'origColName',
  nameForNewValueColumn= 'performance',
  columnsToTakeFrom= c('val_loss',
                       'val_acc',
                       'loss',
                       'acc' ))
dT <- rowrecs_to_blocks(
  controlTable = cT,
  columnsToCopy = "epoch")
cR <- dT %.>%
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") 
Then we define a value mapping table to build the new 
value key columns that we need.
mp <- data.frame(
  origColName = qc(val_loss, val_acc, 
                     loss, acc),
  dataset = qc("validation", "validation", 
               "training", "training"),
  measure = qc("minus binary cross entropy", "accuracy",
               "minus binary cross entropy", "accuracy"),
  stringsAsFactors = FALSE)
  knitr::kable(.) %.>%
  kable_styling(., full_width = F)  %.>%
  row_spec(., 0:nrow(mp), background = "lightgrey") %.>%
  column_spec(., 1, background = "lightgreen") %.>%
  column_spec(., 2:ncol(mp), background = "yellow")
We apply that key map and we are ready to plot:
dT <- map_fields(dT, 
                 "origColName",
dT$measure <- factor(dT$measure, 
                     levels = c("minus binary cross entropy",
                                "accuracy"))
cR <- dT %.>%
  knitr::kable(.) %.>%
  kable_styling(., full_width = F)  %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2, background = "lightgreen") %.>%
  column_spec(., c(4,5), background = "yellow")
# this is picking the epoch with the optimal loss
# (minimum(loss) -> maximum(-loss)
pick <- dT %.>%
  filter_se(.,
            qe(measure == "minus binary cross entropy",
               dataset == "validation")) %.>%
  .$epoch[[which.max(.$performance)]]
# pick colors so that validation curve is perceptually dominant.
# training comes first
manual_scale = c("#fec44f", "#993404")
ggplot(data = dT, 
       aes(x = epoch, 
           y = performance,
           color = dataset)) +
  geom_point() +
  stat_smooth(geom = "line", se = FALSE, method = "loess", alpha = 0.5) +
  facet_wrap(~measure, ncol=1, scales = "free_y") +
  geom_vline(xintercept = pick, alpha=0.7, color="#993404", linetype=2) +
  scale_color_manual(values =manual_scale) + 
  ggtitle("model performance by epoch, dataset, and measure")
# Creating an improved performance trajectory plot 
Second plot: the steps to get to [WVPlots::plot_Keras_fit_trajectory()](https://winvector.github.io/WVPlots/reference/plot_Keras_fit_trajectory.html).  In particular show the structure of the control table (especially when applied to itself).  The idea is: any in-table block-structure can be taken to any block-structure by moving through a very wide column as an intermediate (or dual form: through a very thin intermediate structure such as RDF-triples).
First let's take a look at our data.
```{r lineplot}
cR <- d %.>%
  head(.) %.>%
  select_se(., qc(epoch, val_loss, val_acc, loss, acc))
  knitr::kable(.)  %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2:5, background = "yellow") %.>%
  row_spec(., 2:nrow(cR), background = "lightgrey") 
Let's concentrate on a single row (in this case the first row).
```{r lineplot1}
  head(., n=1) %.>%
  select_se(., qc(epoch, val_loss, val_acc, loss, acc)) %.>%
  knitr::kable(.)  %.>%
  kable_styling(., full_width = F)%.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2, background = "lightgreen") %.>%
  column_spec(., 3:4, background = "yellow")
To create `ggplot2::geom_ribbon()` and `ggplot2::geom_segment()` we
need both the training and validation loss for a given epoch to be in the same row.
We also want the different performance metrics to be in different rows so we can
use `ggplot2::facet_wrap()`.  That means we want the first row of data to look like the following sub-table:
```{r lineplot2}
cT <- dplyr::tribble(
  ~measure,                 ~training, ~validation,
  "minus binary cross entropy", "loss",    "val_loss",
  "accuracy",               "acc",     "val_acc"
cR <- d %.>%
  head(., n=1) %.>%
  rowrecs_to_blocks(
    controlTable = cT,
    columnsToCopy = "epoch")
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2, background = "lightgreen") %.>%
  column_spec(., 3:4, background = "yellow")
What allows this motion is the `controlTable` which is essentially a before or after
diagram of the transform (depending on which direction you are going).  I can not emphasize
enough the benefit of looking at the data and drawing out the transform on paper *before*
attempting any coding.
The control table is as follows:
```{r lpcct}
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cT), background = "lightgrey") %.>%
  column_spec(., 1, background = "lightgreen") %.>%
  column_spec(., 2:3, background = "yellow")
It can be applied to data, or to itself, in a forward or backward direction (depending if we use `cdata::rowrecs_to_blocks()` or `cdata::blocks_to_rowrecs()`).
cR <- cT %.>%
  blocks_to_rowrecs(
    controlTable = cT,
    keyColumns = NULL) %.>%
  select_se(.,
            qc(val_loss, val_acc, loss, acc)) 
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 1:4, background = "yellow")
# Now move it back
   moveValuesToRowsD(
    controlTable = cT)
Try this with numbers.
```{r proto}
onerow = d %.>%
  head(., n=1)
onerow  %.>% 
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 1:4, background = "yellow")
r1 = moveValuesToRowsD(onerow,
                        controlTable=cT,
                       columnsToCopy = 'epoch')
   knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cT), background = "lightgrey") %.>%
  column_spec(., 2, background = "lightgreen") %.>%
  column_spec(., 3:4, background = "yellow")
moveValuesToColumnsD(r1,
                     controlTable=cT,
                     keyColumns='epoch')  %.>% 
   knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2:5, background = "yellow")
We can now apply the transform to all the data, and produce the final plot.
```{r moveall}
dT <- rowrecs_to_blocks(
  controlTable = cT,
  columnsToCopy = "epoch")
cR <- dT %.>%
  knitr::kable(.) %.>%
  kable_styling(., full_width = F) %.>%
  row_spec(., 0:nrow(cR), background = "lightgrey") %.>%
  column_spec(., 2, background = "lightgreen") %.>%
  column_spec(., 3:4, background = "yellow")
dT$measure <- factor(dT$measure, 
                     levels = c("minus binary cross entropy",
                                "accuracy"))
# note: this step requres wrapr 1.0.3 or better
dT <- dT %.>%
  mutate_se(.,
            qae(rmin := ifelse(validation <= training, validation, NA),
                rmax := ifelse(validation <= training, training, NA),
                discounted := ifelse(validation <= training, 
                                     validation - 0.1*(training-validation), 
                                     validation)))
pick <- dT %.>%
  filter_se(.,
            qe(measure == "minus binary cross entropy")) %.>%
  .$epoch[[which.max(.$discounted)]]
ggplot(data = dT, 
       aes(x = epoch,
           xend = epoch,
           y = validation,
           yend = training,
           ymin = rmin,
           ymax = rmax)) +
  geom_segment(alpha = 0.5) +
  geom_point() +
  geom_point(aes(y = training), shape = 3, alpha = 0.5) +
  stat_smooth(geom = "line",
              se = FALSE, 
              color  = "#d95f02", 
              alpha = 0.8,
              method = "loess") +
  stat_smooth(geom = "line",
              aes(y = discounted),
              se = FALSE, 
              color  = "#d95f02", 
              alpha = 0.2,
              method = "loess",
              linetype = 2) +
  geom_ribbon(alpha=0.2, fill = "#1b9e77") +
  geom_vline(xintercept = pick, alpha=0.7, color='#e6ab02') +
  facet_wrap(~measure, ncol=1, scales = 'free_y') +
  ylab("performance") +
  ggtitle("model performance by epoch, dataset, and measure")
All of the above is now wrapped in convenient function: [WVPlots::plot_Keras_fit_trajectory()](https://winvector.github.io/WVPlots/reference/plot_Keras_fit_trajectory.html).
# Conclusion
All of the above is based on the second generation fluid data theory behind the `cdata` package.  The [first generation of the theory](http://winvector.github.io/FluidData/RowsAndColumns.html) was about establishing and maintaining invariant that make data transform reversible and commutative.  The [second generation of the theory](http://winvector.github.io/FluidData/ArbitraryTransform.html) is about transforms beyond pivot/un-pivot (moving sets of values in unison).
A quick list of references:
  * [Coordinatized Data: A Fluid Data Specification](http://winvector.github.io/FluidData/RowsAndColumns.html)
  * [Arbitrary Data Transforms Using cdata](http://winvector.github.io/FluidData/ArbitraryTransform.html)
  * [Data Wrangling at Scale](http://winvector.github.io/FluidData/DataWranglingAtScale.html)
  * [Big Data Transforms](http://winvector.github.io/FluidData/BigDataTransforms.html)
  * [cdata Transforms](http://winvector.github.io/FluidData/PlotExample/PlotExample.html) (*this article*)
  * [Plotting Keras Performance Trajectories](http://winvector.github.io/FluidData/PlotExample/KerasPerfPlot.html)
  * [Plotting xgboost Performance Trajectories](http://winvector.github.io/FluidData/PlotExample/xgboostPerfPlot.html)
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

PlotExample.Rmd

Latest commit

History

PlotExample.Rmd

File metadata and controls