---
title: "Getting Started with NMF"
author: "Zach DeBruine"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with NMF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette demonstrates basic usage of the `RcppML::nmf` function and visualization of the results.

## Install RcppML

Install the RcppML R package from CRAN or the development version from GitHub. 

```{R, eval = FALSE}
install.packages('RcppML')                     # install CRAN version
# devtools::install_github("zdebruine/RcppML") # compile dev version
```

## What is NMF?

Non-negative Matrix Factorization (NMF) finds additive signals in non-negative data in terms of the features and samples associated with those signals.

NMF gives an approximation of an input matrix as the cross-product of two low-rank submatrices: 

$$A = wdh$$

Here, $A$ is the input matrix, $w$ is a tall matrix of features in rows and factors in columns, and $h$ is a wide matrix of factors in rows and samples in columns. 

`RcppML::nmf` introduces one more important component into this system, a scaling diagonal, $d$. This scaling diagonal provides:

* consistent factor scalings throughout model fitting
* robustness across random restarts
* symmetry in factorization of symmetric matrices
* a means for convex L1 regularization

## Running NMF

Run NMF on the `iris` dataset. We need to specify a rank (`k`) for the factorization, and will also specify the `seed` for random initialization for reproducibility:

```{R, message = FALSE, warning = FALSE, results = "hide"}
library(RcppML)
library(Matrix)
library(ggplot2)
library(cowplot)

data(iris)
model <- nmf(iris[,1:4], k = 3, seed = 1)
```

```{R}
model
```

## Visualizing NMF Models

The result of `RcppML::nmf` is an S34object of class `nmf`. The `nmf` class has many useful methods:

```{R, warning = FALSE}
methods(class = "nmf")
```

One of these useful methods is `summary` (which in turn has a `plot` method):

```{R}
species_stats <- summary(model, group_by = iris$Species)
species_stats
```

```{R, fig.height = 2.5, fig.width = 3}
plot(species_stats, stat = "sum")
```

Notice how NMF factors capture variable information among iris species.

The `biplot` method for NMF (see `?biplot.nmf` for details) can compare the weights of different features or samples in two factors:

```{R, fig.height = 3, fig.width = 4}
biplot(model, factors = c(1, 2), group_by = iris$Species)
```

## Random Restarts

NMF is randomly initialized, thus results may be slightly different every time. To run NMF many times, set multiple seeds, and the best model will be returned. 

Here we run 10 factorizations at a higher tolerance, and the best model is returned:

```{R, results = "hide"}
model2 <- nmf(iris[,1:4], k = 3, seed = 1:10, tol = 1e-5)
```

```{R}
# MSE of model from single random initialization
evaluate(model, iris[,1:4])

# MSE of best model among 10 random restarts
evaluate(model2, iris[,1:4])
```

The second model is better.

## L1 Regularization

Sparse factors contain only a few non-zero values and make it easy to identify features or samples that are important. 

L1/LASSO regularization is the best method for introducing sparsity into a linear model.

```{R, results = "hide"}
data(movielens)
ratings <- movielens$ratings
model_L1 <- nmf(ratings, k = 7, L1 = 0.1, seed = 123, mask_zeros = TRUE)
```

```{R}
sparsity(model_L1)
```

The `sparsity` S3 method for class `nmf` makes it easy to compute the sparsity of factors, as done above.

Note that `mask_zeros = TRUE` in the example above. This is because zero-valued ratings are missing, and thus should not be considered during factorization.

In the above example, we regularized both $w$ and $h$, however each model can also be regularized separately:

```{R, results = "hide"}
model_no_L1 <- nmf(ratings, k = 7, L1 = 0, seed = 123, mask_zeros = TRUE)
model_L1_h <-  nmf(ratings, k = 7, L1 = c(0, 0.1), seed = 123, mask_zeros = TRUE)
model_L1_w <-  nmf(ratings, k = 7, L1 = c(0.1, 0), seed = 123, mask_zeros = TRUE)

# summarize sparsity of all models in a data.frame
df <- rbind(sparsity(model_no_L1), sparsity(model_L1_h), sparsity(model_L1_w), sparsity(model_L1))
df$side <- c(rep("none", 14), rep("h only", 14), rep("w only", 14), rep("both", 14))
df$side <- factor(df$side, levels = unique(df$side))
```

```{R, fig.height = 3, fig.width = 4}
ggplot(df, aes(x = side, y = sparsity, color = model)) + 
  geom_boxplot(outlier.shape = NA, width = 0.6) + 
  geom_point(position = position_jitterdodge()) + theme_classic() + 
  labs(x = "Regularized side of model", y = "sparsity of model factors")
```

Note how each side of the model is regularized independently.

L1 regularization does not significantly affect model loss:

```{R}
# L1 = 0
evaluate(model_no_L1, movielens$ratings, mask = "zeros")

# L1 = 0.1
evaluate(model_L1, movielens$ratings, mask = "zeros")
```

L1 regularization also does not significantly affect model information at low penalties. Here we measure the cost of bipartite matching between two models on a cosine distance matrix for `L1 = 0`, `L1 = 0.01`, and `L1 = 0.1`:

```{R, results = "hide"}
model_low_L1 <- nmf(movielens$ratings, k = 5, L1 = 0.01, seed = 123)
```

```{R}
# cost of bipartite matching: L1 = 0 vs. L1 = 0.01
bipartiteMatch(1 - cosine(model_no_L1$w, model_low_L1$w))$cost / 10

# cost of bipartite matching: L1 = 0 vs. L1 = 0.1
bipartiteMatch(1 - cosine(model_no_L1$w, model_L1$w))$cost / 10
```

These cosine angles (range 0 to 1) are very small -- in other words, the models are very similar.

See `?RcppML::cosine` for details on very fast computation of cosine similarity.

In the above code, we computed cosine distance by subtracting cosine similarity from 1, matched on this cost matrix, and divided by 10 to find the mean cosine distance between matched factors. In both cases, factors correspond well.

Thus, regularized `RcppML::nmf` increases factor sparsity without significantly affecting the loss or information content of the model.

## Prediction/Recommendation with NMF

NMF models learned on some samples can be projected to other samples, a common routine in recommendation systems or transfer learning.

For instance, we may train a model on some samples and then use it to predict the values in other samples. For instance, in this dataset we predict what bird species are likely to be encountered in a grid of land given information about just a fraction of the species.

```{R, results = "hide", fig.width = 6, fig.height = 3}
data(hawaiibirds)
A <- hawaiibirds$counts

test_grids <- sample(1:ncol(A), ncol(A) / 5)
test_species <- sample(1:nrow(A), nrow(A) * 0.5)

# construct a sparse masking matrix for these species and grids
mask <- matrix(0, nrow(A), ncol(A))
mask[test_species, test_grids] <- 1
mask <- as(mask, "dgCMatrix")

model <- nmf(A, k = 15, mask = mask, tol = 1e-6, seed = 123)

df <- rbind(
  data.frame(
    "observed" = as.vector(A[test_species, test_grids]),
    "predicted" = as.vector(prod(model)[test_species, test_grids]),
    "set" = "test"
  ), data.frame(
    "observed" = as.vector(A[-test_species, -test_grids]),
    "predicted" = as.vector(prod(model)[-test_species, -test_grids]),
    "set" = "train"
  )
)

ggplot(df, aes(observed, predicted, color = set)) + 
  theme_classic() + 
  theme(aspect.ratio = 1) + 
  scale_y_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") + 
  scale_x_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") + 
  geom_point(size = 0.5) + 
  facet_grid(cols = vars(set)) + 
  theme(legend.position = "none")
```

## Cross-validation for rank determination

Cross-validation can assist in finding a reasonable factorization rank. 

Here we determine the optimal rank for the `aml` dataset using cross-validation across three random replicates:

```{R}
data(aml)
cv_data <- crossValidate(aml$data, k = 1:10, reps = 3)
head(cv_data)
```

Use the S4 `plot` method for the `nmfCrossValidation` class to visualize:

```{R}
plot(cv_data)
```

The optimal rank is around `k = 8`.