This vignette demonstrates basic
usage of the RcppML::nmf
function and visualization of the
results.
Install the RcppML R package from CRAN or the development version from GitHub.
Non-negative Matrix Factorization (NMF) finds additive signals in non-negative data in terms of the features and samples associated with those signals.
NMF gives an approximation of an input matrix as the cross-product of two low-rank submatrices:
A = wdh
Here, A is the input matrix, w is a tall matrix of features in rows and factors in columns, and h is a wide matrix of factors in rows and samples in columns.
RcppML::nmf
introduces one more important component into
this system, a scaling diagonal, d. This scaling diagonal
provides:
Run NMF on the iris
dataset. We need to specify a rank
(k
) for the factorization, and will also specify the
seed
for random initialization for reproducibility:
library(RcppML)
library(Matrix)
library(ggplot2)
library(cowplot)
data(iris)
model <- nmf(iris[,1:4], k = 3, seed = 1)
model
#> 150 x 4 x 3 factor model of class "nmf"
#> @ w
#> nmf1 nmf2 nmf3
#> [1,] 7.247732e-05 7.729860e-04 1.481245e-02
#> [2,] . 1.732147e-03 1.302637e-02
#> [3,] 6.215598e-05 8.130570e-04 1.354473e-02
#> [4,] 5.138898e-04 1.325539e-03 1.252291e-02
#> [5,] 9.180207e-04 . 1.466685e-02
#> [6,] 2.343225e-03 . 1.483233e-02
#> [7,] 1.622847e-03 . 1.302818e-02
#> ...suppressing 143 rows
#>
#> @ d
#> 751.569 680.8483 650.4042
#>
#> @ h
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> nmf1 0.32778323 0.23147341 0.29211248 0.14863088
#> nmf2 0.44685928 0.07648536 0.39024299 0.08641237
#> nmf3 0.50380813 0.35848065 0.12272530 0.01498593
#>
#> @ misc
#> List of 4
#> $ tol : num 1e-04
#> $ iter : num 6
#> $ runtime: 'difftime' num 0.00352048873901367
#> ..- attr(*, "units")= chr "secs"
#> $ w_init : num [1:3, 1:150] 0.266 0.372 0.573 0.908 0.202 ...
The result of RcppML::nmf
is an S34object of class
nmf
. The nmf
class has many useful
methods:
methods(class = "nmf")
#> [1] $ [ [[ align biplot coerce dim dimnames
#> [9] evaluate head predict prod show sort sparsity subset
#> [17] summary t
#> see '?methods' for accessing help and source code
One of these useful methods is summary
(which in turn
has a plot
method):
species_stats <- summary(model, group_by = iris$Species)
species_stats
#> group factor stat
#> 1 setosa nmf1 0.03254147
#> 2 versicolor nmf1 0.34688642
#> 3 virginica nmf1 0.62057211
#> 4 setosa nmf2 0.04228313
#> 5 versicolor nmf2 0.44358369
#> 6 virginica nmf2 0.51413318
#> 7 setosa nmf3 0.70111729
#> 8 versicolor nmf3 0.23548147
#> 9 virginica nmf3 0.06340124
Notice how NMF factors capture variable information among iris species.
The biplot
method for NMF (see ?biplot.nmf
for details) can compare the weights of different features or samples in
two factors:
NMF is randomly initialized, thus results may be slightly different every time. To run NMF many times, set multiple seeds, and the best model will be returned.
Here we run 10 factorizations at a higher tolerance, and the best model is returned:
# MSE of model from single random initialization
evaluate(model, iris[,1:4])
#> [1] 0.00744055
# MSE of best model among 10 random restarts
evaluate(model2, iris[,1:4])
#> [1] 0.006025415
The second model is better.
Sparse factors contain only a few non-zero values and make it easy to identify features or samples that are important.
L1/LASSO regularization is the best method for introducing sparsity into a linear model.
data(movielens)
ratings <- movielens$ratings
model_L1 <- nmf(ratings, k = 7, L1 = 0.1, seed = 123, mask_zeros = TRUE)
sparsity(model_L1)
#> factor sparsity model
#> 1 nmf1 0.6193432 w
#> 2 nmf2 0.8774244 w
#> 3 nmf3 0.6346005 w
#> 4 nmf4 0.6030515 w
#> 5 nmf5 0.8986294 w
#> 6 nmf6 0.2927334 w
#> 7 nmf7 0.4569434 w
#> 8 nmf1 0.6786885 h
#> 9 nmf2 0.2819672 h
#> 10 nmf3 0.6540984 h
#> 11 nmf4 0.7114754 h
#> 12 nmf5 0.4311475 h
#> 13 nmf6 0.9245902 h
#> 14 nmf7 0.9426230 h
The sparsity
S3 method for class nmf
makes
it easy to compute the sparsity of factors, as done above.
Note that mask_zeros = TRUE
in the example above. This
is because zero-valued ratings are missing, and thus should not be
considered during factorization.
In the above example, we regularized both w and h, however each model can also be regularized separately:
model_no_L1 <- nmf(ratings, k = 7, L1 = 0, seed = 123, mask_zeros = TRUE)
model_L1_h <- nmf(ratings, k = 7, L1 = c(0, 0.1), seed = 123, mask_zeros = TRUE)
model_L1_w <- nmf(ratings, k = 7, L1 = c(0.1, 0), seed = 123, mask_zeros = TRUE)
# summarize sparsity of all models in a data.frame
df <- rbind(sparsity(model_no_L1), sparsity(model_L1_h), sparsity(model_L1_w), sparsity(model_L1))
df$side <- c(rep("none", 14), rep("h only", 14), rep("w only", 14), rep("both", 14))
df$side <- factor(df$side, levels = unique(df$side))
ggplot(df, aes(x = side, y = sparsity, color = model)) +
geom_boxplot(outlier.shape = NA, width = 0.6) +
geom_point(position = position_jitterdodge()) + theme_classic() +
labs(x = "Regularized side of model", y = "sparsity of model factors")
Note how each side of the model is regularized independently.
L1 regularization does not significantly affect model loss:
# L1 = 0
evaluate(model_no_L1, movielens$ratings, mask = "zeros")
#> [1] 6.911534
# L1 = 0.1
evaluate(model_L1, movielens$ratings, mask = "zeros")
#> [1] 7.418375
L1 regularization also does not significantly affect model
information at low penalties. Here we measure the cost of bipartite
matching between two models on a cosine distance matrix for
L1 = 0
, L1 = 0.01
, and
L1 = 0.1
:
# cost of bipartite matching: L1 = 0 vs. L1 = 0.01
bipartiteMatch(1 - cosine(model_no_L1$w, model_low_L1$w))$cost / 10
#> [1] 0.03909451
# cost of bipartite matching: L1 = 0 vs. L1 = 0.1
bipartiteMatch(1 - cosine(model_no_L1$w, model_L1$w))$cost / 10
#> [1] 0.01713988
These cosine angles (range 0 to 1) are very small – in other words, the models are very similar.
See ?RcppML::cosine
for details on very fast computation
of cosine similarity.
In the above code, we computed cosine distance by subtracting cosine similarity from 1, matched on this cost matrix, and divided by 10 to find the mean cosine distance between matched factors. In both cases, factors correspond well.
Thus, regularized RcppML::nmf
increases factor sparsity
without significantly affecting the loss or information content of the
model.
NMF models learned on some samples can be projected to other samples, a common routine in recommendation systems or transfer learning.
For instance, we may train a model on some samples and then use it to predict the values in other samples. For instance, in this dataset we predict what bird species are likely to be encountered in a grid of land given information about just a fraction of the species.
data(hawaiibirds)
A <- hawaiibirds$counts
test_grids <- sample(1:ncol(A), ncol(A) / 5)
test_species <- sample(1:nrow(A), nrow(A) * 0.5)
# construct a sparse masking matrix for these species and grids
mask <- matrix(0, nrow(A), ncol(A))
mask[test_species, test_grids] <- 1
mask <- as(mask, "dgCMatrix")
model <- nmf(A, k = 15, mask = mask, tol = 1e-6, seed = 123)
df <- rbind(
data.frame(
"observed" = as.vector(A[test_species, test_grids]),
"predicted" = as.vector(prod(model)[test_species, test_grids]),
"set" = "test"
), data.frame(
"observed" = as.vector(A[-test_species, -test_grids]),
"predicted" = as.vector(prod(model)[-test_species, -test_grids]),
"set" = "train"
)
)
ggplot(df, aes(observed, predicted, color = set)) +
theme_classic() +
theme(aspect.ratio = 1) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") +
scale_x_continuous(expand = c(0, 0), limits = c(0, 1), trans = "sqrt") +
geom_point(size = 0.5) +
facet_grid(cols = vars(set)) +
theme(legend.position = "none")
#> Warning: Removed 150 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Cross-validation can assist in finding a reasonable factorization rank.
Here we determine the optimal rank for the aml
dataset
using cross-validation across three random replicates:
data(aml)
cv_data <- crossValidate(aml$data, k = 1:10, reps = 3)
#> 'as(<ngCMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "dMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
head(cv_data)
#> rep k value
#> 1 1 1 0.03619901
#> 2 1 2 0.02974895
#> 3 1 3 0.02757322
#> 4 1 4 0.02599658
#> 5 1 5 0.02490966
#> 6 1 6 0.02413747
Use the S4 plot
method for the
nmfCrossValidation
class to visualize:
The optimal rank is around k = 8
.