--- title: "Robust NMF with random initializations" author: "Zach DeBruine" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Robust NMF with random initializations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Non-negative matrix factorization (NMF) is NP-hard ([Vavasis, 2007](https://arxiv.org/abs/0708.4149)). As such, the best that NMF can do, in practice, is find the best discoverable local minima from some set of initializations. Non-negative Double SVD (NNDSVD) has previously been proposed as a "head-start" for NMF ([Boutsidis, 2008](https://www.sciencedirect.com/science/article/abs/pii/S0031320307004359)), but never does better than random initializations and often does worse (see my [blog post](https://www.zachdebruine.com/post/learning-optimal-nmf-models-from-random-restarts/) on this). ## Reproducibility Simply, set the seed when calling the function. ```{R} library(RcppML) A <- r_sparsematrix(1000, 1000, inv_density = 16) not_reproducible_model <- nmf(A, k = 10, seed = NULL) # default reproducible_model <- nmf(A, k = 10, seed = 123) ``` ## Random initialization Random uniform initializations are the best initializations because they are not local minima and do not assume prior distribution information. Using non-random initializations (like NNDSVD, first proposed by Boutsidis) can trap models in dangerous local minima and mandate a model inspired by orthogonality rather than colinearity. ## Random restarts with RcppML Let's see how multiple restarts can improve a model. In RcppML, we simply specify multiple seeds to run multiple restarts: ```{R, message = FALSE, warning = FALSE} data(hawaiibirds) m1 <- nmf(hawaiibirds$counts, k = 10, seed = 1) m2 <- nmf(hawaiibirds$counts, k = 10, seed = 1:10) ``` ```{R} evaluate(m1, hawaiibirds$counts) ``` ```{R} evaluate(m2, hawaiibirds$counts) ``` Multiple random restarts help discover better models. What happens when only a single seed is specified? RcppML uses `runif` to generate a uniform distribution in a randomly selected range. `rnorm` is not used just to be safe, because sometimes `rnorm` can actually give worse results. What happens when multiple seeds are specified? RcppML generates multiple initializations, for each initialization randomly choosing to use `runif` or `rnorm`, and then randomly selecting a uniform range or mean and standard deviation, respectively. ## Refining NMF models from warm starts Suppose we learn an NMF model and want to come back later and fit it further. Alternatively, we might learn a model from one set of samples, and simply want to fit it to another set of samples. Here we learn an NMF model on half of all survey grids in the `hawaiibirds` dataset, and then fit it further on the other half: ```{R} A <- hawaiibirds$counts grids1 <- sample(1:ncol(A), floor(ncol(A)/2)) grids2 <- (1:ncol(A))[-grids1] model1 <- nmf(A[, grids1], k = 15, seed = 123) model2 <- nmf(A[, grids2], k = 15, seed = model1@w) cat("model 1 iterations: ", model1@misc$iter, ", model 2 iterations: ", model2@misc$iter) ``` Here use used model1 as a "warm start" for training model2, but this was not particularly helpful. Suppose we trained model2 from a "cold start", would we have done better (as measured by MSE)? ```{R} model2_new <- nmf(A[, grids2], k = 15, seed = 123) cat("model 2 warm-start", evaluate(model2_new, A[, grids2]), ", model 2 random start: ", evaluate(model2, A[, grids2])) ``` We only do slightly better. In conclusion, it is almost always better to start from a random initialization. Using prior information is rarely helpful and does not appreciably speed up the fitting process.