Robust NMF with random initializations

Non-negative matrix factorization (NMF) is NP-hard (Vavasis, 2007). As such, the best that NMF can do, in practice, is find the best discoverable local minima from some set of initializations.

Non-negative Double SVD (NNDSVD) has previously been proposed as a “head-start” for NMF (Boutsidis, 2008), but never does better than random initializations and often does worse (see my blog post on this).

Reproducibility

Simply, set the seed when calling the function.

library(RcppML)
A <- r_sparsematrix(1000, 1000, inv_density = 16)
not_reproducible_model <- nmf(A, k = 10, seed = NULL) # default
reproducible_model <- nmf(A, k = 10, seed = 123)

Random initialization

Random uniform initializations are the best initializations because they are not local minima and do not assume prior distribution information. Using non-random initializations (like NNDSVD, first proposed by Boutsidis) can trap models in dangerous local minima and mandate a model inspired by orthogonality rather than colinearity.

Random restarts with RcppML

Let’s see how multiple restarts can improve a model. In RcppML, we simply specify multiple seeds to run multiple restarts:

data(hawaiibirds)
m1 <- nmf(hawaiibirds$counts, k = 10, seed = 1)
m2 <- nmf(hawaiibirds$counts, k = 10, seed = 1:10)

evaluate(m1, hawaiibirds$counts)
#> [1] 0.003552783

evaluate(m2, hawaiibirds$counts)
#> [1] 0.003450242

Multiple random restarts help discover better models.

What happens when only a single seed is specified? RcppML uses runif to generate a uniform distribution in a randomly selected range. rnorm is not used just to be safe, because sometimes rnorm can actually give worse results.

What happens when multiple seeds are specified? RcppML generates multiple initializations, for each initialization randomly choosing to use runif or rnorm, and then randomly selecting a uniform range or mean and standard deviation, respectively.

Refining NMF models from warm starts

Suppose we learn an NMF model and want to come back later and fit it further. Alternatively, we might learn a model from one set of samples, and simply want to fit it to another set of samples.

Here we learn an NMF model on half of all survey grids in the hawaiibirds dataset, and then fit it further on the other half:

A <- hawaiibirds$counts
grids1 <- sample(1:ncol(A), floor(ncol(A)/2))
grids2 <- (1:ncol(A))[-grids1]

model1 <- nmf(A[, grids1], k = 15, seed = 123)
model2 <- nmf(A[, grids2], k = 15, seed = model1@w)

cat("model 1 iterations: ", model1@misc$iter,
    ", model 2 iterations: ", model2@misc$iter)
#> model 1 iterations:  15 , model 2 iterations:  33

Here use used model1 as a “warm start” for training model2, but this was not particularly helpful. Suppose we trained model2 from a “cold start”, would we have done better (as measured by MSE)?

model2_new <- nmf(A[, grids2], k = 15, seed = 123)

cat("model 2 warm-start", evaluate(model2_new, A[, grids2]),
    ", model 2 random start: ", evaluate(model2, A[, grids2]))
#> model 2 warm-start 0.002776528 , model 2 random start:  0.002686309

We only do slightly better.

In conclusion, it is almost always better to start from a random initialization. Using prior information is rarely helpful and does not appreciably speed up the fitting process.