Non-negative matrix factorization (NMF) is NP-hard (Vavasis, 2007). As such, the best that NMF can do, in practice, is find the best discoverable local minima from some set of initializations.
Non-negative Double SVD (NNDSVD) has previously been proposed as a “head-start” for NMF (Boutsidis, 2008), but never does better than random initializations and often does worse (see my blog post on this).
Simply, set the seed when calling the function.
Random uniform initializations are the best initializations because they are not local minima and do not assume prior distribution information. Using non-random initializations (like NNDSVD, first proposed by Boutsidis) can trap models in dangerous local minima and mandate a model inspired by orthogonality rather than colinearity.
Let’s see how multiple restarts can improve a model. In RcppML, we simply specify multiple seeds to run multiple restarts:
data(hawaiibirds)
m1 <- nmf(hawaiibirds$counts, k = 10, seed = 1)
m2 <- nmf(hawaiibirds$counts, k = 10, seed = 1:10)
Multiple random restarts help discover better models.
What happens when only a single seed is specified? RcppML uses
runif
to generate a uniform distribution in a randomly
selected range. rnorm
is not used just to be safe, because
sometimes rnorm
can actually give worse results.
What happens when multiple seeds are specified? RcppML generates
multiple initializations, for each initialization randomly choosing to
use runif
or rnorm
, and then randomly
selecting a uniform range or mean and standard deviation,
respectively.
Suppose we learn an NMF model and want to come back later and fit it further. Alternatively, we might learn a model from one set of samples, and simply want to fit it to another set of samples.
Here we learn an NMF model on half of all survey grids in the
hawaiibirds
dataset, and then fit it further on the other
half:
A <- hawaiibirds$counts
grids1 <- sample(1:ncol(A), floor(ncol(A)/2))
grids2 <- (1:ncol(A))[-grids1]
model1 <- nmf(A[, grids1], k = 15, seed = 123)
model2 <- nmf(A[, grids2], k = 15, seed = model1@w)
cat("model 1 iterations: ", model1@misc$iter,
", model 2 iterations: ", model2@misc$iter)
#> model 1 iterations: 15 , model 2 iterations: 33
Here use used model1 as a “warm start” for training model2, but this was not particularly helpful. Suppose we trained model2 from a “cold start”, would we have done better (as measured by MSE)?
model2_new <- nmf(A[, grids2], k = 15, seed = 123)
cat("model 2 warm-start", evaluate(model2_new, A[, grids2]),
", model 2 random start: ", evaluate(model2, A[, grids2]))
#> model 2 warm-start 0.002776528 , model 2 random start: 0.002686309
We only do slightly better.
In conclusion, it is almost always better to start from a random initialization. Using prior information is rarely helpful and does not appreciably speed up the fitting process.