Title: | Rcpp Machine Learning Library |
---|---|
Description: | Fast machine learning algorithms including matrix factorization and divisive clustering for large sparse and dense matrices. |
Authors: | Zach DeBruine [aut, cre] |
Maintainer: | Zach DeBruine <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 0.5.6 |
Built: | 2024-11-21 06:23:39 UTC |
Source: | https://github.com/zdebruine/rcppml |
Align two NMF models
align(object, ...) ## S4 method for signature 'nmf' align(object, ref, method = "cosine", ...)
align(object, ...) ## S4 method for signature 'nmf' align(object, ref, method = "cosine", ...)
object |
nmf model to be aligned to |
... |
arguments passed to or from other methods |
ref |
reference nmf model to which |
method |
either |
For nmf
models, factors in object
are reordered to minimize the cost of bipartite matching
(see bipartiteMatch
) on a cosine
or correlation distance matrix. The matrix is used for matching, and
must be equidimensional in
object
and ref
.
Same as the align
S4 method for the 'nmf' class, but operates only on the 'w' matrices.
align_models(w, wref, method = "cosine", ...)
align_models(w, wref, method = "cosine", ...)
w |
matrix with columns to be aligned to columns in |
wref |
reference matrix to which columns in |
method |
distance metric (either |
... |
additional arguments |
DNA methylation in ~800 regions in 123 Acute Myelogenous Leukemia (AML) samples classified by probable cell of origin (GMP, L-MPP, or MEP), and 5 samples from healthy references for each suspected cell of origin (GMP, L-MPP, MEP).
data(aml)
data(aml)
list of dense matrix of features (methylated regions) as rows and samples ("AML" or "Control" samples, classified by putative cell of origin or reference cell type) as columns. The "metadata_h" maps to publicly available clinical metadata.
AML methylation signatures differ from their cell of origin by additive methylation and subtractive methylation, in addition to tumor-specific heterogeneity. These AML tumors likely originated from one of three healthy cell types (GMP, LMPP, or MEP), the challenge is to classify the cell of origin based on these healthy cell type DMRs.
Hungarian algorithm for matching samples in a bipartite graph from a distance ("cost") matrix
bipartiteMatch(x)
bipartiteMatch(x)
x |
symmetric matrix giving the cost of every possible pairing |
This implementation was adapted from RcppHungarian, an Rcpp wrapper for the original C++ implementation by Cong Ma (2016).
List of "cost" and "pairs"
Spectral biparitioning by rank-2 matrix factorization
bipartition(data, tol = 1e-05, nonneg = TRUE, ...)
bipartition(data, tol = 1e-05, nonneg = TRUE, ...)
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
tol |
tolerance of the fit |
nonneg |
enforce non-negativity of the rank-2 factorization used for bipartitioning |
... |
development parameters |
Spectral bipartitioning is a popular subroutine in divisive clustering. The sign of the difference between sample loadings in factors of a rank-2 matrix factorization gives a bipartition that is nearly identical to an SVD.
Rank-2 matrix factorization by alternating least squares is faster than rank-2-truncated SVD (i.e. irlba).
This function is a specialization of rank-2 nmf
with support for factorization of only a subset of samples, and with additional calculations on the factorization model relevant to bipartitioning. See nmf
for details regarding rank-2 factorization.
A list giving the bipartition and useful statistics:
v : vector giving difference between sample loadings between factors in a rank-2 factorization
dist : relative cosine distance of samples within a cluster to centroids of assigned vs. not-assigned cluster
size1 : number of samples in first cluster (positive loadings in 'v')
size2 : number of samples in second cluster (negative loadings in 'v')
samples1: indices of samples in first cluster
samples2: indices of samples in second cluster
center1 : mean feature loadings across samples in first cluster
center2 : mean feature loadings across samples in second cluster
Several parameters may be specified in the ...
argument:
diag = TRUE
: scale factors in and
to sum to 1 by introducing a diagonal,
. This should generally never be set to
FALSE
. Diagonalization enables symmetry of models in factorization of symmetric matrices, convex L1 regularization, and consistent factor scalings.
samples = 1:ncol(A)
: samples to include in bipartition, numbered from 1 to ncol(A)
. Default is all samples.
calc_dist = TRUE
: calculate the relative cosine distance of samples within a cluster to either cluster centroid. If TRUE
, centers for clusters will also be calculated.
seed = NULL
: random seed for model initialization, generally not needed for rank-2 factorizations because robust solutions are recovered when diag = TRUE
maxit = 100
: maximum number of alternating updates of and
. Generally, rank-2 factorizations converge quickly and this should not need to be adjusted.
Zach DeBruine
Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.
## Not run: library(Matrix) data(iris) A <- as(as.matrix(iris[,1:4]), "dgCMatrix") bipartition(A, calc_dist = TRUE) ## End(Not run)
## Not run: library(Matrix) data(iris) A <- as(as.matrix(iris[,1:4]), "dgCMatrix") bipartition(A, calc_dist = TRUE) ## End(Not run)
Produces a biplot from the output of nmf
## S4 method for signature 'nmf' biplot(x, factors = c(1, 2), matrix = "w", group_by = NULL, ...)
## S4 method for signature 'nmf' biplot(x, factors = c(1, 2), matrix = "w", group_by = NULL, ...)
x |
an object of class " |
factors |
length 2 vector specifying factors to plot. |
matrix |
either |
group_by |
a discrete factor giving groupings for samples or features. Must be of the same length as number of samples in |
... |
for consistency with |
ggplot2 object
Column-by-column Euclidean norm cosine similarity for a matrix, pair of matrices, pair of vectors, or pair of a vector and matrix. Supports sparse matrices.
cosine(x, y = NULL)
cosine(x, y = NULL)
x |
matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector" |
y |
(optional) matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector" |
This function takes advantage of extremely fast vector operations and is able to handle very large datasets.
cosine
applies a Euclidean norm to provide very similar results to Pearson correlation. Note that negative values may be returned due to the use of Euclidean normalization when all associations are largely random.
This function adopts the sparse matrix computational strategy applied by qlcMatrix::cosSparse
, and extends it to any combination of single and/or pair of sparse matrix and/or dense vector.
dense matrix, vector, or value giving cosine distances
Find an "optimal" rank for a Non-Negative Matrix Factorization using cross-validation. Returns a data.frame
with class nmfCrossValidate
. Plot results using the plot
class method.
crossValidate(data, k, reps = 3, n = 0.05, verbose = FALSE, ...) ## S3 method for class 'nmfCrossValidate' plot(x, ...)
crossValidate(data, k, reps = 3, n = 0.05, verbose = FALSE, ...) ## S3 method for class 'nmfCrossValidate' plot(x, ...)
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
k |
array of factorization ranks to test |
reps |
number of independent replicates to run |
n |
fraction of values to handle as missing (default is 5%, or |
verbose |
should updates be displayed when each factorization is completed |
... |
parameters to |
x |
|
A random speckled pattern of values is masked off during model fitting, and the mean squared error of prediction is evaluated after the model has reached the desired tolerance. The rank at which the model achieves the lowest error (best prediction accuracy) is the optimal rank.
data.frame
with class nmfCrossValidate
with columns rep
, k
, and value
Recursive bipartitioning by rank-2 matrix factorization with an efficient modularity-approximate stopping criteria
dclust( A, min_samples, min_dist = 0, tol = 1e-05, maxit = 100, nonneg = TRUE, seed = NULL )
dclust( A, min_samples, min_dist = 0, tol = 1e-05, maxit = 100, nonneg = TRUE, seed = NULL )
A |
matrix of features-by-samples in sparse format (preferred class is "Matrix::dgCMatrix") |
min_samples |
stopping criteria giving the minimum number of samples permitted in a cluster |
min_dist |
stopping criteria giving the minimum cosine distance of samples within a cluster to the center of their assigned vs. unassigned cluster. If |
tol |
in rank-2 NMF, the correlation distance ( |
maxit |
maximum number of fitting iterations |
nonneg |
in rank-2 NMF, enforce non-negativity |
seed |
random seed for rank-2 NMF model initialization |
Divisive clustering is a sensitive and fast method for sample classification. Samples are recursively partitioned into two groups until a stopping criteria is satisfied and prevents successful partitioning.
See nmf
and bipartition
for technical considerations and optimizations relevant to bipartitioning.
Stopping criteria. Two stopping criteria are used to prevent indefinite division of clusters and tune the clustering resolution to a desirable range:
min_samples
: Minimum number of samples permitted in a cluster
min_dist
: Minimum cosine distance of samples to their cluster center relative to their unassigned cluster center (an approximation of Newman-Girvan modularity)
Newman-Girvan modularity () is an interpretable and widely used measure of modularity for a bipartition. However, it requires the calculation of distance between all within-cluster and between-cluster sample pairs. This is computationally intensive, especially for large sample sets.
dclust
uses a measure which linearly approximates Newman-Girvan modularity, and simply requires the calculation of distance between all samples in a cluster and both cluster centers (the assigned and unassigned center), which is orders of magnitude faster to compute. Cosine distance is used instead of Euclidean distance since it handles outliers and sparsity well.
A bipartition is rejected if either of the two clusters contains fewer than min_samples
or if the mean relative cosine distance of the bipartition is less than min_dist
.
A bipartition will only be attempted if there are more than 2 * min_samples
samples in the cluster, meaning that dist
may not be calculated for some clusters.
Reproducibility. Because rank-2 NMF is approximate and requires random initialization, results may vary slightly across restarts. Therefore, specify a seed
to guarantee absolute reproducibility.
Other than setting the seed, reproducibility may be improved by setting tol
to a smaller number to increase the exactness of each bipartition.
A list of lists corresponding to individual clusters:
id : character sequence of "0" and "1" giving position of clusters along splitting hierarchy
samples : indices of samples in the cluster
center : mean feature expression of all samples in the cluster
dist : if applicable, relative cosine distance of samples in cluster to assigned/unassigned cluster center.
leaf : is cluster a leaf node
Zach DeBruine
Schwartz, G. et al. "TooManyCells identifies and visualizes relationships of single-cell clades". Nature Methods (2020).
Newman, MEJ. "Modularity and community structure in networks". PNAS (2006)
Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.
## Not run: data(USArrests) A <- Matrix::as(as.matrix(t(USArrests)), "dgCMatrix") clusters <- dclust(A, min_samples = 2, min_dist = 0.001) str(clusters) ## End(Not run)
## Not run: data(USArrests) A <- Matrix::as(as.matrix(t(USArrests)), "dgCMatrix") clusters <- dclust(A, min_samples = 2, min_dist = 0.001) str(clusters) ## End(Not run)
Calculate mean squared error for an NMF model, accounting for any masking schemes requested during fitting.
evaluate(x, ...) ## S4 method for signature 'nmf' evaluate(x, data, mask = NULL, missing_only = FALSE, ...)
evaluate(x, ...) ## S4 method for signature 'nmf' evaluate(x, data, mask = NULL, missing_only = FALSE, ...)
x |
fitted model, class |
... |
development parameters |
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
mask |
dense or sparse matrix of values in |
missing_only |
calculate mean squared error only for missing values specified as a matrix in |
Frequency of bird species observation within 1km-squared grids in Hawaii as recorded in the eBird project. Not all counts are complete because some grids are sampled better than others (i.e. urban areas, birding hotspots).
data(hawaiibirds)
data(hawaiibirds)
list of three components: counts
giving mean total counts of species in each grid, metadata_h
giving information about each grid (i.e. latitude, longitude, and island), and metadata_w
giving information about species taxonomic classification.
This dataset was obtained as follows:
All eBird observations in Hawaii were downloaded from the eBird website as of Oct. 2021
Only complete checklists were retained
Only non-X counts were retained
Species were removed with fewer than 50 observations or that were spuhs
Grids were defined by latitude/longitude rounded to two decimal places
Grids were removed with fewer than 10 checklists
Mean frequency of species in each grid was calculated based on the number of times a species was observed and the number of checklists submitted in that grid.
Grids were assigned to one of each major Hawaiian island based on geographical coordinates.
Run lNMF on a list of datasets to separate shared and unique signals.
lnmf( data, k_wh, k_uv, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL )
lnmf( data, k_wh, k_uv, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL )
data |
list of dense or sparse matrices giving features in rows and samples in columns. Rows in all matrices must correspond exactly. Prefer |
k_wh |
rank of the shared factorization |
k_uv |
ranks of the unique factorizations, an array corresponding to each dataset provided |
tol |
tolerance of the fit |
maxit |
maximum number of fitting iterations |
L1 |
LASSO penalties in the range (0, 1], single value or array of length two for |
L2 |
Ridge penalties greater than zero, single value or array of length two for |
seed |
single initialization seed or array of seeds. If multiple seeds are provided, the model with least mean squared error is returned. |
mask |
list of dense or sparse matrices indicating values in |
Detailed documentation to come
an object of class lnmf
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
~250,000 Ratings of ~3800 movies across 19 genres by 610 users on a scale of 1 to 5 stars. Zeros indicate no rating.
data(movielens)
data(movielens)
list of two components: ratings
matrix in Matrix::dgCMatrix
format of movies (rows) vs. users (columns), and genres
as a sparse logical matrix in Matrix::lgCMatrix
format, giving genres (rows) to which each movie (columns) is assigned.
This dataset was derived from https://grouplens.org/datasets/movielens/ (ml-latest-small.zip), and movies without a genre assignment were removed. Half-star ratings were rounded up. Movies with fewer than 5 ratings were removed.
Same as the evaluate
S4 method for the nmf
class, but allows one to input the 'w', 'd', 'h', and 'data' independently.
mse(w, d = NULL, h, data, mask = NULL, missing_only = FALSE, ...)
mse(w, d = NULL, h, data, mask = NULL, missing_only = FALSE, ...)
w |
feature factor matrix (features as rows) |
d |
scaling diagonal vector (if applicable) |
h |
sample factor matrix (samples as columns) |
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
mask |
dense or sparse matrix of values in |
missing_only |
only calculate mean squared error at masked values |
... |
additional arguments |
High-performance NMF of the form for large dense or sparse matrices, returns an object of class
nmf
.
nmf( data, k, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL, ... )
nmf( data, k, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL, ... )
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
k |
rank |
tol |
tolerance of the fit |
maxit |
maximum number of fitting iterations |
L1 |
LASSO penalties in the range (0, 1], single value or array of length two for |
L2 |
Ridge penalties greater than zero, single value or array of length two for |
seed |
single initialization seed or array, or a matrix or list of matrices giving initial |
mask |
dense or sparse matrix of values in |
... |
development parameters |
This fast NMF implementation decomposes a matrix into lower-rank non-negative matrices
and
,
with columns of
and rows of
scaled to sum to 1 via multiplication by a diagonal,
:
The scaling diagonal ensures convex L1 regularization, consistent factor scalings regardless of random initialization, and model symmetry in factorizations of symmetric matrices.
The factorization model is randomly initialized. and
are updated by alternating least squares.
RcppML achieves high performance using the Eigen C++ linear algebra library, OpenMP parallelization, a dedicated Rcpp sparse matrix class, and fast sequential coordinate descent non-negative least squares initialized by Cholesky least squares solutions.
Sparse optimization is automatically applied if the input matrix A
is a sparse matrix (i.e. Matrix::dgCMatrix
). There are also specialized back-ends for symmetric, rank-1, and rank-2 factorizations.
L1 penalization can be used for increasing the sparsity of factors and assisting interpretability. Penalty values should range from 0 to 1, where 1 gives complete sparsity.
Set options(RcppML.verbose = TRUE)
to print model tolerances to the console after each iteration.
Parallelization is applied with OpenMP using the number of threads in getOption("RcppML.threads")
and set by option(RcppML.threads = 0)
, for example. 0
corresponds to all threads, let OpenMP decide.
object of class nmf
w
feature factor matrix
d
scaling diagonal vector
h
sample factor matrix
misc
list often containing components:
tol : tolerance of fit
iter : number of fitting updates
runtime : runtime in seconds
mse : mean squared error of model (calculated for multiple starts only)
w_init : initial w matrix used for model fitting
S4 methods available for the nmf
class:
predict
: project an NMF model (or partial model) onto new samples
evaluate
: calculate mean squared error loss of an NMF model
summary
: data.frame
giving fractional
, total
, or mean
representation of factors in samples or features grouped by some criteria
align
: find an ordering of factors in one nmf
model that best matches those in another nmf
model
prod
: compute the dense approximation of input data
sparsity
: compute the sparsity of each factor in and
subset
: subset, reorder, select, or extract factors (same as [
)
generics such as dim
, dimnames
, t
, show
, head
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
Lin, X, and Boutros, PC (2020). "Optimization and expansion of non-negative matrix factorization." BMC Bioinformatics.
Lee, D, and Seung, HS (1999). "Learning the parts of objects by non-negative matrix factorization." Nature.
Franc, VC, Hlavac, VC, Navara, M. (2005). "Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem". Proc. Int'l Conf. Computer Analysis of Images and Patterns. Lecture Notes in Computer Science.
## Not run: # basic NMF model <- nmf(rsparsematrix(1000, 100, 0.1), k = 10) # compare rank-2 NMF to second left vector in an SVD data(iris) A <- Matrix::as(as.matrix(iris[, 1:4]), "dgCMatrix") nmf_model <- nmf(A, 2, tol = 1e-5) bipartitioning_vector <- apply(nmf_model$w, 1, diff) second_left_svd_vector <- base::svd(A, 2)$u[, 2] abs(cor(bipartitioning_vector, second_left_svd_vector)) # compare rank-1 NMF with first singular vector in an SVD abs(cor(nmf(A, 1)$w[, 1], base::svd(A, 2)$u[, 1])) # symmetric NMF A <- crossprod(rsparsematrix(100, 100, 0.02)) model <- nmf(A, 10, tol = 1e-5, maxit = 1000) plot(model$w, t(model$h)) # see package vignette for more examples ## End(Not run)
## Not run: # basic NMF model <- nmf(rsparsematrix(1000, 100, 0.1), k = 10) # compare rank-2 NMF to second left vector in an SVD data(iris) A <- Matrix::as(as.matrix(iris[, 1:4]), "dgCMatrix") nmf_model <- nmf(A, 2, tol = 1e-5) bipartitioning_vector <- apply(nmf_model$w, 1, diff) second_left_svd_vector <- base::svd(A, 2)$u[, 2] abs(cor(bipartitioning_vector, second_left_svd_vector)) # compare rank-1 NMF with first singular vector in an SVD abs(cor(nmf(A, 1)$w[, 1], base::svd(A, 2)$u[, 1])) # symmetric NMF A <- crossprod(rsparsematrix(100, 100, 0.02)) model <- nmf(A, 10, tol = 1e-5, maxit = 1000) plot(model$w, t(model$h)) # see package vignette for more examples ## End(Not run)
Solves the equation a %*% x = b
for x
subject to .
nnls(a, b, cd_maxit = 100L, cd_tol = 1e-08, L1 = 0, L2 = 0, upper_bound = 0)
nnls(a, b, cd_maxit = 100L, cd_tol = 1e-08, L1 = 0, L2 = 0, upper_bound = 0)
a |
symmetric positive definite matrix giving coefficients of the linear system |
b |
matrix giving the right-hand side(s) of the linear system |
cd_maxit |
maximum number of coordinate descent iterations |
cd_tol |
stopping criteria, difference in |
L1 |
L1/LASSO penalty to be subtracted from |
L2 |
Ridge penalty by which to shrink the diagonal of |
upper_bound |
maximum value permitted in solution, set to |
This is a very fast implementation of sequential coordinate descent non-negative least squares (NNLS), suitable for very small or very large systems.
The algorithm begins with a zero-filled initialization of x
.
Least squares by sequential coordinate descent is used to ensure the solution returned is exact. This algorithm was introduced by Franc et al. (2005), and our implementation is a vectorized and optimized rendition of that found in the NNLM R package by Xihui Lin (2020).
vector or matrix giving solution for x
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
Franc, VC, Hlavac, VC, and Navara, M. (2005). "Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. Proc. Int'l Conf. Computer Analysis of Images and Patterns."
Lin, X, and Boutros, PC (2020). "Optimization and expansion of non-negative matrix factorization." BMC Bioinformatics.
Myre, JM, Frahm, E, Lilja DJ, and Saar, MO. (2017) "TNT-NN: A Fast Active Set Method for Solving Large Non-Negative Least Squares Problems". Proc. Computer Science.
## Not run: # compare solution to base::solve for a random system X <- matrix(runif(100), 10, 10) a <- crossprod(X) b <- crossprod(X, runif(10)) unconstrained_soln <- solve(a, b) nonneg_soln <- nnls(a, b) unconstrained_err <- mean((a %*% unconstrained_soln - b)^2) nonnegative_err <- mean((a %*% nonneg_soln - b)^2) unconstrained_err nonnegative_err all.equal(solve(a, b), nnls(a, b)) # example adapted from multiway::fnnls example 1 X <- matrix(1:100,50,2) y <- matrix(101:150,50,1) beta <- solve(crossprod(X)) %*% crossprod(X, y) beta beta <- nnls(crossprod(X), crossprod(X, y)) beta # learn nmf model and do bvls projection data(hawaiibirds) w <- nmf(hawaiibirds$counts, 10)@w h <- project(w, hawaiibirds$counts) # now impose upper bound on solutions h2 <- project(w, hawaiibirds$counts, upper_bound = 2) ## End(Not run)
## Not run: # compare solution to base::solve for a random system X <- matrix(runif(100), 10, 10) a <- crossprod(X) b <- crossprod(X, runif(10)) unconstrained_soln <- solve(a, b) nonneg_soln <- nnls(a, b) unconstrained_err <- mean((a %*% unconstrained_soln - b)^2) nonnegative_err <- mean((a %*% nonneg_soln - b)^2) unconstrained_err nonnegative_err all.equal(solve(a, b), nnls(a, b)) # example adapted from multiway::fnnls example 1 X <- matrix(1:100,50,2) y <- matrix(101:150,50,1) beta <- solve(crossprod(X)) %*% crossprod(X, y) beta beta <- nnls(crossprod(X), crossprod(X, y)) beta # learn nmf model and do bvls projection data(hawaiibirds) w <- nmf(hawaiibirds$counts, 10)@w h <- project(w, hawaiibirds$counts) # now impose upper bound on solutions h2 <- project(w, hawaiibirds$counts, upper_bound = 2) ## End(Not run)
Equivalent to predict
method for NMF, but requires only the w
matrix to be supplied and not the entire NMF model. Use NNLS to project a basis factor model onto new samples.
project(w, data, L1 = 0, L2 = 0, mask = NULL, upper_bound = 0, ...)
project(w, data, L1 = 0, L2 = 0, mask = NULL, upper_bound = 0, ...)
w |
matrix of features (rows) by factors (columns), corresponding to rows in |
data |
a dense or sparse matrix |
L1 |
L1/LASSO penalty |
L2 |
L2/Ridge penalty |
mask |
masking on data values |
upper_bound |
maximum value permitted in least squares solutions, essentially a bounded-variable least squares problem between 0 and |
... |
arguments passed to |
See nmf
for more info, as well as the predict
method for NMF.
Generate a random sparse matrix, just like Matrix::rsparsematrix
or (matrix(runif(nrow * ncol), nrow,))
, but much faster.
Generation of transpose-identical matrices is also supported without additional computational cost.
r_matrix(nrow, ncol, transpose_identical = FALSE) r_sparsematrix( nrow, ncol, inv_density, transpose_identical = FALSE, pattern = FALSE )
r_matrix(nrow, ncol, transpose_identical = FALSE) r_sparsematrix( nrow, ncol, inv_density, transpose_identical = FALSE, pattern = FALSE )
nrow |
number of rows |
ncol |
number of columns |
transpose_identical |
should the matrix be transpose-identical? |
inv_density |
an integer giving the inverse density of the matrix (i.e. 10 percent density corresponds to |
pattern |
should a pattern matrix ( |
# generate a simple random matrix A <- r_matrix(10, 10) # generate two matrices that are transpose identical set.seed(123) A1 <- r_matrix(10, 100, transpose_identical = TRUE) set.seed(123) A2 <- r_matrix(100, 10, transpose_identical = TRUE) all.equal(t(A2), A1) # generate a transpose-identical pair of speckled matrices set.seed(123) A <- r_sparsematrix(10, 100, inv_density = 10, transpose_identical = TRUE) set.seed(123) A <- r_sparsematrix(100, 10, inv_density = 10, transpose_identical = TRUE) all.equal(t(A), A) Matrix::isSymmetric(A[1:10, 1:10]) heatmap(as.matrix(A), scale = "none", Rowv = NA, Colv = NA) # note that density is probabilistic, not absolute A <- replicate(1000, r_sparsematrix(100, 100, 10)) densities <- sapply(A, function(x) length(x@i) / prod(dim(x))) plot(density(densities)) # normal distribution centered at 0.100 range(densities)
# generate a simple random matrix A <- r_matrix(10, 10) # generate two matrices that are transpose identical set.seed(123) A1 <- r_matrix(10, 100, transpose_identical = TRUE) set.seed(123) A2 <- r_matrix(100, 10, transpose_identical = TRUE) all.equal(t(A2), A1) # generate a transpose-identical pair of speckled matrices set.seed(123) A <- r_sparsematrix(10, 100, inv_density = 10, transpose_identical = TRUE) set.seed(123) A <- r_sparsematrix(100, 10, inv_density = 10, transpose_identical = TRUE) all.equal(t(A), A) Matrix::isSymmetric(A[1:10, 1:10]) heatmap(as.matrix(A), scale = "none", Rowv = NA, Colv = NA) # note that density is probabilistic, not absolute A <- replicate(1000, r_sparsematrix(100, 100, 10)) densities <- sapply(A, function(x) length(x@i) / prod(dim(x))) plot(density(densities)) # normal distribution centered at 0.100 range(densities)
r_sample
is just like base::sample
, only faster.
r_sample
takes a sample of the specified size from the elements of x
using replacement if indicated.
These functions generate random distributions (uniform, normal, or binomial) just like their base R counterparts (runif
, rnorm
, and rbinom
), but faster.
r_sample(x, size = NULL, replace = FALSE) r_unif(n, min = 0, max = 1) r_binom(n, size = 1, inv_prob = 2)
r_sample(x, size = NULL, replace = FALSE) r_unif(n, min = 0, max = 1) r_binom(n, size = 1, inv_prob = 2)
x |
either a positive integer giving the number of items to choose from, or a vector of elements to shuffle or from which to choose. See 'Details'. |
size |
number of trials (one or more) |
replace |
should sampling be with replacement? |
n |
number of observations |
min |
finite lower limit of the uniform distribution |
max |
finite upper limit of the uniform distribution |
inv_prob |
inverse probability of success for each trial, must be integral (e.g. 50 percent success = 2, 10 percent success = 10) |
All RNGs make use of Marsaglia's xorshift method to generate random integers.
r_unif
takes the random integer and divides it by the seed and returns the floating decimal portion of the result.
# draw all integers from 1 to 10 in a random order sample(10) # shuffle a vector of values v <- r_unif(3) v v_ <- sample(v) v_ # draw values from a vector sample(r_unif(100), 3) # draw some integers between 1 and 1000 sample(1000, 3) # simulate a uniform distribution v <- r_unif(10000) plot(density(v)) # simulate a binomial distribution v <- r_binom(10000, 100, inv_prob = 10) hist(v) sum(v) / length(v) # ~10 because 100 trials at 10 percent success odds # is about 10 successes per element # get successful trials in a bernoulli distribution v <- r_binom(100, 1, 20) successful_trials <- slot(as(v, "nsparseVector"), "i") successful_trials
# draw all integers from 1 to 10 in a random order sample(10) # shuffle a vector of values v <- r_unif(3) v v_ <- sample(v) v_ # draw values from a vector sample(r_unif(100), 3) # draw some integers between 1 and 1000 sample(1000, 3) # simulate a uniform distribution v <- r_unif(10000) plot(density(v)) # simulate a binomial distribution v <- r_binom(10000, 100, inv_prob = 10) hist(v) sum(v) / length(v) # ~10 because 100 trials at 10 percent success odds # is about 10 successes per element # get successful trials in a bernoulli distribution v <- r_binom(100, 1, 20) successful_trials <- slot(as(v, "nsparseVector"), "i") successful_trials
High-performance non-negative matrix factorization and linear model projection for sparse matrices, and fast non-negative least squares implementations
Zach DeBruine
Generate a random matrix that follows some defined NMF model to test NMF factorizations. Adapts methods from NMF::syntheticNMF
.
simulateNMF(nrow, ncol, k, noise = 0.5, dropout = 0.5, seed = NULL)
simulateNMF(nrow, ncol, k, noise = 0.5, dropout = 0.5, seed = NULL)
nrow |
number of rows |
ncol |
number of columns |
k |
true rank of simulated model |
noise |
standard deviation of Gaussian noise centered at 0 to add to input matrix. Any negative values after noise addition are set to 0. |
dropout |
density of dropout events |
seed |
seed for random number generation |
list of dense matrix A
and true w
and h
models
Compute the sparsity of each NMF factor
sparsity(object, ...) ## S4 method for signature 'nmf' sparsity(object, ...)
sparsity(object, ...) ## S4 method for signature 'nmf' sparsity(object, ...)
object |
object of class |
... |
additional parameters |
For nmf
models, the sparsity of each factor is computed and summarized
or and
matrices. A long
data.frame
with columns factor
, sparsity
, and model
is returned.
Given an NMF model in the form ,
project
projects w
onto A
to solve for h
.
## S4 method for signature 'nmf' subset(x, i, ...) ## S4 method for signature 'nmf,ANY,ANY,ANY' x[i] ## S4 method for signature 'nmf' head(x, n = getOption("digits"), ...) ## S4 method for signature 'nmf' show(object) ## S4 method for signature 'nmf' dimnames(x) ## S4 method for signature 'nmf' dim(x) ## S4 method for signature 'nmf' t(x) ## S4 method for signature 'nmf' sort(x, decreasing = TRUE, ...) ## S4 method for signature 'nmf' prod(x, ..., na.rm = FALSE) ## S4 method for signature 'nmf' x$name ## S4 method for signature 'nmf,list' coerce(from, to) ## S4 method for signature 'nmf' x[[i]] ## S4 method for signature 'nmf' predict(object, data, L1 = 0, L2 = 0, mask = NULL, upper_bound = NULL, ...)
## S4 method for signature 'nmf' subset(x, i, ...) ## S4 method for signature 'nmf,ANY,ANY,ANY' x[i] ## S4 method for signature 'nmf' head(x, n = getOption("digits"), ...) ## S4 method for signature 'nmf' show(object) ## S4 method for signature 'nmf' dimnames(x) ## S4 method for signature 'nmf' dim(x) ## S4 method for signature 'nmf' t(x) ## S4 method for signature 'nmf' sort(x, decreasing = TRUE, ...) ## S4 method for signature 'nmf' prod(x, ..., na.rm = FALSE) ## S4 method for signature 'nmf' x$name ## S4 method for signature 'nmf,list' coerce(from, to) ## S4 method for signature 'nmf' x[[i]] ## S4 method for signature 'nmf' predict(object, data, L1 = 0, L2 = 0, mask = NULL, upper_bound = NULL, ...)
x |
object of class |
i |
indices |
... |
arguments passed to or from other methods |
n |
number of rows/columns to show |
object |
fitted model, class |
decreasing |
logical. Should the sort be increasing or decreasing? |
na.rm |
remove na values |
name |
name of nmf class slot |
from |
class which the coerce method should perform coercion from |
to |
class which the coerce method should perform coercion to |
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
L1 |
a single LASSO penalty in the range (0, 1] |
L2 |
a single Ridge penalty greater than zero |
mask |
dense or sparse matrix of values in |
upper_bound |
maximum value permitted in least squares solutions, essentially a bounded-variable least squares problem between 0 and |
For the alternating least squares matrix factorization update problem , the updates
(or projection) of
is given by the equation:
which is in the form where
and
for all columns
in
.
Any L1 penalty is subtracted from and should generally be scaled to
max(b)
, where for all columns
in
. An easy way to properly scale an L1 penalty is to normalize all columns in
to sum to the same value (e.g. 1). No scaling is applied in this function. Such scaling guarantees that
L1 = 1
gives a completely sparse solution.
There are specializations for dense and sparse input matrices, symmetric input matrices, and for rank-1 and rank-2 projections. See documentation for nmf
for theoretical details and guidance.
object of class nmf
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
## Not run: w <- matrix(runif(1000 * 10), 1000, 10) h_true <- matrix(runif(10 * 100), 10, 100) # A is the crossproduct of "w" and "h" with 10% signal dropout A <- (w %*% h_true) * (r_sparsematrix(1000, 100, 10) > 0) h <- project(w, A) cor(as.vector(h_true), as.vector(h)) # alternating projections refine solution (like NMF) mse_bad <- mse(w, rep(1, ncol(w)), h, A) # mse before alternating updates h <- project(w, A) w <- t(project(h, t(A))) h <- project(w, A) w <- t(project(h, t(A))) h <- project(w, A) w <- t(project(h, t(A))) mse_better <- mse(w, rep(1, ncol(w)), h, A) # mse after alternating updates mse_better < mse_bad ## End(Not run)
## Not run: w <- matrix(runif(1000 * 10), 1000, 10) h_true <- matrix(runif(10 * 100), 10, 100) # A is the crossproduct of "w" and "h" with 10% signal dropout A <- (w %*% h_true) * (r_sparsematrix(1000, 100, 10) > 0) h <- project(w, A) cor(as.vector(h_true), as.vector(h)) # alternating projections refine solution (like NMF) mse_bad <- mse(w, rep(1, ncol(w)), h, A) # mse before alternating updates h <- project(w, A) w <- t(project(h, t(A))) h <- project(w, A) w <- t(project(h, t(A))) h <- project(w, A) w <- t(project(h, t(A))) mse_better <- mse(w, rep(1, ncol(w)), h, A) # mse after alternating updates mse_better < mse_bad ## End(Not run)
summary
method for class "nmf
". Describes metadata representation in NMF factors. Returns object of class nmfSummary
. Plot results using plot
.
## S4 method for signature 'nmf' summary(object, group_by, stat = "sum", ...) ## S3 method for class 'nmfSummary' plot(x, ...)
## S4 method for signature 'nmf' summary(object, group_by, stat = "sum", ...) ## S3 method for class 'nmfSummary' plot(x, ...)
object |
an object of class " |
group_by |
a discrete factor giving groupings for samples or features. Must be of the same length as number of samples in |
stat |
either |
... |
arguments passed to or from other methods |
x |
|
data.frame
with columns group
, factor
, and stat