| Title: | Fast Non-Negative Matrix Factorization and Divisive Clustering |
|---|---|
| Description: | High-performance non-negative matrix factorization (NMF), singular value decomposition (SVD), and divisive clustering for large sparse and dense matrices. Implements alternating least squares with coordinate descent and Cholesky NNLS solvers, diagonal scaling for interpretable factors, cross-validation for automatic rank selection, multiple distribution-based losses (Gaussian, Poisson, Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie) via iteratively reweighted least squares, regularization (L1, L2, L21, angular, graph Laplacian), and optional GPU acceleration via CUDA. Includes divisive clustering via recursive rank-2 factorization, consensus clustering, and the StreamPress compressed sparse matrix format. Methods are described in DeBruine, Melcher, and Triche (2021) <doi:10.1101/2021.09.01.458620>. |
| Authors: | Zachary DeBruine [aut, cre] (ORCID: <https://orcid.org/0000-0003-2234-4827>) |
| Maintainer: | Zachary DeBruine <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.0.0 |
| Built: | 2026-05-06 09:18:14 UTC |
| Source: | https://github.com/zdebruine/rcppml |
High-performance non-negative matrix factorization (NMF), singular value decomposition (SVD/PCA), and divisive clustering for large sparse and dense matrices, powered by Rcpp and Eigen.
nmfFit NMF model (sparse or dense input, optional cross-validation)
evaluateEvaluate reconstruction loss of an NMF model
alignAlign factors across NMF models
predict,nmf-methodProject new data onto a fitted NMF model
consensus_nmfConsensus clustering from multiple NMF runs
simulateNMFSimulate data from a known NMF model
auto_nmf_distributionSelect distribution based on data characteristics
svdTruncated SVD via deflation
pcaPCA (centered SVD)
reconstructReconstruct matrix from SVD/PCA model
variance_explainedProportion of variance per factor
nnlsSolve non-negative least squares problems
dclustDivisive clustering via recursive rank-2 NMF
bipartitionSplit samples into two groups via rank-2 NMF
bipartiteMatchMatch two sets of cluster labels
factor_netCompile a factorization network
fitFit a compiled factor network
factor_input, nmf_layer, svd_layer
Node constructors
factor_shared, factor_concat, factor_add
Merge operations
factor_config, W, H
Configuration
cross_validate_graphCross-validate a factor network
st_write, st_read
Read/write .spz files
st_infoInspect .spz file metadata
st_read_obs, st_read_var
Read embedded metadata tables
st_read_gpu, st_free_gpu
GPU-direct .spz reading
gpu_availableCheck GPU availability
gpu_infoGet GPU device details
Maintainer: Zachary DeBruine [email protected] (ORCID)
Useful links:
Access layer results by name
## S3 method for class 'factor_net_result' x$name## S3 method for class 'factor_net_result' x$name
x |
A |
name |
Layer name (e.g. |
The layer result list, or the named field if not a layer name.
Align two NMF models
align(object, ...) ## S4 method for signature 'nmf' align(object, ref, method = "cosine", ...)align(object, ...) ## S4 method for signature 'nmf' align(object, ref, method = "cosine", ...)
object |
nmf model to be aligned to |
... |
arguments passed to or from other methods |
ref |
reference nmf model to which |
method |
either |
For nmf models, factors in object are reordered to minimize the cost of bipartite matching
(see bipartiteMatch) on a cosine or correlation distance matrix. The matrix is used for matching, and
must be equidimensional in object and ref.
An nmf object with factors reordered to best match ref.
data <- simulateNMF(50, 30, k = 3, seed = 1) m1 <- nmf(data$A, 3, seed = 1, maxit = 50) m2 <- nmf(data$A, 3, seed = 2, maxit = 50) aligned <- align(m2, m1) # reorder m2 factors to match m1data <- simulateNMF(50, 30, k = 3, seed = 1) m1 <- nmf(data$A, 3, seed = 1, maxit = 50) m2 <- nmf(data$A, 3, seed = 2, maxit = 50) aligned <- align(m2, m1) # reorder m2 factors to match m1
ATAC-seq chromatin accessibility data from 123 acute myelogenous leukemia (AML) patient samples and 5 healthy hematopoietic stem/progenitor cell (HSPC) samples. The dataset contains chromatin accessibility measurements across 824 genomic regions.
amlaml
A matrix (dense) with 824 rows (genomic regions) and
135 columns (samples). Sample metadata is stored as an attribute.
The data is stored as dense because it has very low sparsity (~0.2% zeros).
Access metadata via attr(aml, "metadata_h"), which contains:
Character vector of sample IDs
Character vector indicating sample type (AML subtype or HSPC)
This dataset contains ATAC-seq chromatin accessibility data for studying acute myelogenous leukemia. Samples represent different AML subtypes including:
Common myeloid progenitors (CMP)
Granulocyte-monocyte progenitors (GMP)
Lymphoid-primed multi-potent progenitors (LMPP)
Megakaryocyte-erythrocyte progenitors (MEP)
Multi-potent progenitors (MPP)
Healthy hematopoietic stem/progenitor cells (HSPC)
Corces et al. (2016). "Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution." Nature Genetics 48(10): 1193-1203.
# Load dataset data(aml) # Inspect structure dim(aml) table(attr(aml, "metadata_h")$category) # Run NMF (transpose so samples are columns) result <- nmf(t(aml), k = 6, maxit = 100)# Load dataset data(aml) # Inspect structure dim(aml) table(attr(aml, "metadata_h")$category) # Run NMF (transpose so samples are columns) result <- nmf(t(aml), k = 6, maxit = 100)
Convert assessment results to a one-row data frame
## S3 method for class 'nmf_assessment' as.data.frame(x, ...)## S3 method for class 'nmf_assessment' as.data.frame(x, ...)
x |
An |
... |
Ignored. |
A one-row data frame with all metrics as columns.
Converts the internal training log from an NMF run into a tidy data frame with one row per logged iteration, suitable for plotting convergence curves.
## S3 method for class 'training_logger' as.data.frame(x, ...)## S3 method for class 'training_logger' as.data.frame(x, ...)
x |
A |
... |
Ignored. |
A data.frame with columns iteration, wall_sec,
and optionally total_loss, per-layer loss columns, and
norm-tracking columns.
Unified evaluation of an NMF, SVD, or arbitrary embedding matrix against one or more label vectors. Computes clustering metrics (ARI, NMI, Silhouette), classification metrics (KNN/LR/RF accuracy, F1, AUC via cross-validation), and batch-mixing metrics (donor/batch Silhouette, kNN entropy).
assess( x, labels, batch = NULL, metrics = "all", n_folds = 5L, classifiers = c("knn", "lr", "rf"), k_nn = 15L, seed = 42L, min_class_size = 10L, sil_samples_per_class = 200L, batch_knn_k = 50L )assess( x, labels, batch = NULL, metrics = "all", n_folds = 5L, classifiers = c("knn", "lr", "rf"), k_nn = 15L, seed = 42L, min_class_size = 10L, sil_samples_per_class = 200L, batch_knn_k = 50L )
x |
An object of class |
labels |
A factor, character, or integer vector of primary labels (e.g. cell type). Required for clustering and classification metrics. |
batch |
Optional factor/character/integer vector of batch labels (e.g. donor_id, tissue). When provided, batch-mixing metrics are computed. |
metrics |
Character vector specifying which metrics to compute.
Options: |
n_folds |
Number of cross-validation folds for classification (default 5). |
classifiers |
Character vector of classifiers: |
k_nn |
Number of neighbors for kNN classifier (default 15). |
seed |
Random seed for reproducibility (default 42). |
min_class_size |
Minimum number of samples per class. Classes smaller than this are dropped before evaluation (default 10). |
sil_samples_per_class |
Number of reference samples per class for approximate silhouette (default 200). Higher = more accurate, slower. |
batch_knn_k |
Number of neighbors for batch entropy (default 50). |
Uses GPU-accelerated kernels when available, otherwise falls back to CPU-based approximate algorithms that avoid O(n^2) distance matrices:
Silhouette: sampled approximation (O(n * samples_per_class * C))
kNN metrics: brute-force on GPU or sampled on CPU (O(n * k * n_ref))
ARI/NMI: contingency table (O(n), always fast)
An S3 object of class nmf_assessment containing:
Named list of all computed metric values
Data frame of per-classifier, per-fold results
List recording all parameters used
# Assess an NMF model library(Matrix) A <- abs(rsparsematrix(200, 100, 0.2)) model <- nmf(A, 5, seed = 1) labels <- factor(sample(letters[1:4], 100, replace = TRUE)) result <- assess(model, labels) print(result) # Select specific metrics result <- assess(model, labels, metrics = c("nmi", "ari")) # Include batch assessment batch <- factor(sample(c("A", "B"), 100, replace = TRUE)) result <- assess(model, labels, batch = batch, metrics = "all")# Assess an NMF model library(Matrix) A <- abs(rsparsematrix(200, 100, 0.2)) model <- nmf(A, 5, seed = 1) labels <- factor(sample(letters[1:4], 100, replace = TRUE)) result <- assess(model, labels) print(result) # Select specific metrics result <- assess(model, labels, metrics = c("nmi", "ari")) # Include batch assessment batch <- factor(sample(c("A", "B"), 100, replace = TRUE)) result <- assess(model, labels, batch = batch, metrics = "all")
Fits NMF with multiple loss functions (distributions) and selects the best based on per-element AIC/BIC. Useful for determining whether count data is best modeled with Gaussian (MSE), Poisson/GP, or Negative Binomial loss.
auto_nmf_distribution( data, k, distributions = c("mse", "gp", "nb"), criterion = c("bic", "aic"), maxit = 50, seed = NULL, verbose = FALSE, ... )auto_nmf_distribution( data, k, distributions = c("mse", "gp", "nb"), criterion = c("bic", "aic"), maxit = 50, seed = NULL, verbose = FALSE, ... )
data |
Input matrix (dense or sparse dgCMatrix) |
k |
Factorization rank |
distributions |
Character vector of distributions to compare.
Default: |
criterion |
Selection criterion: |
maxit |
Maximum iterations per fit |
seed |
Random seed for reproducibility |
verbose |
Print progress and comparison table |
... |
Additional arguments passed to |
For each distribution, NMF is fit and the final negative log-likelihood (NLL)
is computed. For GP and NB, the C++ loss is already the total NLL.
For MSE (Gaussian), the C++ loss is the sum of squared errors (SSE), which is
converted to Gaussian NLL: .
The number of effective parameters is:
mse: (factor params + noise variance)
gp: (factor params + dispersion per row)
nb: (factor params + size per row)
BIC = where is
the number of observations (nonzeros for sparse, all entries for dense).
AIC = .
A list with:
Character string: name of the best distribution (loss function)
Data frame with distribution, nll, df, aic, bic, selected
Named list of fitted nmf objects
score_test_distribution, diagnose_zero_inflation,
nmf
library(Matrix) set.seed(42) A <- abs(rsparsematrix(100, 50, 0.3)) result <- auto_nmf_distribution(A, k = 5) print(result$comparison) cat("Best distribution:", result$loss, "\n")library(Matrix) set.seed(42) A <- abs(rsparsematrix(100, 50, 0.3)) result <- auto_nmf_distribution(A, k = 5) print(result$comparison) cat("Best distribution:", result$loss, "\n")
Hungarian algorithm for matching samples in a bipartite graph from a distance ("cost") matrix
bipartiteMatch(x)bipartiteMatch(x)
x |
symmetric matrix giving the cost of every possible pairing |
This implementation was adapted from RcppHungarian, an Rcpp wrapper for the original C++ implementation by Cong Ma (2016).
List with elements cost (total matching cost) and assignment (0-indexed column assignment for each row).
cost <- matrix(c(1, 0.5, 0.2, 0.5, 1, 0.3, 0.2, 0.3, 1), nrow = 3, byrow = TRUE) result <- bipartiteMatch(cost) result$cost result$assignmentcost <- matrix(c(1, 0.5, 0.2, 0.5, 1, 0.3, 0.2, 0.3, 1), nrow = 3, byrow = TRUE) result <- bipartiteMatch(cost) result$cost result$assignment
Spectral biparitioning by rank-2 matrix factorization
bipartition( data, tol = 1e-05, nonneg = TRUE, threads = 0, verbose = FALSE, ... )bipartition( data, tol = 1e-05, nonneg = TRUE, threads = 0, verbose = FALSE, ... )
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
tol |
tolerance of the fit (default 1e-4) |
nonneg |
enforce non-negativity of the rank-2 factorization used for bipartitioning |
threads |
number of threads for OpenMP parallelization (default 0 = all available) |
verbose |
print progress information (default FALSE) |
... |
additional arguments (see Advanced Parameters section) |
Spectral bipartitioning is a popular subroutine in divisive clustering. The sign of the difference between sample loadings in factors of a rank-2 matrix factorization gives a bipartition that is nearly identical to an SVD.
Rank-2 matrix factorization by alternating least squares is faster than rank-2-truncated SVD (i.e. irlba).
This function is a specialization of rank-2 nmf with support for factorization of only a subset of samples, and with additional calculations on the factorization model relevant to bipartitioning. See nmf for details regarding rank-2 factorization.
A list giving the bipartition and useful statistics:
v : vector giving difference between sample loadings between factors in a rank-2 factorization
dist : relative cosine distance of samples within a cluster to centroids of assigned vs. not-assigned cluster
size1 : number of samples in first cluster (positive loadings in 'v')
size2 : number of samples in second cluster (negative loadings in 'v')
samples1: indices of samples in first cluster
samples2: indices of samples in second cluster
center1 : mean feature loadings across samples in first cluster
center2 : mean feature loadings across samples in second cluster
Several parameters may be specified in the ... argument:
diag = TRUE: scale factors in and to sum to 1 by introducing a diagonal, . This should generally never be set to FALSE. Diagonalization enables symmetry of models in factorization of symmetric matrices, convex L1 regularization, and consistent factor scalings.
samples = 1:ncol(A): samples to include in bipartition, numbered from 1 to ncol(A). Default is all samples.
calc_dist = TRUE: calculate the relative cosine distance of samples within a cluster to either cluster centroid. If TRUE, centers for clusters will also be calculated.
seed = NULL: random seed for model initialization, generally not needed for rank-2 factorizations because robust solutions are recovered when diag = TRUE
maxit = 100: maximum number of alternating updates of and . Generally, rank-2 factorizations converge quickly and this should not need to be adjusted.
bipartition() uses scalar nonneg (not length-2)
because rank-2 factorizations apply the same constraint to both factors.
The default tol = 1e-5 (inherited from nmf()'s
internal rank-2 path) is tighter than nmf()'s global 1e-4
because single rank-2 subproblems converge faster.
Zach DeBruine
Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.
library(Matrix) data(iris) A <- as(as.matrix(iris[,1:4]), "dgCMatrix") bipartition(A, calc_dist = TRUE)library(Matrix) data(iris) A <- as(as.matrix(iris[,1:4]), "dgCMatrix") bipartition(A, calc_dist = TRUE)
Produces a biplot from the output of nmf
## S4 method for signature 'nmf' biplot(x, factors = c(1, 2), matrix = "w", group_by = NULL, ...)## S4 method for signature 'nmf' biplot(x, factors = c(1, 2), matrix = "w", group_by = NULL, ...)
x |
an object of class " |
factors |
length 2 vector specifying factors to plot. |
matrix |
either |
group_by |
a discrete factor giving groupings for samples or features. Must be of the same length as number of samples in |
... |
for consistency with |
ggplot2 object
library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) if (requireNamespace("ggplot2", quietly = TRUE)) { biplot(model, factors = c(1, 2)) }library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) if (requireNamespace("ggplot2", quietly = TRUE)) { biplot(model, factors = c(1, 2)) }
Trains a k-nearest-neighbor classifier on factor embeddings and evaluates on held-out test samples. Returns a comprehensive metrics object.
classify_embedding( embedding, labels, test_fraction = 0.2, test_idx = NULL, k = 5L, seed = NULL, distance = c("euclidean", "cosine") )classify_embedding( embedding, labels, test_fraction = 0.2, test_idx = NULL, k = 5L, seed = NULL, distance = c("euclidean", "cosine") )
embedding |
Numeric matrix where rows are samples and columns are
features (e.g., |
labels |
Integer or factor vector of class labels. Length must equal
|
test_fraction |
Fraction of samples held out for testing (default 0.2). |
test_idx |
Optional integer vector of test indices. If provided,
|
k |
Number of nearest neighbors (default 5). |
seed |
Random seed for train/test split reproducibility. |
distance |
Distance metric: "euclidean" (default) or "cosine". |
An fn_classifier_eval object with fields:
Overall test accuracy
Data frame with per-class precision, recall, F1, support
Macro-averaged precision
Macro-averaged recall
Macro-averaged F1
Support-weighted F1
Macro-averaged one-vs-rest AUC (from neighbor vote fractions)
Confusion matrix (rows = true, cols = predicted)
Test set predictions
True test labels
Training sample indices
Test sample indices
Number of neighbors used
classify_logistic, classify_rf
data(digits) model <- nmf(digits, 10, maxit = 20, seed = 1, verbose = FALSE) labels <- attr(digits, "target") eval <- classify_embedding(model$w, labels, test_fraction = 0.2, k = 5, seed = 42) print(eval)data(digits) model <- nmf(digits, 10, maxit = 20, seed = 1, verbose = FALSE) labels <- attr(digits, "target") eval <- classify_embedding(model$w, labels, test_fraction = 0.2, k = 5, seed = 42) print(eval)
Trains a multinomial logistic regression on factor embeddings using
stats::glm (one-vs-rest for > 2 classes) and evaluates on test
samples with the same comprehensive metrics as classify_embedding.
classify_logistic( embedding, labels, test_fraction = 0.2, test_idx = NULL, seed = NULL )classify_logistic( embedding, labels, test_fraction = 0.2, test_idx = NULL, seed = NULL )
embedding |
Numeric matrix where rows are samples and columns are
features (e.g., |
labels |
Integer or factor vector of class labels. Length must equal
|
test_fraction |
Fraction of samples held out for testing (default 0.2). |
test_idx |
Optional integer vector of test indices. If provided,
|
seed |
Random seed for train/test split reproducibility. |
An fn_classifier_eval object (same structure as KNN variant).
classify_embedding, classify_rf
# Logistic regression on random embeddings set.seed(42) embed <- matrix(rnorm(200), nrow = 40, ncol = 5) labels <- factor(rep(1:4, each = 10)) eval <- classify_logistic(embed, labels, seed = 1) eval$accuracy# Logistic regression on random embeddings set.seed(42) embed <- matrix(rnorm(200), nrow = 40, ncol = 5) labels <- factor(rep(1:4, each = 10)) eval <- classify_logistic(embed, labels, seed = 1) eval$accuracy
Trains a random forest on factor embeddings using the randomForest package (must be installed) and evaluates on test samples.
classify_rf( embedding, labels, test_fraction = 0.2, test_idx = NULL, ntree = 500L, seed = NULL )classify_rf( embedding, labels, test_fraction = 0.2, test_idx = NULL, ntree = 500L, seed = NULL )
embedding |
Numeric matrix where rows are samples and columns are
features (e.g., |
labels |
Integer or factor vector of class labels. Length must equal
|
test_fraction |
Fraction of samples held out for testing (default 0.2). |
test_idx |
Optional integer vector of test indices. If provided,
|
ntree |
Number of trees (default 500). |
seed |
Random seed for train/test split reproducibility. |
An fn_classifier_eval object (same structure as KNN variant).
classify_embedding, classify_logistic
# Random forest on random embeddings (requires randomForest package) if (requireNamespace("randomForest", quietly = TRUE)) { set.seed(42) embed <- matrix(rnorm(200), nrow = 40, ncol = 5) labels <- factor(rep(1:4, each = 10)) eval <- classify_rf(embed, labels, ntree = 100, seed = 1) eval$accuracy }# Random forest on random embeddings (requires randomForest package) if (requireNamespace("randomForest", quietly = TRUE)) { set.seed(42) embed <- matrix(rnorm(200), nrow = 40, ncol = 5) labels <- factor(rep(1:4, each = 10)) eval <- classify_rf(embed, labels, ntree = 100, seed = 1) eval$accuracy }
Overlay training curves from multiple NMF models to compare convergence behavior.
compare_nmf(..., labels = NULL, metric = "loss", smooth = TRUE, span = 0.3)compare_nmf(..., labels = NULL, metric = "loss", smooth = TRUE, span = 0.3)
... |
nmf objects to compare |
labels |
optional character vector of model labels |
metric |
what to compare: "loss" (default), "sparsity" |
smooth |
apply smoothing (default TRUE) |
span |
smoothing span for LOESS (default 0.3) |
A ggplot2 object comparing the requested metric across models.
library(Matrix) A <- abs(rsparsematrix(100, 50, 0.2)) model1 <- nmf(A, k = 5, L1 = 0.01) model2 <- nmf(A, k = 5, L1 = 0.1) compare_nmf(model1, model2, labels = c("L1=0.01", "L1=0.1"))library(Matrix) A <- abs(rsparsematrix(100, 50, 0.2)) model1 <- nmf(A, k = 5, L1 = 0.01) model2 <- nmf(A, k = 5, L1 = 0.1) compare_nmf(model1, model2, labels = c("L1=0.01", "L1=0.1"))
Builds a k x n target matrix from class labels, suitable for passing
as target_H to nmf. Each sample column is set
to its class centroid (optionally ZCA-whitened), so that target
regularization steers H toward class-discriminative structure.
compute_target(H, labels, whiten = TRUE)compute_target(H, labels, whiten = TRUE)
H |
k x n embedding matrix (e.g. from an initial NMF fit). |
labels |
Factor or character/integer vector of class labels
(length n). |
whiten |
Logical; apply OAS-shrinkage ZCA whitening to class
centroids before broadcasting. Default |
With positive target_lambda, NMF attracts H toward the target
(label enrichment). With negative target_lambda, NMF uses
PROJ_ADV eigenvalue-projected adversarial removal to suppress
target-correlated structure (batch removal). See
vignette("guided-nmf") for mathematical details.
Algorithm:
Compute per-class centroids from rows/columns of H.
(Optional) Apply Oracle-Approximating Shrinkage (OAS) covariance estimation followed by ZCA whitening to the centroids, ensuring isotropic class structure.
Broadcast each sample's class centroid back to a k x n matrix.
A numeric k x n matrix. Pass it to nmf(..., target_H = T,
target_lambda = 0.5) for enrichment or target_lambda = -0.5
for batch removal.
data(hawaiibirds) model <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50) meta <- attr(hawaiibirds, "metadata_h") # Build target from class labels target <- compute_target(model@h, labels = meta$island) # Enrichment: attract H toward island structure guided <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50, target_H = target, target_lambda = 0.5) # Batch removal: suppress island-correlated structure removed <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50, target_H = target, target_lambda = -0.5)data(hawaiibirds) model <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50) meta <- attr(hawaiibirds, "metadata_h") # Build target from class labels target <- compute_target(model@h, labels = meta$island) # Enrichment: attract H toward island structure guided <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50, target_H = target, target_lambda = 0.5) # Batch removal: suppress island-correlated structure removed <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50, target_H = target, target_lambda = -0.5)
Run multiple NMF replicates and compute consensus matrix showing co-clustering frequency of samples.
consensus_nmf( data, k, reps = 50, method = c("hard", "knn_jaccard"), knn = 10, seed = NULL, threads = 0, verbose = FALSE, ... )consensus_nmf( data, k, reps = 50, method = c("hard", "knn_jaccard"), knn = 10, seed = NULL, threads = 0, verbose = FALSE, ... )
data |
input matrix (samples x features for clustering samples) |
k |
rank of factorization |
reps |
number of replicates (default 50) |
method |
consensus method: "hard" for hard cluster assignments (default), or "knn_jaccard" for KNN-based Jaccard overlap of factor loadings |
knn |
number of nearest neighbors to use for KNN Jaccard method (default 10) |
seed |
random seed for reproducibility |
threads |
number of threads for OpenMP parallelization (default 0 = all available) |
verbose |
print progress information (default FALSE) |
... |
additional arguments passed to |
Consensus clustering runs NMF multiple times with different random initializations.
**Hard clustering method** (method = "hard"): For each run, samples are clustered based on their dominant factor in W. The consensus matrix C[i,j] gives the proportion of runs where samples i and j were assigned to the same cluster. This is the traditional consensus clustering approach.
**KNN Jaccard method** (method = "knn_jaccard"): For each run, the k-nearest neighbors of each sample are computed based on factor loadings (W matrix). The consensus matrix C[i,j] is the average Jaccard similarity between the KNN sets of samples i and j across all replicates. This approach is more robust to ambiguous cluster assignments and captures neighborhood structure rather than hard cluster membership.
High consensus values (near 1) indicate stable co-clustering or neighborhood overlap. Intermediate values suggest ambiguous relationships.
The cophenetic correlation coefficient measures cluster stability - higher values (closer to 1) indicate more stable/reproducible clustering.
List with:
consensus - consensus matrix (samples x samples)
models - list of fitted nmf objects
clusters - final cluster assignments
cophenetic - cophenetic correlation coefficient
method - consensus method used
nmf, plot.consensus_nmf, summary.consensus_nmf
library(Matrix) A <- rsparsematrix(100, 50, 0.3) # Traditional hard clustering consensus cons_hard <- consensus_nmf(A, k = 5, reps = 10, method = "hard", seed = 123) # KNN Jaccard consensus (more robust) cons_knn <- consensus_nmf(A, k = 5, reps = 10, method = "knn_jaccard", knn = 15, seed = 123) # Plot consensus heatmap plot(cons_hard) plot(cons_knn) # Check cophenetic coefficient (higher = more stable) print(cons_hard$cophenetic) print(cons_knn$cophenetic) # Get cluster assignments print(table(cons_hard$clusters))library(Matrix) A <- rsparsematrix(100, 50, 0.3) # Traditional hard clustering consensus cons_hard <- consensus_nmf(A, k = 5, reps = 10, method = "hard", seed = 123) # KNN Jaccard consensus (more robust) cons_knn <- consensus_nmf(A, k = 5, reps = 10, method = "knn_jaccard", knn = 15, seed = 123) # Plot consensus heatmap plot(cons_hard) plot(cons_knn) # Check cophenetic coefficient (higher = more stable) print(cons_hard$cophenetic) print(cons_knn$cophenetic) # Get cluster assignments print(table(cons_hard$clusters))
Column-by-column Euclidean norm cosine similarity for a matrix, pair of matrices, pair of vectors, or pair of a vector and matrix. Supports sparse matrices.
cosine(x, y = NULL)cosine(x, y = NULL)
x |
matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector" |
y |
(optional) matrix or vector of, or coercible to, class "dgCMatrix" or "sparseVector" |
This function takes advantage of extremely fast vector operations and is able to handle very large datasets.
cosine applies a Euclidean norm to provide very similar results to Pearson correlation. Note that negative values may be returned due to the use of Euclidean normalization when all associations are largely random.
This function adopts the sparse matrix computational strategy applied by qlcMatrix::cosSparse, and extends it to any combination of single and/or pair of sparse matrix and/or dense vector.
dense matrix, vector, or value giving cosine distances
x <- matrix(runif(20), 4, 5) cosine(x) # self-similarity: 5x5 matrix cosine(x, x[,1]) # similarity of columns to first columnx <- matrix(runif(20), 4, 5) cosine(x) # self-similarity: 5x5 matrix cosine(x, x[,1]) # similarity of columns to first column
Evaluates a factor_net architecture across a grid of hyperparameter combinations using held-out test loss. Each combination is fitted with a fraction of entries masked, and the test loss on those entries determines the optimal configuration.
cross_validate_graph( inputs, layer_fn, params, config = factor_config(), reps = 3L, strategy = c("grid", "random"), n_random = 20L, seed = 42L, verbose = TRUE )cross_validate_graph( inputs, layer_fn, params, config = factor_config(), reps = 3L, strategy = c("grid", "random"), n_random = 20L, seed = 42L, verbose = TRUE )
inputs |
Input node(s) from |
layer_fn |
A function that, given a named list of parameter values,
returns the output layer node of the network. Example:
|
params |
A named list of parameter vectors to search over.
Names should match the arguments expected by |
config |
A |
reps |
Number of CV replicates per parameter combination (each with a different CV mask seed). Default 3. |
strategy |
Search strategy: |
n_random |
Number of combinations for random search. Ignored for grid search. Default 20. |
seed |
Seed for random search sampling and CV mask derivation. Default 42. |
verbose |
Print progress updates. Default TRUE. |
For single-parameter rank selection, pass k = c(5, 10, 20).
For multi-parameter search, use params to specify named lists
of values for each layer and parameter.
A factor_net_cv object with components:
Data frame with columns: param values, rep, test_loss, train_loss, iterations, converged.
Data frame with param values, mean_test_loss, se_test_loss, mean_train_loss, ranked by mean_test_loss.
Named list of the best parameter combination.
List of all fit results (if keep_fits = TRUE).
The seed parameter defaults to 42L (deterministic)
rather than NULL (random) used by nmf() and
svd(). This ensures reproducible cross-validation grid
searches by default. Pass seed = NULL for non-deterministic
behavior.
factor_net, fit, factor_config
library(Matrix) X <- rsparsematrix(100, 50, 0.1) inp <- factor_input(X, "X") # Rank selection cv <- cross_validate_graph( inputs = inp, layer_fn = function(p) inp |> nmf_layer(k = p$k), params = list(k = c(3, 5, 10, 20)), config = factor_config(maxit = 50, seed = 42) ) print(cv) cv$best_params # optimal rank # Multi-parameter search cv2 <- cross_validate_graph( inputs = inp, layer_fn = function(p) inp |> nmf_layer(k = p$k, L1 = p$L1), params = list(k = c(5, 10, 20), L1 = c(0, 0.01, 0.1)), config = factor_config(maxit = 50, seed = 42), reps = 3 )library(Matrix) X <- rsparsematrix(100, 50, 0.1) inp <- factor_input(X, "X") # Rank selection cv <- cross_validate_graph( inputs = inp, layer_fn = function(p) inp |> nmf_layer(k = p$k), params = list(k = c(3, 5, 10, 20)), config = factor_config(maxit = 50, seed = 42) ) print(cv) cv$best_params # optimal rank # Multi-parameter search cv2 <- cross_validate_graph( inputs = inp, layer_fn = function(p) inp |> nmf_layer(k = p$k, L1 = p$L1), params = list(k = c(5, 10, 20), L1 = c(0, 0.01, 0.1)), config = factor_config(maxit = 50, seed = 42), reps = 3 )
Recursive bipartitioning by rank-2 matrix factorization with an efficient modularity-approximate stopping criteria
dclust( A, min_samples, min_dist = 0, tol = 1e-05, maxit = 100, nonneg = TRUE, seed = NULL, threads = 0, verbose = FALSE )dclust( A, min_samples, min_dist = 0, tol = 1e-05, maxit = 100, nonneg = TRUE, seed = NULL, threads = 0, verbose = FALSE )
A |
matrix of features-by-samples in sparse format (preferred class is "Matrix::dgCMatrix") |
min_samples |
stopping criteria giving the minimum number of samples permitted in a cluster |
min_dist |
stopping criteria giving the minimum cosine distance of samples within a cluster to the center of their assigned vs. unassigned cluster. If |
tol |
in rank-2 NMF, the correlation distance ( |
maxit |
maximum number of fitting iterations (default 100) |
nonneg |
in rank-2 NMF, enforce non-negativity |
seed |
random seed for rank-2 NMF model initialization |
threads |
number of threads for OpenMP parallelization (default 0 = all available) |
verbose |
print progress information (default FALSE) |
Divisive clustering is a sensitive and fast method for sample classification. Samples are recursively partitioned into two groups until a stopping criteria is satisfied and prevents successful partitioning.
See nmf and bipartition for technical considerations and optimizations relevant to bipartitioning.
Stopping criteria. Two stopping criteria are used to prevent indefinite division of clusters and tune the clustering resolution to a desirable range:
min_samples: Minimum number of samples permitted in a cluster
min_dist: Minimum cosine distance of samples to their cluster center relative to their unassigned cluster center (an approximation of Newman-Girvan modularity)
Newman-Girvan modularity () is an interpretable and widely used measure of modularity for a bipartition. However, it requires the calculation of distance between all within-cluster and between-cluster sample pairs. This is computationally intensive, especially for large sample sets.
dclust uses a measure which linearly approximates Newman-Girvan modularity, and simply requires the calculation of distance between all samples in a cluster and both cluster centers (the assigned and unassigned center), which is orders of magnitude faster to compute. Cosine distance is used instead of Euclidean distance since it handles outliers and sparsity well.
A bipartition is rejected if either of the two clusters contains fewer than min_samples or if the mean relative cosine distance of the bipartition is less than min_dist.
A bipartition will only be attempted if there are more than 2 * min_samples samples in the cluster, meaning that dist may not be calculated for some clusters.
Reproducibility. Because rank-2 NMF is approximate and requires random initialization, results may vary slightly across restarts. Therefore, specify a seed to guarantee absolute reproducibility.
Other than setting the seed, reproducibility may be improved by setting tol to a smaller number to increase the exactness of each bipartition.
A list of lists corresponding to individual clusters:
id : character sequence of "0" and "1" giving position of clusters along splitting hierarchy
samples : 0-indexed integer indices of samples in the cluster (add 1 for R-style indexing)
center : mean feature expression of all samples in the cluster
size : number of samples in the cluster
dclust() uses A for the data matrix and scalar
nonneg/tol parameters (matching bipartition()).
The default tol = 1e-5 is tighter than nmf()'s
1e-4 because rank-2 subproblems converge faster and benefit
from higher precision.
Zach DeBruine
Schwartz, G. et al. "TooManyCells identifies and visualizes relationships of single-cell clades". Nature Methods (2020).
Newman, MEJ. "Modularity and community structure in networks". PNAS (2006)
Kuang, D, Park, H. (2013). "Fast rank-2 nonnegative matrix factorization for hierarchical document clustering." Proc. 19th ACM SIGKDD intl. conf. on Knowledge discovery and data mining.
data(USArrests) A <- as(as.matrix(t(USArrests)), "dgCMatrix") clusters <- dclust(A, min_samples = 2, min_dist = 0.001) str(clusters)data(USArrests) A <- as(as.matrix(t(USArrests)), "dgCMatrix") clusters <- dclust(A, min_samples = 2, min_dist = 0.001) str(clusters)
Determines whether dispersion should be estimated per-row, per-column, or globally by examining the coefficient of variation of per-row and per-column dispersion estimates.
diagnose_dispersion(data, model, cv_threshold = 0.5, min_mu = 1e-06)diagnose_dispersion(data, model, cv_threshold = 0.5, min_mu = 1e-06)
data |
Input matrix (sparse or dense) |
model |
A fitted NMF model |
cv_threshold |
CV threshold for declaring structured dispersion. Default 0.5. |
min_mu |
Floor for predicted values. Default 1e-6. |
Computes moment-based dispersion estimates
(where is determined by the distribution) per row and per column.
If the coefficient of variation (CV) of per-row estimates exceeds
cv_threshold, per-row dispersion is recommended; similarly for
per-column. If both CVs are low, global dispersion suffices.
A list with:
Recommended DispersionMode: "global", "per_row", or "per_col"
Global dispersion estimate
CV of per-row dispersion estimates
CV of per-col dispersion estimates
data(aml) model <- nmf(aml, k = 3, maxit = 10, seed = 1) disp <- diagnose_dispersion(aml, model) disp$modedata(aml) model <- nmf(aml, k = 3, maxit = 10, seed = 1) disp <- diagnose_dispersion(aml, model) disp$mode
Tests whether a dataset has excess zeros beyond what the chosen distribution predicts, and recommends a zero-inflation mode.
diagnose_zero_inflation(data, model, threshold = 0.05)diagnose_zero_inflation(data, model, threshold = 0.05)
data |
Input matrix (sparse or dense) |
model |
A fitted NMF model (any distribution) |
threshold |
Minimum excess zero fraction to declare ZI. Default 0.05. |
Computes the expected number of zeros under the fitted distribution
(Poisson approximation: ), compares to
the observed zero count, and recommends ZI if the excess is large.
ZI granularity is determined by whether per-row and per-col excess rates have high variance (suggesting different rows/cols have different ZI levels).
A list with:
Fraction of zeros exceeding the expected count
Logical: TRUE if excess_zero_rate > threshold
Recommended mode: "none", "row", or "col"
Per-row excess zero rates
Per-col excess zero rates
data(aml) model <- nmf(aml, k = 3, maxit = 10, seed = 1) zi <- diagnose_zero_inflation(aml, model) zi$has_zidata(aml) model <- nmf(aml, k = 3, maxit = 10, seed = 1) zi <- diagnose_zero_inflation(aml, model) zi$has_zi
MNIST handwritten digit dataset containing grayscale images of digits 0-9. Each image is 28x28 pixels flattened to 784 features.
digitsdigits
A matrix (dense) with pixel intensities.
Rows are samples, columns are features.
This is the MNIST dataset, commonly used for benchmarking machine learning algorithms. The data is stored as a dense matrix (not sparse) since handwritten digit images have substantial non-zero content.
For NMF, it's recommended to:
Normalize pixel values to [0,1] by dividing by 255
Transpose to have pixels as rows, samples as columns
Use a subset for faster experimentation
The true rank is 10 (number of digit classes), though higher ranks may capture stroke variations and writing styles within digits.
MNIST database of handwritten digits. Yann LeCun, Corinna Cortes, Christopher J.C. Burges. http://yann.lecun.com/exdb/mnist/
data(digits) dim(digits) model <- nmf(digits, k = 10, maxit = 50)data(digits) dim(digits) model <- nmf(digits, k = 10, maxit = 50)
Calculate loss for an NMF model using the specified loss function, accounting for any masking schemes requested during fitting.
evaluate(x, ...) ## S4 method for signature 'nmf' evaluate( x, data, mask = NULL, missing_only = FALSE, loss = c("mse", "gp"), test_fraction = 0, test_seed = NULL, eval_set = c("all", "test", "train"), threads = 0, verbose = FALSE, ... )evaluate(x, ...) ## S4 method for signature 'nmf' evaluate( x, data, mask = NULL, missing_only = FALSE, loss = c("mse", "gp"), test_fraction = 0, test_seed = NULL, eval_set = c("all", "test", "train"), threads = 0, verbose = FALSE, ... )
x |
fitted model, class |
... |
advanced parameters. See Advanced Parameters section. |
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
mask |
missing data mask. Accepts:
|
missing_only |
calculate loss only for missing values specified as a matrix in |
loss |
loss function to use: "mse" (Mean Squared Error, default) or "gp" (Generalized Poisson / KL divergence) |
test_fraction |
fraction of entries to hold out as test set (default 0 = disabled). When > 0, creates a random mask for test/train split. |
test_seed |
seed for test set generation. If NULL, attempts to use test mask from model's @misc$test_mask if available. |
eval_set |
which set to evaluate: "all" (default), "test" (held-out entries only), or "train" (non-held-out entries only). Only used when test_fraction > 0 or test mask exists in model. |
threads |
number of threads for OpenMP parallelization (default 0 = all available) |
verbose |
print progress information (default FALSE) |
A single numeric value: the loss (MSE or GP/KL divergence) of the model on the data.
library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, 3, seed = 1, maxit = 50, tol = 1e-4) evaluate(model, A) # MSElibrary(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, 3, seed = 1, maxit = 50, tol = 1e-4) evaluate(model, A) # MSE
Export training log to CSV
export_log(logger, file)export_log(logger, file)
logger |
A |
file |
Path to write the CSV file. |
Invisibly returns the data frame written.
training_logger, as.data.frame.training_logger
logger <- training_logger() # After fitting: export_log(logger, tempfile(fileext = ".csv"))logger <- training_logger() # After fitting: export_log(logger, tempfile(fileext = ".csv"))
Creates a node that adds H factors element-wise from multiple branches. All branches must have the same rank k.
factor_add(...)factor_add(...)
... |
Two or more |
An fn_node of type "add".
factor_concat, factor_shared, factor_net
data(aml) inp <- factor_input(aml) branch1 <- nmf_layer(inp, k = 5) branch2 <- nmf_layer(inp, k = 5) added <- factor_add(branch1, branch2)data(aml) inp <- factor_input(aml) branch1 <- nmf_layer(inp, k = 5) branch2 <- nmf_layer(inp, k = 5) added <- factor_add(branch1, branch2)
Creates a node that row-binds the H factors from multiple branches. Allows combining branches with different ranks into a single higher-dimensional representation: k_out = k_1 + k_2 + ...
factor_concat(...)factor_concat(...)
... |
Two or more |
An fn_node of type "concat".
factor_shared, factor_add, factor_net
data(aml) inp <- factor_input(aml) branch1 <- nmf_layer(inp, k = 3) branch2 <- nmf_layer(inp, k = 5) combined <- factor_concat(branch1, branch2)data(aml) inp <- factor_input(aml) branch1 <- nmf_layer(inp, k = 3) branch2 <- nmf_layer(inp, k = 5) combined <- factor_concat(branch1, branch2)
Appends columns from a metadata matrix Z to the H factor before passing downstream. The next layer's W learns to factor out the conditioning variables.
factor_condition(input, Z)factor_condition(input, Z)
input |
An |
Z |
Conditioning matrix (n x p) or (p x n). Will be oriented to match H dimensions. |
An fn_node of type "condition".
data(aml) inp <- factor_input(aml) layer1 <- nmf_layer(inp, k = 5) # Condition on batch metadata (2 batches) Z <- matrix(rep(c(1, 0, 0, 1), c(60, 75, 60, 75)), nrow = 135, ncol = 2) conditioned <- factor_condition(layer1, Z)data(aml) inp <- factor_input(aml) layer1 <- nmf_layer(inp, k = 5) # Condition on batch metadata (2 batches) Z <- matrix(rep(c(1, 0, 0, 1), c(60, 75, 60, 75)), nrow = 135, ncol = 2) conditioned <- factor_condition(layer1, Z)
Sets network-wide defaults. Layer-level and factor-level settings override these where specified.
factor_config( maxit = 100, tol = 1e-04, loss = c("mse", "gp", "nb", "gamma", "inverse_gaussian", "tweedie"), verbose = FALSE, seed = NULL, threads = 0, norm = c("L1", "L2", "none"), solver = c("auto", "cholesky", "cd"), resource = "auto", test_fraction = 0, cv_seed = 0L, mask_zeros = FALSE, patience = 5L, ... )factor_config( maxit = 100, tol = 1e-04, loss = c("mse", "gp", "nb", "gamma", "inverse_gaussian", "tweedie"), verbose = FALSE, seed = NULL, threads = 0, norm = c("L1", "L2", "none"), solver = c("auto", "cholesky", "cd"), resource = "auto", test_fraction = 0, cv_seed = 0L, mask_zeros = FALSE, patience = 5L, ... )
maxit |
Maximum ALS iterations per layer. Default 100. |
tol |
Convergence tolerance. Default 1e-4. |
loss |
Loss function: "mse", "gp", "nb", "gamma",
"inverse_gaussian", "tweedie". Default "mse". For robust (Huber/MAE)
fitting, use the |
verbose |
Print per-iteration diagnostics. Default FALSE. |
seed |
Random seed. Default NULL (auto). |
threads |
Number of CPU threads (0 = all). Default 0. |
norm |
Normalization type: "L1", "L2", "none". Default "L1". |
solver |
NNLS solver: "auto", "cholesky", "cd". Default "auto". |
resource |
Resource override: "auto", "cpu", "gpu". Default "auto". |
test_fraction |
Fraction of entries held out for cross-validation test set. 0 = no CV (standard fit). Default 0. |
cv_seed |
Separate seed for CV mask (0 = derive from |
mask_zeros |
If TRUE, only non-zero entries are in the test set (suitable for recommendation systems). If FALSE, all entries including zeros may appear in the test set. Default FALSE. |
patience |
Number of iterations without test-loss improvement before early stopping during CV. Default 5. |
... |
Additional arguments applied network-wide. These are forwarded
to the underlying |
Unsupported combinations:
Cholesky solver with non-MSE losses (GP, NB, Gamma, etc.) is not
supported. Use solver="cd" or solver="auto".
Zero-inflation is only available with GP and NB losses at the layer
level; it cannot be set globally in factor_config().
An fn_global_config object.
# Default config cfg <- factor_config() # Poisson NMF with cross-validation cfg <- factor_config(loss = "gp", test_fraction = 0.1, maxit = 50)# Default config cfg <- factor_config() # Poisson NMF with cross-validation cfg <- factor_config(loss = "gp", test_fraction = 0.1, maxit = 50)
Wraps a data matrix as a graph input node. The matrix can be dense or sparse (dgCMatrix), or a file path to a .spz file for streaming.
factor_input(data, name = NULL)factor_input(data, name = NULL)
data |
A numeric matrix, sparse matrix (dgCMatrix), or character string path to a .spz file for out-of-core streaming NMF. |
name |
Optional name for the input (used in multi-modal results). |
An fn_node object of type "input".
nmf_layer, svd_layer, factor_net
data(aml) inp <- factor_input(aml, name = "aml") inpdata(aml) inp <- factor_input(aml, name = "aml") inp
Validates the graph topology, resolves config inheritance, and
returns a compiled network ready for fit().
factor_net(inputs, output, config = factor_config())factor_net(inputs, output, config = factor_config())
inputs |
A single |
output |
The output |
config |
A |
A factor_net object.
fit, factor_config, nmf_layer,
svd_layer, cross_validate_graph
data(aml) inp <- factor_input(aml, "aml") out <- nmf_layer(inp, k = 5) net <- factor_net(inp, out, config = factor_config(maxit = 10)) netdata(aml) inp <- factor_input(aml, "aml") out <- nmf_layer(inp, k = 5) net <- factor_net(inp, out, config = factor_config(maxit = 10)) net
Executes the compiled graph from factor_net(). For single-layer
networks, this delegates directly to nmf() for maximum performance.
Multi-layer networks use R-level alternating least squares.
fit(object, ...) ## S3 method for class 'factor_net' fit(object, logger = NULL, ...)fit(object, ...) ## S3 method for class 'factor_net' fit(object, logger = NULL, ...)
object |
A |
... |
Additional arguments (currently unused). |
logger |
Optional |
A factor_net_result object. Access per-layer results by
name (e.g. result$L1). Each layer is a list with components
W, d, H, and iterations. Additional
fields: n_layers, multi_modal, total_iterations,
total_loss, and converged.
factor_net, predict.factor_net_result,
cross_validate_graph
library(Matrix) X <- rsparsematrix(100, 50, 0.1) inp <- factor_input(X, "X") L1 <- inp |> nmf_layer(k = 5, name = "L1") net <- factor_net(inputs = inp, output = L1, config = factor_config(maxit = 100, seed = 42)) res <- fit(net) res$L1$W # m x k basis matrix res$L1$H # k x n coefficient matrixlibrary(Matrix) X <- rsparsematrix(100, 50, 0.1) inp <- factor_input(X, "X") L1 <- inp |> nmf_layer(k = 5, name = "L1") net <- factor_net(inputs = inp, output = L1, config = factor_config(maxit = 100, seed = 42)) res <- fit(net) res$L1$W # m x k basis matrix res$L1$H # k x n coefficient matrix
Gene expression data from the landmark Golub et al. (1999) study, consisting of bone marrow samples from acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML) patients. This is the 5000 most variable genes subset used in the seminal NMF paper by Brunet et al. (2004, PNAS).
golubgolub
A dgCMatrix sparse matrix (38 x 5000) containing gene expression
values. Rows are patient samples, columns are genes. Cancer type labels are
stored as an attribute.
Access metadata via:
attr(golub, "cancer_type") - Factor: "ALL" or "AML"
attr(golub, "n_all") - Number of ALL samples (27)
attr(golub, "n_aml") - Number of AML samples (11)
This dataset is one of the most widely used benchmarks for class discovery and cancer subtype identification. The original study identified gene expression signatures that distinguish ALL from AML.
NMF analysis reveals:
True rank = 2: The two cancer types (ALL vs AML)
Optimal rank ~3: Brunet et al. found rank 3 more informative, potentially capturing ALL subtypes (B-cell vs T-cell)
This dataset has been preprocessed to include only the 5000 most variable genes for computational efficiency, following Brunet et al. (2004).
Golub et al. (1999). "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring." Science 286(5439): 531-537.
Brunet et al. (2004). "Metagenes and molecular pattern discovery using matrix factorization." PNAS 101(12): 4164-4169.
Data retrieved via NMF package (Gaujoux and Seoighe).
# Load dataset data(golub) # Inspect dim(golub) # 38 samples x 5000 genes table(attr(golub, "cancer_type")) # 27 ALL, 11 AML # Run NMF for class discovery model <- nmf(t(golub), k = 2, maxit = 100) # Transpose: genes x samples # Try rank 3 as suggested by Brunet et al. model3 <- nmf(t(golub), k = 3, maxit = 100)# Load dataset data(golub) # Inspect dim(golub) # 38 samples x 5000 genes table(attr(golub, "cancer_type")) # 27 ALL, 11 AML # Run NMF for class discovery model <- nmf(t(golub), k = 2, maxit = 100) # Transpose: genes x samples # Try rank 3 as suggested by Brunet et al. model3 <- nmf(t(golub), k = 3, maxit = 100)
Detects CUDA GPUs at runtime. Result is cached for the session.
gpu_available(force_recheck = FALSE)gpu_available(force_recheck = FALSE)
force_recheck |
Re-probe GPUs even if already cached |
logical TRUE if GPU is available
gpu_available()gpu_available()
Get GPU device information
gpu_info()gpu_info()
data.frame with GPU device details, or NULL if no GPU
gpu_info()gpu_info()
S3 methods for the gpu_sparse_matrix class returned by
st_read_gpu.
## S3 method for class 'gpu_sparse_matrix' print(x, ...) ## S3 method for class 'gpu_sparse_matrix' dim(x) ## S3 method for class 'gpu_sparse_matrix' nrow(x) ## S3 method for class 'gpu_sparse_matrix' ncol(x)## S3 method for class 'gpu_sparse_matrix' print(x, ...) ## S3 method for class 'gpu_sparse_matrix' dim(x) ## S3 method for class 'gpu_sparse_matrix' nrow(x) ## S3 method for class 'gpu_sparse_matrix' ncol(x)
x |
a |
... |
additional arguments (unused) |
printInvisibly returns x, prints summary to console.
dimInteger vector of length 2: c(nrow, ncol).
nrowNumber of rows (integer).
ncolNumber of columns (integer).
## Not run: gpu_mat <- st_read_gpu("data.spz") print(gpu_mat) dim(gpu_mat) nrow(gpu_mat) ncol(gpu_mat) st_free_gpu(gpu_mat) ## End(Not run)## Not run: gpu_mat <- st_read_gpu("data.spz") print(gpu_mat) dim(gpu_mat) nrow(gpu_mat) ncol(gpu_mat) st_free_gpu(gpu_mat) ## End(Not run)
Frequency of bird species observations across survey grids in the Hawaiian islands. This dataset contains presence/frequency data for 183 bird species across 1,183 spatial grid cells, along with metadata about survey locations and species characteristics.
hawaiibirdshawaiibirds
A dgCMatrix sparse matrix (183 x 1,183) with bird species
frequency observations. Rows are bird species and columns are survey grid cells.
Metadata is stored as attributes.
Access metadata via:
attr(hawaiibirds, "metadata_h") - Survey grid information (1,183 rows):
Character vector of grid cell identifiers
Factor indicating which Hawaiian island
Numeric latitude coordinate
Numeric longitude coordinate
attr(hawaiibirds, "metadata_w") - Species information (183 rows):
Character vector of species common names
Factor: "introduced" or "native"
Factor indicating bird type (e.g., "birds of prey", "seabirds")
This dataset contains bird observation data from the Hawaiian Islands, useful for ecological and biogeographic studies. The data is sparse, with many grid cells containing no observations for most species.
Bird types include:
Birds of prey
Coastal and wetland birds
Land birds
Seabirds
Waterbirds
Data originally from RcppML package, derived from Hawaii bird observation surveys.
# Load dataset data(hawaiibirds) # Inspect structure dim(hawaiibirds) table(attr(hawaiibirds, "metadata_w")$status) table(attr(hawaiibirds, "metadata_h")$island) # Run NMF to identify species co-occurrence patterns result <- nmf(hawaiibirds, k = 10, maxit = 50)# Load dataset data(hawaiibirds) # Inspect structure dim(hawaiibirds) table(attr(hawaiibirds, "metadata_w")$status) table(attr(hawaiibirds, "metadata_h")$island) # Run NMF to identify species co-occurrence patterns result <- nmf(hawaiibirds, k = 10, maxit = 50)
Movie ratings data from the MovieLens 100K dataset, containing user ratings for movies along with genre information.
movielensmovielens
A dgCMatrix sparse matrix (3,867 x 610) containing movie ratings.
Rows are movies, columns are users, and values are ratings from 1-5 stars.
Most entries are zero (unrated). Genre information is stored as an attribute.
Access genre data via attr(movielens, "genres"), which returns an
ngCMatrix sparse binary matrix (19 x 3,867) indicating genre membership.
Rows are genres, columns are movies, and entries are TRUE/FALSE for genre membership.
This is a subset of the MovieLens dataset, widely used for demonstrating collaborative filtering and recommendation systems. The data is highly sparse, as most users have only rated a small fraction of available movies.
The 19 genres include:
Action, Adventure, Animation, Children
Comedy, Crime, Documentary, Drama
Fantasy, Film-Noir, Horror, Musical
Mystery, Romance, Sci-Fi, Thriller
War, Western, IMAX
MovieLens 100K Dataset from GroupLens Research. https://grouplens.org/datasets/movielens/
# Load dataset data(movielens) # Inspect structure dim(movielens) dim(attr(movielens, "genres")) # Sparsity Matrix::nnzero(movielens) / prod(dim(movielens)) # Run NMF for collaborative filtering result <- nmf(movielens, k = 20, maxit = 50)# Load dataset data(movielens) # Inspect structure dim(movielens) dim(attr(movielens, "genres")) # Sparsity Matrix::nnzero(movielens) / prod(dim(movielens)) # Run NMF for collaborative filtering result <- nmf(movielens, k = 20, maxit = 50)
High-performance NMF of the form for large dense or sparse matrices. Supports single-rank fitting or cross-validation across multiple ranks. Returns an nmf object or nmfCrossValidate data.frame.
nmf( data, k, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL, loss = c("mse", "gp", "nb", "gamma", "inverse_gaussian", "tweedie"), nonneg = c(TRUE, TRUE), test_fraction = 0, verbose = FALSE, projective = FALSE, symmetric = FALSE, zi = c("none", "row", "col"), robust = FALSE, ... )nmf( data, k, tol = 1e-04, maxit = 100, L1 = c(0, 0), L2 = c(0, 0), seed = NULL, mask = NULL, loss = c("mse", "gp", "nb", "gamma", "inverse_gaussian", "tweedie"), nonneg = c(TRUE, TRUE), test_fraction = 0, verbose = FALSE, projective = FALSE, symmetric = FALSE, zi = c("none", "row", "col"), robust = FALSE, ... )
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
k |
rank (integer), vector of ranks for cross-validation, or "auto" for automatic rank determination. |
tol |
tolerance of the fit (default 1e-4) |
maxit |
maximum number of fitting iterations (default 100) |
L1 |
LASSO penalties in the range (0, 1], single value or array of length two for |
L2 |
Ridge penalties greater than zero, single value or array of length two for |
seed |
initialization control. Accepts:
|
mask |
missing data mask. Accepts:
|
loss |
loss function: |
nonneg |
logical vector of length 2 for |
test_fraction |
fraction of entries to hold out for cross-validation (default 0 = disabled). |
verbose |
print progress information during fitting (default FALSE) |
projective |
if |
symmetric |
if |
zi |
zero-inflation mode: |
robust |
robustness control. |
... |
advanced parameters. See Advanced Parameters section. |
This fast NMF implementation decomposes a matrix into lower-rank non-negative matrices and ,
with columns of and rows of scaled to sum to 1 via multiplication by a diagonal, :
The scaling diagonal ensures convex L1 regularization, consistent factor scalings regardless of random initialization, and model symmetry in factorizations of symmetric matrices.
The factorization model is randomly initialized. and are updated by alternating least squares.
RcppML achieves high performance using the Eigen C++ linear algebra library, OpenMP parallelization, a dedicated Rcpp sparse matrix class, and fast sequential coordinate descent non-negative least squares initialized by Cholesky least squares solutions.
Sparse optimization is automatically applied if the input matrix A is a sparse matrix (i.e. Matrix::dgCMatrix). There are also specialized back-ends for symmetric, rank-1, and rank-2 factorizations.
L1 penalization can be used for increasing the sparsity of factors and assisting interpretability. Penalty values should range from 0 to 1, where 1 gives complete sparsity.
Set options(RcppML.verbose = TRUE) to print model tolerances to the console after each iteration.
Parallelization is applied with OpenMP using the number of threads in getOption("RcppML.threads") and set by option(RcppML.threads = 0), for example. 0 corresponds to all threads, let OpenMP decide.
When k is a single integer: an S4 object of class nmf with slots:
wfeature factor matrix, m x k
dscaling diagonal vector of length k (descending order after sorting)
hsample factor matrix, k x n
misclist containing tol (final tolerance), iter (iteration count),
loss (final loss value), loss_type (loss function used), and runtime (seconds).
When k is a vector: a data.frame of class nmfCrossValidate with columns
k, rep, train_loss, test_loss, and best_iter.
When k is a vector, the function performs cross-validation to find the optimal rank:
A fraction (test_fraction) of entries are held out for validation
Models are fit for each rank in k using only training data
Test loss is computed on held-out entries
Returns a data.frame with k, rep, train_mse, test_mse, and best_iter for each rank/replicate
Use plot() on the result to visualize loss vs rank
The loss parameter controls the objective function:
"mse": Mean Squared Error (default). Standard Frobenius norm minimization.
"gp": Generalized Poisson. For overdispersed count data (e.g., scRNA-seq).
Use dispersion to control per-row or global overdispersion estimation.
With dispersion="none", equivalent to KL divergence (Poisson model).
"nb": Negative Binomial. Quadratic variance-mean relationship. Standard for
scRNA-seq data with overdispersion. Size parameter (r) estimated per-row or globally.
"gamma": Gamma distribution. Variance proportional to mu^2. For positive
continuous data. Dispersion (phi) estimated via MoM Pearson estimator.
"inverse_gaussian": Inverse Gaussian. Variance proportional to mu^3. For
positive continuous data with heavy right tails.
"tweedie": Tweedie distribution. Variance proportional to mu^p where p is
set via tweedie_power (default 1.5). Interpolates between Poisson (p=1),
Gamma (p=2), and Inverse Gaussian (p=3).
For robust fitting (equivalent to Huber or MAE loss), use the robust parameter
instead of a separate loss function. Setting robust = TRUE applies Huber-weighted
IRLS with delta=1.345; robust = "mae" approximates mean absolute error.
Non-MSE losses use Iteratively Reweighted Least Squares (IRLS) which may be slower but provides better fits for count data (GP, NB) and positive continuous data (Gamma, Inverse Gaussian).
...)The following parameters can be passed via ...:
Regularization:
L21Group sparsity penalty, single value or c(w, h) (default c(0,0))
angularAngular decorrelation penalty, c(w, h) (default c(0,0))
upper_boundBox constraint on factors, c(w, h) (default c(0,0) = no bound)
graph_W, graph_H
Sparse graph Laplacian matrices for feature/sample regularization
graph_lambdaGraph regularization strength, c(w, h) (default c(0,0))
target_HTarget matrix (k x n) for H-side regularization.
Steers H toward a desired structure during fitting. See compute_target.
target_lambdaTarget regularization strength, single value
or c(w, h) (default 0). Positive values attract H toward the
target (label enrichment); negative values use PROJ_ADV
eigenvalue-projected adversarial removal to suppress
target-correlated structure (batch removal).
See vignette("guided-nmf") for details.
Distribution Tuning:
dispersionDispersion mode for GP loss: "per_row", "per_col", "global", "none"
theta_init, theta_max, theta_min
GP theta bounds
nb_size_init, nb_size_max, nb_size_min
NB dispersion bounds
gamma_phi_init, gamma_phi_max, gamma_phi_min
Gamma/IG dispersion bounds
tweedie_powerTweedie variance power (default 1.5)
irls_max_iter, irls_tol
IRLS convergence parameters
Zero-Inflation:
zi_em_itersEM iterations per NMF step (default 1)
Solver:
solverNNLS solver: "auto", "cholesky", "cd"
cd_tolCD convergence tolerance (default 1e-8)
cd_maxitCD max iterations (default 100)
h_initCustom initial H matrix
Cross-Validation:
cv_seedSeparate seed(s) for CV holdout pattern
patienceEarly stopping patience (default 5)
cv_k_rangeAuto-rank search range (default c(2, 50))
track_train_lossTrack training loss in CV (default TRUE)
Resources & Output:
threadsOpenMP threads (default 0 = all)
resourceCompute backend: "auto", "cpu", "gpu"
normFactor normalization: "L1", "L2", "none"
sort_modelSort factors by diagonal (default TRUE)
Streaming:
streamingSPZ streaming mode: "auto", TRUE, FALSE
panel_colsPanel size for streaming (default 0 = auto)
dispatchStreamPress dispatch override
Callbacks:
on_iterationPer-iteration callback function
profileEnable timing profiling (default FALSE)
By default (resource = "auto"), RcppML auto-detects available hardware.
All features are fully supported on CPU (the default backend).
GPU acceleration (when compiled with CUDA support) accelerates sparse and dense NMF.
GPU is experimental and falls back to CPU automatically if unavailable.
S4 methods available for the nmf class:
predict: project an NMF model (or partial model) onto new samples
evaluate: calculate mean squared error loss of an NMF model
summary: data.frame giving fractional, total, or mean representation of factors in samples or features grouped by some criteria
align: find an ordering of factors in one nmf model that best matches those in another nmf model
prod: compute the dense approximation of input data
sparsity: compute the sparsity of each factor in and
subset: subset, reorder, select, or extract factors (same as [)
generics such as dim, dimnames, t, show, head
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
predict, mse, nnls,
align, summary,nmf-method
# basic NMF model <- nmf(Matrix::rsparsematrix(1000, 100, 0.1), k = 10) # cross-validation to find optimal rank sim <- simulateNMF(200, 80, k = 5, noise = 3.0, seed = 42) cv <- nmf(sim$A, k = 2:10, test_fraction = 0.05, cv_seed = 1:3, tol = 1e-5, maxit = 200) plot(cv) optimal_k <- cv$k[which.min(cv$test_mse)] # fit final model with optimal rank model <- nmf(sim$A, optimal_k)# basic NMF model <- nmf(Matrix::rsparsematrix(1000, 100, 0.1), k = 10) # cross-validation to find optimal rank sim <- simulateNMF(200, 80, k = 5, noise = 3.0, seed = 42) cv <- nmf(sim$A, k = 2:10, test_fraction = 0.05, cv_seed = 1:3, tol = 1e-5, maxit = 200) plot(cv) optimal_k <- cv$k[which.min(cv$test_mse)] # fit final model with optimal rank model <- nmf(sim$A, optimal_k)
Creates a non-negative matrix factorization layer that decomposes its input into W * diag(d) * H with non-negativity constraints.
nmf_layer( input, k, L1 = 0, L2 = 0, L21 = 0, angular = 0, upper_bound = 0, mask = NULL, zi = c("none", "row", "col"), projective = FALSE, symmetric = FALSE, robust = FALSE, W = NULL, H = NULL, name = NULL, ... )nmf_layer( input, k, L1 = 0, L2 = 0, L21 = 0, angular = 0, upper_bound = 0, mask = NULL, zi = c("none", "row", "col"), projective = FALSE, symmetric = FALSE, robust = FALSE, W = NULL, H = NULL, name = NULL, ... )
input |
An |
k |
Factorization rank. |
L1 |
L1 penalty (shared by W and H unless overridden). Default 0. |
L2 |
L2 penalty. Default 0. |
L21 |
Group sparsity penalty. Default 0. |
angular |
Orthogonality penalty. Default 0. |
upper_bound |
Box constraint. Default 0. |
mask |
Masking mode: NULL (none), |
zi |
Zero-inflation mode: |
projective |
Use projective NMF (W is reused as H). Default FALSE. |
symmetric |
Use symmetric NMF (W == H). Default FALSE. |
robust |
Robustness control: |
W |
Optional |
H |
Optional |
name |
Optional layer name (for results access). |
... |
Additional arguments forwarded to |
When used with |>, the input node is the first argument:
x |> nmf_layer(k = 64).
Layer-level regularization parameters (L1, L2, etc.) apply to both
W and H unless overridden by W() or H().
An fn_node of type "nmf_layer".
svd_layer, factor_net, factor_input,
nmf
data(aml) inp <- factor_input(aml) layer <- nmf_layer(inp, k = 5) # With zero-inflation and distribution-specific tuning layer_gp <- nmf_layer(inp, k = 5, zi = "row", dispersion = "per_col", theta_init = 0.5)data(aml) inp <- factor_input(aml) layer <- nmf_layer(inp, k = 5) # With zero-inflation and distribution-specific tuning layer_gp <- nmf_layer(inp, k = 5, zi = "row", dispersion = "per_col", theta_init = 0.5)
The S4 class for NMF model results.
wfeature factor matrix
dscaling diagonal vector
hsample factor matrix
misclist containing optional components including tol (tolerance), iter (iterations), loss (final loss value), loss_type (loss function used), runtime (in seconds), w_init (initial w matrix), test_mask (CV test set), test_seed (CV seed), test_fraction (CV holdout fraction), train_loss (CV training loss), test_loss (CV test loss), and best_iter (CV best iteration)
nmf, evaluate, align, predict,nmf-method
Project a factor matrix onto new data to solve for the complementary matrix.
Given and , find such that is minimized.
Alternatively, given and , find such that is minimized.
nnls( w = NULL, h = NULL, A, L1 = c(0, 0), L2 = c(0, 0), loss = "mse", upper_bound = c(0, 0), nonneg = c(TRUE, TRUE), threads = 0, verbose = FALSE, ... )nnls( w = NULL, h = NULL, A, L1 = c(0, 0), L2 = c(0, 0), loss = "mse", upper_bound = c(0, 0), nonneg = c(TRUE, TRUE), threads = 0, verbose = FALSE, ... )
w |
factor model matrix (n_features x k) for solving H. Set to NULL to solve for W. |
h |
factor model matrix (k x n_samples) for solving W. Set to NULL to solve for H. |
A |
data matrix. Can be dense (matrix) or sparse (dgCMatrix). |
L1 |
L1/LASSO penalty, length 1 or 2. Range [0, 1). |
L2 |
Ridge penalty, length 1 or 2. Range [0, Inf). |
loss |
loss function: |
upper_bound |
maximum value in solution, length 1 or 2. 0 = no bound. |
nonneg |
non-negativity constraints, length 1 or 2. |
threads |
number of threads for OpenMP parallelization (0 = all available). |
verbose |
print progress information. |
... |
advanced parameters. See Advanced Parameters section. |
This function solves NNLS projection problems with full flexibility.
Projection Modes:
w provided, h = NULL: Solve for H given W and A (standard projection)
h provided, w = NULL: Solve for W given H and A (transpose projection)
Exactly one of w or h must be NULL
matrix of dimension (k x n_samples) for H or (n_features x k) for W
...)L21L2,1 group sparsity penalty, length 1 or 2 (default c(0,0))
angularAngular decorrelation penalty, length 1 or 2 (default c(0,0))
cd_maxitMax coordinate descent iterations (default 100)
cd_tolCD stopping tolerance (default 1e-8)
warm_startOptional initial solution matrix
dispersionDispersion estimation mode: "per_row" (default) or "global"
theta_initInitial GP theta (default 0.1)
theta_maxMaximum GP theta (default 5.0)
theta_minMinimum GP theta (default 0.0)
nb_size_initInitial NB size parameter (default 10.0)
nb_size_maxMaximum NB size (default 1e6)
nb_size_minMinimum NB size (default 0.01)
gamma_phi_initInitial Gamma shape parameter (default 1.0)
gamma_phi_maxMaximum Gamma shape (default 1e4)
gamma_phi_minMinimum Gamma shape (default 1e-6)
tweedie_powerTweedie variance power (default 1.5)
irls_max_iterMaximum IRLS iterations per solve (default 5)
irls_tolIRLS convergence tolerance (default 1e-4)
When target_H and target_lambda are provided (via ...),
nnls() routes the solve through a single NMF iteration internally.
Positive target_lambda attracts the solution toward target_H
(label enrichment). Negative target_lambda uses eigenvalue-projected
adversarial removal (PROJ_ADV) to suppress target-correlated structure
(batch removal). See vignette("guided-nmf") for details.
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
Franc, VC, Hlavac, VC, and Navara, M. (2005). "Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem."
# Generate synthetic NMF problem set.seed(123) w <- matrix(runif(100), 20, 5) # 20 features, 5 factors h_true <- matrix(runif(50), 5, 10) # 5 factors, 10 samples A <- w %*% h_true + matrix(rnorm(200, 0, 0.1), 20, 10) # Add noise # Project W onto new data to find H h_recovered <- nnls(w = w, A = A) cor(as.vector(h_true), as.vector(h_recovered)) # With L1 penalty for sparse H h_sparse <- nnls(w = w, A = A, L1 = c(0, 0.1))# Generate synthetic NMF problem set.seed(123) w <- matrix(runif(100), 20, 5) # 20 features, 5 factors h_true <- matrix(runif(50), 5, 10) # 5 factors, 10 samples A <- w %*% h_true + matrix(rnorm(200, 0, 0.1), 20, 10) # Add noise # Project W onto new data to find H h_recovered <- nnls(w = w, A = A) cor(as.vector(h_true), as.vector(h_recovered)) # With L1 penalty for sparse H h_sparse <- nnls(w = w, A = A, L1 = c(0, 0.1))
Grayscale face images from AT&T Laboratories Cambridge. This dataset contains 400 face images (64x64 pixels) from 40 subjects, with 10 images per subject showing different poses, expressions, and lighting conditions.
olivettiolivetti
A dgCMatrix sparse matrix (400 x 4096) containing grayscale
face images. Each row is a flattened 64x64 pixel image with values in [0,1].
Subject labels are stored as an attribute.
Access metadata via:
attr(olivetti, "subject") - Factor indicating which of 40 subjects
attr(olivetti, "image_shape") - c(64, 64) image dimensions
attr(olivetti, "n_subjects") - Number of subjects (40)
attr(olivetti, "images_per_subject") - Images per subject (10)
This dataset is commonly used for face recognition, clustering, and dimensionality reduction benchmarks. The images show variation in:
Facial expression (smiling, neutral, etc.)
Head pose (left, right, up, down)
Lighting conditions
Accessories (glasses on/off in some cases)
The true rank for NMF is 40 (number of subjects), though lower ranks may capture common facial features and higher ranks may distinguish expression and pose variations within subjects.
To reshape a row back to an image:
matrix(olivetti[i,], nrow=64, ncol=64, byrow=TRUE)
AT&T Laboratories Cambridge (formerly Olivetti Research Laboratory). https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
Original images 92x112, downsampled to 64x64 in sklearn version.
# Load dataset data(olivetti) # Inspect dim(olivetti) # 400 x 4096 table(attr(olivetti, "subject")) # 10 images per subject # Visualize first face face_img <- matrix(olivetti[1,], nrow=64, ncol=64, byrow=TRUE) image(t(face_img)[,64:1], col=grey.colors(256)) # Run NMF to discover face components (small k for speed) model <- nmf(t(olivetti), k = 5, maxit = 10)# Load dataset data(olivetti) # Inspect dim(olivetti) # 400 x 4096 table(attr(olivetti, "subject")) # 10 images per subject # Visualize first face face_img <- matrix(olivetti[1,], nrow=64, ncol=64, byrow=TRUE) image(t(face_img)[,64:1], col=grey.colors(256)) # Run NMF to discover face components (small k for speed) model <- nmf(t(olivetti), k = 5, maxit = 10)
The full 10x Genomics PBMC 3k single-cell RNA-seq dataset with Seurat cell type annotations, shipped as StreamPress-compressed raw bytes. Contains 13,714 genes across 2,638 cells with 9 annotated cell types.
pbmc3kpbmc3k
A raw vector containing StreamPress (.spz) compressed bytes.
To obtain the sparse matrix, write the bytes to a temporary file and read
with st_read:
data(pbmc3k) tmp <- tempfile(fileext = ".spz") writeBin(pbmc3k, tmp) counts <- st_read(tmp) # counts is a dgCMatrix: 13,714 genes x 2,638 cells
Cell type annotations are embedded in the .spz file as column (var) metadata:
cell_types <- st_read_var(tmp)$cell_type table(cell_types)
The underlying matrix is a dgCMatrix with 13,714 rows (genes) and
2,638 columns (cells), containing 2,238,732 non-zero entries (integer UMI counts).
Cell type annotations (9 types: Naive CD4 T, Memory CD4 T, CD14+ Mono, B,
CD8 T, FCGR3A+ Mono, NK, DC, Platelet) were obtained from the Seurat
pbmc3k.final reference object via the SeuratData package and stored
as StreamPress column metadata.
This dataset is commonly used for demonstrating single-cell analysis workflows including distribution-aware NMF and zero-inflation diagnostics.
10x Genomics PBMC 3k dataset, processed with Seurat (SeuratData::pbmc3k.final).
# Load the compressed bytes data(pbmc3k) # Decompress to sparse matrix tmp <- tempfile(fileext = ".spz") writeBin(pbmc3k, tmp) counts <- st_read(tmp) dim(counts) # 13714 x 2638 # Access cell type annotations cell_types <- st_read_var(tmp)$cell_type table(cell_types)# Load the compressed bytes data(pbmc3k) # Decompress to sparse matrix tmp <- tempfile(fileext = ".spz") writeBin(pbmc3k, tmp) counts <- st_read(tmp) dim(counts) # 13714 x 2638 # Access cell type annotations cell_types <- st_read_var(tmp)$cell_type table(cell_types)
Convenience wrapper around svd with center = TRUE.
pca(A, k = 10, ...)pca(A, k = 10, ...)
A |
Input matrix. May be dense ( |
k |
Number of factors (rank). Use |
... |
Additional arguments passed to |
An S4 object of class svd (see svd).
library(Matrix) data(aml) result <- pca(aml, k = 5) resultlibrary(Matrix) data(aml) result <- pca(aml, k = 5) result
Plot Consensus Matrix Heatmap
## S3 method for class 'consensus_nmf' plot( x, cluster_rows = TRUE, cluster_cols = TRUE, show_clusters = TRUE, color_palette = c("white", "#fee5d9", "#fcae91", "#fb6a4a", "#de2d26", "#a50f15"), interactive = FALSE, ... )## S3 method for class 'consensus_nmf' plot( x, cluster_rows = TRUE, cluster_cols = TRUE, show_clusters = TRUE, color_palette = c("white", "#fee5d9", "#fcae91", "#fb6a4a", "#de2d26", "#a50f15"), interactive = FALSE, ... )
x |
consensus_nmf object |
cluster_rows |
whether to reorder rows by hierarchical clustering (default TRUE) |
cluster_cols |
whether to reorder columns (default TRUE, same as rows) |
show_clusters |
whether to show cluster assignments as sidebar (default TRUE) |
color_palette |
color palette name or vector of colors |
interactive |
whether to make interactive plotly heatmap (default FALSE) |
... |
additional arguments (unused) |
A ggplot2 object (or plotly object if interactive = TRUE) showing the consensus heatmap.
consensus_nmf, summary.consensus_nmf
library(Matrix) A <- rsparsematrix(50, 30, 0.3) cons <- consensus_nmf(A, k = 3, reps = 5, seed = 42) if (requireNamespace("ggplot2", quietly = TRUE)) { plot(cons) }library(Matrix) A <- rsparsematrix(50, 30, 0.3) cons <- consensus_nmf(A, k = 3, reps = 5, seed = 42) if (requireNamespace("ggplot2", quietly = TRUE)) { plot(cons) }
Reconstructs and plots the binary splitting tree from a
dclust result. Each cluster's binary path ID encodes its
position in the hierarchy (e.g., "01" = root->left->right).
If labels are provided, a stacked composition bar is drawn
below each leaf showing label proportions.
## S3 method for class 'dclust' plot( x, labels = NULL, palette = NULL, main = "Divisive Clustering Hierarchy", ... )## S3 method for class 'dclust' plot( x, labels = NULL, palette = NULL, main = "Divisive Clustering Hierarchy", ... )
x |
a |
labels |
optional character or factor vector of class labels, one per
sample in the original data matrix passed to |
palette |
optional named character vector mapping label levels to colors.
If |
main |
plot title |
... |
additional arguments passed to |
x (invisibly)
TensorBoard-style visualization of NMF training dynamics, convergence analysis, and factor diagnostics. Provides multiple plot types for comprehensive model analysis.
## S3 method for class 'nmf' plot( x, type = c("loss", "convergence", "regularization", "sparsity"), smooth = TRUE, span = 0.3, log_scale = FALSE, interactive = FALSE, theme = "classic", ... )## S3 method for class 'nmf' plot( x, type = c("loss", "convergence", "regularization", "sparsity"), smooth = TRUE, span = 0.3, log_scale = FALSE, interactive = FALSE, theme = "classic", ... )
x |
object of class "nmf" |
type |
plot type: - "loss": Loss components over iterations (default) - "convergence": Log-scale loss convergence - "regularization": Regularization penalty contributions - "sparsity": Factor sparsity patterns |
smooth |
apply smoothing (LOESS) for noisy curves (default TRUE) |
span |
smoothing span for LOESS (default 0.3) |
log_scale |
use log scale for y-axis (default FALSE, auto TRUE for "convergence") |
interactive |
create interactive plotly plot (default FALSE) |
theme |
ggplot2 theme: "classic", "minimal", "dark" (default "classic") |
... |
additional arguments passed to specific plotting functions |
ggplot2 or plotly object
# Basic loss plot model <- nmf(hawaiibirds, k = 10) plot(model) # Convergence analysis plot(model, type = "convergence") # Interactive plot plot(model, type = "loss", interactive = TRUE) # Compare multiple runs models <- replicate(5, nmf(hawaiibirds, k = 10), simplify = FALSE) plot(models[[1]], type = "sparsity")# Basic loss plot model <- nmf(hawaiibirds, k = 10) plot(model) # Convergence analysis plot(model, type = "convergence") # Interactive plot plot(model, type = "loss", interactive = TRUE) # Compare multiple runs models <- replicate(5, nmf(hawaiibirds, k = 10), simplify = FALSE) plot(models[[1]], type = "sparsity")
Visualize NMF cross-validation results showing test (and optionally train) loss across candidate ranks. Useful for selecting the optimal factorization rank.
## S3 method for class 'nmfCrossValidate' plot(x, show_train = NULL, point_size = 3, interactive = FALSE, ...)## S3 method for class 'nmfCrossValidate' plot(x, show_train = NULL, point_size = 3, interactive = FALSE, ...)
x |
object of class |
show_train |
logical, overlay train loss on the plot (default TRUE if
|
point_size |
size of data points (default 3) |
interactive |
create interactive plotly plot (default FALSE) |
... |
additional arguments (unused) |
ggplot2 or plotly object
library(Matrix) A <- abs(rsparsematrix(50, 30, 0.3)) cv_result <- nmf(A, k = 2:6, test_fraction = 0.1, cv_seed = 1, maxit = 20) plot(cv_result)library(Matrix) A <- abs(rsparsematrix(50, 30, 0.3)) cv_result <- nmf(A, k = 2:6, test_fraction = 0.1, cv_seed = 1, maxit = 20) plot(cv_result)
Produces a multi-panel plot of loss, classifier metrics, and factor norms over training iterations.
## S3 method for class 'training_logger' plot(x, ...)## S3 method for class 'training_logger' plot(x, ...)
x |
A |
... |
Ignored. |
Invisibly returns NULL. Called for its side effect (plotting).
For single-layer results, projects new samples using the trained W and d. For multi-layer results, chains through each layer.
## S3 method for class 'factor_net_result' predict(object, newdata, ...)## S3 method for class 'factor_net_result' predict(object, newdata, ...)
object |
A |
newdata |
A matrix (features x samples) to project. |
... |
Additional arguments (currently unused). |
For single-layer: a matrix (k x n_new). For multi-layer: a list of H matrices, one per layer.
Print a factor_net
## S3 method for class 'factor_net' print(x, ...)## S3 method for class 'factor_net' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
Invisibly returns x.
Print a factor_net_cv result
## S3 method for class 'factor_net_cv' print(x, ...)## S3 method for class 'factor_net_cv' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
Invisibly returns x.
cross_validate_graph, factor_net
Print a factor_net_result
## S3 method for class 'factor_net_result' print(x, ...)## S3 method for class 'factor_net_result' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
Invisibly returns x.
factor_net, fit, summary.factor_net_result
Displays a human-readable summary of classification metrics including accuracy, macro/weighted F1, AUC, and per-class precision/recall.
## S3 method for class 'fn_classifier_eval' print(x, ...)## S3 method for class 'fn_classifier_eval' print(x, ...)
x |
An |
... |
Additional arguments (unused). |
Invisibly returns x.
summary.fn_classifier_eval, classify_embedding
Print an fn_factor_config
## S3 method for class 'fn_factor_config' print(x, ...)## S3 method for class 'fn_factor_config' print(x, ...)
x |
An |
... |
Additional arguments (unused). |
Invisibly returns x.
W, H, factor_config
Print an fn_global_config
## S3 method for class 'fn_global_config' print(x, ...)## S3 method for class 'fn_global_config' print(x, ...)
x |
An |
... |
Additional arguments (unused). |
Invisibly returns x.
Print an fn_node
## S3 method for class 'fn_node' print(x, ...)## S3 method for class 'fn_node' print(x, ...)
x |
An |
... |
Additional arguments (unused). |
Invisibly returns x.
factor_input, nmf_layer, factor_net
Print method for nmf_assessment objects
## S3 method for class 'nmf_assessment' print(x, ...)## S3 method for class 'nmf_assessment' print(x, ...)
x |
an |
... |
additional arguments (unused) |
Invisibly returns x
Print a training log
## S3 method for class 'training_logger' print(x, ...)## S3 method for class 'training_logger' print(x, ...)
x |
A |
... |
Additional arguments (unused). |
Invisibly returns x.
Post-hoc correction of an NMF embedding to improve class separation or remove batch effects, optionally followed by W-refit cycles that propagate the corrected H back through the model.
refine( x, data = NULL, labels, batch = NULL, lambda = 0.8, cycles = 0L, nonneg = TRUE, whiten = TRUE )refine( x, data = NULL, labels, batch = NULL, lambda = 0.8, cycles = 0L, nonneg = TRUE, whiten = TRUE )
x |
An NMF model (S4 object of class |
data |
Original data matrix (required when |
labels |
Factor or character/integer vector of class labels (length n). |
batch |
Optional factor of batch labels (length n) for batch removal.
When supplied, batch-correlated structure is suppressed using the
PROJ_ADV method (eigenvalue-projected adversarial removal).
See |
lambda |
Correction strength in |
cycles |
Number of W-refit cycles. 0 = post-hoc only (default). Each cycle: refit W from corrected H, refit H from new W, re-correct H. |
nonneg |
Enforce non-negativity on refitted factors. Default |
whiten |
Apply OAS-ZCA whitening to class centroids. Default |
The correction proceeds in two stages:
Stage 1: Post-hoc centroid correction (always performed)
Computes a target matrix from labels via compute_target,
then shifts each sample's embedding toward its class centroid:
where T is the target matrix (class centroid shifts, optionally whitened).
Stage 1b: Batch removal (when batch is supplied)
Uses PROJ_ADV (Projected Adversarial) method: computes a batch target
from batch labels with negative target_lambda, which subtracts
the batch Gram matrix from the NNLS Gram matrix, eigendecomposes,
and clips negative eigenvalues. This suppresses batch-correlated
directions while preserving all other structure.
Stage 2: W-refit cycles (when cycles > 0)
Iteratively refits the model:
Solve for W given corrected H:
Solve for H given new W:
Re-apply centroid correction to the new H
If x is an nmf object, returns an updated
nmf object. If x is a matrix, returns a corrected
k x n matrix.
data(hawaiibirds) model <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50) meta <- attr(hawaiibirds, "metadata_h") # Post-hoc correction only (cycles = 0) corrected <- refine(model, labels = meta$island, lambda = 0.8) # W-refit cycles: propagate correction back through the model refined <- refine(model, data = hawaiibirds, labels = meta$island, lambda = 0.8, cycles = 3)data(hawaiibirds) model <- nmf(hawaiibirds, k = 4, seed = 42, maxit = 50) meta <- attr(hawaiibirds, "metadata_h") # Post-hoc correction only (cycles = 0) corrected <- refine(model, labels = meta$island, lambda = 0.8) # W-refit cycles: propagate correction back through the model refined <- refine(model, data = hawaiibirds, labels = meta$island, lambda = 0.8, cycles = 3)
Given a baseline MSE-fitted NMF model and the original data,
computes score-test statistics for the power-variance family
() to determine the best-fitting distribution without
refitting. Optionally tests for NB overdispersion.
score_test_distribution( data, model, powers = c(0, 1, 2, 3), test_nb = TRUE, min_mu = 1e-06 )score_test_distribution( data, model, powers = c(0, 1, 2, 3), test_nb = TRUE, min_mu = 1e-06 )
data |
Original data matrix (sparse or dense) |
model |
A fitted NMF object (from |
powers |
Numeric vector of variance powers to test.
Default |
test_nb |
Logical; if |
min_mu |
Floor for predicted values to avoid division by zero. Default 1e-6. |
The score test statistic for variance power is:
where are residuals and
are predicted values.
Under the correct model, . The power minimizing
best matches the observed variance-mean relationship.
The NB diagnostic tests for quadratic overdispersion:
If , there is substantial overdispersion beyond Poisson,
suggesting NB may be preferable to GP.
A list with:
Data frame with columns power, T_stat, abs_T,
distribution (label)
Numeric: the power p with smallest |T_p|
Character: name of the best-matching distribution
If test_nb=TRUE: list with T_NB and
overdispersed (logical)
A <- abs(rsparsematrix(200, 100, 0.3)) model <- nmf(A, k = 5, loss = "mse") diag <- score_test_distribution(A, model) print(diag$scores) cat("Best distribution:", diag$best_distribution, "\n")A <- abs(rsparsematrix(200, 100, 0.3)) model <- nmf(A, k = 5, loss = "mse") diag <- score_test_distribution(A, model) print(diag$scores) cat("Best distribution:", diag$best_distribution, "\n")
Generate a random nonnegative matrix with known factor structure for benchmarking NMF recovery.
Uses block-diagonal construction: each factor owns a disjoint subset of features (rows) and dominates
a disjoint subset of samples (columns), with small cross-talk for realism. This produces clearly
recoverable factors even at moderate noise levels. Inspired by NMF::syntheticNMF.
simulateNMF(nrow, ncol, k, noise = 0.5, dropout = 0, seed = NULL)simulateNMF(nrow, ncol, k, noise = 0.5, dropout = 0, seed = NULL)
nrow |
number of rows (features) |
ncol |
number of columns (samples) |
k |
true rank (number of factors) |
noise |
noise level as a multiplier on the mean signal. A value of 1.0 means the noise standard deviation equals the mean signal value. Default: 0.5. |
dropout |
fraction of entries to set to zero (0 = no dropout). Default: 0. |
seed |
seed for random number generation |
list of dense matrix A and true w and h models
data <- simulateNMF(50, 30, k = 3, noise = 0.1, seed = 42) dim(data$A) # 50 x 3 dim(data$w) # 50 x 3 dim(data$h) # 3 x 30data <- simulateNMF(50, 30, k = 3, noise = 0.1, seed = 42) dim(data$A) # 50 x 3 dim(data$w) # 50 x 3 dim(data$h) # 3 x 30
Generate a synthetic "swimmer" dataset consisting of stick figure images in various swimming poses. This is a classic benchmark dataset for NMF, originally described in Donoho and Stodden (2003). Each image is a 32x32 pixel representation of a stick figure with a torso and four limbs (left arm, right arm, left leg, right leg), where each limb can be in one of 4 positions.
simulateSwimmer( n_images = 256, style = c("stick", "gaussian"), sigma = 1.5, noise = 0, seed = NULL, return_factors = FALSE )simulateSwimmer( n_images = 256, style = c("stick", "gaussian"), sigma = 1.5, noise = 0, seed = NULL, return_factors = FALSE )
n_images |
number of images to generate (default: 256 for all combinations) |
style |
either "stick" for stick figures or "gaussian" for smoothed Gaussian blobs (default: "stick") |
sigma |
standard deviation for Gaussian smoothing when style = "gaussian" (default: 1.5) |
noise |
standard deviation of Gaussian noise to add (default: 0) |
seed |
seed for random number generation |
return_factors |
logical, if TRUE return the true W and H factors (default: FALSE) |
The swimmer dataset consists of stick figure images with:
A fixed torso (circle for head, vertical line for body)
4 limbs, each with 4 possible angles: 0, 45, 90, 135 degrees from body
Total of 16 "limb factors" (4 limbs x 4 positions each)
256 possible combinations when sampling all positions
The true rank of this dataset is 16 (the number of unique limb positions).
When return_factors = TRUE, the function returns the true generative
factors W (256 x 16) and H (16 x 1024), where each column of H represents
one limb position pattern.
Style options:
style = "stick": Sharp lines for limbs (binary image)
style = "gaussian": Smoothed with Gaussian kernel for softer edges
If return_factors = FALSE, returns a sparse matrix (n_images x 1024)
where each row is a flattened 32x32 image. If return_factors = TRUE, returns
a list with components:
The data matrix (n_images x 1024)
True factor matrix W (n_images x 16)
True factor matrix H (16 x 1024)
Matrix (n_images x 4) indicating limb positions
Donoho, D. and Stodden, V. (2003). "When does non-negative matrix factorization give a correct decomposition into parts?" Advances in Neural Information Processing Systems 16.
# Generate all 256 swimmer combinations swimmers <- simulateSwimmer() dim(swimmers) # 256 x 1024 # Generate random subset with Gaussian smoothing swimmers_smooth <- simulateSwimmer(n_images = 100, style = "gaussian", seed = 123) # Get true factors for validation swimmer_data <- simulateSwimmer(return_factors = TRUE) # Run NMF and compare to true factors model <- nmf(swimmer_data$A, k = 16, maxit = 100)# Generate all 256 swimmer combinations swimmers <- simulateSwimmer() dim(swimmers) # 256 x 1024 # Generate random subset with Gaussian smoothing swimmers_smooth <- simulateSwimmer(n_images = 100, style = "gaussian", seed = 123) # Get true factors for validation swimmer_data <- simulateSwimmer(return_factors = TRUE) # Run NMF and compare to true factors model <- nmf(swimmer_data$A, k = 16, maxit = 100)
Compute the sparsity of each NMF factor
sparsity(object, ...) ## S4 method for signature 'nmf' sparsity(object, ...)sparsity(object, ...) ## S4 method for signature 'nmf' sparsity(object, ...)
object |
object of class |
... |
additional parameters |
For nmf models, the sparsity of each factor is computed and summarized
or and matrices. A long data.frame with columns factor, sparsity, and model is returned.
A data.frame with columns factor, sparsity, and model ("w" or "h").
data <- simulateNMF(50, 30, k = 3, seed = 1) model <- nmf(data$A, 3, seed = 1, maxit = 50) sparsity(model)data <- simulateNMF(50, 30, k = 3, seed = 1) model <- nmf(data$A, 3, seed = 1, maxit = 50) sparsity(model)
Add Transpose Section to an Existing StreamPress File
st_add_transpose(path, verbose = TRUE)st_add_transpose(path, verbose = TRUE)
path |
Path to a |
verbose |
Logical; print progress. Default TRUE. |
Invisibly returns the path.
library(Matrix) m <- rsparsematrix(50, 20, 0.3) tmp <- tempfile(fileext = ".spz") st_write(m, tmp, include_transpose = FALSE) st_add_transpose(tmp) info <- st_info(tmp) unlink(tmp)library(Matrix) m <- rsparsematrix(50, 20, 0.3) tmp <- tempfile(fileext = ".spz") st_write(m, tmp, include_transpose = FALSE) st_add_transpose(tmp) info <- st_info(tmp) unlink(tmp)
Returns the column ranges (1-indexed, inclusive) for each chunk without decompressing any data.
st_chunk_ranges(path)st_chunk_ranges(path)
path |
Path to a |
A data.frame with columns start and end.
## Not run: ranges <- st_chunk_ranges("data.spz") print(ranges) ## End(Not run)## Not run: ranges <- st_chunk_ranges("data.spz") print(ranges) ## End(Not run)
Slice Columns Matching Variable Metadata Filter
st_filter_cols(path, ..., threads = 0L)st_filter_cols(path, ..., threads = 0L)
path |
Path to a |
... |
Filter expression on var columns (passed to |
threads |
Integer decode threads. 0 = all. |
A dgCMatrix sparse matrix.
## Not run: mat <- st_filter_cols("data.spz", highly_variable == TRUE) dim(mat) ## End(Not run)## Not run: mat <- st_filter_cols("data.spz", highly_variable == TRUE) dim(mat) ## End(Not run)
Slice Rows Matching Observation Metadata Filter
st_filter_rows(path, ..., threads = 0L)st_filter_rows(path, ..., threads = 0L)
path |
Path to a |
... |
Filter expression on obs columns (passed to |
threads |
Integer decode threads. 0 = all. |
A dgCMatrix sparse matrix.
st_filter_cols, st_obs_indices
## Not run: mat <- st_filter_rows("data.spz", cell_type == "B cell") dim(mat) ## End(Not run)## Not run: mat <- st_filter_rows("data.spz", cell_type == "B cell") dim(mat) ## End(Not run)
Explicitly frees CUDA device memory held by a gpu_sparse_matrix object.
This is optional — the memory will be freed automatically when the object
is garbage-collected.
st_free_gpu(x)st_free_gpu(x)
x |
A |
Invisibly returns NULL.
## Not run: gpu_mat <- st_read_gpu("matrix.spz") st_free_gpu(gpu_mat) ## End(Not run)## Not run: gpu_mat <- st_read_gpu("matrix.spz") st_free_gpu(gpu_mat) ## End(Not run)
Reads only the header – no decompression is performed.
st_info(path)st_info(path)
path |
Path to a |
A list with fields including rows, cols, nnz,
density_pct, file_bytes, raw_bytes, ratio,
version, has_obs, has_var, has_transpose,
transp_chunk_cols.
library(Matrix) A <- rsparsematrix(100, 50, 0.1) f <- tempfile(fileext = ".spz") st_write(A, f) info <- st_info(f) cat(sprintf("Matrix: %d x %d, nnz=%d\n", info$rows, info$cols, info$nnz)) unlink(f)library(Matrix) A <- rsparsematrix(100, 50, 0.1) f <- tempfile(fileext = ".spz") st_write(A, f) info <- st_info(f) cat(sprintf("Matrix: %d x %d, nnz=%d\n", info$rows, info$cols, info$nnz)) unlink(f)
Sequentially reads and decodes each chunk, applies fn, and
collects results.
st_map_chunks(path, fn, transpose = FALSE, threads = 0L)st_map_chunks(path, fn, transpose = FALSE, threads = 0L)
path |
Path to a |
fn |
Function taking |
transpose |
Logical; if |
threads |
Integer; decode threads per chunk. Default 0 (all threads). |
Invisible list of results from fn.
## Not run: # Compute column sums per chunk st_map_chunks("data.spz", function(chunk, s, e) Matrix::colSums(chunk)) ## End(Not run)## Not run: # Compute column sums per chunk st_map_chunks("data.spz", function(chunk, s, e) Matrix::colSums(chunk)) ## End(Not run)
Reads the obs table, applies a filter expression via subset(),
and returns matching row indices.
st_obs_indices(path, ...)st_obs_indices(path, ...)
path |
Path to a |
... |
Filter expressions passed to |
Integer vector of matching row indices (1-based).
## Not run: idx <- st_obs_indices("data.spz", cell_type == "B cell") length(idx) ## End(Not run)## Not run: idx <- st_obs_indices("data.spz", cell_type == "B cell") length(idx) ## End(Not run)
Read a StreamPress file into a dgCMatrix
st_read(path, cols = NULL, reorder = TRUE, threads = NULL)st_read(path, cols = NULL, reorder = TRUE, threads = NULL)
path |
Path to a |
cols |
Optional integer vector of column indices to read (1-indexed). |
reorder |
Logical; if |
threads |
Integer or NULL; threads for parallel decompression.
|
A dgCMatrix sparse matrix with dimnames if available.
library(Matrix) A <- rsparsematrix(100, 50, 0.1) f <- tempfile(fileext = ".spz") st_write(A, f) B <- st_read(f) all.equal(A, B) unlink(f)library(Matrix) A <- rsparsematrix(100, 50, 0.1) f <- tempfile(fileext = ".spz") st_write(A, f) B <- st_read(f) all.equal(A, B) unlink(f)
Read a Dense Matrix from StreamPress v3 Format
st_read_dense(path)st_read_dense(path)
path |
Path to a StreamPress v3 |
A numeric matrix.
A <- matrix(rnorm(1000), 50, 20) f <- tempfile(fileext = ".spz") st_write_dense(A, f) B <- st_read_dense(f) max(abs(A - B)) unlink(f)A <- matrix(rnorm(1000), 50, 20) f <- tempfile(fileext = ".spz") st_write_dense(A, f) B <- st_read_dense(f) max(abs(A - B)) unlink(f)
Reads a .spz v2 file and decodes it directly on the GPU, returning
an opaque GPU-resident CSC matrix. This avoids the CPU-to-GPU transfer that
occurs when using st_read() followed by nmf(data, gpu = TRUE).
The returned object is an external reference — the matrix data lives
entirely in GPU device memory. Pass it directly to nmf() for
zero-copy GPU NMF.
st_read_gpu(path, device = 0L)st_read_gpu(path, device = 0L)
path |
Path to a |
device |
Integer; CUDA device ID (default 0). |
Only .spz v2 format is supported for GPU decode. Use
st_convert() to convert other formats to v2.
The returned object has a finalizer that automatically frees GPU memory
when the R object is garbage-collected. You can also free it manually
with st_free_gpu().
An object of class "gpu_sparse_matrix" with fields:
Number of rows
Number of columns
Number of non-zeros
CUDA device ID
Opaque device pointer (numeric)
Opaque device pointer (numeric)
Opaque device pointer (numeric)
## Not run: # Read directly to GPU gpu_data <- st_read_gpu("data.spz") # Run NMF on GPU-resident data (zero-copy) result <- nmf(gpu_data, k = 10) # Clean up (optional — GC will do this automatically) st_free_gpu(gpu_data) ## End(Not run)## Not run: # Read directly to GPU gpu_data <- st_read_gpu("data.spz") # Run NMF on GPU-resident data (zero-copy) result <- nmf(gpu_data, k = 10) # Clean up (optional — GC will do this automatically) st_free_gpu(gpu_data) ## End(Not run)
Reads the embedded obs table from a v2 .spz file without
decompressing the matrix data. Returns an empty data.frame if no obs
table was stored.
st_read_obs(path)st_read_obs(path)
path |
Path to a |
A data.frame with observation metadata, or an empty
data.frame if no obs table is present.
## Not run: obs <- st_read_obs("data.spz") head(obs) ## End(Not run)## Not run: obs <- st_read_obs("data.spz") head(obs) ## End(Not run)
Reads the embedded var table from a v2 .spz file without
decompressing the matrix data. Returns an empty data.frame if no var
table was stored.
st_read_var(path)st_read_var(path)
path |
Path to a |
A data.frame with variable metadata, or an empty
data.frame if no var table is present.
## Not run: var <- st_read_var("data.spz") head(var) ## End(Not run)## Not run: var <- st_read_var("data.spz") head(var) ## End(Not run)
Convenience function combining row and column slicing.
st_slice(path, rows = NULL, cols = NULL, threads = 0L)st_slice(path, rows = NULL, cols = NULL, threads = 0L)
path |
Path to a |
rows |
Optional integer vector of row indices (1-indexed). |
cols |
Optional integer vector of column indices (1-indexed). |
threads |
Integer; number of threads (0 = all available). Default 0. |
A dgCMatrix sparse matrix.
## Not run: mat <- st_slice("data.spz", rows = 1:100, cols = 1:10) dim(mat) ## End(Not run)## Not run: mat <- st_slice("data.spz", rows = 1:100, cols = 1:10) dim(mat) ## End(Not run)
Read a subset of columns from a .spz file without decompressing
the entire matrix.
st_slice_cols(path, cols, threads = 0L)st_slice_cols(path, cols, threads = 0L)
path |
Path to a |
cols |
Integer vector of column indices (1-indexed). |
threads |
Integer; number of threads (0 = all available). Default 0. |
A dgCMatrix sparse matrix.
## Not run: mat <- st_slice_cols("data.spz", cols = 1:10) dim(mat) ## End(Not run)## Not run: mat <- st_slice_cols("data.spz", cols = 1:10) dim(mat) ## End(Not run)
Read a subset of rows from a .spz file using the pre-stored
transpose section. Requires that the file was written with
include_transpose = TRUE.
st_slice_rows(path, rows, threads = 0L)st_slice_rows(path, rows, threads = 0L)
path |
Path to a |
rows |
Integer vector of row indices (1-indexed). |
threads |
Integer; number of threads (0 = all available). Default 0. |
A dgCMatrix sparse matrix.
## Not run: mat <- st_slice_rows("data.spz", rows = 1:100) dim(mat) ## End(Not run)## Not run: mat <- st_slice_rows("data.spz", rows = 1:100) dim(mat) ## End(Not run)
Write a sparse matrix to a StreamPress file
st_write( x, path, obs = NULL, var = NULL, delta = TRUE, value_pred = FALSE, verbose = FALSE, precision = "auto", row_sort = FALSE, include_transpose = TRUE, chunk_cols = NULL, chunk_bytes = 8e+06, transp_chunk_cols = NULL, transp_chunk_bytes = NULL, threads = 0L )st_write( x, path, obs = NULL, var = NULL, delta = TRUE, value_pred = FALSE, verbose = FALSE, precision = "auto", row_sort = FALSE, include_transpose = TRUE, chunk_cols = NULL, chunk_bytes = 8e+06, transp_chunk_cols = NULL, transp_chunk_bytes = NULL, threads = 0L )
x |
A sparse matrix ( |
path |
Output file path. Extension |
obs |
Optional data.frame of observation (row/cell) metadata.
Must have |
var |
Optional data.frame of variable (column/gene) metadata.
Must have |
delta |
Logical; use density-based delta prediction for structure.
Default |
value_pred |
Logical; use value prediction for integer-valued data.
Default |
verbose |
Logical; print compression statistics. Default |
precision |
Value precision: |
row_sort |
Logical; sort rows by nnz for better compression. |
include_transpose |
Logical; store CSC(A^T) in the file. Default |
chunk_cols |
Integer or NULL; columns per chunk. If NULL, computed from
|
chunk_bytes |
Numeric; target bytes per chunk when |
transp_chunk_cols |
Integer or NULL; columns per transpose chunk. |
transp_chunk_bytes |
Numeric or NULL; target bytes per transpose chunk. |
threads |
Integer; number of threads for parallel compression (0 = all available). Default 0. |
Invisibly returns a list with compression statistics.
library(Matrix) A <- rsparsematrix(1000, 500, 0.05) f <- tempfile(fileext = ".spz") st_write(A, f) B <- st_read(f) all.equal(A, B) # TRUE unlink(f)library(Matrix) A <- rsparsematrix(1000, 500, 0.05) f <- tempfile(fileext = ".spz") st_write(A, f) B <- st_read(f) all.equal(A, B) # TRUE unlink(f)
Write a Dense Matrix to StreamPress v3 Format
st_write_dense( x, path, include_transpose = FALSE, chunk_cols = 2048L, codec = "raw", delta = FALSE, verbose = FALSE )st_write_dense( x, path, include_transpose = FALSE, chunk_cols = 2048L, codec = "raw", delta = FALSE, verbose = FALSE )
x |
A numeric matrix. |
path |
Output file path. Extension |
include_transpose |
Logical; store transposed panels. Default |
chunk_cols |
Integer; columns per chunk. Default 2048. |
codec |
Compression codec: |
delta |
Logical; apply XOR-delta encoding. Default |
verbose |
Logical; print write statistics. Default |
Invisibly returns a list with write statistics.
A <- matrix(rnorm(1000), 50, 20) f <- tempfile(fileext = ".spz") st_write_dense(A, f, include_transpose = TRUE) B <- st_read_dense(f) max(abs(A - B)) unlink(f)A <- matrix(rnorm(1000), 50, 20) f <- tempfile(fileext = ".spz") st_write_dense(A, f, include_transpose = TRUE) B <- st_read_dense(f) max(abs(A - B)) unlink(f)
Column-concatenates a list of matrices and writes them as a single
.spz file. All matrices must have the same number of rows.
st_write_list( x, path, obs = NULL, var = NULL, chunk_bytes = 6.4e+07, chunk_cols = NULL, include_transpose = TRUE, precision = "auto", threads = 0L, verbose = FALSE )st_write_list( x, path, obs = NULL, var = NULL, chunk_bytes = 6.4e+07, chunk_cols = NULL, include_transpose = TRUE, precision = "auto", threads = 0L, verbose = FALSE )
x |
A list of |
path |
Output |
obs |
Optional data.frame of cell metadata ( |
var |
Optional data.frame of gene metadata ( |
chunk_bytes |
Target bytes per chunk. Default 64 MB. |
chunk_cols |
Explicit column count per chunk. Overrides |
include_transpose |
Logical. Default |
precision |
Value precision. Default |
threads |
Integer. 0 = all threads. |
verbose |
Logical. |
Invisibly, compression statistics.
## Not run: mats <- list(mat1, mat2) st_write_list(mats, "combined.spz") ## End(Not run)## Not run: mats <- list(mat1, mat2) st_write_list(mats, "combined.spz") ## End(Not run)
Functions for reading and writing matrices in StreamPress (.spz) format. StreamPress achieves 5-10x compression over raw float32 CSC binary on typical scRNA-seq sparse matrices using rANS entropy coding. Beyond storage savings, SPZ is also faster to read than raw CSC binary at any thread count: the bottleneck in reading large sparse matrices is sparse object construction (sorting indices, allocating R/Eigen structures), which SPZ parallelises across independent chunks while raw CSC must perform sequentially.
StreamPress (.spz) supports two format versions:
v2 (sparse): Column-oriented compressed CSC format. Lossless, 5-10x compression over raw float32 CSC. Self-describing header.
v3 (dense): Column-major dense panels with optional FP16/QUANT8/rANS compression. For streaming NMF on dense data.
st_write, st_read, st_info,
st_write_dense, st_read_dense
Given an NMF model in the form , project projects w onto A to solve for h.
## S4 method for signature 'nmf' subset(x, i, ...) ## S4 method for signature 'nmf,ANY,ANY,ANY' x[i] ## S4 method for signature 'nmf' head(x, n = getOption("digits"), ...) ## S4 method for signature 'nmf' show(object) ## S4 method for signature 'nmf' dimnames(x) ## S4 method for signature 'nmf' dim(x) ## S4 method for signature 'nmf' t(x) ## S4 method for signature 'nmf' sort(x, decreasing = TRUE, ...) ## S4 method for signature 'nmf' prod(x, ..., na.rm = FALSE) ## S4 method for signature 'nmf' x$name ## S4 method for signature 'nmf,list' coerce(from, to) ## S4 method for signature 'nmf' x[[i]] ## S4 method for signature 'nmf' predict( object, data, L1 = NULL, L2 = NULL, mask = NULL, upper_bound = NULL, threads = 0, verbose = FALSE, ... )## S4 method for signature 'nmf' subset(x, i, ...) ## S4 method for signature 'nmf,ANY,ANY,ANY' x[i] ## S4 method for signature 'nmf' head(x, n = getOption("digits"), ...) ## S4 method for signature 'nmf' show(object) ## S4 method for signature 'nmf' dimnames(x) ## S4 method for signature 'nmf' dim(x) ## S4 method for signature 'nmf' t(x) ## S4 method for signature 'nmf' sort(x, decreasing = TRUE, ...) ## S4 method for signature 'nmf' prod(x, ..., na.rm = FALSE) ## S4 method for signature 'nmf' x$name ## S4 method for signature 'nmf,list' coerce(from, to) ## S4 method for signature 'nmf' x[[i]] ## S4 method for signature 'nmf' predict( object, data, L1 = NULL, L2 = NULL, mask = NULL, upper_bound = NULL, threads = 0, verbose = FALSE, ... )
x |
object of class |
i |
indices |
... |
arguments passed to or from other methods |
n |
number of rows/columns to show |
object |
fitted model, class |
decreasing |
logical. Should the sort be increasing or decreasing? |
na.rm |
remove na values |
name |
name of nmf class slot |
from |
class which the coerce method should perform coercion from |
to |
class which the coerce method should perform coercion to |
data |
dense or sparse matrix of features in rows and samples in columns. Prefer |
L1 |
a single LASSO penalty in the range (0, 1] |
L2 |
a single Ridge penalty greater than zero |
mask |
missing data mask. Accepts:
|
upper_bound |
maximum value permitted in least squares solutions, essentially a bounded-variable least squares problem between 0 and |
threads |
number of threads for OpenMP parallelization (default 0 = all available) |
verbose |
print progress information (default FALSE) |
For the alternating least squares matrix factorization update problem , the updates
(or projection) of is given by the equation:
which is in the form where and for all columns in .
Any L1 penalty is subtracted from and should generally be scaled to max(b), where for all columns in . An easy way to properly scale an L1 penalty is to normalize all columns in to sum to the same value (e.g. 1). No scaling is applied in this function. Such scaling guarantees that L1 = 1 gives a completely sparse solution.
There are specializations for dense and sparse input matrices, symmetric input matrices, and for rank-1 and rank-2 projections. See documentation for nmf for theoretical details and guidance.
An nmf object subsetted to the specified factors.
An nmf object subsetted to the specified factors.
Invisibly returns the nmf object.
Invisibly returns the nmf object.
A list of length two: row names of w and column names of h.
An integer vector of length 3: number of features, number of samples, and rank.
An nmf object with transposed factorization (w and h swapped and transposed).
An nmf object with factors reordered by decreasing (or increasing) d.
A dense matrix equal to w %*% diag(d) %*% h.
Contents of the named slot (w, d, h, or misc), or the named element from misc.
A named list with elements w, d, h, and misc.
The element from the misc list at the given name or index.
An nmf object containing the projected model (with updated h matrix).
Zach DeBruine
DeBruine, ZJ, Melcher, K, and Triche, TJ. (2021). "High-performance non-negative matrix factorization for large single-cell data." BioRXiv.
library(Matrix) w <- matrix(runif(1000 * 10), 1000, 10) h_true <- matrix(runif(10 * 100), 10, 100) # A is the crossproduct of "w" and "h" with 10% signal dropout mask <- rsparsematrix(1000, 100, density = 0.1) A <- (w %*% h_true) * (mask != 0) h <- nnls(w = w, A = A) cor(as.vector(h_true), as.vector(h)) # alternating projections refine solution (like NMF) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A)library(Matrix) w <- matrix(runif(1000 * 10), 1000, 10) h_true <- matrix(runif(10 * 100), 10, 100) # A is the crossproduct of "w" and "h" with 10% signal dropout mask <- rsparsematrix(1000, 100, density = 0.1) A <- (w %*% h_true) * (mask != 0) h <- nnls(w = w, A = A) cor(as.vector(h_true), as.vector(h)) # alternating projections refine solution (like NMF) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A) h <- nnls(w = w, A = A) w <- nnls(h = h, A = A)
summary method for class "nmf". Describes metadata representation in NMF factors. Returns object of class nmfSummary. Plot results using plot.
## S4 method for signature 'nmf' summary(object, group_by, stat = "sum", ...) ## S3 method for class 'nmfSummary' plot(x, ...)## S4 method for signature 'nmf' summary(object, group_by, stat = "sum", ...) ## S3 method for class 'nmfSummary' plot(x, ...)
object |
an object of class " |
group_by |
a discrete factor giving groupings for samples or features. Must be of the same length as number of samples in |
stat |
either |
... |
Additional arguments (unused). |
x |
|
data.frame with columns group, factor, and stat
A ggplot2 object showing factor representation by group.
library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) groups <- factor(sample(c("A", "B"), 50, replace = TRUE)) s <- summary(model, group_by = groups) library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) groups <- factor(sample(c("A", "B"), 50, replace = TRUE)) s <- summary(model, group_by = groups) if (requireNamespace("ggplot2", quietly = TRUE)) { plot(s) }library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) groups <- factor(sample(c("A", "B"), 50, replace = TRUE)) s <- summary(model, group_by = groups) library(Matrix) A <- rsparsematrix(100, 50, 0.1) model <- nmf(A, k = 3, seed = 42, tol = 1e-2, maxit = 10) groups <- factor(sample(c("A", "B"), 50, replace = TRUE)) s <- summary(model, group_by = groups) if (requireNamespace("ggplot2", quietly = TRUE)) { plot(s) }
Summary for Consensus NMF
## S3 method for class 'consensus_nmf' summary(object, ...)## S3 method for class 'consensus_nmf' summary(object, ...)
object |
consensus_nmf object |
... |
additional arguments (unused) |
Invisibly returns the consensus_nmf object. Summary statistics are printed to the console.
consensus_nmf, plot.consensus_nmf
library(Matrix) A <- rsparsematrix(50, 30, 0.3) cons <- consensus_nmf(A, k = 3, reps = 5, seed = 42) summary(cons)library(Matrix) A <- rsparsematrix(50, 30, 0.3) cons <- consensus_nmf(A, k = 3, reps = 5, seed = 42) summary(cons)
Summarize a factor_net_result
## S3 method for class 'factor_net_result' summary(object, ...)## S3 method for class 'factor_net_result' summary(object, ...)
object |
A |
... |
Additional arguments (unused). |
Invisibly returns object.
factor_net, fit, print.factor_net_result
Returns a tidy data frame of aggregate classification metrics.
## S3 method for class 'fn_classifier_eval' summary(object, ...)## S3 method for class 'fn_classifier_eval' summary(object, ...)
object |
An |
... |
Additional arguments (unused). |
A data frame with columns metric and value.
print.fn_classifier_eval, classify_embedding
Compute a truncated SVD or PCA of a matrix using multiple algorithm backends. Supports sparse and dense matrices, implicit PCA centering, per-element regularization (L1/L2/non-negativity/bounds), Gram-level regularization (L2,1/angular/graph Laplacian), and automatic rank selection via speckled holdout cross-validation.
svd( A, k = 10, tol = 1e-05, maxit = 200, center = FALSE, scale = FALSE, verbose = FALSE, seed = NULL, L1 = 0, L2 = 0, nonneg = FALSE, upper_bound = 0, test_fraction = 0, mask = NULL, robust = FALSE, resource = "auto", method = "auto", ... )svd( A, k = 10, tol = 1e-05, maxit = 200, center = FALSE, scale = FALSE, verbose = FALSE, seed = NULL, L1 = 0, L2 = 0, nonneg = FALSE, upper_bound = 0, test_fraction = 0, mask = NULL, robust = FALSE, resource = "auto", method = "auto", ... )
A |
Input matrix. May be dense ( |
k |
Number of factors (rank). Use |
tol |
Convergence tolerance per rank-1 subproblem. Default: 1e-5. |
maxit |
Maximum ALS iterations per factor. Default: 200. |
center |
If |
scale |
If |
verbose |
Print per-factor diagnostics. Default: |
seed |
Random seed for initialization and cross-validation. Default: NULL. |
L1 |
L1 (lasso) penalty. Single value or length-2 vector. Default: 0. |
L2 |
L2 (ridge) penalty. Single value or length-2 vector. Default: 0. |
nonneg |
Non-negativity constraints. Single logical or length-2 vector. Default: |
upper_bound |
Upper bound constraints. Single value or length-2 vector. Default: 0. |
test_fraction |
Fraction of entries to hold out for cross-validation. Default: 0. |
mask |
Masking control. |
robust |
Robustness control: |
resource |
Compute backend: |
method |
Algorithm: |
... |
Advanced parameters. See Advanced Parameters section. |
Multiple algorithms are available via the method parameter:
"deflation"Sequential rank-1 ALS with deflation correction. Supports all constraints/CV/auto-rank. Best for small k with mixed constraints.
"krylov"Krylov-Seeded Projected Refinement (KSPR). Block method: computes all k factors simultaneously via Lanczos seed + Gram-solve-then-project. Faster than deflation for larger k. Supports all regularization except robust. CV is evaluated post-convergence.
"lanczos"Unconstrained Lanczos bidiagonalization. Fast for unconstrained SVD. No regularization support.
"irlba"Implicitly Restarted Lanczos. Good general-purpose unconstrained SVD. No regularization support.
"randomized"Randomized SVD with power iterations. Fast approximate SVD for large matrices. No regularization support.
An S4 object of class svd with slots:
uLeft singular vectors (scores), m x k matrix
dSingular values, length-k numeric vector
vRight singular vectors (loadings), n x k matrix
miscList with metadata: centered, row_means,
test_loss, iters_per_factor, wall_time_ms
When k = "auto", a speckled holdout set is created and test MSE is
evaluated after each factor. Rank selection stops when test MSE fails to
improve for patience consecutive factors. The returned model is
truncated to the best rank.
...)The following parameters can be passed via ...:
L21L2,1 (group sparsity) penalty. Single value or length-2 vector (default 0).
angularAngular (orthogonality) penalty. Single value or length-2 vector (default 0).
graph_USparse graph Laplacian for features (m x m). Default NULL.
graph_VSparse graph Laplacian for samples (n x n). Default NULL.
graph_lambdaGraph regularization strength. Single value or length-2 (default 0).
convergenceConvergence criterion: "factor" (default) or "global".
cv_seedSeparate seed for holdout mask. Default NULL.
patienceAuto-rank non-improving factor patience (default 3).
k_maxMaximum rank for auto-rank mode (default 50).
threadsNumber of OpenMP threads. 0 = all (default 0).
This function shadows base::svd(). To use the base R version,
call base::svd() explicitly.
library(RcppML) data(iris) A <- as.matrix(iris[, 1:4]) # Standard SVD (3 factors) s <- svd(A, k = 3) # PCA with centering (convenience wrapper) p <- pca(A, k = 3) # Auto-rank PCA ar <- svd(A, k = "auto", center = TRUE, verbose = TRUE) # Robust PCA (outlier-resistant) rp <- svd(A, k = 3, center = TRUE, robust = TRUE, verbose = TRUE)library(RcppML) data(iris) A <- as.matrix(iris[, 1:4]) # Standard SVD (3 factors) s <- svd(A, k = 3) # PCA with centering (convenience wrapper) p <- pca(A, k = 3) # Auto-rank PCA ar <- svd(A, k = "auto", center = TRUE, verbose = TRUE) # Robust PCA (outlier-resistant) rp <- svd(A, k = 3, center = TRUE, robust = TRUE, verbose = TRUE)
Creates an unconstrained (SVD/PCA) factorization layer. No non-negativity constraint by default; factors are signed.
svd_layer( input, k, L1 = 0, L2 = 0, L21 = 0, angular = 0, upper_bound = 0, center = FALSE, scale = FALSE, method = c("auto", "deflation", "krylov", "lanczos", "irlba", "randomized"), mask = NULL, robust = FALSE, W = NULL, H = NULL, name = NULL, ... )svd_layer( input, k, L1 = 0, L2 = 0, L21 = 0, angular = 0, upper_bound = 0, center = FALSE, scale = FALSE, method = c("auto", "deflation", "krylov", "lanczos", "irlba", "randomized"), mask = NULL, robust = FALSE, W = NULL, H = NULL, name = NULL, ... )
input |
An |
k |
Factorization rank. |
L1 |
L1 penalty (shared by U and V unless overridden). Default 0. |
L2 |
L2 penalty. Default 0. |
L21 |
Group sparsity penalty. Default 0. |
angular |
Orthogonality penalty. Default 0. |
upper_bound |
Box constraint. Default 0. |
center |
Center columns before factorization (PCA mode). Default FALSE. |
scale |
Scale columns to unit variance. Default FALSE. |
method |
SVD algorithm: |
mask |
Masking mode: NULL (none), |
robust |
Use robust loss. Default FALSE. |
W |
Optional |
H |
Optional |
name |
Optional layer name (for results access). |
... |
Additional arguments forwarded to |
SVD-specific parameters (center, scale, method)
control the SVD algorithm directly. Regularization parameters
and ... are forwarded to svd at fit time.
An fn_node of type "svd_layer".
nmf_layer, factor_input,
factor_net, svd
data(aml) inp <- factor_input(aml) layer <- svd_layer(inp, k = 5) # PCA with centering and scaling pca_layer <- svd_layer(inp, k = 10, center = TRUE, scale = TRUE)data(aml) inp <- factor_input(aml) layer <- svd_layer(inp, k = 5) # PCA with centering and scaling pca_layer <- svd_layer(inp, k = 10, center = TRUE, scale = TRUE)
S4 class for deflation SVD/PCA results.
Computes scores for new samples by projecting newdata onto the right
singular vectors. Equivalent to PCA "out-of-sample prediction".
## S4 method for signature 'svd,ANY,ANY,ANY' x[i] ## S4 method for signature 'svd' head(x, n = getOption("digits"), ...) ## S4 method for signature 'svd' show(object) ## S4 method for signature 'svd' dim(x) reconstruct(object, ...) ## S4 method for signature 'svd' reconstruct(object, ...) ## S4 method for signature 'svd' predict(object, newdata, ...) variance_explained(object, ...) ## S4 method for signature 'svd' variance_explained(object, ...)## S4 method for signature 'svd,ANY,ANY,ANY' x[i] ## S4 method for signature 'svd' head(x, n = getOption("digits"), ...) ## S4 method for signature 'svd' show(object) ## S4 method for signature 'svd' dim(x) reconstruct(object, ...) ## S4 method for signature 'svd' reconstruct(object, ...) ## S4 method for signature 'svd' predict(object, newdata, ...) variance_explained(object, ...) ## S4 method for signature 'svd' variance_explained(object, ...)
x |
object of class |
i |
indices for subsetting factors |
n |
number of rows/columns to show |
... |
Ignored. |
object |
An |
newdata |
A numeric matrix of new samples (rows = samples,
columns = features). Must have the same number of features as the
original data ( |
A subsetted svd object containing only the selected factors.
Invisibly returns the svd object x.
Invisibly returns the svd object.
Integer vector of length 3: c(m, n, k) where m is the number of
rows, n the number of columns, and k the rank.
Dense matrix: (plus row means if centered)
A numeric matrix of scores with nrow(newdata) rows and
length(object@d) columns (i.e., same as the model).
Each row is the projection of one new sample onto the singular space.
Numeric vector of proportion of variance explained by each factor
uleft singular vectors (scores), m x k matrix
dsingular values, numeric vector of length k
vright singular vectors (loadings), n x k matrix
misclist containing metadata: centered, row_means, test_loss, iters_per_factor, wall_time_ms, auto_rank
svd, pca, reconstruct, variance_explained
A <- matrix(rnorm(200), 20, 10) s <- RcppML::svd(A, k = 3) Ahat <- reconstruct(s) dim(Ahat) A <- matrix(rnorm(200), 20, 10) s <- RcppML::svd(A, k = 3) variance_explained(s)A <- matrix(rnorm(200), 20, 10) s <- RcppML::svd(A, k = 3) Ahat <- reconstruct(s) dim(Ahat) A <- matrix(rnorm(200), 20, 10) s <- RcppML::svd(A, k = 3) variance_explained(s)
Returns a logger object that records per-iteration metrics during
fit(). After training, the log can be printed, plotted, or
exported to CSV.
training_logger( log_loss = TRUE, log_norms = FALSE, log_classifier = NULL, interval = 1L )training_logger( log_loss = TRUE, log_norms = FALSE, log_classifier = NULL, interval = 1L )
log_loss |
Log reconstruction loss per layer (default TRUE). |
log_norms |
Log factor W/H norms per layer (default FALSE). |
log_classifier |
Evaluate classifier accuracy per iteration
using a supplied |
interval |
Log every |
A training_logger object to pass to fit().
## Not run: logger <- training_logger( log_norms = TRUE, log_classifier = list(labels = labels, test_idx = test_idx, k = 5) ) res <- fit(net, logger = logger) print(logger) plot(logger) export_log(logger, "training_log.csv") ## End(Not run)## Not run: logger <- training_logger( log_norms = TRUE, log_classifier = list(labels = labels, test_idx = test_idx, k = 5) ) res <- fit(net, logger = logger) print(logger) plot(logger) export_log(logger, "training_log.csv") ## End(Not run)
Use W() and H() inside nmf_layer() or
svd_layer() to set factor-specific regularization, constraints,
and targets. Parameters not set inherit from the layer defaults.
W( L1 = NULL, L2 = NULL, L21 = NULL, angular = NULL, upper_bound = NULL, nonneg = NULL, graph = NULL, graph_lambda = NULL, target = NULL, target_lambda = NULL ) H( L1 = NULL, L2 = NULL, L21 = NULL, angular = NULL, upper_bound = NULL, nonneg = NULL, graph = NULL, graph_lambda = NULL, target = NULL, target_lambda = NULL )W( L1 = NULL, L2 = NULL, L21 = NULL, angular = NULL, upper_bound = NULL, nonneg = NULL, graph = NULL, graph_lambda = NULL, target = NULL, target_lambda = NULL ) H( L1 = NULL, L2 = NULL, L21 = NULL, angular = NULL, upper_bound = NULL, nonneg = NULL, graph = NULL, graph_lambda = NULL, target = NULL, target_lambda = NULL )
L1 |
L1 (lasso) penalty. Default 0. |
L2 |
L2 (ridge) penalty. Default 0. |
L21 |
Group sparsity penalty. Default 0. |
angular |
Orthogonality penalty. Default 0. |
upper_bound |
Box constraint upper bound (0 = none). Default 0. |
nonneg |
Non-negativity constraint. Default TRUE for NMF, FALSE for SVD. |
graph |
Sparse graph Laplacian matrix (or NULL). |
graph_lambda |
Graph regularization strength. Default 0. |
target |
Target matrix for regularization (k x n for H, k x m for W).
See |
target_lambda |
Target regularization strength. Default 0. |
An fn_factor_config object.
factor_config, nmf_layer, factor_net
# Per-factor config for W with L1 sparsity w_cfg <- W(L1 = 0.1, nonneg = TRUE) h_cfg <- H(L2 = 0.01)# Per-factor config for W with L1 sparsity w_cfg <- W(L1 = 0.1, nonneg = TRUE) h_cfg <- H(L2 = 0.01)