NMF learns an interpretable low-rank representation of data. However, how do we make sense of the factors in this low-rank latent model? A great way to begin annotating a latent space is to simply map it back to known sample and feature traits.
This vignette demonstrates these concepts using an NMF model of bird species communities throughout the Hawaiian islands.
Install the RcppML R package from CRAN or the development version from GitHub.
The hawaiibirds
dataset gives the frequency of bird
species in small geographical grids throughout the state of Hawaii.
## 4 x 4 sparse Matrix of class "dgCMatrix"
## grid1 grid2 grid3 grid4
## Common Myna 0.32432432 0.19230769 0.242753623 0.80208333
## Black-crowned Night-Heron 0.06756757 0.07692308 0.007246377 0.03819444
## Black Noddy . 0.26923077 0.188405797 .
## Brown Noddy . 0.38461538 . .
A separate metadata_h
matrix gives the geographical
coordinates and the corresponding island for each grid.
## grid island lat lng
## 1 grid1 Maui 20.87 -156.44
## 2 grid2 Oahu 21.33 -157.66
## 3 grid3 Hawaii 19.33 -155.19
## 4 grid4 Oahu 21.37 -157.94
## 5 grid5 Hawaii 19.72 -155.11
## 6 grid6 Maui 20.74 -156.24
And a separate metadata_w
matrix gives taxonomic
information about each species in the database.
## species status type
## 1 Common Myna introduced perching birds
## 2 Black-crowned Night-Heron native waders
## 3 Black Noddy native shorebirds
## 4 Brown Noddy native shorebirds
## 5 Bulwer's Petrel native seabirds
## 6 Sooty Tern native shorebirds
We can learn an NMF model to describe linear combinations of species across geographical grids. First we need to choose a rank.
The rank of a factorization is a crucial hyperparameter. One way to
help decide on a rank is cross-validation. This is made easy using the
crossValidate
function. See ?crossValidate
for
details on methods.
For many applications, there is no “optimal” rank. In this case, we do expect some amount of distinct biodiversity across the various islands, but within the islands there will be a continuum of habitat niches confounding rank of the signal. Additionally, there may be a number of “missing” observations where surveys were incomplete, which will confound signal separation.
Here we cross-validate across 3 independent replicates and plot the result (this code is not evaluated in this vignette since it takes about a minute to execute):
plot(crossValidate(hawaiibirds$counts, k = c(1:20), reps = 3)) + scale_y_continuous(trans = "log10")
We’ll choose a rank of k = 15
since this seems to return
the best prediction accuracy.
Let’s generate a high-quality NMF model across 3 random restarts at very low tolerance:
## 183 x 1183 x 15 factor model of class "nmf"
## @ w
## nmf1 nmf2 nmf3 nmf4
## Common Myna 0.148328389 0.097004277 0.070322627 0.126175888
## Black-crowned Night-Heron 0.007912592 0.009766297 . 0.006037977
## Black Noddy . . . 0.009565077
## Brown Noddy . . . .
## Bulwer's Petrel . . . .
## Sooty Tern . . 0.001327280 .
## Wedge-tailed Shearwater . . . .
## nmf5 nmf6 nmf7
## Common Myna 0.024307396 0.001601450 .
## Black-crowned Night-Heron 0.087244449 . .
## Black Noddy . . .
## Brown Noddy . . .
## Bulwer's Petrel . . .
## Sooty Tern . . .
## Wedge-tailed Shearwater . . .
## ...suppressing 176 rows and 8 columns
##
## @ d
## 1180.546 991.8922 869.0251 833.8436 686.2124 684.9236 559.2632
## ...suppressing 8 values
##
## @ h
## grid1 grid2 grid3 grid4 grid5
## nmf1 8.912460e-04 6.376840e-04 5.574931e-04 3.221175e-03 1.039969e-03
## nmf2 7.929197e-04 . 1.946878e-04 . 3.430622e-05
## nmf3 2.719171e-04 4.056909e-04 . 4.664333e-03 .
## nmf4 . . 1.187724e-03 1.497042e-04 1.682771e-03
## nmf5 5.350900e-03 4.269131e-05 . 3.384526e-04 .
## nmf6 4.803149e-04 4.019916e-04 4.962776e-04 2.990747e-05 1.287416e-03
## nmf7 . . 1.156676e-03 . .
## grid6 grid7
## nmf1 . 5.320816e-04
## nmf2 2.434128e-05 3.503179e-03
## nmf3 . 3.180302e-04
## nmf4 . 7.406379e-05
## nmf5 . .
## nmf6 . 9.921460e-05
## nmf7 . .
## ...suppressing 8 rows and 1176 columns
##
## @ misc
## List of 5
## $ tol : num 9.7e-07
## $ iter : num 51
## $ runtime: 'difftime' num 0.444320917129517
## ..- attr(*, "units")= chr "secs"
## $ mse : num 0.00274
## $ w_init : num [1:15, 1:183] 1.038 1.707 2.259 0.848 2.196 ...
In the w
matrix we have factors describing communities
of co-occuring bird species.
In the h
matrix we have the association of these bird
communities in each surveyed geographical grid.
What does each NMF factor tell us?
The sample embeddings matrix (h
) gives information about
the geographical representation of each NMF factor across all grids.
We’ll look at just the first four factors:
plots <- list()
for(i in 1:4){
df <- data.frame(
"lat" = hawaiibirds$metadata_h$lat,
"lng" = hawaiibirds$metadata_h$lng,
"nmf_factor" = model$h[i, ])
plots[[i]] <- ggplot(df, aes(x = lng, y = lat, color = nmf_factor)) +
geom_point() +
scale_color_viridis(option = "B") +
theme_void() +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
ggtitle(paste0("Factor ", i))
}
plot_grid(plotlist = plots, nrow = 2)
Factors can capture both island-restricted bird species information and also information shared across islands.
Quantitatively, the summary
method for the
nmf
S4 class makes it easy to annotate factors using
metadata about samples or features.
In this case, we will use summary
to map factor
enrichment in grids corresponding to each Hawaiian island, and species
enrichment corresponding to each type of species.
In general, grids separate based on island – consistent with the expectation that islands contain distinct species communities.
Notice what type of bird species tend to co-occur – waders with perching birds and waterfowl, waterfowl with waders, seabirds with shorebirds, and many different perching bird contexts.
Compare species composition in any two factors, for instance factor 2 and 3:
We might also be interested in visualizing how factors in
w
capture similarities among bird species using UMAP.
set.seed(123)
umap <- data.frame(uwot::umap(model$w))
umap$taxon <- hawaiibirds$metadata_w$type
umap$status <- hawaiibirds$metadata_w$status
plot_grid(
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = taxon)) +
geom_point() + theme_void(),
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = status)) +
geom_point() + theme_void(),
nrow = 1
)
Species are classified based on habitat niche and taxonomic membership. There are also two groups of “waterfowl”, consistent with ocean shoreline and inland wetland niches.
Hawaii is bird species extinction kingdom: more than 20 species of endemic honeycreeper have gone extinct in the past two centuries due to the establishment of introduced species and habitat devastation. Few remain. In the UMAP plot above on the right, we can observe that introduced species dominate habitat niches occupied by native perching and non-perching birds, a problem underlying historic and ongoing mass extinction events.
set.seed(123)
umap <- data.frame(uwot::umap(t(model$h), metric = "cosine"))
umap$group <- hawaiibirds$metadata_h$island
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = group)) +
geom_point() + theme_void() + theme(aspect.ratio = 1)
Islands are also well-defined by the NMF model.
The Palila is a highly endangered species that survives in small numbers on the southwestern slopes of Mauna Kea in a shrubby dry “rainforest” biome, characterized by the last stands of endemic Mamame trees.
What species coexist with the Palila?
Let’s have a look at the species composition in the factor with the highest Palila representation, specifically identifying which species are introduced and which are native:
ggplot(data.frame("value" = model$w["Palila", ], "factor" = 1:ncol(model$w)), aes(factor, value)) +
geom_point() +
theme_classic() +
theme(aspect.ratio = 1) +
labs(x = "NMF factor", y = "Palila weight in NMF factor")
df <- data.frame("value" = model$w[, which.max(model$w["Palila", ])])
df$status <- hawaiibirds$metadata_w$status
df <- df[order(-df$value), ]
df <- df[df$value > 0.001, ]
df
## value status
## Hawaii Amakihi 0.429048285 native
## Warbling White-eye 0.197932914 native
## House Finch 0.165412451 introduced
## California Quail 0.036549864 native
## Yellow-fronted Canary 0.035886958 introduced
## Palila 0.033441821 native
## Erckel's Francolin 0.028694271 introduced
## Hawaii Elepaio 0.025463511 native
## Eurasian Skylark 0.025317139 introduced
## Red-billed Leiothrix 0.006118122 introduced
## Chukar 0.006071825 introduced
## Chinese Hwamei 0.005362257 introduced
## Hawaiian Hawk 0.001774794 native
## Red-masked Parakeet 0.001499308 introduced
The diet of the Palilla is largely seeds from the “mamame” tree, but also naio berries and mamame flowers, buds, and young leaves. What introduced perching birds may be competing with the Palila for these resources?
perching_birds <- hawaiibirds$metadata_w$species[hawaiibirds$metadata_w$type == "perching birds"]
df[which(rownames(df) %in% perching_birds & df$status == "introduced"), ]
## value status
## House Finch 0.165412451 introduced
## Yellow-fronted Canary 0.035886958 introduced
## Eurasian Skylark 0.025317139 introduced
## Red-billed Leiothrix 0.006118122 introduced
## Chinese Hwamei 0.005362257 introduced
The “House Finch” and “Yellow-fronted Canary” seem to be the most significant competitors in the Palila habitat niche.