Learning and Annotating NMF Models

Annotating NMF factors

NMF learns an interpretable low-rank representation of data. However, how do we make sense of the factors in this low-rank latent model? A great way to begin annotating a latent space is to simply map it back to known sample and feature traits.

This vignette demonstrates these concepts using an NMF model of bird species communities throughout the Hawaiian islands.

Install RcppML

Install the RcppML R package from CRAN or the development version from GitHub.

install.packages('RcppML')                     # install CRAN version
# devtools::install_github("zdebruine/RcppML") # compile dev version
library(RcppML)
library(ggplot2)
library(cowplot)
library(viridis)
library(ggrepel)
library(uwot)

The hawaiibirds dataset

The hawaiibirds dataset gives the frequency of bird species in small geographical grids throughout the state of Hawaii.

data(hawaiibirds)
hawaiibirds$counts[1:4, 1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##                                grid1      grid2       grid3      grid4
## Common Myna               0.32432432 0.19230769 0.242753623 0.80208333
## Black-crowned Night-Heron 0.06756757 0.07692308 0.007246377 0.03819444
## Black Noddy               .          0.26923077 0.188405797 .         
## Brown Noddy               .          0.38461538 .           .

A separate metadata_h matrix gives the geographical coordinates and the corresponding island for each grid.

head(hawaiibirds$metadata_h)
##    grid island   lat     lng
## 1 grid1   Maui 20.87 -156.44
## 2 grid2   Oahu 21.33 -157.66
## 3 grid3 Hawaii 19.33 -155.19
## 4 grid4   Oahu 21.37 -157.94
## 5 grid5 Hawaii 19.72 -155.11
## 6 grid6   Maui 20.74 -156.24

And a separate metadata_w matrix gives taxonomic information about each species in the database.

head(hawaiibirds$metadata_w)
##                     species     status           type
## 1               Common Myna introduced perching birds
## 2 Black-crowned Night-Heron     native         waders
## 3               Black Noddy     native     shorebirds
## 4               Brown Noddy     native     shorebirds
## 5           Bulwer's Petrel     native       seabirds
## 6                Sooty Tern     native     shorebirds

Cross-validation for Rank Determination

We can learn an NMF model to describe linear combinations of species across geographical grids. First we need to choose a rank.

The rank of a factorization is a crucial hyperparameter. One way to help decide on a rank is cross-validation. This is made easy using the crossValidate function. See ?crossValidate for details on methods.

For many applications, there is no “optimal” rank. In this case, we do expect some amount of distinct biodiversity across the various islands, but within the islands there will be a continuum of habitat niches confounding rank of the signal. Additionally, there may be a number of “missing” observations where surveys were incomplete, which will confound signal separation.

Here we cross-validate across 3 independent replicates and plot the result (this code is not evaluated in this vignette since it takes about a minute to execute):

plot(crossValidate(hawaiibirds$counts, k = c(1:20), reps = 3)) + scale_y_continuous(trans = "log10")

We’ll choose a rank of k = 15 since this seems to return the best prediction accuracy.

Run robust NMF

Let’s generate a high-quality NMF model across 3 random restarts at very low tolerance:

model <- nmf(hawaiibirds$counts, k = 15, seed = 1:3, tol = 1e-6)
model
## 183 x 1183 x 15 factor model of class "nmf"
## @ w
##                                  nmf1        nmf2        nmf3        nmf4
## Common Myna               0.148328389 0.097004277 0.070322627 0.126175888
## Black-crowned Night-Heron 0.007912592 0.009766297           . 0.006037977
## Black Noddy                         .           .           . 0.009565077
## Brown Noddy                         .           .           .           .
## Bulwer's Petrel                     .           .           .           .
## Sooty Tern                          .           . 0.001327280           .
## Wedge-tailed Shearwater             .           .           .           .
##                                  nmf5        nmf6 nmf7
## Common Myna               0.024307396 0.001601450    .
## Black-crowned Night-Heron 0.087244449           .    .
## Black Noddy                         .           .    .
## Brown Noddy                         .           .    .
## Bulwer's Petrel                     .           .    .
## Sooty Tern                          .           .    .
## Wedge-tailed Shearwater             .           .    .
## ...suppressing 176 rows and 8 columns
## 
## @ d
## 1180.546  991.8922  869.0251  833.8436  686.2124  684.9236  559.2632  
## ...suppressing 8 values
## 
## @ h
##             grid1        grid2        grid3        grid4        grid5
## nmf1 8.912460e-04 6.376840e-04 5.574931e-04 3.221175e-03 1.039969e-03
## nmf2 7.929197e-04            . 1.946878e-04            . 3.430622e-05
## nmf3 2.719171e-04 4.056909e-04            . 4.664333e-03            .
## nmf4            .            . 1.187724e-03 1.497042e-04 1.682771e-03
## nmf5 5.350900e-03 4.269131e-05            . 3.384526e-04            .
## nmf6 4.803149e-04 4.019916e-04 4.962776e-04 2.990747e-05 1.287416e-03
## nmf7            .            . 1.156676e-03            .            .
##             grid6        grid7
## nmf1            . 5.320816e-04
## nmf2 2.434128e-05 3.503179e-03
## nmf3            . 3.180302e-04
## nmf4            . 7.406379e-05
## nmf5            .            .
## nmf6            . 9.921460e-05
## nmf7            .            .
## ...suppressing 8 rows and 1176 columns
## 
## @ misc
## List of 5
##  $ tol    : num 9.7e-07
##  $ iter   : num 51
##  $ runtime: 'difftime' num 0.444320917129517
##   ..- attr(*, "units")= chr "secs"
##  $ mse    : num 0.00274
##  $ w_init : num [1:15, 1:183] 1.038 1.707 2.259 0.848 2.196 ...

In the w matrix we have factors describing communities of co-occuring bird species.

In the h matrix we have the association of these bird communities in each surveyed geographical grid.

Geographic focus on NMF factors

What does each NMF factor tell us?

The sample embeddings matrix (h) gives information about the geographical representation of each NMF factor across all grids. We’ll look at just the first four factors:

plots <- list()
for(i in 1:4){
  df <- data.frame(
    "lat" = hawaiibirds$metadata_h$lat,
    "lng" = hawaiibirds$metadata_h$lng,
    "nmf_factor" = model$h[i, ])
  plots[[i]] <- ggplot(df, aes(x = lng, y = lat, color = nmf_factor)) +
    geom_point() +
    scale_color_viridis(option = "B") +
    theme_void() +
    theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) + 
    ggtitle(paste0("Factor ", i))
}
plot_grid(plotlist = plots, nrow = 2)

Metadata enrichment in factors

Factors can capture both island-restricted bird species information and also information shared across islands.

Quantitatively, the summary method for the nmf S4 class makes it easy to annotate factors using metadata about samples or features.

In this case, we will use summary to map factor enrichment in grids corresponding to each Hawaiian island, and species enrichment corresponding to each type of species.

plot(summary(model, group_by = hawaiibirds$metadata_h$island, stat = "mean"))

In general, grids separate based on island – consistent with the expectation that islands contain distinct species communities.

plot(summary(model, group_by = hawaiibirds$metadata_w$type, stat = "mean"))

Notice what type of bird species tend to co-occur – waders with perching birds and waterfowl, waterfowl with waders, seabirds with shorebirds, and many different perching bird contexts.

NMF biplots

Compare species composition in any two factors, for instance factor 2 and 3:

biplot(model, factors = c(2, 3), matrix = "w", group_by = hawaiibirds$metadata_w$type) + 
  scale_y_continuous(trans = "sqrt") + 
  scale_x_continuous(trans = "sqrt") +
  geom_text_repel(size = 2.5, seed = 123, max.overlaps = 15) +
  theme(aspect.ratio = 1)

UMAP on NMF embeddings

We might also be interested in visualizing how factors in w capture similarities among bird species using UMAP.

set.seed(123)
umap <- data.frame(uwot::umap(model$w))
umap$taxon <- hawaiibirds$metadata_w$type
umap$status <- hawaiibirds$metadata_w$status
plot_grid(
  ggplot(umap, aes(x = umap[,1], y = umap[,2], color = taxon)) +
    geom_point() + theme_void(),
  ggplot(umap, aes(x = umap[,1], y = umap[,2], color = status)) +
    geom_point() + theme_void(),
  nrow = 1
)

Species are classified based on habitat niche and taxonomic membership. There are also two groups of “waterfowl”, consistent with ocean shoreline and inland wetland niches.

Hawaii is bird species extinction kingdom: more than 20 species of endemic honeycreeper have gone extinct in the past two centuries due to the establishment of introduced species and habitat devastation. Few remain. In the UMAP plot above on the right, we can observe that introduced species dominate habitat niches occupied by native perching and non-perching birds, a problem underlying historic and ongoing mass extinction events.

set.seed(123)
umap <- data.frame(uwot::umap(t(model$h), metric = "cosine"))
umap$group <- hawaiibirds$metadata_h$island
ggplot(umap, aes(x = umap[,1], y = umap[,2], color = group)) +
  geom_point() + theme_void() + theme(aspect.ratio = 1)

Islands are also well-defined by the NMF model.

Defining the “Palila” species niche

The Palila is a highly endangered species that survives in small numbers on the southwestern slopes of Mauna Kea in a shrubby dry “rainforest” biome, characterized by the last stands of endemic Mamame trees.

What species coexist with the Palila?

Let’s have a look at the species composition in the factor with the highest Palila representation, specifically identifying which species are introduced and which are native:

ggplot(data.frame("value" = model$w["Palila", ], "factor" = 1:ncol(model$w)), aes(factor, value)) + 
  geom_point() + 
  theme_classic() + 
  theme(aspect.ratio = 1) + 
  labs(x = "NMF factor", y = "Palila weight in NMF factor")

df <- data.frame("value" = model$w[, which.max(model$w["Palila", ])])
df$status <- hawaiibirds$metadata_w$status
df <- df[order(-df$value), ]
df <- df[df$value > 0.001, ]
df
##                             value     status
## Hawaii Amakihi        0.429048285     native
## Warbling White-eye    0.197932914     native
## House Finch           0.165412451 introduced
## California Quail      0.036549864     native
## Yellow-fronted Canary 0.035886958 introduced
## Palila                0.033441821     native
## Erckel's Francolin    0.028694271 introduced
## Hawaii Elepaio        0.025463511     native
## Eurasian Skylark      0.025317139 introduced
## Red-billed Leiothrix  0.006118122 introduced
## Chukar                0.006071825 introduced
## Chinese Hwamei        0.005362257 introduced
## Hawaiian Hawk         0.001774794     native
## Red-masked Parakeet   0.001499308 introduced

The diet of the Palilla is largely seeds from the “mamame” tree, but also naio berries and mamame flowers, buds, and young leaves. What introduced perching birds may be competing with the Palila for these resources?

perching_birds <- hawaiibirds$metadata_w$species[hawaiibirds$metadata_w$type == "perching birds"]
df[which(rownames(df) %in% perching_birds & df$status == "introduced"), ]
##                             value     status
## House Finch           0.165412451 introduced
## Yellow-fronted Canary 0.035886958 introduced
## Eurasian Skylark      0.025317139 introduced
## Red-billed Leiothrix  0.006118122 introduced
## Chinese Hwamei        0.005362257 introduced

The “House Finch” and “Yellow-fronted Canary” seem to be the most significant competitors in the Palila habitat niche.