CellFunTopic provides methods for rapid, automated cell-type annotation across datasets without dependence on marker genes, reducing labor intensive and time-consuming work of manual markers collection and manual cell-type annotation.

Predict cell type of query data based on reference data

Here, we take pbmc3k.final as the reference data:

data('pbmc3k.final', package = "pbmc3k.SeuratData")
unique(Seurat::Idents(pbmc3k.final))
## [1] Memory CD4 T B            CD14+ Mono   NK           CD8 T       
## [6] Naive CD4 T  FCGR3A+ Mono DC           Platelet    
## 9 Levels: Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK ... Platelet

To make things easier, we re-analyze and re-cluster pbmc3k.final as the query data, each cell owns a new cluster identity.

SeuratObj <- readData(data = pbmc3k.final, type = 'Seurat', species = "Homo sapiens")
SeuratObj <- CalMTpercent(SeuratObj, by = "use_internal_data")
SeuratObj <- QCfun(SeuratObj, plot = F)
SeuratObj <- RunSeurat(SeuratObj, nPCs = 10, resolution = 1, plot = FALSE)
SeuratObj <- RunGSEA(SeuratObj, by = 'GO')
# check the new clustering
table(Seurat::Idents(SeuratObj))
## 
##   0   1   2   3   4   5   6   7   8   9 
## 537 460 325 252 220 159 143 141 140 134

It’s convenient to transfer annotation from reference to query data:

df <- predictFun(query_SeuratObj = SeuratObj, reference_SeuratObj = pbmc3k.final,
                 group_by = 'seurat_annotations', cluster_by = 'seurat_clusters',
                 species = "Homo sapiens", by = 'GO', k = NULL, LDAmethod = "VEM")
## Calculating differentially expressed genes for clusters in reference data......
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Performing GSEA on reference data......
## 'select()' returned 1:many mapping between keys and columns
## Performing topic modelling on reference data......
## training a svm classifier......
## 载入需要的程辑包:lattice
## 载入需要的程辑包:ggplot2
## predicting cell types of the query data......
df  # prediction result
##    query   prediction
## 1      0  Naive CD4 T
## 2      1  Naive CD4 T
## 3      2            B
## 4      3   CD14+ Mono
## 5      4   CD14+ Mono
## 6      5        CD8 T
## 7      6  Naive CD4 T
## 8      7           NK
## 9      8 FCGR3A+ Mono
## 10     9        CD8 T

To obtain the accuracy of prediction in this case, we identify the cell types of each cluster in query data based on ground-truth identity of each cell.

mm <- table(SeuratObj$seurat_annotations, SeuratObj$seurat_clusters)
nn <- setNames(rownames(mm)[apply(mm, 2, which.max)], colnames(mm))
nn[df$query] # cell types of query data based on identity of each cell
##              0              1              2              3              4 
## "Memory CD4 T"  "Naive CD4 T"            "B"   "CD14+ Mono"   "CD14+ Mono" 
##              5              6              7              8              9 
##        "CD8 T"  "Naive CD4 T"           "NK" "FCGR3A+ Mono"        "CD8 T"
# check the accuracy of prediction
caret::confusionMatrix(data = as.factor(df$prediction), reference = as.factor(nn[df$query]))$overall[["Accuracy"]]
## [1] 0.9

To improve accuracy of prediction, users can oversample the reference data. In this case, we skip this step because it can be time-consuming.

reference_SeuratObj <- oversample_ref(reference_SeuratObj, number_clusters = 5,
                                      group_by = 'cellType', cluster_by = 'seurat_clusters',
                                      species = "Homo sapiens", by = 'GO', k = NULL, method = "VEM")
predictFun(query_SeuratObj, reference_SeuratObj, group_by = 'seurat_annotations', cluster_by = 'seurat_clusters',
           species = "Homo sapiens", by = 'GO', k = NULL, LDAmethod = "VEM")
# if true labels of query data are provided, you can use caret::confusionMatrix to inspect the confusion matrix and accuracy
caret::confusionMatrix(data = prediction , reference = true_label)
caret::confusionMatrix(data = prediction , reference = true_label)$overall[["Accuracy"]]