vignettes/transfer_annotation.Rmd
transfer_annotation.Rmd
CellFunTopic provides methods for rapid, automated cell-type annotation across datasets without dependence on marker genes, reducing labor intensive and time-consuming work of manual markers collection and manual cell-type annotation.
Here, we take pbmc3k.final
as the reference data:
## [1] Memory CD4 T B CD14+ Mono NK CD8 T
## [6] Naive CD4 T FCGR3A+ Mono DC Platelet
## 9 Levels: Naive CD4 T Memory CD4 T CD14+ Mono B CD8 T FCGR3A+ Mono NK ... Platelet
To make things easier, we re-analyze and re-cluster
pbmc3k.final
as the query data, each cell owns a new
cluster identity.
SeuratObj <- readData(data = pbmc3k.final, type = 'Seurat', species = "Homo sapiens")
SeuratObj <- CalMTpercent(SeuratObj, by = "use_internal_data")
SeuratObj <- QCfun(SeuratObj, plot = F)
SeuratObj <- RunSeurat(SeuratObj, nPCs = 10, resolution = 1, plot = FALSE)
SeuratObj <- RunGSEA(SeuratObj, by = 'GO')
##
## 0 1 2 3 4 5 6 7 8 9
## 537 460 325 252 220 159 143 141 140 134
It’s convenient to transfer annotation from reference to query data:
df <- predictFun(query_SeuratObj = SeuratObj, reference_SeuratObj = pbmc3k.final,
group_by = 'seurat_annotations', cluster_by = 'seurat_clusters',
species = "Homo sapiens", by = 'GO', k = NULL, LDAmethod = "VEM")
## Calculating differentially expressed genes for clusters in reference data......
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Performing GSEA on reference data......
## 'select()' returned 1:many mapping between keys and columns
## Performing topic modelling on reference data......
## training a svm classifier......
## 载入需要的程辑包:lattice
## 载入需要的程辑包:ggplot2
## predicting cell types of the query data......
df # prediction result
## query prediction
## 1 0 Naive CD4 T
## 2 1 Naive CD4 T
## 3 2 B
## 4 3 CD14+ Mono
## 5 4 CD14+ Mono
## 6 5 CD8 T
## 7 6 Naive CD4 T
## 8 7 NK
## 9 8 FCGR3A+ Mono
## 10 9 CD8 T
To obtain the accuracy of prediction in this case, we identify the cell types of each cluster in query data based on ground-truth identity of each cell.
mm <- table(SeuratObj$seurat_annotations, SeuratObj$seurat_clusters)
nn <- setNames(rownames(mm)[apply(mm, 2, which.max)], colnames(mm))
nn[df$query] # cell types of query data based on identity of each cell
## 0 1 2 3 4
## "Memory CD4 T" "Naive CD4 T" "B" "CD14+ Mono" "CD14+ Mono"
## 5 6 7 8 9
## "CD8 T" "Naive CD4 T" "NK" "FCGR3A+ Mono" "CD8 T"
# check the accuracy of prediction
caret::confusionMatrix(data = as.factor(df$prediction), reference = as.factor(nn[df$query]))$overall[["Accuracy"]]
## [1] 0.9
To improve accuracy of prediction, users can oversample the reference data. In this case, we skip this step because it can be time-consuming.
reference_SeuratObj <- oversample_ref(reference_SeuratObj, number_clusters = 5,
group_by = 'cellType', cluster_by = 'seurat_clusters',
species = "Homo sapiens", by = 'GO', k = NULL, method = "VEM")
predictFun(query_SeuratObj, reference_SeuratObj, group_by = 'seurat_annotations', cluster_by = 'seurat_clusters',
species = "Homo sapiens", by = 'GO', k = NULL, LDAmethod = "VEM")
# if true labels of query data are provided, you can use caret::confusionMatrix to inspect the confusion matrix and accuracy
caret::confusionMatrix(data = prediction , reference = true_label)
caret::confusionMatrix(data = prediction , reference = true_label)$overall[["Accuracy"]]