261-如何自己写函数做peaks的注释?
刘小泽写于2023.8.30
前言
分析ChIP-Seq、ATAC-Seq的小伙伴都知道,call peaks后的注释过程是必须之路。常用好用的工具有很多,比如之前写过的 一起学习一遍ChIPseeker的使用(之前写的,现在应该有更新内容了)、ChIPpeakAnno或者基于perl的homer(annotatePeak.pl)
今天要介绍的是自己写函数操作,主要基于了AnnotationHub、GRange的一系列操作
参考:https://compgenomr.github.io/book/peak-calling.html
首先 call peaks
因为这本书是R相关的,所以作者全程都在用R进行分析,call peaks也没有使用比较常用的macs,而是使用了normr
的R包。
library(normr)
# 如果要call broad peaks,需要先设置一下binsize
# countConfiguration = countConfigSingleEnd(binsize = 5000)
# call peaks参数也是很简单
ctcf_fit = enrichR(
# ChIP file
treatment = chip_file,
# control file
control = control_file,
# genome version
genome = "hg38",
# print intermediary steps during the analysis
verbose = FALSE,
# 针对broad peaks
#countConfig = countConfiguration
)
# 然后从这个对象中,提取出坐标信息
ctcf_peaks = getRanges(ctcf_fit)
# 都是包装好的函数
ctcf_peaks$qvalue = getQvalues(ctcf_fit)
ctcf_peaks$enrichment = getEnrichment(ctcf_fit)
# 过滤操作
ctcf_peaks = subset(ctcf_peaks, !is.na(component))
ctcf_peaks = subset(ctcf_peaks, qvalue < 0.01)
# 排个序
ctcf_peaks = ctcf_peaks[order(ctcf_peaks$qvalue)]
最后的结果就长这样:
## GRanges object with 724 ranges and 3 metadata columns:
## seqnames ranges strand | component qvalue enrichment
## <Rle> <IRanges> <Rle> | <integer> <numeric> <numeric>
## [1] chr21 43939251-43939500 * | 1 4.69881e-140 1.37891
## [2] chr21 43646751-43647000 * | 1 2.52006e-137 1.42361
## [3] chr21 43810751-43811000 * | 1 1.86404e-121 1.30519
## [4] chr21 43939001-43939250 * | 1 2.10822e-121 1.19820
## [5] chr21 37712251-37712500 * | 1 6.35711e-118 1.70989
## ... ... ... ... . ... ... ...
## [720] chr21 38172001-38172250 * | 1 0.00867374 0.951189
## [721] chr21 38806001-38806250 * | 1 0.00867374 0.951189
## [722] chr21 42009501-42009750 * | 1 0.00867374 0.656253
## [723] chr21 46153001-46153250 * | 1 0.00867374 0.951189
## [724] chr21 46294751-46295000 * | 1 0.00867374 0.722822
## -------
## seqinfo: 24 sequences from an unspecified genome
然后拿到基因组的注释内容
核心就是将类似GTF文件中的“exon”、“intron”等等进行分组,自定义一些比如“TSS”、“Intergenic”之类的。
# 利用AnnotationHub下载人类的信息
hub = AnnotationHub()
gtf = hub[['AH61126']]
seqlevels(gtf, pruning.mode='coarse') = '21'
seqlevels(gtf, pruning.mode='coarse') = paste0('chr', seqlevels(gtf))
# 自定义(这里设置promoter为TSS上下游1k),当然还可以继续细分
annotation_list = GRangesList(
tss = promoters(subset(gtf, type=='gene'), 1000, 1000),# 这里其实写promoter更合理,但是作者写了tss去作图,就先按他的意思来
exon = subset(gtf, type=='exon'),
intron = subset(gtf, type=='gene')
)
其实除了上面这样的做法,还可以:
require(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
promoters_txdb <- promoters(txdb)
promoters_txdb
# 但是这个promoter regions包含了基因+转录本
# 如果只想要基因相关的
genes_txdb <- genes(txdb)
promoters_txdb <- promoters(genes_txdb)
promoters_txdb
unique(width(promoters_txdb)) # 统计一下数量会比之前少很多,从几万降到几千
# 还可以指定上下游
promoters_txdb <- promoters(genes(txdb), upstream = 1500, downstream = 500)
unique(width(promoters_txdb))
最后将peak坐标映射到注释坐标上去
作者做了这么几步:
- Creating a disjoint set of peak regions =》 使用了
recude
函数(https://seandavi.github.io/ITR/RangesAndSignal.html):the reduce method will align the ranges and merge overlapping ranges to produce a simplified set - Finding the overlapping annotation for each peak.
- Annotating each peak with the corresponding annotation class.
- Calculating summary statistics
annotatePeaks = function(peaks, annotation_list, name){
# ------------------------------------------------ #
# 1. getting disjoint regions
# collapse touching enriched regions
peaks = reduce(peaks)
# ------------------------------------------------ #
# 2. overlapping peaks and annotation
# find overlaps between the peaks and annotation_list
result = as.data.frame(findOverlaps(peaks, annotation_list))
# ------------------------------------------------ #
# 3. annotating peaks
# fetch annotation names
result$annotation = names(annotation_list)[result$subjectHits]
# rank by annotation precedence
result = result[order(result$subjectHits),]
# remove overlapping annotations
result = subset(result, !duplicated(queryHits))
# ------------------------------------------------ #
# 4. calculating statistics
# count the number of peaks in each annotation category
result = group_by(.data = result, annotation)
result = summarise(.data = result, counts = length(annotation))
# fetch the number of intergenic peaks
result = rbind(result,
data.frame(annotation = 'intergenic',
counts = length(peaks) - sum(result$counts)))
result$frequency = with(result, round(counts/sum(counts),2))
result$experiment = name
return(result)
}
当然还有一些地方还是值得讨论的:
- 他是直接将坐标“唯一化”,直接用
recude
进行了merge,就是为了下面findOverlaps
的方便 - 分类没有做的很精细,因为精细的分类需要对基因组坐标了解更透彻
但不管怎样,也算是提供一个基础思路
利用函数注释并作图
# 如果有多组peaks结果
peak_list = list(
CTCF = ctcf_peaks,
H3K36me3 = h3k36_peaks
)
# 循环注释
annot_peaks_list = lapply(names(peak_list), function(peak_name){
annotatePeaks(peak_list[[peak_name]], annotation_list, peak_name)
})
# 整合结果
annot_peaks_df = dplyr::bind_rows(annot_peaks_list)
# 最后画图
ggplot(data = annot_peaks_df,
aes(
x = experiment,
y = frequency,
fill = annotation
)) +
geom_bar(stat='identity') +
scale_fill_brewer(palette='Set2') +
theme_bw()+
theme(
axis.text = element_text(size=18, face='bold'),
axis.title = element_text(size=14,face="bold"),
plot.title = element_text(hjust = 0.5)) +
ggtitle('Peak distribution in\ngenomic regions') +
xlab('Experiment') +
ylab('Frequency')