261-如何自己写函数做peaks的注释？

Aug 30, 2023 3 min read cnposts

刘小泽写于2023.8.30

前言

分析ChIP-Seq、ATAC-Seq的小伙伴都知道，call peaks后的注释过程是必须之路。常用好用的工具有很多，比如之前写过的一起学习一遍ChIPseeker的使用（之前写的，现在应该有更新内容了）、ChIPpeakAnno或者基于perl的homer（annotatePeak.pl）

今天要介绍的是自己写函数操作，主要基于了AnnotationHub、GRange的一系列操作

参考：https://compgenomr.github.io/book/peak-calling.html

首先 call peaks

因为这本书是R相关的，所以作者全程都在用R进行分析，call peaks也没有使用比较常用的macs，而是使用了normr的R包。

library(normr)
# 如果要call broad peaks，需要先设置一下binsize
# countConfiguration = countConfigSingleEnd(binsize = 5000)

# call peaks参数也是很简单
ctcf_fit = enrichR(
    
            # ChIP file
            treatment = chip_file,
            
            # control file
            control   = control_file,
            
            # genome version
            genome    = "hg38",
            
            # print intermediary steps during the analysis
            verbose   = FALSE,

  					# 针对broad peaks
						#countConfig = countConfiguration
)

# 然后从这个对象中，提取出坐标信息
ctcf_peaks = getRanges(ctcf_fit)
# 都是包装好的函数
ctcf_peaks$qvalue     = getQvalues(ctcf_fit)
ctcf_peaks$enrichment = getEnrichment(ctcf_fit)

# 过滤操作
ctcf_peaks = subset(ctcf_peaks, !is.na(component))
ctcf_peaks = subset(ctcf_peaks, qvalue < 0.01)

# 排个序
ctcf_peaks = ctcf_peaks[order(ctcf_peaks$qvalue)]

最后的结果就长这样：

## GRanges object with 724 ranges and 3 metadata columns:
##         seqnames            ranges strand | component       qvalue enrichment
##            <Rle>         <IRanges>  <Rle> | <integer>    <numeric>  <numeric>
##     [1]    chr21 43939251-43939500      * |         1 4.69881e-140    1.37891
##     [2]    chr21 43646751-43647000      * |         1 2.52006e-137    1.42361
##     [3]    chr21 43810751-43811000      * |         1 1.86404e-121    1.30519
##     [4]    chr21 43939001-43939250      * |         1 2.10822e-121    1.19820
##     [5]    chr21 37712251-37712500      * |         1 6.35711e-118    1.70989
##     ...      ...               ...    ... .       ...          ...        ...
##   [720]    chr21 38172001-38172250      * |         1   0.00867374   0.951189
##   [721]    chr21 38806001-38806250      * |         1   0.00867374   0.951189
##   [722]    chr21 42009501-42009750      * |         1   0.00867374   0.656253
##   [723]    chr21 46153001-46153250      * |         1   0.00867374   0.951189
##   [724]    chr21 46294751-46295000      * |         1   0.00867374   0.722822
##   -------
##   seqinfo: 24 sequences from an unspecified genome

然后拿到基因组的注释内容

核心就是将类似GTF文件中的“exon”、“intron”等等进行分组，自定义一些比如“TSS”、“Intergenic”之类的。

# 利用AnnotationHub下载人类的信息
hub = AnnotationHub()
gtf = hub[['AH61126']]
seqlevels(gtf, pruning.mode='coarse') = '21'
seqlevels(gtf, pruning.mode='coarse') = paste0('chr', seqlevels(gtf))

# 自定义（这里设置promoter为TSS上下游1k），当然还可以继续细分
annotation_list = GRangesList(
  tss    = promoters(subset(gtf, type=='gene'), 1000, 1000),# 这里其实写promoter更合理，但是作者写了tss去作图，就先按他的意思来
  exon   = subset(gtf, type=='exon'),
  intron = subset(gtf, type=='gene')
)

其实除了上面这样的做法，还可以：

require(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

promoters_txdb <- promoters(txdb)
promoters_txdb
# 但是这个promoter regions包含了基因+转录本

# 如果只想要基因相关的
genes_txdb <- genes(txdb)
promoters_txdb <- promoters(genes_txdb)
promoters_txdb
unique(width(promoters_txdb)) # 统计一下数量会比之前少很多，从几万降到几千

# 还可以指定上下游
promoters_txdb <- promoters(genes(txdb), upstream = 1500, downstream = 500)
unique(width(promoters_txdb))

最后将peak坐标映射到注释坐标上去

作者做了这么几步：

Creating a disjoint set of peak regions =》使用了recude函数（https://seandavi.github.io/ITR/RangesAndSignal.html）：the reduce method will align the ranges and merge overlapping ranges to produce a simplified set
Finding the overlapping annotation for each peak.
Annotating each peak with the corresponding annotation class.
Calculating summary statistics

annotatePeaks = function(peaks, annotation_list, name){
  
  # ------------------------------------------------ #
  # 1. getting disjoint regions
  # collapse touching enriched regions
  peaks = reduce(peaks)
  
  # ------------------------------------------------ #
  # 2. overlapping peaks and annotation
  # find overlaps between the peaks and annotation_list
  result = as.data.frame(findOverlaps(peaks, annotation_list))
  
  # ------------------------------------------------ #
  # 3. annotating peaks
  # fetch annotation names
  result$annotation = names(annotation_list)[result$subjectHits]
  
  # rank by annotation precedence
  result = result[order(result$subjectHits),]    
  
  # remove overlapping annotations
  result = subset(result, !duplicated(queryHits))
  
  # ------------------------------------------------ #
  # 4. calculating statistics
  # count the number of peaks in each annotation category
  result = group_by(.data = result, annotation)
  result = summarise(.data = result, counts = length(annotation))
  
  # fetch the number of intergenic peaks
  result = rbind(result, 
                 data.frame(annotation = 'intergenic', 
                            counts     = length(peaks) - sum(result$counts)))
  
  result$frequency  = with(result, round(counts/sum(counts),2))
  result$experiment = name
  
  return(result)
}

当然还有一些地方还是值得讨论的：

他是直接将坐标“唯一化”，直接用recude进行了merge，就是为了下面findOverlaps的方便
分类没有做的很精细，因为精细的分类需要对基因组坐标了解更透彻

但不管怎样，也算是提供一个基础思路

利用函数注释并作图

# 如果有多组peaks结果
peak_list = list(
    CTCF     = ctcf_peaks, 
    H3K36me3 = h3k36_peaks
)

# 循环注释
annot_peaks_list = lapply(names(peak_list), function(peak_name){
  annotatePeaks(peak_list[[peak_name]], annotation_list, peak_name)
})

# 整合结果
annot_peaks_df = dplyr::bind_rows(annot_peaks_list)

# 最后画图
ggplot(data = annot_peaks_df, 
       aes(
           x    = experiment, 
           y    = frequency, 
           fill = annotation
        )) +
  geom_bar(stat='identity') +
  scale_fill_brewer(palette='Set2') +
  theme_bw()+
  theme(
    axis.text = element_text(size=18, face='bold'),
    axis.title = element_text(size=14,face="bold"),
    plot.title = element_text(hjust = 0.5))  +
  ggtitle('Peak distribution in\ngenomic regions') +
  xlab('Experiment') +
  ylab('Frequency')

ChIP-Seq

261-如何自己写函数做peaks的注释？

前言

首先 call peaks

然后拿到基因组的注释内容

最后将peak坐标映射到注释坐标上去

利用函数注释并作图

Yunze Liu

Bioinformatics Sharer

Related