计算机应用 ›› 2018, Vol. 38 ›› Issue (6): 1826-1830.DOI: 10.11772/j.issn.1001-9081.2017112749

• 应用前沿、交叉与综合 • 上一篇    

基于染色质免疫共沉淀的高通量测序数据集的 顺式调控模体发现算法

冯艳霞, 张志红, 张少强   

  1. 天津师范大学 计算机与信息工程学院, 天津 300387
  • 收稿日期:2017-11-22 修回日期:2018-01-16 出版日期:2018-06-10 发布日期:2018-06-13
  • 通讯作者: 张少强
  • 作者简介:冯艳霞(1991-),女,山西吕梁人,硕士研究生,CCF会员,主要研究方向:生物信息计算;张志红(1991-),女,河南周口人,硕士研究生,主要研究方向:生物信息计算;张少强(1976-),男,天津人,教授,博士,CCF会员,主要研究方向:生物信息计算。
  • 基金资助:
    国家自然科学基金资助项目(61572358);天津自然科学基金资助项目(16JCYBJC23600)。

Cis-regulatory motif finding algorithm in chromatin immunoprecipitation sequencing datasets

FENG Yanxia, ZHANG Zhihong, ZHANG Shaoqiang   

  1. College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
  • Received:2017-11-22 Revised:2018-01-16 Online:2018-06-10 Published:2018-06-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572358), the Natural Science Foundation of Tianjin (16JCYBJC23600).

摘要: 针对新一代测序(NGS)的染色质免疫共沉淀的高通量测序(ChIP-Seq)数据集的模体发现问题,提出一种基于费舍尔(Fisher)精确检验的模体发现算法——FisherNet。首先运用费舍尔精确检验计算所有k长短序的P值并筛选出模体的种子;然后,构建初始模体的位置赋权矩阵;最后,用位置赋权矩阵扫描所有k长短序形成最终模体。通过小鼠胚胎干细胞(mESC)和红细胞、人类淋巴母细胞系的ChIP-Seq数据集以及ENCODE数据库的数据进行验证,结果表明所提算法精度和计算速度均高于其他常见的模体发现算法,并且能够发现超过80%的已知转录因子核心模体及其辅调控因子模体。该算法在保证高精度的同时可以应用到大规模测序数据集。

关键词: 模体发现算法, 顺式调控, 真核生物, 染色质免疫共沉淀的高通量测序, 转录因子

Abstract: Aiming at the motif finding problem in Chromatin Immunoprecipitation Sequencing (ChIP-Seq) datasets of Next-Generation Sequencing (NGS), a new motif finding algorithm based on Fisher's exact test, called FisherNet, was proposed. Firstly, Fisher's exact test was used to calculate the P values of all k-mers, some of which were selected as motif seeds. Secondly, the position weight matrix of the initial motif was constructed. Finally, the position weight matrix was employed to scan all k-mers for obtaining the final motif. The ChIP-Seq datasets of mouse Embryonic Stem cells (mESC), mouse erythrocytes, human lymphoblastic lines and the ENCODE database were used for verifying. The verification results show that, the accuracy and calculation speed of the proposed algorithm are higher than those of other common motif finding algorithms, and it can find more than 80% of core motifs for known transcription factors and their co-factors. The proposed algorithm can be applied to large-scale sequencing datasets while ensuring high accuracy.

Key words: motif finding algorithm, cis-regulatory, eukaryote, Chromatin Immunoprecipitation Sequencing (ChIP-Seq), transcription factor

中图分类号: