计算机应用 ›› 2010, Vol. 30 ›› Issue (05): 1379-1382.

• 典型应用 • 上一篇    下一篇

多重压缩DNA序列数据

张丽霞1,宋鸿陟2   

  1. 1. 华南农业大学信息学院
    2.
  • 收稿日期:2009-10-30 修回日期:2009-12-24 发布日期:2010-05-04 出版日期:2010-05-01
  • 通讯作者: 张丽霞
  • 基金资助:
    国家自然科学基金资助项目;教育部留学回国人员科研启动基金第31批次

Multiple-compression of DNA sequence data

  • Received:2009-10-30 Revised:2009-12-24 Online:2010-05-04 Published:2010-05-01

摘要: 根据DNA序列数据的特点,提出对DNA序列数据进行多重压缩的思想。多重压缩的首要步骤是扩展字母表。首先对DNA序列数据进行0/1编码,然后每8位转换成一个ASCII码字符,将原来的DNA序列数据仅含有的4个字符扩展到256个字符。第二步采取基于统计模型的Huffman编码压缩算法和基于转换模型的Burrows-Wheeler算法,对扩展后的DNA序列数据进行二次压缩。最后对各种算法的压缩结果进行性能分析比较,比较结果显示,多重压缩算法有较优的压缩比。

关键词: DNA序列数据, 多重压缩, Huffman编码, Burrows-Wheeler编码

Abstract: The multiple-compression concept was proposed based on the characteristics of DNA sequence data. The first step of multiple-compression is to enlarge the alphabet. Firstly, DNA sequence data was encoded by 0/1 coding, and then every eight bits was transformed into an ASCII character. Through this step, the former DNA sequence data with only four characters would be enlarged to 256 characters. Secondly, the enlarged DNA sequence data was compressed for the second time with Huffman coding compression algorithm based on statistical model and Burrows-Wheeler coding algorithm based on transition model separately. Finally, the compression performances of all algorithms were compared. The results show that the multiple-compression algorithm is better than others in terms of compression ratio.

Key words: DNA sequence data, multiple-compression, Huffman coding, Burrows-Wheeler coding