Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (4): 1136-1141.DOI: 10.11772/j.issn.1001-9081.2022040489

• Data science and technology • Previous Articles    

Data enhancement method for drugs under graph-structured representation

Yinjiang CAI1,2, Guangjun XU3, Xibo MA1,2()   

  1. 1.National Laboratory of Pattern Recognition (Institute of Automation,Chinese Academy of Sciences),Beijing 100190,China
    2.School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049,China
    3.Data Center Beijing Branch,Agricultural Bank of China,Beijing 100088,China
  • Received:2022-04-14 Revised:2022-06-24 Accepted:2022-07-05 Online:2023-04-11 Published:2023-04-10
  • Contact: Xibo MA
  • About author:CAI Yinjiang, born in 1997, M. S. candidate. His research interests include intelligent medicine.
    XU Guangjun, born in 1980, M. S., engineer. His research interests include big data mining.
  • Supported by:
    Natural Science Foundation of China(82090051);Excellent Member Program of Youth Innovation Promotion Association of Chinese Academy of Sciences(Y201930)

图结构表示下的药物数据增强方法

蔡引江1,2, 许光俊3, 马喜波1,2()   

  1. 1.模式识别国家重点实验室(中国科学院 自动化研究所), 北京 100190
    2.中国科学院大学 人工智能学院, 北京 100049
    3.中国农业银行 数据中心北京分部, 北京 100088
  • 通讯作者: 马喜波
  • 作者简介:蔡引江(1997—),男,江苏扬州人,硕士研究生,主要研究方向:智能医药;
    许光俊(1980—),男,山东济南人,工程师,硕士,主要研究方向:大数据挖掘;
  • 基金资助:
    国家自然科学基金资助项目(82090051);中国科学院青年创新促进会优秀会员项目(Y201930)

Abstract:

Small sample data can lead to over-fitting problems in machine learning models. In the field of drug development, most data tend to be small samples, which greatly limits the application of machine learning techniques in this field. To solve the above problem, a drug data enhancement method based on graph structure was proposed. The samples were perturbed by the proposed method and new similar samples were generated to expand the dataset. The proposed method are consisted of four sub-methods, which are node discarding method based on molecular backbone, edge discarding method based on molecular backbone, multi-sample splicing methods and hybrid strategy method. In specific, the perturbation of drug molecules was completed by the node and edge discarding method based on molecular backbone in the way of a small number of deletion operation on the composition and structure of drug molecules; the perturbation was completed by the multi-sample splicing method through using an addition operation to combine different molecules; in the hybrid strategy method, the diversity of data enhancement results was improved by combining the deletion and addition operation in a certain ratio. The proposed method improved the Area Under receiver operating characteristic Curve (AUC) of the drug attribute prediction baseline model MG-BERT (Molecular Graph Bidirectional Encoder Representations from Transformer) by 1.94% to 12.49% on public datasets BACE, BBBP, ToxCast and ClinTox. Experimental results demonstrate the effectiveness of the proposed method on small sample drug data enhancement.

Key words: small sample data, drug molecule, data enhancement, graph-structured representation, drug attribute prediction

摘要:

小样本数据会导致机器学习模型出现过拟合问题,而药物研发中的数据往往都具有小样本特性,这极大地限制了机器学习技术在该领域的应用。针对上述问题,提出了图结构下的药物数据增强方法。所提方法通过对样本微扰生成新的相似样本,以扩充数据集。所提方法包含4个子方法,分别是:基于分子骨干的节点丢弃法、基于分子骨干的边丢弃法、多样本拼接法以及混合策略法。其中,基于分子骨干的节点丢弃和边丢弃法通过少量删减药物分子的组成与结构完成对药物分子的微扰;多样本拼接法则使用一种增添性操作,通过组合不同分子完成微扰;混合策略法按一定比例配比删减性和增添性操作提升数据增强结果的多样性。在公开数据集BACE、BBBP、ToxCast和ClinTox上,所提方法为药物属性预测基线模型MG-BERT(Molecular Graph Bidirectional Encoder Representations from Transformer)的受试者工作特征曲线下面积(AUC)带来了1.94%~12.49%的提升。实验结果验证了所提方法在小样本药物数据增强上的有效性。

关键词: 小样本数据, 药物分子, 数据增强, 图结构表示, 药物属性预测

CLC Number: