ChinaXiv.org 中国科学院科技论文预发布平台

按提交时间

2023
1

按主题分类

计算机应用技术
1

按作者

按机构

山东农业大学信息科学与工程学院农业大数据研究中心
1

当前资源共 1条

隐藏摘要

点击量

时间

下载量

1. ChinaXiv:202302.00053
下载全文

Comprehensive evaluation of gene sequence encoding methods in deep learning

分类：计算机科学 >> 计算机应用技术提交时间： 2023-02-09

李晗胡继明孙晓勇

摘要： Background: The prediction of genomic structure has become a hot spot in genomeresearch. At present, the prediction method based on deep learning is more effective and accurate than other machine learning algorithms. Since gene sequence data cannot directly enter the deep learning model, the original data need to be encoded and converted into numerical features before model prediction. As a result, different encoding methods may affect final accuracy.Methods: In order to explore the performance of different encoding methods, we compared ten strategies in six deep learning models. We also compared the performance of all methods on independent datasets and models from our laboratory. For all models, we used their original parameters.Results: Dummy encoding, hash encoding, and one-hot encoding perform best in various models. In addition, dummy encoding and one-hot encoding are the best for processing RNA data, while hash encoding is superior to other methods for processing promoter data. Also, when processing part- or full-sequence data, the performance of dummy encoding, hash encoding, and one-hot encoding is similar. Besides that, in sisRNA datasets and prediction models of Arabidopsisand rice, dummy encoding and one-hot encoding achieve higher prediction accuracy.Conclusions:We conclude that the best encoding method varies when the data set changes. One-hot encoding, dummy encoding, and hash encoding are the three best methods for six models. This study fills the gap on sequence encoding methods in deep learning and can provide a valuable reference for the community.

同行评议状态:待评议

点击量 5818 下载量 680 评论 0

Comprehensive evaluation of gene sequence encoding methods in deep learning