Loading [MathJax]/extensions/tex2jax.js
  • Comprehensive evaluation of gene sequence encoding methods in deep learning

    分类: 计算机科学 >> 计算机应用技术 提交时间: 2023-02-09

    摘要: Background: The prediction of genomic structure has become a hot spot in genomeresearch. At present, the prediction method based on deep learning is more effective and accurate than other machine learning algorithms. Since gene sequence data cannot directly enter the deep learning model, the original data need to be encoded and converted into numerical features before model prediction. As a result, different encoding methods may affect final accuracy.Methods: In order to explore the performance of different encoding methods, we compared ten strategies in six deep learning models. We also compared the performance of all methods on independent datasets and models from our laboratory. For all models, we used their original parameters.Results: Dummy encoding, hash encoding, and one-hot encoding perform best in various models. In addition, dummy encoding and one-hot encoding are the best for processing RNA data, while hash encoding is superior to other methods for processing promoter data. Also, when processing part- or full-sequence data, the performance of dummy encoding, hash encoding, and one-hot encoding is similar. Besides that, in sisRNA datasets and prediction models of Arabidopsisand rice, dummy encoding and one-hot encoding achieve higher prediction accuracy.Conclusions:We conclude that the best encoding method varies when the data set changes. One-hot encoding, dummy encoding, and hash encoding are the three best methods for six models. This study fills the gap on sequence encoding methods in deep learning and can provide a valuable reference for the community.