ChinaXiv.org 中国科学院科技论文预发布平台

按提交时间

按主题分类

按作者

按机构

当前资源共 26条

隐藏摘要

点击量

时间

下载量

1. ChinaXiv:202311.00040
下载全文

A Novel Framework for Future Natural Language Processing From a Database Perspective

分类：计算机科学 >> 自然语言理解与机器翻译提交时间： 2023-11-01

Limin Zhang

摘要： Most research and applications on natural language still concentrate on its superficial features and structures. However, natural language is essentially a way of encoding information and knowledge. Thus, the focus should be on what is encoded and how it is encoded. In line with this, we suggest a database-based approach for natural language processing that emulates the encoding of information and knowledge to build models. Based on these models, 1) generating sentences becomes akin to reading data from the models (or databases) and encoding it following some rules; 2) understanding sentences involves decoding rules and a series of boolean operations on the databases; 3) learning can be accomplished by writing on the databases. Our method closely mirrors how the human brain processes information, offering excellent interpretability and expandability.

同行评议状态:待评议

点击量 831 下载量 201 评论 0
2. ChinaXiv:202309.00134
下载全文

Understanding principal component analysis

分类：统计学 >> 应用统计数学提交时间： 2023-09-15

Lin, Shiqiang

摘要： The principal component analysis (PCA) is a frequently used machine learning method. In this paper, the PCA operation is explained by examples with Python program illustration. A proof of the diagonalizability of real symmetric matrix is also included, which may help to understand the mathematics behind PCA.

同行评议状态:待评议

点击量 934 下载量 187 评论 0
3. ChinaXiv:202005.00043
下载全文

DEED: A general quantization scheme for saving bits in communication

分类：数学 >> 控制和优化提交时间： 2020-06-16

Tian Ye Peijun Xiao Ruoyu Sun

摘要： Quantization is a popular technique to reduce communication in distributed optimization. Motivated by the classical work on inexact gradient descent (GD) \cite{bertsekas2000gradient}, we provide a general convergence analysis framework for inexact GD that is tailored for quantization schemes. We also propose a quantization scheme Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity in three settings: frequent-communication large-memory, frequent-communication small-memory, and infrequent-communication (e.g. federated learning). More specifically, in the frequent-communication large-memory setting, DEED can be easily combined with Nesterov's method, so that the total number of bits required is $ \tilde{O}( \sqrt{\kappa} \log 1/\epsilon )$, where $\tilde{O}$ hides numerical constant and $\log \kappa $ factors. In the frequent-communication small-memory setting, DEED combined with SGD only requires $\tilde{O}( \kappa \log 1/\epsilon)$ number of bits in the interpolation regime. In the infrequent communication setting, DEED combined with Federated averaging requires a smaller total number of bits than Federated Averaging. All these algorithms converge at the same rate as their non-quantized versions, while using a smaller number of bits.

同行评议状态:待评议

点击量 23758 下载量 2527 评论 0
4. ChinaXiv:202407.00103
下载全文

Experimental investigation on acoustic emission precursor of rockburst based on unsupervised machine learning method

分类：矿山工程技术 >> 矿山工程技术其他学科提交时间： 2024-07-08

Jie Sun Dongqiao Liu Pengfei He Longji Guo Binghao Cao Lei Zhang Zhe Li

摘要： The key to achieving rockburst warning lies in the understanding of rockburst precursors. Considering the correlation characteristics of rockburst acoustic emission (AE) parameters, a self-organizing map neural network (SOMNN) based method for rockburst precursor inversion was proposed. The feature of this method lies in acyclic data segmentation iteration process based on the thinking of "interference signal screening", "key signal extraction", and "precursor signal inversion". The rationality of this method has been verified in three groups of rockburst experiments. The results revealed that rockburst AE precursor signals consist of a series of signals characterized by long duration, high energy, low average frequency, high energy amplitude, and low peak frequency. Subsequently, potential value in long term rockburst warning of the precursor obtained in this study was shown via the comparison of conventional precursors. Finally, a preliminary interpretation for rockburst precursor was proposed under the framework of AE parameters physical significance, and it is revealed that AE precursor signals are likely linked to the creation of large-scale tensile cracks before rockburst

通过

点击量 214 下载量 37 评论 0
5. ChinaXiv:202309.00079
下载全文

Predictions of nuclear charge radii based on the convolutional neural network

分类：物理学 >> 核物理学提交时间： 2023-09-07

Cao, Yingyu Guo, Jianyou Zhou, Bo

摘要： In this study, we developed a neural network that incorporates a fully connected layer with a convolutional layer to predict the nuclear charge radii based on the relationships between four local nuclear charge radii.The convolutional neural network (CNN) combines the isospin and pairing effects to describe the charge radii of nuclei with $A geq $ 39 and $Z geq $ 20. The developed neural network achieved a root--mean--square (RMS) deviation of 0.0195 fm for a dataset with 928 nuclei. Specifically, the CNN reproduced the trend of the inverted parabolic behavior and odd--even staggering observed in the calcium isotopic chain, demonstrating reliable predictive capability.

通过

点击量 3984 下载量 224 评论 0
6. ChinaXiv:202211.00424
下载全文

Comparative Evaluation and Comprehensive Analysis of Machine Learning Models for Regression Problems

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-28 合作期刊: 《数据智能（英文）》

Boran, Sekeroglu Yoney, Kirsal Ever Kamil, Dimililer Fadi, Al-Turjman

摘要： Artificial intelligence and machine learning applications are of significant importance almost in every field of human life to solve problems or support human experts. However, the determination of the machine learning model to achieve a superior result for a particular problem within the wide real-life application areas is still a challenging task for researchers. The success of a model could be affected by several factors such as dataset characteristics, training strategy and model responses. Therefore, a comprehensive analysis is required to determine model ability and the efficiency of the considered strategies. This study implemented ten benchmark machine learning models on seventeen varied datasets. Experiments are performed using four different training strategies 60:40, 70:30, and 80:20 hold-out and five-fold cross-validation techniques. We used three evaluation metrics to evaluate the experimental results: mean squared error, mean absolute error, and coefficient of determination (R2 score). The considered models are analyzed, and each model's advantages, disadvantages, and data dependencies are indicated. As a result of performed excess number of experiments, the deep Long-Short Term Memory (LSTM) neural network outperformed other considered models, namely, decision tree, linear regression, support vector regression with a linear and radial basis function kernels, random forest, gradient boosting, extreme gradient boosting, shallow neural network, and deep neural network. It has also been shown that cross-validation has a tremendous impact on the results of the experiments and should be considered for the model evaluation in regression studies where data mining or selection is not performed.

点击量 2804 下载量 450 评论 0
7. ChinaXiv:202405.00253
下载全文

Machine learning the apparent diffusion coefficient of Se(IV) in compacted bentonite

分类：核科学技术 >> 辐射物理与技术提交时间： 2024-05-22

Xiaoqiong Shi Junlei Tian Jiacong Shen Zhengye Feng Jiaxing Feng Tao Wu Qingfeng Li

摘要： Light Gradient Boosting Machine (LightGBM) and Random Forest (RF) algorithms were used to predict the apparent diffusion coefficient of Se(IV) in compacted bentonite.Seven instances of Se(IV)were measured using through-diffusion method.LightGBM (R2= 0.98 and RMSE = 0.025) exhibited superior predictive accuracy with a trainingdataset consisting of 956instances and eight input featuresfrom Japan Atomic Energy Agency(JAEA-DDB).Shapley Additive Explanation and Partial Dependence Plots analysesrevealedvaluable insightsinto the diffusion mechanism of adsorbed anion obtained by evaluating the relationshipsbetween the apparent diffusion coefficient and the dependency of each input feature.

同行评议状态:待评议

点击量 769 下载量 311 评论 0
8. ChinaXiv:202211.00440
下载全文

HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-28 合作期刊: 《数据智能（英文）》

Amirpasha, Mozaffari Michael, Langguth Bing, Gong Jessica, Ahring Adrian, Rojas Campos Pascal, Nieters Otoniel, José Campos Escobar Martin, Wittenbrink Peter, Baumann Martin, G. Schultz

摘要： Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.

点击量 956 下载量 290 评论 0
9. ChinaXiv:202211.00447
下载全文

Canonical Workflow for Machine Learning Tasks

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-28 合作期刊: 《数据智能（英文）》

Christophe, Blanchi Binyam, Gebre Peter, Wittenburg

摘要： There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.

点击量 537 下载量 158 评论 0
10. ChinaXiv:202405.00120
下载全文

Reliable calculations of nuclear binding energies by the Gaussian process of machine learning

分类：物理学 >> 核物理学提交时间： 2024-05-11

Ziyi Yuan Dong Bai Zhen Wang Zhongzhou Ren

摘要： Reliable calculations of nuclear binding energies are crucial for advancing the research of nuclear physics. Machine learning provides an innovative approach to exploring complex physical problems. In this study, the nuclear binding energies are modeled directly using a machine-learning method called the Gaussian process. First, the binding energies for 2238 nuclei withZ >20andN >20are calculated using the Gaussian process in a physically motivated feature space, yielding an average deviation of 0.046 MeV and a standard deviation of 0.066 MeV. The results show the good learning ability of the Gaussian process in the studies of binding energies. Then, the predictive power of the Gaussian process is studied by calculating the binding energies for 108 nuclei newly included in AME2020. The theoretical results are in good agreement with the experimental data, reflecting the good predictive power of the Gaussian process. Moreover, theα-decay energies for 1169 nuclei with50≤Z≤110are derived from the theoretical binding energies calculated using the Gaussian process. The average deviation and the standard deviation are, respectively, 0.047 MeV and 0.070 MeV. Noticeably, the calculatedα-decay energies for the two new isotopes204Ac M. H. Huanget al.,Phys. Lett. B834, 137484 (2022) and207Th H. B. Yanget al.,Phys. Rev. C105, L051302 (2022) agree well with the latest experimental data. These results demonstrate that the Gaussian process is reliable for the calculations of nuclear binding energies. Finally, theα-decay properties of some unknown actinide nuclei are predicted using the Gaussian process. The predicted results can be useful guides for future research on binding energies andα-decay properties.

通过

点击量 939 下载量 497 评论 0
11. ChinaXiv:202404.00032
下载全文

Comparison of different kernel functions in nuclear charge radius predictions by the kernel ridge regression method

分类：物理学 >> 核物理学提交时间： 2024-04-01

Zhen-Hua Zhang Lu Tang

摘要： Using two nuclear models, i) the relativistic continuum Hartree-Bogoliubov (RCHB) theoryand ii) the Weizs acker-Skyrme (WS) model WS$^ ast$,the performances of nine kinds of kernel functions in the kernel ridge regression (KRR) methodare investigated by comparing the accuracies of describing the experimental nuclear chargeradii and the extrapolation abilities.It is found that, except the inverse power kernel, other kernels can reach the same levelaround 0.015-0.016~fm for these two models with KRR method.The extrapolation ability for the neutron rich region of each kernel depends on the trainning data.Our investigation shows that the performances of the power kernel and Multiquadric kernel arebetter in the RCHB+KRR calculation, and the Gaussian kernel is better in the WS$^ ast$+KRR calculation.In addition, the performance of different basis functions inthe radial basis function method is also investigated for comparison.The results are similar to the KRR method.The influence of different kernels on the KRR reconstruct function is discussedby investigating the whole nuclear chart.At last, the charge radii of some specific isotopic chains have been investigatedby the RCHB+KRR with power kernel and the WS$^ ast$+KRR with Gaussian kernel.The charge radii and most of the specific features in these isotopic chainscan be reproduced after considering the KRR method.

通过

点击量 1442 下载量 750 评论 0
12. ChinaXiv:202312.00131
下载全文

Nuclear charge radius predictions by kernel ridge regression with odd-even effects

分类：物理学 >> 核物理学提交时间： 2023-12-13

Lu Tang Zhen-Hua Zhang

摘要： The extended kernel ridge regression (EKRR) method with odd-even effects was adopted to improve the de charge radius using five commonly used nuclear models.These are: (i) the isospin dependent $A^{1/3}$ formula(ii) relativistic continuum Hartree-Bogoliubov (RCHB) theory(iii) Hartree-Fock-Bogoliubov (HFB) model HFB25(iv) the Weizs acker-Skyrme (WS) model WS$^ ast$,(v) HFB25$^ ast$ model.In the last two models, the charge radii were calculated using a five-parameter formulawith the nuclear shell corrections and deformations obtained from the WS and HFB25 models, respectively.For each model, the resultant root-mean-square deviation for the 1014 nucleiwith proton number $Z geq 8$ can be significantly reducedto 0.009-0.013~fm after considering the modification with the EKRR method.The best among them was the RCHB model, with a root-mean-square deviation of 0.0092~fm.The extrapolation abilities of the KRR and EKRR methods for the neutron-rich region were examined and it was found that afterconsidering the odd-even effects,the extrapolation power was improved compared with that of the original KRR method.The strong odd-even staggering of nuclear charge radii of Ca and Cu isotopesand the abrupt kinks across the neutron $N=126$ and 82 shell closures were alsocalculated and could be reproduced quite well by calculations using the EKRR method.

通过

点击量 548 下载量 146 评论 0
13. ChinaXiv:202301.00054
下载全文

Machine Learning and Font Design -- Taking the Print Advertisement Generated by Artificial Intelligence As an Example

分类：工程与技术科学 >> 工程与技术科学其他学科提交时间： 2023-01-03 合作期刊: 《2022年第三届艺术设计、传播与工程科学研讨会》

Jianjun Fang

摘要： With the gradual maturity of artificial intelligence, the field of artificial intelligence and print advertising has been widely combined, and the generation of print advertising intelligence has been widely used. However, AI generated print ads have gradually exposed its shortcomings in font design. Only relying on a single copyright font library for graphic combination, it is easy to cause the font and advertising theme is not appropriate, the font and brand image positioning is not consistent, and the font and advertising claims are not up to standard. Therefore, the introduction of machine learning font design into the field of print advertising has a certain practical value.

点击量 713 下载量 241 评论 0
14. ChinaXiv:202105.00003
下载全文

通过 ICU 临床数据集成系统预测脓毒症

分类：医学、药学 >> 临床医学提交时间： 2021-05-07

Chen, Qiyu Li, Ranran Lin, Zhizhe Lai, Zhiming Xue, Peijiao Jiang, Jingfeng Lu, Wenlian Li, Lei Tang, Yaoqing

摘要： Sepsis is an essential issue in critical care medicine, and early detection and intervention are key for survival. We established the sepsis early warning system based on a data integration platform that can be implemented in ICU. The sepsis early warning module can detect the onset of sepsis 5 hours proceeding, and the data integration platform integrates, standardizes, and stores information from different medical devices, making the inference of the early warning module possible. Our best early warning model got an AUC of 0.9833 in the task of detect sepsis in 4 hours proceeding on the open-source database. Our data integration platform has already been operational in a hospital for months.

同行评议状态:待评议

点击量 31113 下载量 1666 评论 0
15. ChinaXiv:202407.00124
下载全文

Simulation and experimental comparison of the performance of four-corner-readout plastic scintillator muon-detector system

分类：核科学技术 >> 核探测技术与核电子学提交时间： 2024-07-04

HeLie Wang Xiao Dong Si-Yuan Luo Xiang-Man Liu

摘要： Cosmic-ray muons are highly penetrating background-radiation particles found in natural environments. In this study, we develop and test a plastic scintillator muon detector based on machine-learning algorithms. The detector underwent muon position-resolution tests at the Institute of Modern Physics in Lanzhou using a mul tiwire drift chamber (MWDC) experimental platform. In the simulation, the same structural and performance parameters were maintained to ensure the reliability of the simulation results. The Gaussian process regression (GPR) algorithm was used as the position-reconstruction algorithm owing to its optimal performance. The re sults of the Time Difference of Arrival algorithm were incorporated as one of the features of the GPR model to reconstruct the muon hit positions. The accuracy of the position reconstruction was evaluated by comparing the experimental results with Geant4 simulation results. In the simulation, large-area plastic scintillator detectors achieved a position resolution better than 20 mm. In the experimental-platform tests, the position resolutions of the test detectors were 27.9 mm. We also analyzed factors affecting the position resolution, including the crit ical angle of the total internal reflection of the photomultiplier tubes and distribution of muons in the MWDC. Simulations were performed to image both large objects and objects with different atomic numbers. The results showed that the system could image high- and low-Z materials in the constructed model and distinguish objects with significant density differences. This study demonstrates the feasibility of the proposed system, thereby providing a new detector system for muon-imaging applications.

通过

点击量 124 下载量 27 评论 0
16. ChinaXiv:202211.00209
下载全文

The Open Data Challenge: An Analysis of 124,000 Data Availability Statements and an Ironic Lesson about Data Management Plans

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2022-11-18 合作期刊: 《数据智能（英文）》

Graf, Chris Flanagan, Dave Wylic, Lisa Silver, Deirdre

摘要： Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.

点击量 750 下载量 192 评论 0
17. ChinaXiv:202105.00003
下载全文

SEPRES: Sepsis prediction via a clinical data integration system and real-world studies in the intensive care unit

分类：医学、药学 >> 临床医学提交时间： 2021-11-22

Chen, Qiyu Li, Ranran Lin, Chihche Lai, Chiming Chen, Dechang Qu, Hongping Huang, Yaling Lu, Wenlian Tang, Yaoqing Li, Lei

摘要： Background: Sepsis is vital in critical care medicine, and early detection and intervention are key to survival. We aimed to establish an early warning system for sepsis based on a data integration system that can be implemented in the intensive care unit (ICU). Methods: We trained the LightGBM and multilayer perceptron on the open-source database Medical Information Mart for Intensive Care for sepsis prediction. An ensemble sepsis prediction model was established based on the transfer learning and ensemble learning technique on the private dataset of Ruijin Hospital. The Shapley Additive Explanations analysis was applied to present feature importance on the prediction inference. With the development of data-integrating hub to collect and transmit data from different brands of ICU medical devices, the data integration system was established to receive, integrate, standardize, and store the real-time clinical data. In this way, the sepsis prediction model developed in the ICU of the Ruijin Hospital for the real-world study of sepsis early warning on ICU management. The trial was registered with ClinicalTrials.gov (NCT05088850). Findings: Our best early warning model achieved an area under the receiver operating characteristic curve (AUC) of 0·9833 in the task of detecting sepsis in 4-h preceding on the open-source database, while our ensemble model achieved an AUC of 0·90650·9436 in the retrospective research from 15-h preceding on the private database, and 0·86360·8992 in real-time real-world studies using the data integration system in the ICU of the Ruijin Hospital. In the continuous early warning process of patients admitted to the ICU, 22 patients who met the diagnostic criteria for sepsis during hospitalization were predicted as positive cases; 29 patients without sepsis were predicted as negative cases. Additionally, 17 patients were predicted as false-positive cases; in six patients with sepsis during ICU stay, the predicted probabilities at different time nodes were all less than the warning threshold 0·7 and predicted as false-negative cases. Interpretation: Machine learning models could allow accurate and real-time inference to detect sepsis onset within 5-h preceding at most with the help of the data integration system. We identified the features such as age, antibiotics, ventilation, and net balance to be important for the sepsis prediction inference. We argue that this system has promising potential to improve ICU management by helping medical practitioners identify at-sepsis-risk patients and prepare for timely diagnosis and intervention. Funding: Shanghai Municipal Science and Technology Major Project, the ZHANGJIANG LAB, and the Science and Technology Commission of Shanghai Municipality.

同行评议状态:待评议

点击量 17233 下载量 1577 评论 0
18. ChinaXiv:202011.00129
下载全文

Precipitation forecasting by large-scale climate indices and machine learning techniques

分类：地球科学 >> 地理学提交时间： 2020-11-25 合作期刊: 《干旱区科学》

GHOLAMI ROSTAM,Mehdi SADATINEJAD,Seyyed Javad MALEKIAN,Arash

摘要： Global warming is one of the most complicated challenges of our time causing considerable tension on our societies and on the environment. The impacts of global warming are felt unprecedentedly in a wide variety of ways from shifting weather patterns that threatens food production, to rising sea levels that deteriorates the risk of catastrophic flooding. Among all aspects related to global warming, there is a growing concern on water resource management. This field is targeted at preventing future water crisis threatening human beings. The very first stage in such management is to recognize the prospective climate parameters influencing the future water resource conditions. Numerous prediction models, methods and tools, in this case, have been developed and applied so far. In line with trend, the current study intends to compare three optimization algorithms on the platform of a multilayer perceptron (MLP) network to explore any meaningful connection between large-scale climate indices (LSCIs) and precipitation in the capital of Iran, a country which is located in an arid and semi-arid region and suffers from severe water scarcity caused by mismanagement over years and intensified by global warming. This situation has propelled a great deal of population to immigrate towards more developed cities within the country especially towards Tehran. Therefore, the current and future environmental conditions of this city especially its water supply conditions are of great importance. To tackle this complication an outlook for the future precipitation should be provided and appropriate forecasting trajectories compatible with this region's characteristics should be developed. To this end, the present study investigates three training methods namely backpropagation (BP), genetic algorithms (GAs), and particle swarm optimization (PSO) algorithms on a MLP platform. Two frameworks distinguished by their input compositions are denoted in this study: Concurrent Model Framework (CMF) and Integrated Model Framework (IMF). Through these two frameworks, 13 cases are generated: 12 cases within CMF, each of which contains all selected LSCIs in the same lead-times, and one case within IMF that is constituted from the combination of the most correlated LSCIs with Tehran precipitation in each lead-time. Following the evaluation of all model performances through related statistical tests, Taylor diagram is implemented to make comparison among the final selected models in all three optimization algorithms, the best of which is found to be MLP-PSO in IMF.

点击量 4280 下载量 931 评论 0
19. ChinaXiv:202404.00233
下载全文

Elucidating Electronic Structure Variations in Nucleic Acid-Protein Complexes Involved in Transcription Regulation Using a Tight-Binding Approach

分类：生物学 >> 生物物理学分类：化学 >> 物理化学分类：生物学 >> 生物物理学提交时间： 2024-04-16

Likai Du Chengbu Liu

摘要： Transcription factor (TF) are proteins that regulates the transcription of genetic information from DNA to messenger RNA by binding to a specific DNA sequence.Nucleic acid-protein interactions are crucial in regulating transcription in biological systems. This work presents a quick and convenient method for constructing tight-binding models and offers physical insights into the electronic structure properties of transcription factor complexes and DNA motifs. The tight binding Hamiltonian parameters are generated using the random forest regression algorithm, which reproduces the given ab-initiolevel calculations with reasonable accuracy. We present a library of residue-level parameters derived from extensive electronic structure calculations over various possible combinations of nucleobases and amino acid side chains from high-quality DNA-protein complex structures. As an example, our approach can reasonably generate the subtle electronic structure details for the orthologous transcription factors human AP-1 and Epstein-Barr virus Zta within a few seconds on a laptop. This method potentially enhances our understanding of the electronic structure variationsof gene-protein interaction complexes, even those involving dozens of proteins and genes. We hope this study offers a powerful tool for analyzing transcription regulation mechanisms at an electronic structural level.

同行评议状态:待评议

点击量 555 下载量 154 评论 0
20. ChinaXiv:202404.00125
下载全文

Unveiling the Re, Cr, and I diffusion in saturated compacted bentonite using machine-learning methods

分类：核科学技术 >> 核科学与技术提交时间： 2024-04-08

Zheng-Ye Feng Jun-Lei Tian Tao Wu Guo-Jun Wei Zhi-Long Li Xiao-Qiong Shi Yong-Jia Wang Qing-Feng Li

摘要： The safety assessment of high-level radioactive waste repositories requires a high predictive accuracy for radionuclide diffusion and a comprehensive understanding of the diffusion mechanism. In this study, a through-diffusion method and six machine-learning methods were employed to investigate the diffusion of ReO4– , HCrO4– , and I– in saturated compacted bentonite under different salinities and compacted dry densities. The machine-learning models were trained using two datasets. One dataset contained six input features and 293 instances obtained from the diffusion database system of the Japan Atomic Energy Agency (JAEA-DDB) and 15 publications. The other dataset, comprising 15,000 pseudo-instances, was produced using a multi-porosity model and contained eight input features. The results indicate that the former dataset yielded a higher predictive accuracy than the latter. Light gradient-boosting exhibited a higher prediction accuracy (R2= 0.92) and lower error (MSE = 0.01) than the other machine-learning algorithms. In addition, Shapley Additive Explanations,Feature Importance, and Partial Dependence Plot analysis results indicate that the rock capacity factor and compacted dry density had the two most significant effects on predicting the effective diffusion coefficient, thereby offering valuable insights.

通过

点击量 658 下载量 172 评论 0

1 2 后页尾页