• A Novel Framework for Future Natural Language Processing From a Database Perspective

    分类: 计算机科学 >> 自然语言理解与机器翻译 提交时间: 2023-11-01

    摘要: Most research and applications on natural language still concentrate on its superficial features and structures. However, natural language is essentially a way of encoding information and knowledge. Thus, the focus should be on what is encoded and how it is encoded. In line with this, we suggest a database-based approach for natural language processing that emulates the encoding of information and knowledge to build models. Based on these models, 1) generating sentences becomes akin to reading data from the models (or databases) and encoding it following some rules; 2) understanding sentences involves decoding rules and a series of boolean operations on the databases; 3) learning can be accomplished by writing on the databases. Our method closely mirrors how the human brain processes information, offering excellent interpretability and expandability.

  • Understanding principal component analysis

    分类: 统计学 >> 应用统计数学 提交时间: 2023-09-15

    摘要: The principal component analysis (PCA) is a frequently used machine learning method. In this paper, the PCA operation is explained by examples with Python program illustration. A proof of the diagonalizability of real symmetric matrix is also included, which may help to understand the mathematics behind PCA.

  • DEED: A general quantization scheme for saving bits in communication

    分类: 数学 >> 控制和优化 提交时间: 2020-06-16

    摘要: Quantization is a popular technique to reduce communication in distributed optimization. Motivated by the classical work on inexact gradient descent (GD) \cite{bertsekas2000gradient}, we provide a general convergence analysis framework for inexact GD that is tailored for quantization schemes. We also propose a quantization scheme Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity in three settings: frequent-communication large-memory, frequent-communication small-memory, and infrequent-communication (e.g. federated learning). More specifically, in the frequent-communication large-memory setting, DEED can be easily combined with Nesterov's method, so that the total number of bits required is $ \tilde{O}( \sqrt{\kappa} \log 1/\epsilon )$, where $\tilde{O}$ hides numerical constant and $\log \kappa $ factors. In the frequent-communication small-memory setting, DEED combined with SGD only requires $\tilde{O}( \kappa \log 1/\epsilon)$ number of bits in the interpolation regime. In the infrequent communication setting, DEED combined with Federated averaging requires a smaller total number of bits than Federated Averaging. All these algorithms converge at the same rate as their non-quantized versions, while using a smaller number of bits.

  • Predictions of nuclear charge radii based on the convolutional neural network

    分类: 物理学 >> 核物理学 提交时间: 2023-09-07

    摘要: In this study, we developed a neural network that incorporates a fully connected layer with a convolutional layer to predict the nuclear charge radii based on the relationships between four local nuclear charge radii.The convolutional neural network (CNN) combines the isospin and pairing effects to describe the charge radii of nuclei with $A geq $ 39 and $Z geq $ 20. The developed neural network achieved a root--mean--square (RMS) deviation of 0.0195 fm for a dataset with 928 nuclei. Specifically, the CNN reproduced the trend of the inverted parabolic behavior and odd--even staggering observed in the calcium isotopic chain, demonstrating reliable predictive capability.

  • Comparative Evaluation and Comprehensive Analysis of Machine Learning Models for Regression Problems

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》

    摘要: Artificial intelligence and machine learning applications are of significant importance almost in every field of human life to solve problems or support human experts. However, the determination of the machine learning model to achieve a superior result for a particular problem within the wide real-life application areas is still a challenging task for researchers. The success of a model could be affected by several factors such as dataset characteristics, training strategy and model responses. Therefore, a comprehensive analysis is required to determine model ability and the efficiency of the considered strategies. This study implemented ten benchmark machine learning models on seventeen varied datasets. Experiments are performed using four different training strategies 60:40, 70:30, and 80:20 hold-out and five-fold cross-validation techniques. We used three evaluation metrics to evaluate the experimental results: mean squared error, mean absolute error, and coefficient of determination (R2 score). The considered models are analyzed, and each model's advantages, disadvantages, and data dependencies are indicated. As a result of performed excess number of experiments, the deep Long-Short Term Memory (LSTM) neural network outperformed other considered models, namely, decision tree, linear regression, support vector regression with a linear and radial basis function kernels, random forest, gradient boosting, extreme gradient boosting, shallow neural network, and deep neural network. It has also been shown that cross-validation has a tremendous impact on the results of the experiments and should be considered for the model evaluation in regression studies where data mining or selection is not performed.

  • Machine learning the apparent diffusion coefficient of Se(IV) in compacted bentonite

    分类: 核科学技术 >> 辐射物理与技术 提交时间: 2024-05-22

    摘要: Light Gradient Boosting Machine (LightGBM) and Random Forest (RF) algorithms were used to predict the apparent diffusion coefficient of Se(IV) in compacted bentonite.Seven instances of Se(IV)were measured using through-diffusion method.LightGBM (R2= 0.98 and RMSE = 0.025) exhibited superior predictive accuracy with a trainingdataset consisting of 956instances and eight input featuresfrom Japan Atomic Energy Agency(JAEA-DDB).Shapley Additive Explanation and Partial Dependence Plots analysesrevealedvaluable insightsinto the diffusion mechanism of adsorbed anion obtained by evaluating the relationshipsbetween the apparent diffusion coefficient and the dependency of each input feature.

  • HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》

    摘要: Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface.

  • Canonical Workflow for Machine Learning Tasks

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-28 合作期刊: 《数据智能(英文)》

    摘要: There is a huge gap between (1) the state of workflow technology on the one hand and the practices in the many labs working with data driven methods on the other and (2) the awareness of the FAIR principles and the lack of changes in practices during the last 5 years. The CWFR concept has been defined which is meant to combine these two intentions, increasing the use of workflow technology and improving FAIR compliance. In the study described in this paper we indicate how this could be applied to machine learning which is now used by almost all research disciplines with the well-known effects of a huge lack of repeatability and reproducibility. Researchers will only change practices if they can work efficiently and are not loaded with additional tasks. A comprehensive CWFR framework would be an umbrella for all steps that need to be carried out to do machine learning on selected data collections and immediately create a comprehensive and FAIR compliant documentation. The researcher is guided by such a framework and information once entered can easily be shared and reused. The many iterations normally required in machine learning can be dealt with efficiently using CWFR methods. Libraries of components that can be easily orchestrated using FAIR Digital Objects as a common entity to document all actions and to exchange information between steps without the researcher needing to understand anything about PIDs and FDO details is probably the way to increase efficiency in repeating research workflows. As the Galaxy project indicates, the availability of supporting tools will be important to let researchers use these methods. Other as the Galaxy framework suggests, however, it would be necessary to include all steps necessary for doing a machine learning task including those that require human interaction and to document all phases with the help of structured FDOs.

  • Reliable calculations of nuclear binding energies by the Gaussian process of machine learning

    分类: 物理学 >> 核物理学 提交时间: 2024-05-11

    摘要: Reliable calculations of nuclear binding energies are crucial for advancing the research of nuclear physics. Machine learning provides an innovative approach to exploring complex physical problems. In this study, the nuclear binding energies are modeled directly using a machine-learning method called the Gaussian process. First, the binding energies for 2238 nuclei withZ >20andN >20are calculated using the Gaussian process in a physically motivated feature space, yielding an average deviation of 0.046 MeV and a standard deviation of 0.066 MeV. The results show the good learning ability of the Gaussian process in the studies of binding energies. Then, the predictive power of the Gaussian process is studied by calculating the binding energies for 108 nuclei newly included in AME2020. The theoretical results are in good agreement with the experimental data, reflecting the good predictive power of the Gaussian process. Moreover, theα-decay energies for 1169 nuclei with50≤Z≤110are derived from the theoretical binding energies calculated using the Gaussian process. The average deviation and the standard deviation are, respectively, 0.047 MeV and 0.070 MeV. Noticeably, the calculatedα-decay energies for the two new isotopes204Ac M. H. Huanget al.,Phys. Lett. B834, 137484 (2022) and207Th H. B. Yanget al.,Phys. Rev. C105, L051302 (2022) agree well with the latest experimental data. These results demonstrate that the Gaussian process is reliable for the calculations of nuclear binding energies. Finally, theα-decay properties of some unknown actinide nuclei are predicted using the Gaussian process. The predicted results can be useful guides for future research on binding energies andα-decay properties.

  • Comparison of different kernel functions in nuclear charge radius predictions by the kernel ridge regression method

    分类: 物理学 >> 核物理学 提交时间: 2024-04-01

    摘要: Using two nuclear models, i) the relativistic continuum Hartree-Bogoliubov (RCHB) theoryand ii) the Weizs acker-Skyrme (WS) model WS$^ ast$,the performances of nine kinds of kernel functions in the kernel ridge regression (KRR) methodare investigated by comparing the accuracies of describing the experimental nuclear chargeradii and the extrapolation abilities.It is found that, except the inverse power kernel, other kernels can reach the same levelaround 0.015-0.016~fm for these two models with KRR method.The extrapolation ability for the neutron rich region of each kernel depends on the trainning data.Our investigation shows that the performances of the power kernel and Multiquadric kernel arebetter in the RCHB+KRR calculation, and the Gaussian kernel is better in the WS$^ ast$+KRR calculation.In addition, the performance of different basis functions inthe radial basis function method is also investigated for comparison.The results are similar to the KRR method.The influence of different kernels on the KRR reconstruct function is discussedby investigating the whole nuclear chart.At last, the charge radii of some specific isotopic chains have been investigatedby the RCHB+KRR with power kernel and the WS$^ ast$+KRR with Gaussian kernel.The charge radii and most of the specific features in these isotopic chainscan be reproduced after considering the KRR method.

  • Nuclear charge radius predictions by kernel ridge regression with odd-even effects

    分类: 物理学 >> 核物理学 提交时间: 2023-12-13

    摘要: The extended kernel ridge regression (EKRR) method with odd-even effects was adopted to improve the de charge radius using five commonly used nuclear models.These are: (i) the isospin dependent $A^{1/3}$ formula(ii) relativistic continuum Hartree-Bogoliubov (RCHB) theory(iii) Hartree-Fock-Bogoliubov (HFB) model HFB25(iv) the Weizs acker-Skyrme (WS) model WS$^ ast$,(v) HFB25$^ ast$ model.In the last two models, the charge radii were calculated using a five-parameter formulawith the nuclear shell corrections and deformations obtained from the WS and HFB25 models, respectively.For each model, the resultant root-mean-square deviation for the 1014 nucleiwith proton number $Z geq 8$ can be significantly reducedto 0.009-0.013~fm after considering the modification with the EKRR method.The best among them was the RCHB model, with a root-mean-square deviation of 0.0092~fm.The extrapolation abilities of the KRR and EKRR methods for the neutron-rich region were examined and it was found that afterconsidering the odd-even effects,the extrapolation power was improved compared with that of the original KRR method.The strong odd-even staggering of nuclear charge radii of Ca and Cu isotopesand the abrupt kinks across the neutron $N=126$ and 82 shell closures were alsocalculated and could be reproduced quite well by calculations using the EKRR method.

  • Machine Learning and Font Design -- Taking the Print Advertisement Generated by Artificial Intelligence As an Example

    分类: 工程与技术科学 >> 工程与技术科学其他学科 提交时间: 2023-01-03 合作期刊: 《2022年第三届艺术设计、传播与工程科学研讨会》

    摘要: With the gradual maturity of artificial intelligence, the field of artificial intelligence and print advertising has been widely combined, and the generation of print advertising intelligence has been widely used. However, AI generated print ads have gradually exposed its shortcomings in font design. Only relying on a single copyright font library for graphic combination, it is easy to cause the font and advertising theme is not appropriate, the font and brand image positioning is not consistent, and the font and advertising claims are not up to standard. Therefore, the introduction of machine learning font design into the field of print advertising has a certain practical value.

  • 通过 ICU 临床数据集成系统预测脓毒症

    分类: 医学、药学 >> 临床医学 提交时间: 2021-05-07

    摘要: Sepsis is an essential issue in critical care medicine, and early detection and intervention are key for survival. We established the sepsis early warning system based on a data integration platform that can be implemented in ICU. The sepsis early warning module can detect the onset of sepsis 5 hours proceeding, and the data integration platform integrates, standardizes, and stores information from different medical devices, making the inference of the early warning module possible. Our best early warning model got an AUC of 0.9833 in the task of detect sepsis in 4 hours proceeding on the open-source database. Our data integration platform has already been operational in a hospital for months.

  • The Open Data Challenge: An Analysis of 124,000 Data Availability Statements and an Ironic Lesson about Data Management Plans

    分类: 计算机科学 >> 计算机科学的集成理论 提交时间: 2022-11-18 合作期刊: 《数据智能(英文)》

    摘要: Data availability statements can provide useful information about how researchers actually share research data. We used unsupervised machine learning to analyze 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. We categorized the data availability statements, and looked at trends over time. We found expected increases in the number of data availability statements submitted over time, and marked increases that correlate with policy changes made by journals. Our open data challenge becomes to use what we have learned to present researchers with relevant and easy options that help them to share and make an impact with new research data.

  • SEPRES: Sepsis prediction via a clinical data integration system and real-world studies in the intensive care unit

    分类: 医学、药学 >> 临床医学 提交时间: 2021-11-22

    摘要: Background: Sepsis is vital in critical care medicine, and early detection and intervention are key to survival. We aimed to establish an early warning system for sepsis based on a data integration system that can be implemented in the intensive care unit (ICU). Methods: We trained the LightGBM and multilayer perceptron on the open-source database Medical Information Mart for Intensive Care for sepsis prediction. An ensemble sepsis prediction model was established based on the transfer learning and ensemble learning technique on the private dataset of Ruijin Hospital. The Shapley Additive Explanations analysis was applied to present feature importance on the prediction inference. With the development of data-integrating hub to collect and transmit data from different brands of ICU medical devices, the data integration system was established to receive, integrate, standardize, and store the real-time clinical data. In this way, the sepsis prediction model developed in the ICU of the Ruijin Hospital for the real-world study of sepsis early warning on ICU management. The trial was registered with ClinicalTrials.gov (NCT05088850). Findings: Our best early warning model achieved an area under the receiver operating characteristic curve (AUC) of 0·9833 in the task of detecting sepsis in 4-h preceding on the open-source database, while our ensemble model achieved an AUC of 0·90650·9436 in the retrospective research from 15-h preceding on the private database, and 0·86360·8992 in real-time real-world studies using the data integration system in the ICU of the Ruijin Hospital. In the continuous early warning process of patients admitted to the ICU, 22 patients who met the diagnostic criteria for sepsis during hospitalization were predicted as positive cases; 29 patients without sepsis were predicted as negative cases. Additionally, 17 patients were predicted as false-positive cases; in six patients with sepsis during ICU stay, the predicted probabilities at different time nodes were all less than the warning threshold 0·7 and predicted as false-negative cases. Interpretation: Machine learning models could allow accurate and real-time inference to detect sepsis onset within 5-h preceding at most with the help of the data integration system. We identified the features such as age, antibiotics, ventilation, and net balance to be important for the sepsis prediction inference. We argue that this system has promising potential to improve ICU management by helping medical practitioners identify at-sepsis-risk patients and prepare for timely diagnosis and intervention. Funding: Shanghai Municipal Science and Technology Major Project, the ZHANGJIANG LAB, and the Science and Technology Commission of Shanghai Municipality.

  • Precipitation forecasting by large-scale climate indices and machine learning techniques

    分类: 地球科学 >> 地理学 提交时间: 2020-11-25 合作期刊: 《干旱区科学》

    摘要: Global warming is one of the most complicated challenges of our time causing considerable tension on our societies and on the environment. The impacts of global warming are felt unprecedentedly in a wide variety of ways from shifting weather patterns that threatens food production, to rising sea levels that deteriorates the risk of catastrophic flooding. Among all aspects related to global warming, there is a growing concern on water resource management. This field is targeted at preventing future water crisis threatening human beings. The very first stage in such management is to recognize the prospective climate parameters influencing the future water resource conditions. Numerous prediction models, methods and tools, in this case, have been developed and applied so far. In line with trend, the current study intends to compare three optimization algorithms on the platform of a multilayer perceptron (MLP) network to explore any meaningful connection between large-scale climate indices (LSCIs) and precipitation in the capital of Iran, a country which is located in an arid and semi-arid region and suffers from severe water scarcity caused by mismanagement over years and intensified by global warming. This situation has propelled a great deal of population to immigrate towards more developed cities within the country especially towards Tehran. Therefore, the current and future environmental conditions of this city especially its water supply conditions are of great importance. To tackle this complication an outlook for the future precipitation should be provided and appropriate forecasting trajectories compatible with this region's characteristics should be developed. To this end, the present study investigates three training methods namely backpropagation (BP), genetic algorithms (GAs), and particle swarm optimization (PSO) algorithms on a MLP platform. Two frameworks distinguished by their input compositions are denoted in this study: Concurrent Model Framework (CMF) and Integrated Model Framework (IMF). Through these two frameworks, 13 cases are generated: 12 cases within CMF, each of which contains all selected LSCIs in the same lead-times, and one case within IMF that is constituted from the combination of the most correlated LSCIs with Tehran precipitation in each lead-time. Following the evaluation of all model performances through related statistical tests, Taylor diagram is implemented to make comparison among the final selected models in all three optimization algorithms, the best of which is found to be MLP-PSO in IMF.

  • Elucidating Electronic Structure Variations in Nucleic Acid-Protein Complexes Involved in Transcription Regulation Using a Tight-Binding Approach

    分类: 生物学 >> 生物物理学 分类: 化学 >> 物理化学 分类: 生物学 >> 生物物理学 提交时间: 2024-04-16

    摘要: Transcription factor (TF) are proteins that regulates the transcription of genetic information from DNA to messenger RNA by binding to a specific DNA sequence.Nucleic acid-protein interactions are crucial in regulating transcription in biological systems. This work presents a quick and convenient method for constructing tight-binding models and offers physical insights into the electronic structure properties of transcription factor complexes and DNA motifs. The tight binding Hamiltonian parameters are generated using the random forest regression algorithm, which reproduces the given ab-initiolevel calculations with reasonable accuracy. We present a library of residue-level parameters derived from extensive electronic structure calculations over various possible combinations of nucleobases and amino acid side chains from high-quality DNA-protein complex structures. As an example, our approach can reasonably generate the subtle electronic structure details for the orthologous transcription factors human AP-1 and Epstein-Barr virus Zta within a few seconds on a laptop. This method potentially enhances our understanding of the electronic structure variationsof gene-protein interaction complexes, even those involving dozens of proteins and genes. We hope this study offers a powerful tool for analyzing transcription regulation mechanisms at an electronic structural level.

  • Unveiling the Re, Cr, and I diffusion in saturated compacted bentonite using machine-learning methods

    分类: 核科学技术 >> 核科学与技术 提交时间: 2024-04-08

    摘要: The safety assessment of high-level radioactive waste repositories requires a high predictive accuracy for radionuclide diffusion and a comprehensive understanding of the diffusion mechanism. In this study, a through-diffusion method and six machine-learning methods were employed to investigate the diffusion of ReO4– , HCrO4– , and I– in saturated compacted bentonite under different salinities and compacted dry densities. The machine-learning models were trained using two datasets. One dataset contained six input features and 293 instances obtained from the diffusion database system of the Japan Atomic Energy Agency (JAEA-DDB) and 15 publications. The other dataset, comprising 15,000 pseudo-instances, was produced using a multi-porosity model and contained eight input features. The results indicate that the former dataset yielded a higher predictive accuracy than the latter. Light gradient-boosting exhibited a higher prediction accuracy (R2= 0.92) and lower error (MSE = 0.01) than the other machine-learning algorithms. In addition, Shapley Additive Explanations,Feature Importance, and Partial Dependence Plot analysis results indicate that the rock capacity factor and compacted dry density had the two most significant effects on predicting the effective diffusion coefficient, thereby offering valuable insights.

  • Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review

    分类: 心理学 >> 应用心理学 提交时间: 2024-01-09

    摘要: This paper explores the frontiers of large language models (LLMs) in psychology applications. Psychology has undergone several theoretical changes, and the current use of Artificial Intelligence (AI) and Machine Learning, particularly LLMs, promises to open up new research directions. We provide a detailed exploration of how LLMs like ChatGPT are transforming psychological research. It discusses the impact of LLMs across various branches of psychology, including cognitive and behavioral, clinical and counseling, educational and developmental, and social and cultural psychology, highlighting their potential to simulate aspects of human cognition and behavior. The paper delves into the capabilities of these models to emulate human-like text generation, offering innovative tools for literature review, hypothesis generation, experimental design, experimental subjects, data analysis, academic writing, and peer review in psychology. While LLMs are essential in advancing research methodologies in psychology, the paper also cautions about their technical and ethical challenges. There are issues like data privacy, the ethical implications of using LLMs in psychological research, and the need for a deeper understanding of these models' limitations. Researchers should responsibly use LLMs in psychological studies, adhering to ethical standards and considering the potential consequences of deploying these technologies in sensitive areas. Overall, the article provides a comprehensive overview of the current state of LLMs in psychology, exploring potential benefits and challenges. It serves as a call to action for researchers to leverage LLLs' advantages responsibly while addressing associated risks.

  • Predicting League of Legends Match Results Based on Machine

    分类: 计算机科学 >> 自然语言理解与机器翻译 提交时间: 2024-01-03

    摘要: League of Legends (LoL) is a highly popular multiplayer online competitive game, featuring intricate game mechanics and team cooperation, making the prediction of match outcomes a challenging task. This study utilizes a dataset from Kaggle, comprising 9,879 ranked matches ranging from Diamond I to Master tier, to build a machine learning model predicting the ultimate winner, either the blue or red team, based on the features of the first 10 minutes of gameplay. Through steps such as data loading, preprocessing, and feature engineering, we provided effective inputs for the model. For model selection, we opted for the Logistic Regression algorithm, achieving a model accuracy of 0.7277 through data splitting and training. This accuracy robustly supports predictions of the winning side, whether blue or red. However, to further enhance model performance, we recommend exploring additional feature en#2;gineering methods, investigating alternative machine learning algorithms, and fine-tuning hyperpa#2;rameters. The introduction of deep learning models is also a promising avenue to better capture the complex relationships within the game. Through these improvements, we anticipate increasing the models predictive accuracy for future matches, offering valuable insights for game development and enhancement.