《数据智能》刊载数据智能相关研究新成果、新方法、新技术、促进学术交流,推动数据融合及数据与数据处理平台的有效共享,助力知识实时构建,提高我国在该领域的研究应用水平和国际影响力
关键词: Data science education; Reproducibility; Reproducible data analysis; Communication; Action research;
DOI:10.1162/dint_a_00053
提交时间: 2022-11-29
摘要:Reproducibility is a cornerstone of scientific research. Data science is not an exception. In recent years scientists were concerned about a large number of irreproducible studies. Such reproducibility crisis in science could severely undermine public trust in science and science-based public policy. Recent efforts to promote reproducible research mainly focused on matured scientists and much less on student training. In this study, we conducted action research on students in data science to evaluate to what extent students are ready for communicating reproducible data analysis. The results show that although two-thirds of the students claimed they were able to reproduce results in peer reports, only one-third of reports provided all necessary information for replication. The actual replication results also include conflicting claims; some lacked comparisons of original and replication results, indicating that some students did not share a consistent understanding of what reproducibility means and how to report replication results. The findings suggest that more training is needed to help data science students communicating reproducible data analysis.
点击量
7266
下载量
1409
评论
0
关键词: Open data; Data sharing; Data citation; Open research;
DOI:10.1162/ dint_a_00021
提交时间: 2022-11-29
摘要:It is easy to argue that open data is critical to enabling faster and more effective research discovery. In this article, we describe the approach we have taken at Wiley to support open data and to start enabling more data to be FAIR data (Findable, Accessible, Interoperable and Reusable) with the implementation of four data policies: “Encourages”, “Expects”, “Mandates” and “Mandates and Peer Reviews Data”. We describe the rationale for these policies and levels of adoption so far. In the coming months we plan to measure and monitor the implementation of these policies via the publication of data availability statements and data citations. With this information, we’ll be able to celebrate adoption of data-sharing practices by the research communities we work with and serve, and we hope to showcase researchers from those communities leading in open research.
点击量
8355
下载量
1703
评论
0
关键词: Open data; Data sharing; Data citation; Open research;
DOI:10.1162/dint_a_00020
提交时间: 2022-11-29
摘要:Over the past five years, Elsevier has focused on implementing FAIR and best practices in data management, from data preservation through reuse. In this paper we describe a series of efforts undertaken in this time to support proper data management practices. In particular, we discuss our journal data policies and their implementation, the current status and future goals for the research data management platform Mendeley Data, and clear and persistent linkages to individual data sets stored on external data repositories from corresponding published papers through partnership with Scholix. Early analysis of our data policies implementation confirms significant disparities at the subject level regarding data sharing practices, with most uptake within disciplines of Physical Sciences. Future directions at Elsevier include implementing better discoverability of linked data within an article and incorporating research data usage metrics.
点击量
7883
下载量
1456
评论
0
关键词: Knowledge graph; Search engine; Question answering;
DOI:10.1162/dint_a_00019
提交时间: 2022-11-29
摘要:Knowledge graph (KG) has played an important role in enhancing the performance of many intelligent systems. In this paper, we introduce the solution of building a large-scale multi-source knowledge graph from scratch in Sogou Inc., including its architecture, technical implementation and applications. Unlike previous works that build knowledge graph with graph databases, we build the knowledge graph on top of SogouQdb, a distributed search engine developed by Sogou Web Search Department, which can be easily scaled to support petabytes of data. As a supplement to the search engine, we also introduce a series of models to support inference and graph based querying. Currently, the data of Sogou knowledge graph that are collected from 136 different websites and constantly updated consist of 54 million entities and over 600 million entity links. We also introduce three applications of knowledge graph in Sogou Inc.: entity detection and linking, knowledge based question answering and knowledge based dialogue system. These applications have been used in Web search products to help user acquire information more efficiently.
点击量
9193
下载量
1790
评论
0
关键词: Open government data; Risk; Taxonomy; Lifecycle; Risk management;
DOI:10.1162/dint_a_00018
提交时间: 2022-11-29
摘要:For many government departments, uncertainty aversion is a source of barriers in the advancement of data openness. A more active response to potential risks is needed and necessitates an in-depth examination of risks related to open government data (OGD). With a cross-case study in which three cases from the United Kingdom, the United States and China are examined, this study identifies potential risks that might emerge at different stages of the lifecycle of OGD programs and constructs a taxonomy model for them. The taxonomy model distinguishes the “risks from OGD” from the “risks to OGD”, which can help government departments make better responses. Finally, risk response strategies are suggested based on the research results.
点击量
7484
下载量
1605
评论
0
关键词: Knowledge Graph; Multi-modal Learning; Poly Encoders;
DOI:10.1162/dint_a_00146
提交时间: 2022-11-28
摘要:Multi-modal entity linking plays a crucial role in a wide range of knowledge-based modal-fusion tasks, i.e., multi-modal retrieval and multi-modal event extraction. We introduce the new ZEro-shot Multi-modal Entity Linking (ZEMEL) task, the format is similar to multi-modal entity linking, but multi-modal mentions are linked to unseen entities in the knowledge graph, and the purpose of zero-shot setting is to realize robust linking in highly specialized domains. Simultaneously, the inference efficiency of existing models is low when there are many candidate entities. On this account, we propose a novel model that leverages visual#2; linguistic representation through the co-attentional mechanism to deal with the ZEMEL task, considering the trade-off between performance and efficiency of the model. We also build a dataset named ZEMELD for the new task, which contains multi-modal data resources collected from Wikipedia, and we annotate the entities as ground truth. Extensive experimental results on the dataset show that our proposed model is effective as it significantly improves the precision from 68.93% to 82.62% comparing with baselines in the ZEMEL task.
点击量
6580
下载量
1435
评论
0
关键词: Public culture; Short text clustering; Topic modeling; LDA; Big data analysis;
DOI: 10.1162/dint_a_00121
提交时间: 2022-11-28
摘要:In this study, we uncover the topics of Chinese public cultural activities in 2020 with a two-step short text clustering (self-taught neural networks and graph-based clustering) and topic modeling approach. The dataset we use for this research is collected from 108 websites of libraries and cultural centers, containing over 17,000 articles. With the novel framework we propose, we derive 3 clusters and 8 topics from 21 provincial#2; level regions in China. By plotting the topic distribution of each cluster, we are able to shows unique tendencies of local cultural institutes, that is, free lessons and lectures on art and culture, entertainment and service for socially vulnerable groups, and the preservation of intangible cultural heritage respectively. The findings of our study provide decision-making support for cultural institutes, thus promoting public cultural service from a data-driven perspective.
点击量
7804
下载量
1982
评论
0
关键词: Graph pattern matching; Medical Knowledge Graphs; Fuzzy constraints; Breast cancer; Diagnostic classification;
DOI:10.1162/dint_a_00153
提交时间: 2022-11-28
摘要:The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the research has focused on complex networks, and there are few researches on GPM in the medical field. Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE algorithm. The experimental results on the two datasets show that compared with existing algorithms, our proposed algorithm is more efficient and effective.
点击量
5677
下载量
1572
评论
0
关键词: Relation extraction; Bi-GRU; CRF keywords attention; Hidden similarity;
DOI:10.1162/dint_a_00147
提交时间: 2022-11-28
摘要:Relational extraction plays an important role in the field of natural language processing to predict semantic relationships between entities in a sentence. Currently, most models have typically utilized the natural language processing tools to capture high-level features with an attention mechanism to mitigate the adverse effects of noise in sentences for the prediction results. However, in the task of relational classification, these attention mechanisms do not take full advantage of the semantic information of some keywords which have information on relational expressions in the sentences. Therefore, we propose a novel relation extraction model based on the attention mechanism with keywords, named Relation Extraction Based on Keywords Attention (REKA). In particular, the proposed model makes use of bi-directional GRU (Bi-GRU) to reduce computation, obtain the representation of sentences , and extracts prior knowledge of entity pair without any NLP tools. Besides the calculation of the entity-pair similarity, Keywords attention in the REKA model also utilizes a linear-chain conditional random field (CRF) combining entity-pair features, similarity features between entity-pair features, and its hidden vectors, to obtain the attention weight resulting from the marginal distribution of each word. Experiments demonstrate that the proposed approach can utilize keywords incorporating relational expression semantics in sentences without the assistance of any high-level features and achieve better performance than traditional methods.
点击量
7066
下载量
1708
评论
0
关键词: Machine learning; Regression; Comparative evaluation; Analysis; Validation;
DOI: 10.1162/dint_a_00155
提交时间: 2022-11-28
摘要:Artificial intelligence and machine learning applications are of significant importance almost in every field of human life to solve problems or support human experts. However, the determination of the machine learning model to achieve a superior result for a particular problem within the wide real-life application areas is still a challenging task for researchers. The success of a model could be affected by several factors such as dataset characteristics, training strategy and model responses. Therefore, a comprehensive analysis is required to determine model ability and the efficiency of the considered strategies. This study implemented ten benchmark machine learning models on seventeen varied datasets. Experiments are performed using four different training strategies 60:40, 70:30, and 80:20 hold-out and five-fold cross-validation techniques. We used three evaluation metrics to evaluate the experimental results: mean squared error, mean absolute error, and coefficient of determination (R2 score). The considered models are analyzed, and each model's advantages, disadvantages, and data dependencies are indicated. As a result of performed excess number of experiments, the deep Long-Short Term Memory (LSTM) neural network outperformed other considered models, namely, decision tree, linear regression, support vector regression with a linear and radial basis function kernels, random forest, gradient boosting, extreme gradient boosting, shallow neural network, and deep neural network. It has also been shown that cross-validation has a tremendous impact on the results of the experiments and should be considered for the model evaluation in regression studies where data mining or selection is not performed.
点击量
7112
下载量
1637
评论
0