ChinaXiv.org 中国科学院科技论文预发布平台

按提交时间

按主题分类

按作者

按机构

当前资源共 1,877条

隐藏摘要

点击量

时间

下载量

1. ChinaXiv:202406.00408
下载全文

Humans are invited to write cell backbones as complex numbers by writing polyribonucleotides as computable numbers

分类：计算机科学 >> 计算机科学技术其他学科分类：生物学 >> 分子生物学分类：数学 >> 逻辑提交时间： 2024-06-28

Huijuan Wang

摘要： Polymer aggregates and molecular polymers are written as computable numbers, realizing the unity of cells and universal Turing machines with the Entscheidungsproblem. However, whether the Entscheidungsproblem of cells really exists remains elusive. Alan Turing found universal Turing machines read only computable numbers written by humans who further differentiate transcendental numbers from the set of computable numbers by Georg Cantor’s diagonal process. It follows that the decidability of the Entscheidungsproblem derived from humans eliminates the independence of computable numbers from each other and enables computable numbers to be fused with each other into the set of computable numbers, with the result that humans are endowed with a capacity to read of the fusion of computable numbers with each other into the set of computable numbers by humans to read the set of computable numbers by being endowed with a capacity to write computable numbers. Accordingly, it is shown here how humans write cell backbones as complex numbers read by artificial intelligence machines emulated by cells by writing polyribonucleotides as computable numbers read by universal Turing machines emulated by extracellular ribosomes to extend Georg Cantor’s continuum hypothesis by extending Alan Turing’s work on the Entscheidungsproblem, realizing the unity of humans, cells, and artificial intelligence machines without the Entscheidungsproblem.

同行评议状态:待评议

点击量 146 下载量 20 评论 0
2. ChinaXiv:202406.00416
下载全文

中美两国人工智能头部企业研发和创新的比较分析与启示

分类：计算机科学 >> 计算机科学技术其他学科提交时间： 2024-06-28 合作期刊: 《中国科学院院刊》

杨锡怡贾佳周小宇汪寿阳

摘要：人工智能是当前科技界最受关注的领域之一，而中国和美国是全球最重要的两个人工智能研究和开发中心。然而，中美两国在这个领域的发展存在明显差异。尤其是2022 年ChatGPT的问世，引发了对中国人工智能企业能力和竞争力的广泛讨论。文章通过对中美两国过去5 年获批的超过12 万件人工智能发明专利的分析，首先构建了一个基于专利特征的多维度指标，并基于该指标定义了中美两国人工智能领域的前十大企业。进一步的分析显示，这2 组企业在专利技术和研究网络上存在显著差异。中国人工智能头部企业的专利数量相对较少，引用率和转化率也较低。中国头部企业的专利主要集中在图像识别、语音识别等应用层技术上，并且尚未形成独具特色的技术集群。与此相对，美国人工智能头部企业产出了更多具有高影响力的专利，并在人工智能产业的基础层和技术层形成了多个技术集群。就学术研究而言，中国人工智能头部企业主要与国内的研究机构进行合作，而美国头部企业则表现出更强的中美合作及美国本土企业间的合作。文章的比较分析揭示了中美两国人工智能头部企业在技术能力和合作策略上的差异，并为中国更好地发展人工智能产业提供了企业管理启示和3 条政策建议。

通过

点击量 74 下载量 28 评论 0
3. ChinaXiv:202406.00412
下载全文

基于深度卷积神经网络的大学英语四级成绩早期预警

分类：计算机科学 >> 计算机应用技术提交时间： 2024-06-28

王宝罗淼

摘要：大学英语四级考试成绩早期预警模型易受学生日常行为模式差异干扰，影响预测精度。以某智慧教学平台上与大学英语四级考试直接相关的四级题型模块化学习成绩作为数据来源，建立模块化学习灰度图片数据库，同时将深度学习引入早期预警，形成基于深度卷积神经网络的大学英语四级成绩预警模型，对学生是否能在大学英语四级考试中取得预期成绩进行前期预测。验证结果表明，深度卷积神经网络预测模型相较于现有的预测模型具有更高的预测精度，可得到更早的最佳干预时间，有利于教师更好地对风险学生进行干预，提高学生大学英语四级考试成绩，提升英语语言应用能力。

同行评议状态:待评议

点击量 170 下载量 20 评论 0
4. ChinaXiv:202406.00317
下载全文

基于BERT模型的科技成果中图分类自动标引方法研究

分类：计算机科学 >> 计算机应用技术提交时间： 2024-06-21

薛钊刘千祥吴昌权李亢陈永海

摘要：随着深度学习预训练语言模型（PLM）的发展，人们很快将其应用于科技文献的领域分类，所能达到的效果远远超过传统自然语言处理技术在相同任务中的表现。科技成果登记数据与科技文献有相似之处，都具有高度凝练的标题，有较为详细的长文本简介，可作为基于PLM分类方法的判断依据。同时科技成果又存在其独特之处，它的简介会介绍项目来源、项目背景、应用情况、获奖情况等多方面内容，而科技文献通常高度聚焦于研究内容。这一特殊性增加了基于PLM分类方法对科技成果中图分类做出正确预测的难度。本研究中，我们以预训练BERT模型（RoBERTa）为基础，构建了科技成果中图分类自动标引系统。受生成式大语言模型解码过程的启发引入了解码策略，将原本的分类问题转化为解码问题。该方法不仅提高了预测的准确率，同时解决了以往分类模型只能局限于单一级别执行预测的问题，从而实现了业务所需的动态预测。还可针对预测链上累积概率及终端概率等设置筛选条件，根据实际业务需求在可靠性和分类细致程度之间进行取舍。

同行评议状态:待评议

点击量 161 下载量 46 评论 0
5. ChinaXiv:202406.00272
下载全文

A New Index for Clustering Evaluation Based on Density Estimation

分类：计算机科学 >> 计算机科学的集成理论提交时间： 2024-06-18

Gangli Liu

摘要： A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index $ I_a $ is called the Ambiguous Index; the second sub-index $ I_s $ is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.

同行评议状态:待评议

点击量 191 下载量 52 评论 0
6. ChinaXiv:202406.00125
下载全文

甘肃方言数据库建设与研究

分类：计算机科学 >> 计算机软件提交时间： 2024-06-12

杜吉梁赵博辛浩然张启超董昊焦佳怡朱珍云罗红军孙美琴张锐

摘要：本文讨论了方言数据库的重要性，现状，以及未来的发展趋势。首先，论文介绍了方言数据库的概念，即对中国各地方言进行数字化整理、标注、分类，形成类似语料库的资源库。接着，论文分析了目前中国方言数据库的发展现状，包括方言语音库、方言文献库的建设，以及方言数字化整理和方言翻译技术的发展。然后，论文预测了方言数据库未来的发展趋势，包括大数据和云计算的应用，深度学习技术的运用，区块链技术的应用，以及新的研究方法和技术的更新。论文特别强调了甘肃方言数据库的研究和建设，包括建设方言数据库的步骤和技术要点，以及研究成果和意义。总的来说，方言数据库的研究和发展对于保护方言文化、弘扬国家语言和推进方言研究具有至关重要的意义。

同行评议状态:待评议

点击量 231 下载量 63 评论 0
7. ChinaXiv:202406.00020
下载全文

面向低资源语言机器翻译的平行语料句对齐评分

分类：语言学及应用语言学 >> 语言学及应用语言学分类：计算机科学 >> 自然语言理解与机器翻译提交时间： 2024-06-05

李林霞陈波周毛克赵小兵

摘要：目的量化低资源语言平行语料的句对齐评分，获取高质量平行语料，提升机器翻译的性能。方法提出基于神经网络的无监督句嵌入双语平行语料句对齐评分方法 NeuroAlign：将平行句对嵌入至同一向量空间，计算平行语料中给定候选句对的对齐评分，然后根据评分排序过滤分值较低的平行句对，获得高质量的低资源语言双语平行语料。结果 BUCC2018 平行文本挖掘任务中 F1 值可提升 0.5-0.8；CCMT2021 低资源语言神经机器翻译中 BLEU 值可提升 0.1-10.9；句对齐评分可接近人工评分。局限限于低资源双语平行语料的资源匮乏，未在藏汉、维汉、蒙汉以外的语言对上进行探索研究。结论可以有效应用至低资源语言平行语料的句对齐评分，从数据源端提升语料质量，进而改进机器翻译的效果。

通过

点击量 396 下载量 92 评论 0
8. ChinaXiv:202405.00127
下载全文

Turing’s thinking machine and ’t Hooft’s principle of superposition of states

分类：物理学 >> 普通物理:统计和量子力学,量子信息等分类：计算机科学 >> 计算机科学技术其他学科提交时间： 2024-05-14

Zeqian Chen

摘要： In his 1950 paper 11 , Turing proposed the notion of a thinking machine, namely a machine that can think. But a thinking machine has to follow a certain law of physics, provided it is realized physically. In this paper, we show that Turing’s thinking machine necessrily obeys ’t Hooft’s principle of superposition of states, which was presented by ’t Hooft 8 in 2016 beyond the usual one as described by Dirac 4 in the conventional quantum mechanics. Precisely, Turing’s thinking machine must be a quantum machine, while ’t Hooft’s principle characterizes its thinking behavior in a probabilistic way.

同行评议状态:待评议

点击量 552 下载量 163 评论 0
9. ChinaXiv:202405.00061
下载全文

恶意代码SCMP分类方法框架与风险行为多标签机制

分类：计算机科学 >> 计算机科学技术其他学科提交时间： 2024-05-09

肖新光李晨平韩耀光童志明李琦

摘要：为响应学术界和工业界对于科学的恶意代码分类方法的需求，本研究基于现有工作基础，借鉴了卡巴斯基相对严谨的多段式分类命名的优点，按照强调互斥、完整覆盖、收敛的思路开展，并与“威胁风险行为标签”组合运用，形成了一套符合MECE原则、分类收敛、兼容工业界事实分类的恶意代码分类框架，能够有效支撑安全防御与治理。

通过

点击量 483 下载量 132 评论 0
10. ChinaXiv:202404.00375
下载全文

Brief Discussion on Scenes and Strategies in Capital Markets Manipulation Detection: From Influence Diffusion Perspectives

分类：计算机科学 >> 计算机科学技术其他学科提交时间： 2024-04-24

Chang Liao

摘要： In capital market, earlier detection of the influential entities can be beneficial to both market investors’ and regulators’ decision making, those whose change can significantly affect the whole trend of the related ones. Meanwhile, market manipulation in capital markets is a serious concern, encompassing tactics like pump and dump, market cornering, spoofing, and wash trading, which disrupt market fairness and erode investor trust. Market manipulation encompasses a range of activities designed to artificially influence the price or trading volume. By leveraging both information behavior data(stock news opinion/volume) and business behavior(stock trading price/volume), together with trade patterns and communication channels, several herding based manipulation scenes and detection models are discussed and proposed.

同行评议状态:待评议

点击量 381 下载量 120 评论 0
11. ChinaXiv:202404.00272
下载全文

Guiding Large Language Models to Generate Computer-Parsable Content

分类：计算机科学 >> 计算机软件提交时间： 2024-04-23

Jiaye Wang

摘要： We propose a method to guide Large Language Models (LLMs) in generating structured content adhering to specific conventions without fine-tuning. By utilizing coroutine-based content generation constraints through a pre-agreed context-free grammar (CFG), LLMs are directed during decoding to produce formal language compliant outputs. This enhances stability and consistency in generating target data structures, types, or instructions, reducing application development complexities. Experimentally, error rates of GPT-2 and Gemma exceed 95% for DSLs longer than 36 and 282 tokens, respectively. We introduce YieldLang, a coroutine-based DSL generation framework, and evaluate it with LLMs on various tasks including JSON and Mermaid flowchart generation. Compared to benchmarks, our approach improves accuracy by 1.09 to 11.6 times, with LLMs requiring only about 16.5% of the samples to generate JSON effectively. This enhances usability of LLM-generated content for computer programs.

同行评议状态:待评议

点击量 456 下载量 102 评论 0
12. ChinaXiv:202404.00287
下载全文

SteganoDDPM: A high-quality image steganography self-learning method using diffusion model

分类：计算机科学 >> 信息安全分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-23

Mengnan Qu Yuhao Jin Guanghua Zhang

摘要： Image steganography has become a focal point of interest for researchers due to its capacity for the covert transmission of sensitive data. Traditional diffusion models often struggle with image steganography tasks involving paired data, as their core principle of gradually removing noise is not directly suited for maintaining the correspondence between carrier and secret information. To address this challenge, this paper conducts an in-depth analysis of the principles behind diffusion models and proposes a novel framework for an image steganography diffusion model. The study begins by mathematically representing the steganography tasks of paired images, introducing two optimization objectives: minimizing the secrecy leakage function and embedding distortion function. Subsequently, it identifies three key issues that need to be addressed in paired image steganography tasks and, through specific constraint mechanisms and optimization strategies, enables the diffusion model to effectively handle paired data. This enhances the quality of the generated stego-images and resolves issues such as image clarity. Finally, on public datasets like CelebA, the proposed model is compared with existing generation model-based image steganography techniques, analyzing its implementation effects and performance parameters. Experimental results indicate that, compared to current technologies, the model framework proposed in this study not only improves image quality but also achieves significant enhancements in multiple performance metrics, including the imperceptibility and anti-detection capabilities of the images. Specifically, the PSNR of its stego-images reaches 93.14dB, and the extracted images’ PSNR reaches 91.23dB, an approximate improvement of 30% over existing technologies; the attack success rate is reduced to 2.4x10-38. These experimental outcomes validate the efficacy and superiority of the method in image steganography tasks.

同行评议状态:待评议

点击量 596 下载量 172 评论 0
13. ChinaXiv:202404.00273
下载全文

引导大语言模型生成计算机可解析内容

分类：计算机科学 >> 计算机软件分类：语言学及应用语言学 >> 语言学及应用语言学提交时间： 2024-04-21

王家晔

摘要：此幻灯片从背景、动机、方法、效果、展望和致谢六方面讲述了《引导大语言模型生成计算机可解析内容》的研究。全文请参考：https://arxiv.org/abs/2404.05499

同行评议状态:待评议

点击量 1511 下载量 405 评论 0
14. ChinaXiv:202404.00195
下载全文

基于大语言模型的中英文整合复杂性建模研究

分类：心理学 >> 应用心理学分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-10

李东启朱廷劭

摘要：整合复杂性是心理学中用来测量个体思维结构的一个概念，主要涉及两个方面：区分性和整合性。区分性是指个体能够识别和理解信息中存在的不同观点或元素的能力；整合性是指个体能够将这些不同的观点或元素合并成一个有逻辑性和连贯性的整体的能力。整合复杂性的测量主要依靠人工对于文本内容进行分析，这些文本可以是书面材料、演讲稿、面试记录或任何其他形式的口头或书面表达。针对当前整合复杂性人工测评方法成本高、自动化评估方法精度低以及缺乏中文文本评估方案等问题，本研究基于大语言模型文本数据增强技术和模型迁移技术为整合复杂性的评估设计了对于中英文文本的自动化评估方案，并探索了整合复杂性两种子结构：精细整合复杂性和辩证整合复杂性的自动化评估方法。本文设计并实施了两个研究，首先基于大语言模型文本数据增强技术实现了对于英文文本整合复杂性的预测模型，其次基于模型迁移技术实现了对于中文文本整合复杂性的预测模型。研究结果显示：1）使用GPT-3.5-Tubo对于英文文本数据进行增强，使用预训练多语言Roberta模型进行词向量提取，使用文本卷积神经网络模型作为下游模型。与人工标注相比，整合复杂性Spearman相关系数为0.62，辩证整合复杂性相关系数为0.51，精细整合复杂性Spearman相关系数为0.60。优于机器学习方法以及未经过数据增强的神经网络模型。2）本文在研究二中建立了与研究一中的神经网络结构一致的模型，并将研究一中最终的模型参数迁移至本研究的模型中，对于中文文本整合复杂性进行训练。在零样本的情况下，迁移学习模型整合复杂性Spearman相关系数为0.31，辩证整合复杂性Spearman相关系数为0.31，精细整合复杂性相关系数为0.33，均优于随机参数情况下的模型表现（整合复杂性：0.17，辩证整合复杂性：0.10，精细整合复杂性：0.10）。在小样本情况下迁移学习模型整合复杂性Spearman相关系数为0.73，辩证整合复杂性Spearman相关系数为0.51，精细整合复杂性相关系数为0.73。

同行评议状态:待评议

点击量 636 下载量 169 评论 0
15. ChinaXiv:202404.00141
下载全文

大模型与标准文献知识库的融合应用探索

分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-10

徐松林

摘要：在人工智能与大数据技术背景下，利用大模型及构建标准文献知识库对于科研创新、知识挖掘和信息检索具有重要价值。标准文献知识库为各行业的规范化、标准化提供了坚实的支撑。本研究首先探讨了标准文献的现状，然后基于检索增强搭建大模型与标准文献知识库集成的框架，并提出各阶段增强优化探索。最后展望了未来的研究方向和应用前景。

同行评议状态:待评议

点击量 699 下载量 208 评论 0
16. ChinaXiv:202404.00159
下载全文

简体中文LIWC2024(SCLIWC2024)词典的修订与验证

分类：心理学 >> 应用心理学分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-09

崔雪婷陈思仪赵楠刘晓倩朱廷劭

摘要：近年来，字词分析取向的方法逐渐受到重视，特别是语言探索与字词计数(Linguistic Inquiry and Word Count, LIWC)工具，它的问世让许多心理学家对语言分析研究重新燃起热情。最新版本LIWC-22词典的修订新增了许多心理变量，在增加了LIWC工具的应用潜力的同时也使其更加完善。为进一步推动LIWC工具中文化的进程，我们对多个版本的中文LIWC词典进行汇总，修订形成了SCLIWC2024，并对其效度进行了检验。研究一中，我们对照LIWC-22词典和CLIWC2015词典，以SCLIWC词典为基础，修订形成了SCLIWC2024词典。研究二中，我们进行了两项实验来检测SCLIWC2024在不同类型网络文本心理表达的有效性，并回答了如何更有效地使用SCLIWC2024来检测社交网络平台短文本的心理表达的重要问题。

同行评议状态:待评议

点击量 951 下载量 259 评论 0
17. ChinaXiv:202404.00111
下载全文

Multimodal Physical Fitness Monitoring (PFM) Framework Based on TimeMAE-PFM in Wearable Scenarios

分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-07

Junjie Zhang Zheming Zhang Huachen Xiang Yangquan Tan Linnan Huo Fengyi Wang

摘要： Physical function monitoring (PFM)plays a crucial role in healthcare especially for the elderly. Traditional assessment methods such as the Short Physical Performance Battery (SPPB) have failedto capture the full dynamic characteristics of physical function. Wearable sensors such as smart wristbands offer a promising solution to this issue. However, challenges exist, such as the computational complexity of machine learning methods and inadequate information capture. This paper proposes a multi-modal PFM framework based on an improved TimeMAE, which compresses time-series data into a low-dimensional latent space and integrates a self-enhanced attention module. This framework achieves effective monitoring of physical health, providing a solution for real-time and personalized assessment. The method is validated using the NHATS dataset, and the results demonstrate an accuracyof 70.6% and an AUC of 82.20%, surpassing other state-of-the-art time-series classification models.

同行评议状态:待评议

点击量 649 下载量 157 评论 0
18. ChinaXiv:202403.00340
下载全文

引导大语言模型生成计算机可解析内容

分类：计算机科学 >> 计算机软件分类：语言学及应用语言学 >> 语言学及应用语言学提交时间： 2024-04-07

王家晔

摘要：大语言模型 (Large Language Models, LLMs) 能够从大量语料的上下文中学习到模式，其包括词语之间的关系、句子的结构甚至更复杂的语义和语用信息。然而，让预训练语言模型生成结构化、严格遵循约定的内容仍然是一项挑战。本文提出了一种引导LLMs生成计算机高可用内容的方案，无需微调和额外的神经网络推理，通过提前约定的上下文无关文法 (Context-Free Grammar, CFG) 引入基于协程的内容生成约束机制，在自回归模型Transformer的解码阶段引导模型采样正确的词元，以构成符合程序约定的形式语言。这将有效地提升LLMs生成目标数据结构、类型或指令的稳定性和一致性，降低应用开发和集成的难度。本文作者先通过“匹配括号对”实验验证了GPT-2和Gemma等模型在生成DSL长度分别大于36和282时错误率就达到了95%，说明了当前LLMs在特定DSL生成上的性能问题。本文作者还提出了基于协程的DSL生成框架YieldLang，并使用LLMs在多个任务数据集上进行了实验，包括JSON、Mermaid流图和函数调用表达式生成等任务。这些实验表明本文的方法相比基准，其准确率提升到了原来的109%到1160%，并且在最好的情况下能够将LLMs生成JSON的采样次数降低到基准的约16.5%，这将有效地提高LLMs生成内容对计算机程序的可用性。

同行评议状态:待评议

点击量 2313 下载量 504 评论 1
19. ChinaXiv:202404.00076
下载全文

Terrain Point Cloud Inpainting via Signal Decomposition

分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-05

Yizhou Xie Xiangning Xie Yuran Wang Yanci Zhang Zejun Lv

摘要： The rapid development of 3D acquisition technology has made it possible to obtain point clouds of real-world terrains. However, due to limitations in sensor acquisition technology or specific requirements, point clouds often contain defects such as holes with missing data. Inpainting algorithms are widely used to patch these holes. However, existing traditional inpainting algorithms rely on precise hole boundaries, which limits their ability to handle cases where the boundaries are not well-defined. On the other hand, learning-based completion methods often prioritize reconstructing the entire point cloud instead of solely focusing on hole filling. Based on the fact that real-world terrain exhibits both global smoothness and rich local detail, we propose a novel representation for terrain point clouds. This representation can help to repair the holes without clear boundaries. Specifically, it decomposes terrains into low-frequency and high-frequency components, which are represented by B-spline surfaces and relative height maps respectively. In this way, the terrain point cloud inpainting problem is transformed into a B-spline surface fitting and 2D image inpainting problem. By solving the two problems, the highly complex and irregular holes on the terrain point clouds can be well-filled, which not only satisfies the global terrain undulation but also exhibits rich geometric details. The experimental results also demonstrate the effectiveness of our method.

同行评议状态:待评议

点击量 727 下载量 243 评论 0
20. ChinaXiv:202404.00067
下载全文

基于 Python 中 MeCab 库对日语文章进行文本分析处理实现

分类：计算机科学 >> 计算机应用技术提交时间： 2024-04-04

于瑾麟

摘要：文本分析处理日益变成重要的课题之一，关于 jieba 中文分词的示例已有许多，但是关于日语语言分词的相关研究甚少，本文旨在介绍 Python 中 MeCab 库对日语进行分词的功能，并且给出相关案例代码，以便根据需要实现日语分词功能。

同行评议状态:待评议

点击量 818 下载量 244 评论 0

1 2 3 4 5 6 7 8 9 10 后页尾页