Constraining Large Language Model for Generating Computer-Parsable Content

Author: Jiaye Wang ^1,2
Institute:

1. School of Software, South China Normal University

2. Platform and Content Group, Tencent Inc.
Correspondent： 王家晔 Email:hk-shao@outlook.com
Submit Time:2024-04-07 04:03:34

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in learning patterns from massive text corpora, including word relationships, sentence structures, and even complex semantic and pragmatic information. However, it remains challenging to induce pre-trained language models to generate structured content that strictly follows specific conventions.We propose a scheme for guiding LLMs to generate highly usable content for computers without the need for fine-tuning and additional neural network inference, by introducing coroutine-based content generation constraints through a pre-agreed context-free grammar (CFG), which guides the autoregressive model Transformer to sample the correct tokens during its decoding phase to form a program-compliant form in the decoding phase of the autoregressive model Transformer to form a formal language that conforms to the program conventions. This will effectively improve the stability and consistency of LLMs in generating target data structures, types or instructions, and reduce the difficulty of application development and integration.We first verified that the error rate of models such as GPT-2 and Gemma reaches 95% when the length of the generated DSLs are greater than 36 and 282, respectively, through the experiment of matching bracket pairs , which illustrates the performance problem of some current LLMs in the generation of specific DSLs. We also present YieldLang, a coroutine-based DSL generation framework, and conduct experiments using LLMs on multiple task datasets, including tasks such as JSON, Mermaid flowchart, and function call expression generation. These experiments show that the approach in this paper improves its accuracy by a factor of 1.09 to 11.6 compared to the benchmarks, and in the best case is able to reduce the number of samples used by the LLMs to generate JSON to about 16.5% of the benchmarks, which will effectively improve the usability of the content generated by the LLMs for computer programs.

Large language models Structured content generation Computer-aided programming Constrained decoding Coroutine Context-free grammar

From: 王家晔
Subject: Computer Science >> Computer Software Linguistics and Applied Linguistics >> Linguistics and Applied Linguistics
Comments： 初稿v6.6
Contribution： No Submitted
Cite as: ChinaXiv:202403.00340 (or this version ChinaXiv:202403.00340V5)
DOI:10.12074/202403.00340V5
CSTR:32003.36.ChinaXiv.202403.00340.V5
Recommended references： 王家晔.(2024).引导大语言模型生成计算机可解析内容.中国科学院科技论文预发布平台.[ChinaXiv:202403.00340] (Click&Copy)

Version History

[V5]	2024-04-07 04:03:34	ChinaXiv:202403.00340V5	Download
[V1]	2024-03-26 22:54:14	ChinaXiv:202403.00340v1 View This Version	Download

Related Paper

1. 引导大语言模型生成计算机可解析内容	2024-04-21
2. 语句翻译过程的广义修辞观阐释	2024-01-26
3. 大语言模型时代的语言学研究新机遇-以歧义分析为例	2024-01-11
4. 新颖词语义韵的发生机制：“双枣树”效应的证据	2024-01-05
5. 转换概率对语音统计学习效应的独立影响	2023-11-01
6. 吴方言昆山方言点、苏州方言点和上海方言点的音类对比研究	2023-08-19
7. 音节的本质和元辅音性质新说	2023-07-18
8. 利用视觉情境范式揭示口语加工的时间进程	2023-07-12
9. 兰银官话二声方言的双字调研究	2023-04-28
10. 汉语语音意识研究：以粤方言为例	2023-04-28


Public comments Anonymous comments Send only to author