Loading web-font TeX/Math/Italic
Current Location: > Detailed Browse

Text Distance from Nested and Hierarchical Repetitions: A Compression-Based Perspective

请选择邀稿期刊:
Abstract: We present a new method for structural sequence analysis grounded in Algorithmic Information Theory (AIT). At its core is the Ladderpath approach, which extracts nested and hierarchical relationships among repeated substructures in linguistic sequences---an instantiation of AIT’s principle of describing data through minimal generative programs. These structures are then used to define three distance measures: a normalized compression distance (NCD), and two alternative distances derived directly from the Ladderpath representation. Integrated with a k-nearest neighbor classifier, these distances achieve strong and consistent performance across in-distribution, out-of-distribution (OOD), and few-shot text classification tasks. In particular, all three methods outperform both gzip-based NCD and BERT under OOD and low-resource settings. These results demonstrate that the structured representations captured by Ladderpath preserve intrinsic properties of sequences and provide a lightweight, interpretable, and training-free alternative for text modeling. This work highlights the potential of AIT-based approaches for structural and domain-agnostic sequence understanding.

Version History

[V1] 2025-06-11 12:06:52 ChinaXiv:202506.00060V1 Download
Download
Preview
Peer Review Status
Awaiting Review
License Information
metrics index
  •  Hits891
  •  Downloads178
Comment
Share
Apply for expert review