Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Chu, Mark; Desikan, Bhargav Srinivasa; Nadler, Ethan O.; Sardo, D. Ruggiero Lo; Darragh-Ford, Elise; Guilbeault, Douglas

Computer Science > Computation and Language

arXiv:2203.07911 (cs)

[Submitted on 15 Mar 2022 (v1), last revised 20 Apr 2022 (this version, v2)]

Title:Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Authors:Mark Chu, Bhargav Srinivasa Desikan, Ethan O. Nadler, D. Ruggiero Lo Sardo, Elise Darragh-Ford, Douglas Guilbeault

View PDF

Abstract:Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2203.07911 [cs.CL]
	(or arXiv:2203.07911v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.07911

Submission history

From: Bhargav Srinivasa Desikan [view email]
[v1] Tue, 15 Mar 2022 13:48:38 UTC (20,658 KB)
[v2] Wed, 20 Apr 2022 16:06:48 UTC (20,658 KB)

Computer Science > Computation and Language

Title:Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Signal in Noise: Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators