Research

You can view my full CV here:

Publications

Updated August 2025.

NLP

Catherine Arnett, Tyler A. Chang, James A. Michaelov, and Benjamin K. Bergen (2025). On the Acquisition of Shared Grammatical Representations in Bilingual Language Models. The 63rd Annual Meeting of the Association for Computational Linguistics (ACL). Vienna, Austria.
Catherine Arnett, Marisa Hudspeth, and Brendan O’Connor (2025). Evaluating Morphological Alignment of Tokenizers in 70 Languages. Tokenizer Workshop at ICML. Vancouver, Canada.
Sander Land and Catherine Arnett (2025). BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization. Tokenizer Workshop at ICML. Vancouver, Canada. Best Paper Award.
Catherine Arnett and Benjamin K. Bergen (2025). Why do language models perform worse for morphologically complex languages? The 31st International Conference on Computational Linguistics (COLING). Abu Dhabi, UAE and online. Best Paper Award
Pavel Chizhov☆, Catherine Arnett☆, Elizaveta Korotkova, Ivan P. Yamshchikov (2024). BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, FL, USA. ☆equal contribution.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen (2024). When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, FL, USA. Outstanding Paper Award
James Michaelov, Catherine Arnett, Benjamin K. Bergen (2024). Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics. The First Conference on Language Modeling (COLM). Philadelphia, USA.
Catherine Arnett☆, Pamela D. Rivière☆, Tyler A. Chang, and Sean Trott (2024). Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement. Special Interest Group on Computational Morphology and Phonology (SIGMORPHON) co-located at NAACL. Mexico City, Mexico. ☆equal contribution
Catherine Arnett☆, Tyler A. Chang☆, Benjamin K. Bergen (2024). A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages. 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL), co-located at LREC-COLING. Torino, Italy. ☆equal contribution
James A. Michaelov☆, Catherine Arnett☆, Tyler A. Chang, Benjamin K. Bergen (2023). Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore. ☆equal contribution

Linguistics and Psycholinguistics

Jesse Quinn, Matthew Goldrick, Catherine Arnett, Victor S. Ferreira, Tamar H. Gollan (2024). Syntax Drives Default Language Selection in Bilingual Connected Speech Production. Journal of Experimental Psychology: Learning, Memory, and Cognition.
Catherine Arnett (2019). Pathways of Change in Romance Motion Events: A Corpus-based Comparison. Proceedings of the Thirtieth Western Conference on Linguistics (WeCOL). Vol 23. Fresno, CA, USA.

Manuscripts

Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen (under review). Goldfish: Monolingual Language Models for 350 Languages.

Conference Presentations

NLP

Jennifer Meng Lu, Catherine Arnett, Ruochen Zhang. (2025) Mechanisms of In-Context Syntactic Generalization in Language Models. New England Mechanistic Interpretability Workshop (NEMI). Boston, MA.
Catherine Arnett (2025). Toxic Commons: Toxicity of the Commons: Curating Open-Source Pre-Training Data. The First Conference of the International Association for Safe & Ethical AI. Official event of the AI Action Summit. Paris, France.
Catherine Arnett, Tyler A. Chang, James A. Michaelov, and Benjamin K. Bergen (2023). Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models. The 3rd Workshop on Multilingual Representation Learning co-located with EMNLP 2023. Singapore. [abstract]

Psycholinguistics

Catherine Arnett & Maho Takahashi (2022). Creating a Baseline to Evaluate Correlations Between Language and Environment. Machine Learning and the Evolution of Language. Kanagawa/online.
Catherine Arnett & Eva Wittenberg (2020). Multiple Meanings of Doubling Up: Mandarin Verbal Reduplication. The 26th Architectures And Mechanisms for Language Processing Conference (AMLaP). Potsdam/online.
Catherine Arnett & Eva Wittenberg (2020). Conceptual Effects of Verbal Reduplication in Mandarin Chinese. North American Conference on Chinese Linguistics 32. Storrs, CT/online.
Catherine Arnett & Eva Wittenberg (2019). Conceptual Effects of Verbal Reduplication in Mandarin Chinese. California Meeting on Psycholinguistics. Santa Cruz, California.
Catherine Arnett (2018). Diachronic study of the typology of motion verbs in the Romance languages. Undergraduate Linguistics Association of Britain Conference, University of Edinburgh.