Research
You can view my full CV here:
Publications
NLP
- Pavel Chizhov☆, Catherine Arnett☆, Elizaveta Korotkova, Ivan P. Yamshchikov (2024). BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, FL, USA. ☆equal contribution.
- Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen (2024). When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Miami, FL, USA.
- James Michaelov, Catherine Arnett, Benjamin K. Bergen (2024). Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics. The First Conference on Language Modeling (COLM). Philadelphia, USA.
- Catherine Arnett☆, Pamela D. Rivière☆, Tyler A. Chang, and Sean Trott (2024). Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement. Special Interest Group on Computational Morphology and Phonology (SIGMORPHON) co-located at NAACL. Mexico City, Mexico. ☆equal contribution
- Catherine Arnett☆, Tyler A. Chang☆, Benjamin K. Bergen (2024). A Bit of a Problem: Measurement Disparities in Dataset Sizes Across Languages. 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL), co-located at LREC-COLING. Torino, Italy. ☆equal contribution
- James A. Michaelov☆, Catherine Arnett☆, Tyler A. Chang, Benjamin K. Bergen (2023). Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore. ☆equal contribution
Linguistics and Psycholinguistics
- Jesse Quinn, Matthew Goldrick, Catherine Arnett, Victor S. Ferreira, Tamar H. Gollan (2024). Syntax Drives Default Language Selection in Bilingual Connected Speech Production. Journal of Experimental Psychology: Learning, Memory, and Cognition.
- Catherine Arnett (2019). Pathways of Change in Romance Motion Events: A Corpus-based Comparison. Proceedings of the Thirtieth Western Conference on Linguistics (WeCOL). Vol 23. Fresno, CA, USA.
Manuscripts
- Catherine Arnett and Benjamin K. Bergen (under review). Why do language models perform worse for morphologically complex languages?
- Catherine Arnett☆, Eliot Jones☆, Ivan P. Yamshchikov, Pierre-Carl Langlais (under review). Toxicity of the Commons: Curating Open-Source Pre-Training Data. *equal contribution
- Tyler A. Chang, Catherine Arnett, Zhuowen Tu, Benjamin K. Bergen (under review). Goldfish: Monolingual Language Models for 350 Languages.
Conference Presentations
NLP
- Catherine Arnett, Tyler A. Chang, James A. Michaelov, and Benjamin K. Bergen (2023). Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models. The 3rd Workshop on Multilingual Representation Learning co-located with EMNLP 2023. Singapore. [abstract]
Psycholinguistics
- Catherine Arnett & Maho Takahashi (2022). Creating a Baseline to Evaluate Correlations Between Language and Environment. Machine Learning and the Evolution of Language. Kanagawa/online.
- Catherine Arnett & Eva Wittenberg (2020). Multiple Meanings of Doubling Up: Mandarin Verbal Reduplication. The 26th Architectures And Mechanisms for Language Processing Conference (AMLaP). Potsdam/online.
- Catherine Arnett & Eva Wittenberg (2020). Conceptual Effects of Verbal Reduplication in Mandarin Chinese. North American Conference on Chinese Linguistics 32. Storrs, CT/online.
- Catherine Arnett & Eva Wittenberg (2019). Conceptual Effects of Verbal Reduplication in Mandarin Chinese. California Meeting on Psycholinguistics. Santa Cruz, California.
- Catherine Arnett (2018). Diachronic study of the typology of motion verbs in the Romance languages. Undergraduate Linguistics Association of Britain Conference, University of Edinburgh.