Welcome
My name is Catherine Arnett. Iβm an NLP Researcher at EleutherAI. I am mainly interested in cross-lingual and multilingual NLP (see the research page for more information). I recently finished my PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. I previously was Lead Research Scientist at PleIAs.
To contact me, you can email me at catherine [dot] arnett [at] gmail [dot] com or find me on π¦ BlueSky.
Other links:
HuggingFace
GitHub
Google Scholar
Semantic Scholar
Orcid
News
- Sander Land and my paper, BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization, won Best Paper at the first Tokenization Workshop at ICML 2025! π
- I released MorphScore v2 with Marisa Hudspeth and Brendan OβConnor. Check out the new preprint about the updates: Evaluating Morphological Alignment of Tokenizers in 70 Languages.
- My paper βWhy do language models perform worse for morphologically complex languages?β was awarded Best Paper at COLING 2025! π
- Tyler Chang and my paper, βWhen is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languagesβ was awarded Outstanding Paper at EMNLP! π₯