Welcome

My name is Catherine Arnett. I’m an NLP Researcher at EleutherAI. I am interested in cross-lingual and multilingual NLP. I received PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. I previously was Lead Research Scientist at PleIAs.

To contact me, you can email me at catherine [dot] arnett [at] gmail [dot] com or find me on 🦋 BlueSky or Twitter.

Other links:

HuggingFace GitHub Google Scholar Semantic Scholar Orcid

News

I have a new blog post out called “There is no such thing as a tokenizer-free lunch”
My paper with Tyler Chang, Stella Biderman, and Ben Bergen got accepted to NeurIPS! Preprint coming soon 👀
Sander Land and my paper, BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization, won Best Paper at the first Tokenization Workshop at ICML 2025! 🏆
My paper “Why do language models perform worse for morphologically complex languages?” was awarded Best Paper at COLING 2025! 🏆
Tyler Chang and my paper, “When is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages” was awarded Outstanding Paper at EMNLP! 🥇