Welcome
My name is Catherine Arnett. Iām an NLP Researcher at EleutherAI. I am interested in cross-lingual and multilingual NLP. I received PhD in Linguistics with a specialization in Computational Social Science at UC San Diego. I previously was Lead Research Scientist at PleIAs.
To contact me, you can email me at catherine [dot] arnett [at] gmail [dot] com or find me on š¦ BlueSky or Twitter.
Other links:
HuggingFace
GitHub
Google Scholar
Semantic Scholar
Orcid
News
- I have a new blog post out called āThere is no such thing as a tokenizer-free lunchā
- My paper with Tyler Chang, Stella Biderman, and Ben Bergen got accepted to NeurIPS! Preprint coming soon š
- Sander Land and my paper, BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization, won Best Paper at the first Tokenization Workshop at ICML 2025! š
- My paper āWhy do language models perform worse for morphologically complex languages?ā was awarded Best Paper at COLING 2025! š
- Tyler Chang and my paper, āWhen is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languagesā was awarded Outstanding Paper at EMNLP! š„