Link Search Menu Expand Document


I’m Catherine. I’m currently a PhD Candidate in Linguistics at the University of California at San Diego with a specialization in Computational Social Science. My dissertation is about multilingual language models. I’m especially interested in language modeling in low-resource contexts and cross-lingual differences that affect language modeling. I also am interested in the intersection of psycholinguistics and NLP.

I’m also a research scientist at PleIAs, where I am contributing to the development of multilingual open-source language models, with a particular emphasis on open and public domain training data. I have been working on tokenization in a multilingual context, with an emphasis on OCR data. I have also helped develop a pipeline for toxicity filtering our training data.

I did my undergraduate degree at the University of Edinburgh in Chinese and Linguistics (MAHons), including one year at Zhejiang University in Hangzhou, China.

You can download a copy of my CV here.

Contact me: ccarnett [at] ucsd [dot] edu.
