Link Search Menu Expand Document

Current Research

My dissertation work focuses on multilingual language models. Some of the questions I am interested include:

  • How do multilingual language models represent information for the different languages they were trained on?
  • What are the optimal conditions for crosslingual transfer in multilingual models?
  • How can we improve performance for low-resource languages?
  • How does a language’s morphology affect tokenization?

Morphology and Tokenization

With Pamela Rivière, Tyler Chang, and Sean Trott, we investigated the relationship between morphological alignment of tokenization (whether the tokenizer segments a word along its morpheme boundaries) and verb number agreement in Spanish. We found no impact of tokenization scheme on agreement performance. Read the paper here.

Cross-lingual Differences in Language Modeling

We found that different languages take different amounts of space (in bytes) to represent content-matched text. We describe these differences in terms of byte premiums and release the Byte Premium Tool to easily calculate relative differences in dataset measurement differences. Read the full paper here.

We used these byte premiums to scale training data and trained small, comparable language models for 350 languages. We find that our models outperform larger models with an order of magnitude more parameters for 98 languages. We call these the Goldfish models. Read the preprint here.

Curse of Multilinguality

My collaborator Tyler Chang and I trained 10,000 language models on 252 languages to investigate the condition for optimal crosslingual transfer, especially for low-resource languages. We have a manuscript pre-print entitled, “When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages”. We found that crosslingual transfer was beneficial for low-resource langauges when languages were similar, but multilinguality harmed high-resource language performance.

Structural Priming

We use structural priming, an experimental paradigm from psycholinguistics to find evidence for multilingual abstract grammatical representations in multilingual language models. My collaborators (James Michaelov, Tyler Chang, and Benjamin Bergen) and I presented the full paper “Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models” at EMNLP 2023. We also presented an extended abstract “Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models” at the Multilingual Representations Learning Workshop co-located with EMNLP 2023.

Previous Work

Verbal Reduplication in Mandarin

I used to work on understanding how the usage and meaning of the various verbal reduplication patterns in Mandarin Chinese.

For a brief overview, you can check out my AMLaP and NACCL-32 presentations. For more information, see the Reduplication page.

Language and Environment

My collaborator, Maho Takahashi, and conducted a reanalysis of the work on the relationship between language and the environment in which a language is spoken. We found that the correlations are not robust when other factors are taken into consideration.

We have a poster, entitled Creating a Baseline to Evaluate Correlations Between Language and Environment [poster] at the Machine Learning for Language Evolution Workshop at the Joint Conference on Language Evolution 2022.

Event Framing

During my undergraduate degree, I was working on the typology of the framing of events in Romance languages. I used several corpora and looked at the change of verb framing from Latin through Medieval French and Spanish to modern Romance varieties.

To learn more, you can read my proceedings paper.