My field is in natural language processing and machine learning. I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models.

The following are the research themes I am interested in, along with some representative publications. I am eager to explore these themes individually or at their intersection. Finally, you can also check my Google Scholar for a complete and up-to-date list.

Understanding high-quality data

What constitutes high-quality data? And how can we efficiently and scalably collect it? I’ve explored these questions in the context of acquiring post-training data by examining human factors in annotation and devising new techniques in data synthesis and generation.

Data for multilingual and equitable NLP

No language should be left behind, especially in data. I’m motivated to pursue research that ensures that the next generation of state-of-the-art LLMs cater to languages beyond English; by improving training data quality and building faithful multilingual benchmarks.

Within this theme, I also care a lot about improving the state of Filipino NLP and representing my native language. This involves building resources for training and evaluation. I also write a lot about Filipino NLP in this blog.