My field is in natural language processing and machine learning. I’m broadly interested in building language technologies that are equitable and useful, through data. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models.

The following are the research themes I am interested in, along with some representative publications. My work has been published in top NLP conferences such as ACL, NAACL, and EMNLP. Finally, you can also check my Google Scholar for a complete and up-to-date list.

Keywords: data-centric AI, multilinguality, resources & evaluation

Collecting high-quality data for LM training

What constitutes high-quality data? And how can we efficiently and scalably collect them? I’ve explored these questions in the context of acquiring post-training data by examining human factors in preference annotation such as in HyPER, and devising new techniques in data synthesis and generation as seen in Tülu 3.

  • Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
    ACL ‘25 Main
    Lester James V. Miranda*, Yizhong Wang*, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi.

  • Tülu 3: Exploring Frontiers in Open Language Model Post-Training
    COLM ‘25
    Nathan Lambert*, Jacob Morrison*, Valentina Pyatkin*, Shenyi Huang*, Hamish Ivison*, Faeze Brahman*, Lester James V. Miranda*, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tajford, Chris Wilhelm, Luca Soldiani, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi (∗: core contributor).

Resources for multilingual and equitable NLP

No language should be left behind, especially in data. I’m motivated to pursue research that ensures that the next generation of state-of-the-art LLMs cater to languages beyond English; by improving training data quality and building faithful multilingual benchmarks.

Within this theme, I also care a lot about improving the state of Filipino NLP and representing my native language. This involves building resources for training and evaluation. I also write a lot about Filipino NLP in this blog.