Research Work

My field is in natural language processing and machine learning. I’m broadly interested in building language technologies that are equitable and useful, through data. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models.

The following are the research themes I am interested in, along with some representative publications. My work has been published in top NLP conferences such as ACL, NAACL, and EMNLP. Finally, you can also check my Google Scholar for a complete and up-to-date list.

Keywords: data-centric AI, multilinguality, resources & evaluation

Collecting high-quality data for LM training

What constitutes high-quality data? And how can we efficiently and scalably collect them? I’ve explored these questions in the context of acquiring post-training data by examining human factors in preference annotation such as in HyPER, and devising new techniques in data synthesis and generation as seen in Tülu 3.

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
ACL ‘25 Main
Lester James V. Miranda*, Yizhong Wang*, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi.
[Code] [Dataset] [Slides] [Poster] [Video]
Tülu 3: Exploring Frontiers in Open Language Model Post-Training
Preprint ‘24
Nathan Lambert*, Jacob Morrison*, Valentina Pyatkin*, Shenyi Huang*, Hamish Ivison*, Faeze Brahman*, Lester James V. Miranda*, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tajford, Chris Wilhelm, Luca Soldiani, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi (∗: core contributor).
[Models] [Datasets] [Website]

Resources for multilingual and equitable NLP

No language should be left behind, especially in data. I’m motivated to pursue research that ensures that the next generation of state-of-the-art LLMs cater to languages beyond English; by improving training data quality and building faithful multilingual benchmarks.

M-RewardBench: Evaluating Reward Models in Multilingual Settings
ACL ‘25 Main
Srishti Gureja*, Lester James V. Miranda*, Shayekh bin Islam*, Rishabh Maheshwary*, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee.
[Code] [Dataset] [Slides] [Poster] [Video]
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24 Main
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
[Catalogue] [Code]

Within this theme, I also care a lot about improving the state of Filipino NLP and representing my native language. This involves building resources for training and evaluation. I also write a lot about Filipino NLP in this blog.

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
ACL ‘25 Main
Angelina A. Aquino*, Lester James V. Miranda*, Elsie Marie T. Or*
[Dataset] [Slides] [Poster] [Video]

calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
[Code] [Poster] [Video]
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
[Code] [Dataset] [Video] [Slides]