Research work
My field is in natural language processing and machine learning. I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models.
The following are the research themes I am interested in, along with some representative publications. I am eager to explore these themes individually or at their intersection. Finally, you can also check my Google Scholar for a complete and up-to-date list.
Understanding high-quality data
What constitutes high-quality data? And how can we efficiently and scalably collect it? I’ve explored these questions in the context of acquiring post-training data by examining human factors in annotation and devising new techniques in data synthesis and generation.
-
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Preprint ‘24
Lester James V. Miranda*, Yizhong Wang*, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi.
[Code] [MultiPref Dataset] -
Tülu 3: Exploring Frontiers in Open Language Model Post-Training
Preprint ‘24
Nathan Lambert*, Jacob Morrison*, Valentina Pyatkin*, Shenyi Huang*, Hamish Ivison*, Faeze Brahman*, Lester James V. Miranda *, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tajford, Chris Wilhelm, Luca Soldiani, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi (∗: core contributor).
[Models] [Datasets] [Website]
Data for multilingual and equitable NLP
No language should be left behind, especially in data. I’m motivated to pursue research that ensures that the next generation of state-of-the-art LLMs cater to languages beyond English; by improving training data quality and building faithful multilingual benchmarks.
-
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Preprint ‘24
Srishti Gureja*, Lester James V. Miranda*, Shayekh bin Islam*, Rishabh Maheshwary*, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee.
[Code] [Dataset] -
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24, Preprint ‘24
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
[Catalogue] [Code]
Within this theme, I also care a lot about improving the state of Filipino NLP and representing my native language. This involves building resources for training and evaluation. I also write a lot about Filipino NLP in this blog.
-
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Angelina A. Aquino*, Lester James V. Miranda*, Elsie Marie T. Or*
[Dataset] -
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
NAACL ‘24, Preprint ‘23
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
[Dataset] [Website] -
calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
[Code] [Poster] [Video] -
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
[Code] [Dataset] [Video]