Research work
I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models. More specifically, I’m interested to work on the following topics:
Efficient and scalable data collection: Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I’ve explored this in the context of LM post-training, by routing preference instances to human/LMs and in generating synthetic preference data.
Multilingual NLP: No language should be left behind, especially in data. I’ve worked on evaluating reward models in multilingual settings, curating datasets for Southeast Asia, and building large-scale named-entity recognition datasets.
Improving the state of Filipino NLP: I care a lot about representing my language, and I hope to continue doing so by building NLP resources. I’ve built datasets and tooling, to name a few—and still more to come! I’ve also written a lot about Filipino NLP in this blog.
If you are interested in these types of work, then do not hesitate to reach out. I’m happy to discuss research and collaborate!
Selected Publications
Below is a list of my publications. You can also check my Google Scholar and Semantic Scholar profiles for more updated information.
At AI2, I’ve worked on various aspects of LM post-training such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, M-RewardBench).
Tülu 3: Exploring Frontiers in Open Language Model Post-Training
Preprint ‘24
Nathan Lambert*, Jacob Morrison*, Valentina Pyatkin*, Shenyi Huang*, Hamish Ivison*, Faeze Brahman*, Lester James V. Miranda *, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tajford, Chris Wilhelm, Luca Soldiani, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, Hannaneh Hajishirzi (∗: core contributor).
[Models] [Datasets] [Website] -
2 OLMo 2 Furious
Preprint ‘24
OLMo Team (30+ authors). I contributed in applying the Tülu 3 recipe of generating synthetic preferences for the OLMo 2 suite of DPO models.
[Collection] [Website] -
Bridging the Data Provenance Gap Across Text, Speech, and Video
ICLR ‘25, Preprint ‘24
Data Provenance Initiative Team (40+ authors). I contributed in the annotation process design for Web Domain services and annotation quality review.
[Website] [MIT Technology Review] -
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Preprint ‘24
Lester James V. Miranda*, Yizhong Wang*, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi.
[Code] [MultiPref Dataset] -
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Preprint ‘24
Srishti Gureja*, Lester James V. Miranda*, Shayekh bin Islam*, Rishabh Maheshwary*, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee.
[Code] [Dataset] -
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24, Preprint ‘24
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
[Catalogue] [Code] -
Consent in Crisis: The Rapid Decline of the AI Data Commons
NeurIPS D&B ‘24, Preprint ‘24
Data Provenance Initiative Team (40+ authors). I contributed in the annotation process design for Web Domain services and annotation quality review.
[Website] [Collection] [New York Times Feature] -
RewardBench: Evaluating Reward Models for Language Modelling
NAACL (Findings) ‘25, Preprint ‘24
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
[Leaderboard] [Code] [Blog] -
Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages
Special Interest Group on Typology (SIGTYP) Workshop @ EACL ‘24
Lester James V. Miranda
[Code] [Video]
I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on linguistic tasks such as POS tagging, NER, and dependency parsing.
calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
[Code] [Poster] [Video] -
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
[Code] [Dataset] [Video] -
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
NAACL ‘24, Preprint ‘23
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
[Dataset] [Website]
My first foray to NLP research is a technical report on spaCy’s hash embedding method. I’m lucky to have worked with established researchers in the field.
- Multi hash embeddings in spaCy
Preprint ‘22
Lester James V. Miranda*, Ákos Kádár*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal (∗: equal contributions).
Previous research
I used to be a bioinformatics researcher at the Furuzuki Neurocomputing Systems Laboratory, working on nature-inspired algorithms and proteomics.
Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction.
IEEE Systems, Man, and Cybernetics (SMC) ‘18
Lester James V. Miranda and Jinglu Hu -
A Deep Learning Approach based on Stacked Denoising Autoencoder for Protein Function Prediction.
IEEE Computer, Software, and Applications (COMPSAC) ‘18
Lester James V. Miranda and Jinglu Hu -
PySwarms, a research-toolkit for Particle Swarm Optimization in Python
Journal of Open Source Software (JOSS) ‘18, vol.3, no. 433
Lester James V. Miranda
I was also involved in research early on during my undergrad:
- Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems
IEEE Region 10 Conference (TENCON) ‘16
Lester James V. Miranda*, Marian Joice Gutierrez*, Samuel Matthew Dumlao, and Rosula Reyes (∗: equal contributions).