Research work
【🎓 Google Scholar】 【📚 Semantic Scholar】
I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models. More specifically, I’m interested to work on the following topics:
-
Efficient approaches to annotation: Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I’m currently exploring this question in the context of human preferences in LLM post-training (RLHF).
-
Resources for multilingual NLP: No language should be left behind, especially in data. I’ve worked on several datasets to improve the state of low-resource and multilingual NLP. These projects involve Filipino datasets & tooling, and large-scale multilingual datasets.
-
Evaluating overlooked aspects of LLMs: I am interested in evaluating less-explored aspects of LLMs, such as their multilingual capabilities or reward model performance, which are often overlooked in mainstream discussions or research.
If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to reach out. I’m happy to discuss research and collaborate!
Selected Publications
Below is a list of my publications. You can also check my Google Scholar and Semantic Scholar profiles for more updated information.
2024
At AI2, I’m working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, M-RewardBench).
-
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Preprint ‘24
Lester James V. Miranda*, Yizhong Wang*, Yanai Elazar, Sachin Kumar, Valentina Pyatkin, Faeze Brahman, Noah A. Smith, Hannaneh Hajishirzi, and Pradeep Dasigi.
[Code] [MultiPref Dataset] -
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Preprint ‘24
Srishti Gureja*, Lester James V. Miranda*, Shayekh bin Islam*, Rishabh Maheshwary*, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee.
[Code] [Dataset] -
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24, Preprint ‘24
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
[Catalogue] [Code] -
Consent in Crisis: The Rapid Decline of the AI Data Commons
NeurIPS D&B ‘24, Preprint ‘24
Data Provenance Initiative Team (40+ authors). I contributed in the annotation process design for Web Domain services and annotation quality review.
[Website] [Collection] [New York Times Feature] -
RewardBench: Evaluating Reward Models for Language Modelling
Preprint ‘24
Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
[Leaderboard] [Code] [Blog] -
Allen Institute for AI @ SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages
Special Interest Group on Typology (SIGTYP) Workshop @ EACL ‘24
Lester James V. Miranda
[Code] [Video]
2023
I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on linguistic tasks such as POS tagging, NER, and dependency parsing.
-
calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
[Code] [Poster] [Video] -
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
[Code] [Dataset] [Video] -
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
NAACL ‘24, Preprint ‘23
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
[Dataset] [Website]
2022
My first foray to NLP research is a technical report on spaCy’s hash embedding method. I’m lucky to have worked with established researchers in the field.
- Multi hash embeddings in spaCy
Preprint ‘22
Lester James V. Miranda*, Ákos Kádár*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal (∗: equal contributions).
[Code]
Previous research
I used to be a bioinformatics researcher at the Furuzuki Neurocomputing Systems Laboratory, working on nature-inspired algorithms and proteomics.
-
Feature Extraction using a Mutually-Competitive Autoencoder for Protein Function Prediction.
IEEE Systems, Man, and Cybernetics (SMC) ‘18
Lester James V. Miranda and Jinglu Hu -
A Deep Learning Approach based on Stacked Denoising Autoencoder for Protein Function Prediction.
IEEE Computer, Software, and Applications (COMPSAC) ‘18
Lester James V. Miranda and Jinglu Hu -
PySwarms, a research-toolkit for Particle Swarm Optimization in Python
Journal of Open Source Software (JOSS) ‘18, vol.3, no. 433
Lester James V. Miranda
I was also involved in research early on during my undergrad:
- Appliance Recognition using Hall-Effect Sensors and k-Nearest Neighbors for Power Management Systems
IEEE Region 10 Conference (TENCON) ‘16
Lester James V. Miranda*, Marian Joice Gutierrez*, Samuel Matthew Dumlao, and Rosula Reyes (∗: equal contributions).