I’m broadly interested in data-centric approaches to building language technologies at scale. My goal is to develop systematic methodologies for efficiently constructing NLP resources while actively building new datasets and benchmarks to enhance language model training and evaluation. More concretely, I’m interested in the following areas:

  • Efficient approaches to annotation: Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I’m currently exploring this question in the context of human preferences in LLM post-training (RLHF).

  • Resources for multilingual NLP: No language should be left behind, especially in data. I’ve worked on several datasets to improve the state of low-resource and multilingual NLP. These projects involve Filipino datasets & tooling, and large-scale multilingual datasets.

  • Faithful benchmarks of model capabilities: How can we design benchmarks that accurately reflect the true capabilities and limitations of LLMs? I’ve explored this question in the context of evaluating reward models (RewardBench), and in assessing multilingual capabilities of LLMs on Southeast Asian languages.

If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to reach out. I’m happy to discuss research and collaborate!

 


 

Selected Publications

Below is a list of my publications. You can also check my Google Scholar and Semantic Scholar profiles for more updated information.

2024

At AI2, I’m working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, SIGTYP).

2023

I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on core NLP tasks: POS tagging, NER, dependency parsing, etc.

2022

My first foray to NLP research is a technical report on spaCy’s hash embedding method. I’m lucky to have worked with established researchers in the field.

  • Multi hash embeddings in spaCy
    ArXiV preprint ‘22
    Lester James V. Miranda*, Ákos Kádár*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal (∗: equal contributions).
    [Code]

 


 

Previous research

I used to be a bioinformatics researcher at the Furuzuki Neurocomputing Systems Laboratory, working on nature-inspired algorithms and proteomics.

I was also involved in research early on during my undergrad: