🎓 Google Scholar】 【📚 Semantic Scholar

I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models. More specifically, I’m interested to work on the following topics:

  • Efficient approaches to annotation: Human annotations are costly. How can we reduce this cost while preserving the nuance that human annotators provide? I’m currently exploring this question in the context of human preferences in LLM post-training (RLHF).

  • Resources for multilingual NLP: No language should be left behind, especially in data. I’ve worked on several datasets to improve the state of low-resource and multilingual NLP. These projects involve Filipino datasets & tooling, and large-scale multilingual datasets.

  • Evaluating overlooked aspects of LLMs: I am interested in evaluating less-explored aspects of LLMs, such as their multilingual capabilities or reward model performance, which are often overlooked in mainstream discussions or research.

If you are interested in these types of work, especially in improving the state of Filipino NLP, then do not hesitate to reach out. I’m happy to discuss research and collaborate!

 


 

Selected Publications

Below is a list of my publications. You can also check my Google Scholar and Semantic Scholar profiles for more updated information.

2024

At AI2, I’m working on various aspects of LM adaptation such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, SIGTYP).

2023

I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on linguistic tasks such as POS tagging, NER, and dependency parsing.

2022

My first foray to NLP research is a technical report on spaCy’s hash embedding method. I’m lucky to have worked with established researchers in the field.

  • Multi hash embeddings in spaCy
    ArXiV preprint ‘22
    Lester James V. Miranda*, Ákos Kádár*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal (∗: equal contributions).
    [Code]

 


 

Previous research

I used to be a bioinformatics researcher at the Furuzuki Neurocomputing Systems Laboratory, working on nature-inspired algorithms and proteomics.

I was also involved in research early on during my undergrad: