I’m broadly interested in data-centric approaches to building language technologies at scale. I believe that a careful and systematic understanding of data— from its collection to its downstream influence on training— is crucial to build general-purpose language models. More specifically, I’m interested to work on the following topics:

If you are interested in these types of work, then do not hesitate to reach out. I’m happy to discuss research and collaborate!

 


 

Selected Publications

Below is a list of my publications. You can also check my Google Scholar and Semantic Scholar profiles for more updated information.

2024

At AI2, I’ve worked on various aspects of LM post-training such as preference data collection and evaluation. I also expanded my work in the multilingual NLP front (SEACrowd, M-RewardBench).

2023

I spent the early parts of 2023 working on low-resource languages and multilinguality, especially Tagalog, my native language. I mostly focused on linguistic tasks such as POS tagging, NER, and dependency parsing.

2022

My first foray to NLP research is a technical report on spaCy’s hash embedding method. I’m lucky to have worked with established researchers in the field.

  • Multi hash embeddings in spaCy
    Preprint ‘22
    Lester James V. Miranda*, Ákos Kádár*, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal (∗: equal contributions).
    [Code]

 


 

Previous research

I used to be a bioinformatics researcher at the Furuzuki Neurocomputing Systems Laboratory, working on nature-inspired algorithms and proteomics.

I was also involved in research early on during my undergrad: