🇵🇭 Filipino NLP Collection

This collection contains all my works on Filipino NLP.

My initial motivation to work on Filipino NLP was quite modest: I was the only Filipino on the spaCy team, and I thought it would be nice to represent and add Tagalog language support to the library. No one else was going to work on it, so why not me? It’s like being alone in a grocery aisle and spotting a cereal box on the floor—you’re the only person there, you’re capable, so why not just pick it up?

This sense of responsibility grew and shaped into what it is today. I’m also glad to know that I’m not alone: there are several researchers and practitioners who are also passionate about improving the state of Filipino NLP. We had our first success with FilBench-Eval (EMNLP ‘25 Main), demonstrating that a scrappy grassroots team can do amazing research. There’s so much more to come, and I’m excited to see what more we can achieve.

I’m actively organizing researchers interested in improving Filipino NLP through FilBench. It’s a collective of researchers who work together through a few focused bets throughout the year. Sometimes I lead projects like in FilBench-Eval, sometimes others do. Join us or reach out!

Writings

Research

FilBench: Can LLMs Generate and Understand Filipino?
EMNLP ‘25 Main
Lester James V. Miranda*, Elyanah Aco*, Conner Manuel*, Jan Christian Blaise Cruz, Joseph Marvin Imperial
Code / Website
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
ACL ‘25 Main
Angelina A. Aquino*, Lester James V. Miranda*, Elsie Marie T. Or*
Dataset / Slides / Poster / Video
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24 Main
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
Code / Website
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
NAACL ‘24 Main
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
Dataset / Website
calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
Code / Poster / Video
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
Code / Dataset / Slides / Video

Writings

2025

2024

2023

2022

Research