🇵🇭 Filipino NLP Collection
This collection contains all my works on Filipino NLP.
My initial motivation to work on Filipino NLP was quite modest: I was the only Filipino on the spaCy team, and I thought it would be nice to represent and add Tagalog language support to the library. No one else was going to work on it, so why not me? It’s like being alone in a grocery aisle and spotting a cereal box on the floor—you’re the only person there, you’re capable, so why not just pick it up?
This sense of responsibility grew and shaped into what it is today. I’m also glad to know that I’m not alone: there are several researchers and practitioners who are also passionate about improving the state of Filipino NLP. We had our first success with FilBench-Eval (EMNLP ‘25 Main), demonstrating that a scrappy grassroots team can do amazing research. There’s so much more to come, and I’m excited to see what more we can achieve.
I’m actively organizing researchers interested in improving Filipino NLP through FilBench. It’s a collective of researchers who work together through a few focused bets throughout the year. Sometimes I lead projects like in FilBench-Eval, sometimes others do. Join us or reach out!
Writings
2025
- Introducing FilBench: An Open LLM Evaluation Suite for Filipino (August 21, 2025)
2024
- Desiderata for Filipino NLP in the Age of LLMs (December 17, 2024)
- Guest lecture @ DLSU Manila: Artisanal Filipino NLP Resources in the time of Large Language Models (July 2, 2024)
2023
- Visualizing Tagalog NER embeddings (November 20, 2023)
- Do large language models work on Tagalog? (August 4, 2023)
- calamanCy: NLP pipelines for Tagalog (August 1, 2023)
- Some thoughts on the annotation process (July 3, 2023)
- Towards a Tagalog NLP pipeline (February 4, 2023)
2022
- Pagmumuni ukol sa wika at kahulugan (December 1, 2022)
- Your train-test split may be doing you a disservice (August 2, 2022)
- Dependency parsing for a low-resource language (Tagalog) (April 24, 2022)
Research
-
FilBench: Can LLMs Generate and Understand Filipino?
EMNLP ‘25 Main
Lester James V. Miranda*, Elyanah Aco*, Conner Manuel*, Jan Christian Blaise Cruz, Joseph Marvin Imperial
Code / Website -
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
ACL ‘25 Main
Angelina A. Aquino*, Lester James V. Miranda*, Elsie Marie T. Or*
Dataset / Slides / Poster / Video -
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
EMNLP ‘24 Main
Holy Lovenia*, Rahmad Mahendra*, Salsabil Maulana Akbar*, Lester James Miranda*, and 50+ other authors (∗: major contributor).
Code / Website -
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
NAACL ‘24 Main
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Šuppa, Hila Gonen, Joseph Marvin Imperial, Börje F. Karlsson, Peiqin Lin, Nikola Ljubešić, LJ Miranda, Barbara Plank, Arij Riabi, Yuval Pinter
Dataset / Website -
calamanCy: a Tagalog Natural Language Processing Toolkit
NLP Open-Source Software (NLP-OSS) Workshop @ EMNLP ‘23
Lester James V. Miranda
Code / Poster / Video -
Developing a Named Entity Recognition Dataset for Tagalog
Southeast Asian Language Processing (SEALP) Workshop @ IJCNLP-AACL ‘23
Lester James V. Miranda
Code / Dataset / Slides / Video