Introducing FilBench: An Open LLM Evaluation Suite for Filipino

At the end of 2024, I wrote about my desiderata for Filipino NLP. One of which was evaluation. I said that most of “how we measure LLM capabilities in Filipino are anecdotal: we post a screenshot of ChatGPT writing in Filipino and claim that it already has that capability—we need a systematic approach to evaluating these models.” Fast forward to today, I’m happy that we are now inching towards systematic evaluations for Filipino.

Without further ado, I introduce FilBench, an LLM Evaluation Suite for Filipino!

Paper: https://arxiv.org/abs/2508.03523
Code: github.com/filbench/filbench-eval
Leaderboard: hf.co/spaces/UD-Filipino/filbench-leaderboard

What is FilBench?

FilBench is a (1) benchmark to test LLM capabilities on Filipino, and a (2) leaderboard to track the progress of LLM development for Philippine languages. When building FilBench, I imagine two types of audiences:

The multilingual NLP research scientist who wants to test whether their new language model generalizes to other languages. FilBench aims to provide a robust and comprehensive evaluation suite for Filipino tasks and use-cases.
The language model developer who wants to know which language model fits best for the application their building. The FilBench leaderboard provides a detailed breakdown of a model’s capabilities, and analyses of the parameter- and cost-efficiency of such models.

When building FilBench, we took stock of what the Philippine NLP research community are evaluating pre-ChatGPT language models upon. Although they are good at understanding language system capabilities, they are ill-posed for current LLM evaluation. We then curated these datasets into four main categories: Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation.

Finally, FilBench has been accepted to EMNLP Main! EMNLP is one of the top NLP conferences in the field and getting a paper published in this venue can be competitive. We’re happy with the outcome, so catch Ely and Joseph as they present FilBench in Suzhou, China this November!

On Building FilBench

One thing that I found exciting when building FilBench is that it felt like I’m assembling the Avengers of Filipino NLP. It started with just the three of us— Ely, Conner, and I— working on the Data Is Better Together annotation project from HuggingFace. I’ve met them separately through other projects like SEACrowd and earlier correspondences. Then, we’ve onboarded Joseph and Blaise, who have been working on Filipino NLP since I entered the field. We’re just five right now, but hopefully next time there’ll be more of us.

I find grassroots projects like this to be appealing, given that all of us are volunteers spending our free time on FilBench. We automatically pass the “will-they-care-about-this-project” bar, making collaboration smoother. There were hurdles too, such as scrounging up compute for running evals, or finding the right time across five timezones. But the scrappiness of the group is energizing, and it allowed us to iterate and build FilBench despite little to no resources.

In addition, it was also nice working with the OpenEvals team at HuggingFace, since they help maintain lighteval, the evaluation framework we used for Filbench. I’m also thankful to Nathan Habib, Clementine Fourrier, and Daniel van Strien for their feedback on the official HuggingFace blog post. Furthermore, FilBench is now part of the official community evaluations on lighteval, so check it out there!

We have some interesting projects lined up and there’s definitely still some space on the bench. I’m very optimistic about the future of Filipino NLP research, and would love to broaden my collaborations more. So if you’re interested in collaborating with us, then reach out!

My next plans for Filipino NLP

For me, the next few months will bring about a lot of change and adjustment. I’m currently vacationing in the Philippines until October so I’m yet to initiate new research projects for the latter half of the year. However, I haven’t been idle: I’ve been shaping up ideas and talking to some people so hopefully I’m back in action by Q4.

On FilBench v2: I have a lot of new ideas for FilBench v2. However, I want to maintain my key priority for developing FilBench, i.e., to improve the scope of Philippine languages in our benchmark. More specifically, my metric of success for FilBench is to cover all major regional languages in the Philippines.

We also have some plans regarding a cost-efficient and mini version of FilBench. This allows model evaluators to easily get feedback on their FilBench scores without running the full set. The details are being finalized, but the goal is to release FilBench-Mini by the end of the year.
On Filipino NLP: Most of what I’ve mentioned in my desiderata still hold. Now that I’m heading towards my PhD, I’m also figuring out how I can balance my interests in Filipino NLP and my general PhD research project. Nevertheless, I believe our efforts on FilBench have proven that grassroots and volunteer-led research projects can be successful.

I’d like to collaborate with organizations who can provide compute. I have some ideas on low-resource LLM training and data curation (with Filipino as a case study). Training a Filipino-centric LLM is often a conversation topic I hear a lot, but I think there should be some finesse on executing and framing this project, so that it becomes impactful and not “yet another” LLM that the next frontier LLM like GPT-4.1 can easily beat.

Final thoughts

Overall, I’m very happy with the progress we had in FilBench. It is a good project where I was able to collaborate with other contemporaries in Filipino NLP, and resulted in a main conference acceptance. Projects like these don’t stop after publication— in fact the whole “Filipino NLP” project is just beginning. With FilBench, we now have a comprehensive and easy-to-run benchmark for LLMs. The next step is to then figure out what kind of model—or more specifically, what type of pre/post-training data, architecture, training paradigm— can top this leaderboard.

If you’re interested to work on language-specific evals and Filipino NLP in general, just shoot me an e-mail and let’s chat!

Citation

@article{miranda2025filbench,
  title   = "Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?",
  author  = "Miranda, Lester James Validad and 
            Aco, Elyanah and 
            Manuel, Conner and 
            Cruz, Jan Christian Blaise and 
            Imperial, Joseph Marvin",
  journal = "arXiv preprint arXiv:2508.03523",
  year    = 2025
}