Release v0.1.0 - Initial Release

2023/08/01

Hi everyone, I’m happy to share the first minor release of calamanCy!

This release adds our first tl_calamancy models with varying sizes to suit any performance or accuracy requirements. The table below shows more information about these pipelines.

Motivation: Tagalog NLP resources are disjointed

Despite Tagalog being a widely-spoken language here in the Philippines, model and data resources are still scarce. For example, our Universal Dependencies (UD) treebanks are tiny (less than 20k words) and domain-specific corpora are few and far between.

In addition, we only have limited choices when it comes to Tagalog language models (LMs). For monolingual LMs, the state-of-the-art is RoBERTa-Tagalog. For multilingual LMs, we have the usual XLM-RoBERTa and multilingual BERT. Tagalog is included in their training pool, but these models are still prone to the curse of multilinguality.

Therefore, consolidating these resources and providing more options to build Tagalog NLP pipelines is still an open problem. This is what I hope and endeavor to solve, as a Filipino and an NLP researcher!

Presenting the first calamanCy models

The models are also hosted on Huggingface, but you can also use the calamancy library to download and access.

ModelPipelinesDescription
tl_calamancy_md (73.7 MB)tok2vec, tagger, morphologizer, parser, nerCPU-optimized Tagalog NLP model. Pretrained using the TLUnified dataset. Using floret vectors (50k keys)
tl_calamancy_lg (431.9 MB)tok2vec, tagger, morphologizer, parser, nerCPU-optimized large Tagalog NLP model. Pretrained using the TLUnified dataset. Using fastText vectors (714k keys)
tl_calamancy_trf (775.6 MB)transformer, tagger, parser, nerGPU-optimized transformer Tagalog NLP model. Uses roberta-tagalog-base as context vectors.

Performance and baselines

Before calamanCy, you usually have two options if you want to build a pipeline for Tagalog: (1) piggyback on a model trained from a linguistically-similar language (cross-lingual transfer) or (2) finetune a multilingual LM like XLM-R or multilingual BERT on your data (multilingual finetuning). Here, I want to check if calamanCy is competitive enough against these alternatives. I tested on the following tasks and datasets:

DatasetTask / LabelsDescription
Hatespeech (Cabasag et al., 2019)Binary text classification (hate speech, not hate speech)Contains 10k tweets collected during the 2016 Philippine Presidential Elections labeled as hatespeech or non-hate speech.
Dengue (Livelo and Cheng, 2018)Multilabel text classification (absent, dengue, health, sick, mosquito)Contains 4k dengue-related tweets collected for a health infoveillance application that classifies text into dengue subtopics.
TLUnified-NER (Cruz and Cheng, 2021)NER (Person, Organization, Location)A held-out test split from the annotated TLUnified corpora containing news reports.
Merged UD (Aquino and de Leon, 2020; Samson, 2018)Dependency parsing and POS taggingMerged version of the Ugnayan and TRG treebanks from the Universal Dependencies framework.

For text categorization and NER, I ran the experiments for five trials and reported their average and standard deviation. For dependency parsing and POS tagging, I used 10-fold cross-validation because the combined UD treebank is still too small.

The results show that our calamanCy pipelines are competitive (you can reproduce the results by following this spaCy project):

Language PipelineBinary textcat (Hatespeech)Multilabel textcat (Dengue)NER (TLUnified-NER)Dependency parsing, UAS (Merged UD)Dependency parsing, LAS (Merged UD)
tl_calamancy_md74.40 (0.05)65.32 (0.04)87.67 (0.03)76.4754.40
tl_calamancy_lg75.62 (0.02)68.42 (0.01)88.90 (0.01)82.1370.32
tl_calamancy_trf78.25 (0.06)72.45 (0.02)90.34 (0.02)92.4880.90

We also evaluated cross-lingual and multilingual approaches in our benchmarks:

Language PipelineBinary textcat (Hatespeech)Multilabel textcat (Dengue)NER (TLUnified-NER)Dependency parsing, UAS (Merged UD)Dependency parsing, LAS (Merged UD)
uk_core_news_trf75.24 (0.05)65.57 (0.01)51.11 (0.02)54.7737.68
ro_core_news_lg69.01 (0.01)59.10 (0.01)02.01 (0.00)84.6565.30
ca_core_news_trf70.01 (0.02)59.42 (0.03)14.58 (0.02)91.1779.30
Language PipelineBinary textcat (Hatespeech)Multilabel textcat (Dengue)NER (TLUnified-NER)Dependency parsing, UAS (Merged UD)Dependency parsing, LAS (Merged UD)
xlm-roberta-base77.57 (0.01)67.20 (0.01)88.03 (0.03)88.3476.07
bert-base-multilingual76.40 (0.02)71.07 (0.04)87.40 (0.02)90.7978.52

Data sources

The table below shows the data sources used to train the pipelines. Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application. I’d definitely want to gain access to commercial-friendly datasets (or develop my own). If you have any leads or just wanna help out, feel free to contact me by e-mail (ljvmiranda at gmail dot com)!

SourceAuthorsLicense
TLUnified DatasetJan Christian Blaise Cruz and Charibeth ChengGNU GPL 3.0
UD_Tagalog-TRGStephanie Samson, Daniel Zeman, and Mary Ann TanCC BY-SA 3.0
UD_Tagalog-UgnayanAngelina AquinoCC BY-NC_SA 4.0

Next steps

For the past few months, I found two annotators and did a small annotation project to re-annotate TLUnified. I learned a lot about this process and I’ll be sharing my learnings in a blog post very soon. In the medium-term, I want to re-annotate TLUnified again with more fine-grained entity types and perhaps create our own treebank.

I am still in the process of testing these models so expect a few more patch releases in the future. I’m quite ahead of my self-imposed August deadline, but I want to release early and often so here it goes. If you found any issues, feel free to post them in the Issue tracker.