Github repo

calamanCy: NLP pipelines for Tagalog

calamanCy is a Tagalog natural language preprocessing framework made with spaCy. Its goal is to provide pipelines and datasets for core NLP tasks such as dependency parsing, morphological analysis, parts-of-speech tagging, and named entity recognition. calamanCy takes inspiration from other language-specific spaCy frameworks such as DaCy (Danish) and huSpaCy (Hungarian).

The name is based on calamansi, a citrus fruit native to the Philippines and used in traditional Filipino cuisine.

Running your first pipeline

First install calamanCy, then download either the medium, large, or transformer model. The command below automatically downloads and load the model you pass:

!pip install calamanCy

nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Ako si Juan de la Cruz")

You can see all available calamanCy models in this 🤗 HuggingFace collection. Alternatively, you can use all the calamanCy models within the spaCy library:

!pip install spacy
!pip install https://huggingface.co/ljvmiranda921/tl_calamancy_md/resolve/main/tl_calamancy_md-any-py3-none-any.whl

import spacy
nlp = spacy.load("tl_calamancy_md")
doc = nlp("Ako si Juan de la Cruz")

To learn more about how calamanCy (or spaCy) process your text, feel free to explore the official documentation.

Citation

If you’re using calamanCy in your paper, please cite our publication:

@inproceedings{miranda-2023-calamancy,
    title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit",
    author = "Miranda, Lester James",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.nlposs-1.1/",
    doi = "10.18653/v1/2023.nlposs-1.1",
    pages = "1--7",
}

Posts