Lj Miranda

Guest lecture @ DLSU Manila: Artisanal Filipino NLP Resources in the time of Large Language Models

2024-07-02T00:00:00+08:00

I was invited to give a talk to a graduate-level NLP class about my work on Filipino resources. It was fun preparing and giving that talk because I was able to synthesize my thoughts and look back on my previous research. This blog post is my lecture in text format. You can find the slides in this link. Finally, thank you to Dr. Charibeth Cheng for the invitation!

Given all the rage in LLMs today, is it still worth it to build artisanal Filipino NLP resources? Hopefully we can answer this question in the context of the work that I’ve done. I’ve worked on large models in the 7B-70B parameter range, but I’ve also done several things for low-resource languages and built models in the ~100M parameter range. Tonight, I want to juxtapose these two sides and call one as artisanal and the other as large-scale.

What is artisanal NLP?

In this talk, I want to contrast two types of ideas when building language technologies. You have artisanal on one end, as shown by this photo of a handmade pottery—carefully constructed by its creator. On the other hand, you have these “mass-produced” objects done in large-scale. I want to differentiate them in terms of three dimensions.

Effort: artisanal NLP resources often require specialized knowledge and effort. For example, it is important to know something about Filipino grammar and morphology when building task-specific models for Filipino. Large-scale models can get away with just providing large amounts of data (as we see in most web-scale datasets).
Size: currently, our definitions of what’s small or large change every month. But in general, artisanal models are relatively smaller in terms of parameter size. Large-scale models need to be bigger because they have to accommodate more tasks and domains.
Utility: artisanal models and datasets tend to be bespoke, i.e., built for a specific task or requirement. On the other hand, most large-scale models we see today were made for general-purpose applications.

Notice that I’m a bit vague whether I’m talking about models or datasets. I like to think of artisanal vs. large-scale as an attitude for building language technologies or NLP artifacts. Finally, this talk is not about one being better than the other. However, I want to focus more on the merits of these artisanal NLP resources by discussing parts of my research and my work.

You can see the outline of my lecture below. For the rest of this talk, I’ll fill the blanks and talk about the merits of artisanal NLP while discussing portions of my research.

Artisanal NLP resources are high-effort, but impactful for the language community

In this section, I want to talk about TLUnified-NER (link to SEALP ‘23 paper), a Named-Entity Recognition dataset that I’ve built. NER is a classical NLP problem: given a text, you want to look for named-entities such as names of persons, locations, or organizations.

This is already an easy task for English NER. However, NER resources for Tagalog are still lacking. We don’t have a lot of labeled data, and in consequence we don’t have a lot of models. There are many ways to get around this problem (e.g., cross-lingual transfer learning, zero-shot from an LLM), but we still lack reliable test sets for evaluation.

In my opinion, a good NER dataset should be open-access, high-quality, and standardized. Most of the NER datasets available for us only fills two of these three attributes: WikiANN has general-purpose tags that follow CoNLL and can be downloaded from HuggingFace, but the quality of annotations are pretty bad. LORELEI is a high-quality dataset, but has strict license restrictions and quite expensive! Finally, we have several hand-annotated datasets for Filipino, but most of them were made for highly-specific tasks. Back in 2022, there’s an obvious gap to fill for Tagalog NER.

And so we built TLUnified-NER. It is publicly accessible, high-quality, and follows the CoNLL standard. I also curated the texts to ensure that it represents how we commonly write Tagalog. The annotation process is done through several rounds (or sprints). I hired two more annotators and then for each round we annotate a batch of examples, evaluate the annotated batch, and update the annotation guidelines to improve quality. You can learn more about this process in our paper. I also wrote some of my thoughts on the annotation process in a blogpost.

After building the dataset, there are two questions that I want to answer: first, is the NER task learnable from our annotations? Then, is it better than existing NER datasets? For the first one, I created baseline approaches using various mixes of word embeddings and language coverage. For all cases, we achieved decent performance. Then for the second question, we compared a model trained on WikiANN and a model trained from TLUnified-NER. In most cases, our model outperforms the WikiANN model, showing that it’s a better dataset to train models upon.

To end this part of the talk, I want to show that NER is just one piece of the NLP puzzle. There are still a lot of tasks to build resources on. I believe that increasing the coverage of Filipino resources allows us to not only train models, but create comprehensive evaluation benchmarks for existing LLMs today. Recently, we’ve seen a lot of claims that LLMs can “speak” Filipino, but most of these are cherry-picked examples and highly vibes-based. If we can create a systematic benchmark that allows us to confidently claim performance, then that would be a big contribution to the field.

Artisanal NLP resources are capable of doing a few things, but can do them well

In this section, I’ll talk about calamanCy, a spaCy-based toolkit that I built for Tagalog (NLP-OSS ‘23 paper). As you already know, spaCy is a toolkit for core linguistic tasks such as dependency parsing, tokenization, and NER. However, most of the models we provide in-house are focused on high-resource languages.

What most people do is they finetune spaCy pipelines on their own language or domain. So you’ll see libraries in the spaCy universe for all kinds of applications.

Domain	Example libraries
Multilinguality	DaCy for Danish and HuSpaCy for Hungarian.
Scientific texts	scispaCy, and medspaCy for scientific and medical texts.
Old Languages	latinCy for Latin, greCy for Greek, and my work on SIGTYP ‘24 on several Ancient & Medieval languages.

So this prompted me the question: why don’t we build a spaCy pipeline for Tagalog?

But what does it mean to build spaCy pipeline?

First, think of a spaCy pipeline as a series of functions that identifies key linguistic features in a text. So a tokenizer is a function that looks for tokens, a tagger is a function for parts-of-speech (POS) tags, and so on. Then at the end, you obtain a Doc object that contains all these linguistic features.

Building a spaCy pipeline means training these functions and composing them into a single package. In fact, we have a well-documented master thread on how to add pretrained language support for spaCy. Most of these functions are based on neural network architectures, and hence require some non-trivial amount of data to train.

One of the hardest parts of building calamanCy is curating datasets to “train” these functions. For example, the current Tagalog treebanks are too small to train a reliable dependency parser and POS tagger. Also, TLUnified-NER doesn’t exist back then, so I still have to build it. You can read more about this curation process in the calamanCy paper.

It was a long process, but the most important question is: was it worth it? To that I remember this figure from Matthew Honnibal’s blog post on LLM maximalism. I think there is value in curating these datasets and training these models as it helps me understand which parts of the linguistic pipeline really requires an LLM, and which could be done more reliably by a simple approach. In addition, we were also able to show empirically that models trained on calamanCy performs pretty well, even compared to commercial LLM APIs.

As shown in the charts below, we found that even commercial LLMs like GPT-4 and Claude don’t fare well on our test set in a zero-shot setting. There are many possible reasons, of course. One major reason is that these models aren’t optimized for multilinguality, and hence have a tiny amount of Tagalog texts in their corpora. I’d love to revisit this experiment soon, especially with the release of multilingual LLMs such as SeaLLM, Sea-LION, and Aya-101.

So…even if artisanal NLP models can do a few things (but do them well), I’m still optimistic that there are opportunities to do more things while doing them well. I want you to remember this chart below. There are a lot of opportunities to work on datasets used for training models with general-purpose capabilities and/or building task-specific models.

Artisanal NLP resources may not be the most mainstream, but fills vital research gaps

As we all know, LLMs are all the rage today. In the past two years, there has been an explosion of open LLMs, and way more soon! However, working on LLM research is costly— a high-spec consumer-grade machine might not be enough to finetune a 7B-parameter model. So how do you contribute to the field if you are GPU-poor?

I will answer this in the context of my collaborations and other works.

The first option is to continue building resources, but at scale. It’s quite common to see researchers first work on a single language (mostly that of their native tongue) and then move on to multilinguality, as the latter presents a different opportunity for creativity.

For Universal NER, we created a multilingual NER corpora for several languages, all based from treebanks in the Universal Dependencies (UD) framework. This effort allows us to create a consistent and standardized annotation for 13 diverse languages— very important for multilingual research.

As an illustration, we can do cross-lingual transfer learning as shown below, where we compare the performance of different models trained on a source language for different test set languages. We can even show that for low-resource languages like Tagalog, we can get by with training an NER model from a Portuguese treebank as an alternative.

Another option is to figure out areas where LLMs are lacking. Most LLMs right now aren’t explicitly trained to be multilingual. They’re incidentally multilingual, probably because of some stray artifacts in their pretraining data. But for the past few months, we’ve seen the rise of multilingual LLMs such as Aya-101, Sea-LION, and more. Are they actually better for multilingual data?

That’s one of the questions we sought to answer in the SEACrowd project. First, we crowdsourced datasets all over Southeast Asia. This allows us to create a data hub for SEA-specific resources. The data hub, in itself, is already a big contribution. It also allowed me to see available datasets from the Philippines, and there’s actually quite a lot!

Then, we curated a multilingual SEA benchmark for both NLU (natural language understanding) and NLG (natural langauge generation) tasks. Upon testing, we found that LLMs trained for multilinguality actually perform quite well (e.g., Aya-101 13B). Even LLMs targeted for SEA languages are competitive with commercial APIs.

What I found very interesting is that when testing for generation “quality” (i.e., are the generated texts natural-sounding or translationese?), we found that even the best LLM is still 58% natural-sounding. There’s still a long way to go to improve the quality of generations. As of now, everything is still vibes-based, but having an empirical benchmark certainly helps!

Conclusion: what types of artisanal NLP contributions can you pursue?

I want to close this talk by showing you this chart. There are a lot of ways to improve Filipino NLP, it just depends on where you’re interested in:

Model-centric or data-centric: what types of artifacts are you excited to build? Do you want to curate datasets or do you want to build models?
General-purpose or specific: are you aiming to build towards generally-capable LLMs, or do you want to solve a specific task?
Training or evals: what is the purpose of the artifact you’re creating? Is it for training models or for evaluating existing models?

There are opportunities for each quadrant of this chart. I myself am interested in data-centric approaches for both task-specific and generally-capable models. There’s still a lot of domains that need foundational NLP resources (think Universal Dependencies treebanks), and several questions can still be asked regarding the datasets we use for training state-of-the-art LLMs.

Hopefully this inspires you to figure out what your project would be!

Postscript

I really enjoyed preparing for this talk as it helped me synthesize my past work on Filipino NLP. Right now, I’m finding ways to marry my current research (preference / alignment-tuning for LLMS) and my previous work on low-resource and multilinguality. I’m quite inspired by works such as the PRISM Alignment Project, Cohere’s Aya-23 dataset, and the Multilingual Alignment prism. I think there are interesting questions at the intersection of multilinguality and preference data, but I don’t think our conclusions should be something like “Oh, people who speak language X prefers Y and Z” or “People from country A prefers B.” Still a long way to go, but I’m excited to pursue these topics in the future! If this caught your attention, feel free to reach out!

A lexical view of contrast pairs in preference datasets

2024-03-12T00:00:00+08:00

Preference data is a staple in the final step of the LLM training pipeline. During RLHF, we train a reward model by showing pairs of chosen and rejected model outputs so that it can teach a policy model how to generate more preferable responses. The hope is, our reward model can capture the nuance and diversity of human judgment.

However, preference is subjective by nature, and few studies have tried articulating it. For example, some looked into different aspects of a response’s helpfulness / harmlessness (Bai et al., 2022) while others investigated surface-level characteristics like its length (Singhal et al., 2023).

In this blog post, I want to offer a different approach: what if instead of looking at qualitative aspects or token-level features, we use sentence embeddings? Sentence embeddings capture a text’s lexical and semantic meaning in a high-dimensional vector space. If so, can we ascertain lexical differences between chosen and rejected responses just by looking at text embeddings?

One reason why this is important is due to synthetic data. I think that it is easier to generate synthetic pairs conditioned on lexical distance (as opposed to some quality-based metric). Maybe, there are some tasks and domains where generating with respect to cosine distances is plausible.

Getting preference data

First, I sampled preference data across different sources. For bigger datasets such as SHP, I only took a particular subset I am interested in. The table below shows the sources I used:

Dataset	Description
OpenAI’s Summarize from Human Feedback (Stiennon et al., 2022)	Dataset used to train a summarization reward model. I used the `comparisons` subset where each instance represents a matchup between two summaries.
Stanford Human Preferences Dataset (Ethayarajh et al., 2022)	Contains a collection of human preferences over responses to questions or instructions. I used the `explainlikeimfive_train` subset to represent OpenQA questions.
Argilla’s Ultrafeedback Multi-Binarized Cleaned Dataset	A clean version of the original Ultrafeedback dataset (Cui et al., 2023). The cleanup process can be found in their writeup.
Tatsumoto Lab’s Alpaca Farm (Dubois et al., 2023)	The human preference subset of the Alpaca Farm dataset. The researchers used this subset to compare their LLM judge’s preferences.
Berkeley Nest Lab’s Nectar Dataset	Preference ranking dataset for training the Starling 7B reward model (Zhu et al., 2023), and consequently, the Starling 7B language model.

For OpenAI’s Summarize and SHP, the preferences are in the form of individual matchups. To get the canonical chosen and rejected responses, I used the Elo rating system to obtain the top and bottom completions.

Measuring distance between pairs

Given a set of preference data, I split the completions based on whether they were chosen (\(\mathbf{y}_w\)) or rejected (\(\mathbf{y}_l\)) by an evaluator—human or GPT, depending on the dataset. Then, I embedded them using sentence-transformers/all-MiniLM-L6-v2 to produce 384-dimensional sentence embeddings. Finally, for each row, I computed the distance (\(\mathbf{d}\)) between the chosen and rejected vectors. The figure below illustrates this process.

To compute the distances, I used the cosine distance from scipy. Cosine distance measures the direction between two vectors, allowing us to capture similarity even if the length of the sentences or overall frequency of the words differ. It is represented by the following equation:

\[\mathbf{d}(\mathbf{v}_w, \mathbf{v}_l) = 1 - \dfrac{\mathbf{v}_w \cdot \mathbf{v}_l}{\lVert\mathbf{v}_w\rVert_2 \lVert\mathbf{v}_l\rVert_2} ,\]

where the distance value ranges from \((0, 2)\). Usually, when we talk about distances between preference pairs, we talk about quality-based distances. They’re often in the form of rankings (i.e., get top-1 and top-N) based on an evaluator’s assessment. Again, in this blog post we’re looking at lexical-based distances that are readily available from a text’s surface form. In the next section, I’ll discuss some interesting findings from these distance calculations.

Findings

Most of the charts I’ll be showing below are histograms. Here, the x-axis represents the cosine distance whereas the y-axis represents the probability density. We compute the probability density by normalizing the fraction of samples in each bin so that the sum of all bar areas equals 1. The best way to think about these value is in terms of chance, that is, how likely is a random preference pair have a distance \(\mathbf{d}\) on the x-axis?

Lexical differences are apparent in some datasets

The chart below shows the distribution across multiple preference datasets. AlpacaFarm and Nectar lie on both extremes. AlpacaFarm is particularly interesting because its completions were generated by API-based LLMs using prompts that replicate human variability and agreement. I’m unfamiliar with how exactly they prompted the LLM, but does that mean their process resulted in similar-looking texts?

On the other hand, Nectar’s completions were a combination of LLM outputs (GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-instruct, LLaMa-2-7B-chat, and Mistral-7B-Instruct) alongside other existing datasets. Because Nectar formats its preferences in terms of ranking, the chosen and rejected pairs here represent the top and bottom choices.

Other datasets have distributions that I expected. For example, OpenAI’s summarization dataset should still have closer preference pairs because of the task’s inherent nature. Summarization is about compressing a text while maintaining information. Upon checking the actual preferences and corresponding evaluator notes, I noticed that rejected completions are oftentimes a matter of recall.

Elo ranking correlates with cosine distance

Next, I looked into how Elo ranking corresponds to the cosine distance of the text embeddings. Preference datasets like OpenAI’s Summarization, SHP, and Berkeley-Nest’s Nectar represent their preferences as individual matchups, allowing us to compute the Elo rating of individual completions. Then, we can order these ratings to achieve a rank of completions from most preferable to least.

However, OpenAI’s Summarization and SHP have unequal number of ranks per prompt \(\mathbf{x}\). So to simplify the visualizations, I took the chosen completion \(\mathbf{y}_w\), the top-2 completion \(\mathbf{y}_{l,next}\), the middle-performer \(\mathbf{y}_{l,mid}\), and the bottom-performer \(\mathbf{y}_{l,last}\) (which is equivalent to \(\mathbf{y}_l\) in the previous section). On the other hand, Berkeley-Nest’s Nectar provides a 7-rank scale of preferences. This allowed me to compute the distance from the first and second choices until the last one: \(\mathbf{d}(\mathbf{y}_1, \mathbf{y}_{2\ldots7})\). Then, I plotted these distances in a histogram (I only retained the curve so that the charts look cleaner) as seen below:

The cosine distances from the OpenAI Summarization preference dataset follow a certain pattern: completions that are closer in ranking have smaller lexical distance. The average mid ranking is 2.042 (with a 4.109 average number of ranks) and the Pearson correlation between the distances and Elo ranking is 0.779.

For the Stanford Human Preferences (SHP) dataset, I chose the explainlikeimfive subset to simulate OpenQA tasks. Interestingly, it has a less pronounced visual correlation even though its Pearson-r is 0.785, much higher than OpenAI Summarization. The average mid ranking is 1.967 with an average rank number of 4.600.

For Berkeley-Nest’s Nectar dataset, the rankings were already given so I didn’t have to compute my own. Here, the Pearson correlation is 0.818. If you look at the “chosen and rejected (2)” red line, you’ll notice that the cosine distances start very small but fall off afterwards. It is interesting that completions that performed similarly during matchups are quite similar to one another based on their embeddings.

Dataset	Number of ranks (avg)	Mid rank	Pearson-r Elo ranking	Pearson-r Elo rating
openai/summarize_from_feedback	4.109	2.042	0.779	-0.534
stanfordnlp/SHP	4.600	1.967	0.7845	-0.458
berkeley-nest/Nectar	7.000	4.000	0.818	-

The table above shows the ranking statistics for each dataset. I also measured the Pearson correlation between the rejected text’s ranking (and Elo rating) with respect to its embedding distance from the chosen text. The sign (+/-) corresponds to the direction of the correlation. For example, the negative sign in the last column shows that as the text’s Elo rating increases, then its lexical distance from the chosen text decreases (i.e., they become more similar).

Lexical distance is consistent across preference attributes

Finally, I was curious how individual attributes of preference manifest in lexical distances using the HelpSteer dataset (Wang et al., 2023). Most datasets only give us a single view of human judgment, but HelpSteer provides finegrained preferences such as helpfulness, correctness, coherence, complexity, and verbosity.

So, I did the same experiments for each of these attributes and found that the distribution didn’t change much. I’m not quite confident on how I preprocessed this dataset. Unlike other preference datasets that uses matchups, HelpSteer uses scores from 0 to 4 so some texts can end up having the same scores. Here, I simply sorted the texts with their score, and designated the chosen text as the first one on the list (whatever Python’s sort function made it to be), and the rejected text as the last element. You can see the figure below:

I think that there’s still a lot that can be done on this angle. One way is to format the data in terms of individual matchups. This process leads to a forced ranking, allowing us to easily designate the chosen and rejected pair. Since HelpSteer is the only one we have (as far as I know), then I’ll leave my analysis as is for now.

Final thoughts

In this blog post, I presented a lexical view of preference pairs using embeddings. Using different preference datasets, I computed for their sentence embeddings, and then measured the cosine distance between chosen and rejected pairs. I found that some datasets exhibit lexical differences and that it correlates to human judgment (i.e., Elo rating). Finally, using the HelpSteer dataset, I saw that cosine distances are consistent even on different attributes of preference.

This experiment is really just a curiosity as I work on RLHF. I’ve been doing some experiments on my job that are a little bit orthogonal to this work. I think this is just my way of exploring interesting avenues and scratching my itch. If you’re interested in this type of work, feel free to reach out and discuss!

You can find the source code for this work on GitHub!

Guest lecture @ UNC Charlotte: Labeling with LLMs

2024-02-21T00:00:00+08:00

A few weeks ago, I held a guest lecture in the DSBA 6188: Text Mining and Information Retrieval Class at UNC Charlotte on using large language models (LLMs) for annotation. It was fun because I could expand my previous blog posts on LLM annotation into a full-fledged lecture.

This blog post is my (abridged) lecture in written format. You can find the slides in this link. Finally, thanks to Ryan Wesslen and Chang Hsin Lee for inviting me!

Case in point: automated fact checking

One of the major problems of the 21st century is disinformation. You’ll see it everywhere, from Facebook posts or X tweets to fake news websites! Combating disinformation is labor-intensive. Politifact, a fact-checking website, relies on volunteer journalists to scour the internet and manually label each source.

There are several efforts to automate the fact-checking process. A common approach is to treat it as an NLP pipeline composed of different tasks (Guo et al., 2022). Today, we will only focus on claim detection, the first step in an automated fact-checking pipeline.

Detecting claims is usually a dual problem: you’d also want to find the premises that support it. Together, the claim and its premises make up an argument. Applying NLP to this domain is often called argument mining. For this talk, I want to introduce two argument mining sub-tasks: (1) first, we want to highlight the claim and premise given a text (claim & premise extraction), and then, (2) we want to determine if a text supports, opposes, or is neutral to a certain topic (stance detection).

So, our general approach is to reframe these two sub-tasks as NLP tasks. First, we treat claim & premise extraction as a span labeling problem. We can use spaCy’s SpanCategorizer to obtain spans or arbitrary slices of text. Then, we treat stance detection as a text categorization problem. Similarly, we can use spaCy’s TextCategorizer to classify a text among our three stances (support, oppose, neutral).

Notice how we’ve decomposed this general problem of disinformation into tractable NLP tasks. And it is an important muscle to train. In computer science, we often learn about the divide and conquer algorithm, and this is a good application of that approach to a more fuzzy and, admittedly, complex problem.

As we already know, training NLP models such as a span or text categorizer requires a lot of data. I want to talk about different methods of collecting this dataset and emphasize how LLMs can fit into this workflow.

Annotating argument mining data with LLMs

Before we get into LLMs, I want to talk about “traditional” ways of annotating data. On the left end, we have manual processes involving much human effort and curation. And then, on the right, we have more automated methods that rely heavily on a reference or base model. LLMs, as advanced as they are, still fall in between. They’re not fully manual but also not fully automated because writing a prompt still requires tuning and domain expertise.

But why are we still interested in LLMs? It’s because LLMs provide something that most semi-automated methods can’t: a model pretrained on web-scale data, and a highly flexible zero-shot capability. Let me put this in a Venn diagram— and for each space in this diagram, I’ll talk about how LLMs can specifically help in our annotation workflows.

Bootstrapping in a human-in-the-loop workflow

One of the most straightforward applications of large language models is bootstrapping labeled data. Here, an LLM is a drop-in replacement for a base model that you’d usually train. LLMs differ because they were pretrained on web-scale data, giving it enough capacity even for your domain-specific task. So, how good is an LLM annotator?

To test this question, I worked on a portion of the UKP Sentential Argument Mining corpus (Stab et al., 2018). It contains several statements across various topics, and the task is to determine whether the statement supports, opposes, or is neutral to the topic— a text categorization problem.

The process was simple: I included each statement in a prompt and asked GPT-3.5 what the stance was. You can read more about my process in this blog post. My findings show that LLMs, when prompted in a zero-shot manner, are competitive on a baseline that I trained on the original labels. In addition, I also found myself annotating faster (and more correctly) when correcting LLM annotations compared to annotating from scratch. The latter finding is important because correcting annotations induces less cognitive load and human effort (Li et al., 2023, Zhang et al., 2023).

So, if LLMs can already provide competitive annotations, is our problem solved? We don’t have to annotate anymore? Remember, the reason why we collect these annotations is so that we can train a supervised model that can reliably approximate the task we’re interested in. The operating word here is reliable. There’s a huge variance in LLM performance, and one way to thin out that curve is to insert it in a human-in-the-loop workflow (Dai et al., 2023; Boubdir et al., 2023; Wang et al., 2023).

Directing annotations by providing extra info in the UI

Another way we can use LLMs for annotation is by taking advantage of their flexibility. LLMs have zero-shot capabilities, i.e., we can always frame structured prediction tasks such as text categorization or named entity recognition as a question-answering problem. Back then, you’d need to train separate supervised models to achieve multi-task skills. I want to use an LLM’s flexibility to enhance the annotation experience.

This time, I want to introduce two workflows. The first one is still a text categorization problem, but I want to ask an LLM to pre-highlight the claims and premises so I can reference them during annotation. For the second one, I want to ask the LLM to do the reasoning for me. I’ll let it identify the claims and premises, then pre-annotate an answer, and then give me a reason for choosing that answer. This exercise aims to explore creative ways we can harness LLMs.

The process is similar to the first section, but I prompt for auxiliary information instead of prompting for the direct labels. LLMs make this possible because we can formulate each task as a question-answer pair. You’ll find examples of my prompt in the slides below. The prompt on the left is a straightforward span labeling prompt, where we ask the LLM to provide the exact spans from a text. On the other hand, the prompt on the right is a chain-of-thought prompt (Wei et al., 2023). Here, we induce an LLM to perform a series of reasoning tasks to arrive at a final answer.

The good thing about Prodigy is that you can easily incorporate this extra information in your annotation UI. On the bottom left, you’ll see that it highlights the claims and premises for each statement, allowing you to focus on the relevant details when labeling. On the bottom right, you’ll find that the UI metadata now contains the prompt’s reasoning steps.

There are many creative ways to improve annotation efficiency (and quality) using LLMs. One of my favorite papers from EMNLP was CoAnnotating (Li et al., 2023), which uses an uncertainty metric to allocate annotation tasks between humans and a chat model such as ChatGPT. We’ve seen a lot of LLM-as-an-assistant applications in the market for the past year, and I think that there’s an opportunity to apply the same perspective to the task of annotation.

Revealing ambiguity in our annotation guidelines

Finally, I’m curious how LLMs parse information originally intended for humans. In most annotation projects, researchers write an annotation guideline to set up the parameters of the labeling task. These guidelines aim to reduce uncertainty about the phenomenon we are annotating. We can even think of these as prompts for humans!

This time, I want to focus on a simple task: determine whether a statement is an argument. It sounds easy because it’s “just” a binary classification task. However, after looking through various argument mining papers and their annotation guidelines, I realized that they each have their definition of what makes an argument!

So, this got me thinking: what if we include the annotation guideline in the prompt? You can check my entire experiment in this blog post. Back then, you could not fit a whole document into an LLM’s limited context length, so I used a continuous prompting strategy that showed chunks of the document and let the LLM update their answer based on new information. Langchain calls this a “refine chain” in their docs. As an aside, I’ve opted into using minichain in my recent projects as it is more lightweight and enough for my needs.

Including an annotation guideline in the prompt resulted in worse results—surprising. I couldn’t delve further, but I hypothesize that writing prompts for LLMs have a particular “dialect” vastly different from how we talk as humans. Annotation guidelines were written with humans in mind, and perhaps some qualities don’t transfer properly into LLM prompts. There are many confounding factors, of course. Maybe the refine strategy is not the best, or maybe I should’ve processed the text much better. An LLM’s prompt sensitivity is still an open problem.

But I learned one thing: we can use LLMs as a “first pass” when iterating over our annotation guidelines. Typically, you’d start with a pilot annotation with a small group of annotators as you write the guidelines, but there’s an opportunity to incorporate LLMs into the mix.

Final thoughts

Before we end, I want to share an important question before you begin your annotation projects. You should always ask yourself: what is the label supposed to reflect? Knowing what you want to use the collected dataset for is paramount.

Rottger et al. (2022) named two paradigms for data annotation: prescriptive and descriptive. Prescriptive annotation is usually found in linguistic tasks such as named entity recognition or parts-of-speech tagging—where there is a “correct” answer for each instance. Here, you already have a function in mind and need to collect enough data to train a reliable model. On the other hand, descriptive annotation aims to capture the whole diversity of human judgment. You’d usually find this in subjective tasks like hate speech detection or human preference collection.

LLMs are pretty good at prescriptive annotation tasks. Some empirical evidence that supports it (Ashok et al., 2023; Chen et al., 2023; Sun et al., 2023), and it allows us to access the web-scale data it was pretrained upon.

And now to my final point: despite their web-scale and zero-shot capabilities, LLMs are only as good as how well you prompt them. During my early days in data science, there is this common adage: “garbage in, garbage out.” Usually, we say this when we want to refer to bad data. The problem with prompts is that the degree of freedom is much higher, which introduces ambiguity to our inputs. Hence, I don’t recommend using LLM outputs straight from the firehose and serving it immediately. There should be an intermediary step that minimizes this uncertainty, and that step is human annotation.

How to set up Git and SSH when your org has enforced SAML SSO

2023-11-28T00:00:00+08:00

While cloning a repository from an organization with SAML SSO, I encountered an SSH error. I’ve been using Git with SSH before, and I admit that this was new:

$ git clone git@github.com:myorg/repo.git
Cloning into 'repo'...
ERROR: The 'myorg' organization has enabled or enforced SAML SSO.  
To access this repository, you must use the HTTPS remote with a 
personal access token or SSH with an SSH key and passphrase that 
has been authorized for this organization.

Visit https://docs.github.com/articles/authenticating-to-a-github-organ
ization-with-saml-single-sign-on/ for more information.

Step 1: Create an SSH key and upload it to your GitHub account

First you need to generate your SSH key. Sometimes, your organization will require you to generate a new one using your company email. Nevertheless, the common denominator would be to run the ssh-keygen command below:

ssh-keygen -t ed25519 -C lj@myorg.org

This will generate a key pair in the form of id_ed25519 and id_25519.pub. In Linux, you can find them in the ~./ssh/ directory. We need to upload the one with the .pub extension to GitHub. Go to your GitHub Settings > SSH and GPG Keys > New SSH Key (or head to github.com/settings/keys).

Write a semi-descriptive title (I usually put the organization name), set the Key Type as “Authentication Key,” and copy the contents of the id_25519.pub in the Key field.

Step 2: Add your SSH key to the SSH agent’s list

First, test the connection by running:

$ ssh -T git@github.com
Hi username! You've successfully authenticated, but GitHub does 
not provide shell access.

Then, start the SSH agent:

$ eval "$(ssh-agent -s)"
Agent pid 16935

It starts a background daemon and displays its process ID (in this case, 16935). We can then add our private keys while this agent is running.

$ ssh-add .ssh/id_ed25519
Identity added: .ssh/id_ed25519 (some other info)

At this point, you should now be able to clone your organization’s private repository. I haven’t really dug deep as to why it errored out the first time, I assumed that the keys are automatically added whenever I create them. Anyway, in case you also encountered this error, I hope this tutorial helps!

Visualizing Tagalog NER embeddings

2023-11-20T00:00:00+08:00

I found this nice blog post that demonstrates how we can glean insights just from visualizing ConLL NER embeddings. They ran DistilBERT on the span labels to get their embeddings, and then projected them to 2D space using t-SNE. I think this approach has a lot of potential applications, from quality assurance to performance prediction.

For now, I am just curious as to how they would look like in our Tagalog NER dataset. My approach is similar to the blog post I mentioned. Only difference is that I’m using a trained RoBERTa Tagalog model to get the embeddings. Finally, there’s nothing new about these methods: dataset cartography has already been an active area of NLP research ever since (Swayamdipta et al., 2020; Balasubramanian et al., 2020).

You can find the code and implementation here.

Examining clusters for all labels

Below you’ll find the t-SNE plot for all entities in our dataset. They are color-coded based on their type— Person (PER), Organization (ORG), and Location (LOC). When you hover over each point, you’ll see the span text, its label, and a portion of the sentence it belongs to. Feel free to explore around using the visualization tools from Plotly.

Figure: t-SNE plot for all entity labels in TLUnified-NER.

LOC and ORG superclusters

At first glance, we see that PER entities are clearly separated from ORG and LOC, whereas the other two have noticeable overlaps. My hunch here is that even if two entities have the same lexical properties, they have different semantic use. For example, Malacañang (the place where the Philippine President resides, akin to The White House) can either be an organization or location based on its usage. We can verify this observation by examining the confusion matrix of a simple transition-based NER model: it significantly misconstrues LOC entities as ORG (and vice-versa).

Embeddings set-up	ORG	LOC
Shared	\(+5\%\)	\(+3\%\)
Context-sensitive	\(+12\%\)	\(+18\%\)

To further test this “lexical-semantic confusion” hypothesis, I trained two additional models that account for a word’s position in the text. The first model uses spaCy’s shared token-to-vector layer that includes a dependency parser and POS tagger aside from NER. The hope is that by sharing information between these three components, our downstream NER model can learn how to disambiguate between these semantic differences. The second model uses a transformer network to obtain context-sensitive vectors. It is interesting then that the relative error for LOC ↔ ORG decreased when using these methods. Therefore, I highly-recommend using context-sensitive techniques when training models from this dataset.

Interesting observations from other clusters

I also want to share interesting clusters from examining all labels. I’m literally just spitballing here: there’s nothing methodical aside from inspecting a few clusters and checking their neighbors. With that in mind, take these observations with a grain of salt.

Political clusters are intriguing. There are some interesting neighborhoods that intrigued me. For example, the Nograles cluster is isolated from most PER entities. Its closest PER cluster is Arroyo, and the majority of its neighboring clusters include Mindanao, MILF, and some cities near Davao. My hunch is that most news stories in the corpus were written during a time when Prospero Nograles’s involvement in Davao and the Arroyo administration is apparent (he was the Speaker of the House).

Now, we’re entering speculative territory but it’s cool that you can at least draw political lines during the 2004-2010 administration. Of course, it’s hard to draw these lines because unlike the US, the Philippines has a multi-party system. It’s fun to point out but I admit that what I’ve been doing is just akin to a Rorschach test. If you’re looking for something more rigorous, I suggest reading the work of Rheault et al (2019). This led me to ask: can we predict shifts in political alliances from words alone? I think it is an interesting exercise— and especially challenging— given that political parties in the Philippines are not really defined by their ideologies.

Biases exist. I also noticed clusters that might potentially be sources of bias when training models from this dataset. For example, most news sources from Mindanao involve acts of terrorism from Abu Sayyaf and the Moro National Liberation Front. It is then unfortunate that entities such as Allah and Muslim are co-located within this neighborhood.

Personally, I’m interested to explore techniques to debias corpora from an embeddings standpoint. The works of Prost et al. (2019) and Kaneko et al. (2021) for gender bias come to mind.

Examining clusters for each entity type

Here, I plotted the embeddings for each entity type while categorizing them based on their span property: paren (if the span is preceded by a parenthesis), all_caps (if all characters in the span are in uppercase), initial (if the span is the first subword in the text), and plain as a catch-all category. These classes are mutually exclusive, i.e., I automatically assign them based on the first property they fulfill.

PER embeddings

Most PER entities were categorized as plain, and it is mostly expected. Although I find it interesting that there is a sizeable amount of names made up of initials such as FVR for Fidel V. Ramos or GMA for Gloria Macapagal Arroyo. Most of our benchmarks had little to no problem recognizing PER entities and my hunch is likely due to the straightforward and consistent structure of such entities.

The prevalence of initials in naming individuals usually stem from cultural influences in the Philippines. It is customary to use initials to refer to prominent figures, such as Presidents and CEOs. I have only scratched the surface on these entities, so feel free to explore around the interactive plot!

LOC embeddings

It is cool that if you squint hard enough, you can see cities in the Philippines arranged in their geographical location—based on their embeddings alone. Of course, there are still inconsistencies: Manila is located at the rightmost portion whereas Bulacan appears in the middle. Why would that be the case? My hunch is that in-country linguistic diversity is still apparent, and somewhat recoverable, through these embeddings. For example, Mindanao is predominantly Muslim, hence affecting naming and proper noun patterns (you’ll often see instances of Muhammad or Al- in Mindanaoan names). These variations borne out of geographical differences may correlate with the linguistic clusters we now see.

My other theory is politics. Although there is a central government, regional politics still dominate. Politicians tend to co-occur with one another in news reports, especially if they belong to the same region. Perhaps these co-occurences caused some of the “geographical separation” we see in our embeddings. Might be fun to explore in the future!

ORG embeddings

It’s nice that most government departments are members of the same cluster. This can lead to improved accuracy in NER tasks, as a model can easily recognize and categorize these entities. Hopefully, this can be a good visual cue that the embeddings have captured the underlying relationships between different organizations. They are similar to PER entities, with the exception that they have a more recognizable orthographic “shape.” For example, most organizations in news reports are acronyms. And so, writers tend to give its full name first, then followed by its shorthand (e.g, XXXXX XXXXX XXX (XXX)).

Final thoughts

Visualizing embeddings is a nice exercise of “examining your data.” Although I have to admit that I have made some huge leaps of logic while explaining my observations above— I’m still getting better at this! In the future, I’m more interested in how these techniques can be applied, and how we can test them via a more empirical approach. For example, we can get LLM embeddings, cluster them together, and examine the outliers. There might be cool applications for correcting annotations or annotating a dataset from scratch. For now, I think this is fun!

References

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
Sriram Balasubramanian, Naman Jain, Gaurav Jindal, Abhijeet Awasthi, and Sunita Sarawagi. 2020. What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name?. In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 205–214, Online. Association for Computational Linguistics.
Ludovic Rheault and Christopher Cochrane. 2020. Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora. Political Analysis, 28(1), 112-133. doi:10.1017/pan.2019.26.
Flavien Prost, Nithum Thain, and Tolga Bolukbasi. 2019. Debiasing Embeddings for Reduced Gender Bias in Text Classification. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 69–75, Florence, Italy. Association for Computational Linguistics.
Masahiro Kaneko and Danushka Bollegala. 2021. Debiasing Pre-trained Contextualised Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1256–1266, Online. Association for Computational Linguistics.

Do large language models work on Tagalog?

2023-08-04T00:00:00+08:00

Parts of this work were published in the paper, “Developing a Named Entity Recognition Dataset for Tagalog”, at IJCNLP-AACL’s Southeast Asian Language Processing Workshop. Feel free to cite that paper in your own work.

A few weeks ago, I saw an interesting blog post from Thinking Machines where they ran Filipino tweets on GPT-4 for a sentiment analysis task. Their prompt was simple. They asked: “what is the sentiment of this tweet?” They obtained a weighted F1-score of 76%— pretty decent for a straightforward zero-shot approach. This inspired me to test LLM performance on other Tagalog NLP tasks, hence these experiments.

In this blog post, I will test how these large language models (LLMs) fare compared to standard finetuning techniques in Tagalog. I will be benchmarking them on the named entity recognition (NER) and text categorization datasets from the calamanCy project.

As a refresher, you can check the datasets I’m using in the table below. I didn’t include the Universal Dependencies (UD) treebanks this time because querying third-party APIs is getting too costly.

Dataset	Task / Labels	Description
Hatespeech (Cabasag et al., 2019)	Binary text classification (hate speech, not hate speech)	Contains 10k tweets collected during the 2016 Philippine Presidential Elections labeled as hatespeech or non-hate speech.
Dengue (Livelo and Cheng, 2018)	Multilabel text classification (absent, dengue, health, sick, mosquito)	Contains 4k dengue-related tweets collected for a health infoveillance application that classifies text into dengue subtopics.
TLUnified-NER (Cruz and Cheng, 2021)	NER (Person, Organization, Location)	A held-out test split from the annotated TLUnified corpora containing news reports.

I wrote a zero-shot prompt and ran it on the test set. Zero-shot prompting only requires a task description for inference. In addition, few-shot prompting is out of scope for this blog post—it’s too laborious to engineer prompts and it might be difficult to compare them properly. I’ll also run the experiments for three trials and report the mean and standard deviation to account for variance. The prompt text will still be in English so as to be consistent with the Thinking Machines blog post.

Finally, I am using spacy-llm throughout the experiments. I highly recommend trying spacy-llm if you’re building production-grade LLM pipelines. You can find and reproduce my work on Github! (Full disclosure: I used to contribute to earlier versions of spacy-llm as part of my work at Explosion)

Methodology: what are my prompts?

The spacy-llm library provides a set of built-in prompt templates for zero- and few-shot prompting. These prompts are categorized and versioned per task. You can view them by checking the configuration file in the Github repo and looking at the components.llm.task section. For example, in NER, we have something like this:

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = ["PER","ORG","LOC"]
label_definitions = {"PER": "PERSON", "ORG": "ORGANIZATION", "LOC": "LOCATION OR GEOPOLITICAL ENTITY"}

Here, spacy.NER.v2 points to a task with its own prompt. From there, you can check the documentation and cross-reference the template (tip: check the template argument in the docs). For NER, we have this Jinja2 file. At runtime, spacy-llm renders our config to the Jinja2 template, thereby producing the final prompt sent to the LLM:

You are an expert Named Entity Recognition (NER) system. Your task is 
to accept Text as input and extract named entities for the set of 
predefined entity labels. From the Text input provided, extract named 
entities for each label in the following format:

PER: 
ORG: 
LOC: 

Below are definitions of each label to help aid you in what kinds of 
named entities to extract for each label.  Assume these definitions are 
written by an expert and follow them closely.

PER: PERSON
ORG: ORGANIZATION
LOC: LOCATION OR GEOPOLITICAL ENTITY

Text:
'''
Pumunta si Juan sa Japan.
'''

I won’t be pasting the prompts for binary and multilabel text categorization here to save space. Again, the best way to view them is to check my configuration files and cross-reference them with the prompt templates in the spacy-llm repository.

Lastly, some spacy-llm tasks provide additional arguments such as label_definitions for explicitly describing a label to an LLM, and examples for incorporating exemplars in few-shot prompting. The library covers most of the core NLP tasks (NER, text categorization, and lemmatization) and seems to be adding more in the natural language understanding (NLU) space (e.g., summarization).

Results: LLMs vs good old-fashioned supervised learning

I tested on a variety of decoder-only large language models, from commercial ones like GPT-4 to open-source models like Dolly. The table below reports the results (Metrics: macro F1-score for Dengue and Hatespeech and F1-score for TLUnified-NER):

LLM	Dengue	Hatespeech	TLUnified-NER
OpenAI (`gpt-4`)	\(\mathbf{62.04 (0.20)}\)	\(45.74 (1.16)\)	\(\mathbf{65.89 (0.44)}\)
OpenAI (`gpt-3.5-turbo`)	\(51.21 (0.38)\)	\(\mathbf{73.90 (0.27)}\)	\(53.05 (0.42)\)
Anthropic (`claude-1`)	\(35.85 (0.02)\)	\(58.70 (0.03)\)	\(58.88 (0.03)\)
Cohere (`command`)	\(39.27 (0.64)\)	\(16.38 (0.88)\)	\(25.48 (0.11)\)
Databricks (`dolly-v2-7b`)	\(27.26 (0.40)\)	\(32.30 (0.18)\)	\(13.07 (0.14)\)
TII (`falcon-7b`)	\(14.77 (0.35)\)	\(33.00 (0.11)\)	\(8.65 (0.04)\)
Stability (`stablelm-base-alpha-7b`)	\(15.56 (0.08)\)	\(32.17 (0.24)\)	\(00.25 (0.03)\)
OpenLM (`open_llama_7b`)	\(15.24 (0.43)\)	\(32.18 (0.73)\)	\(15.09 (0.48)\)

For comparison, the table below shows the results for the word vector and encoder-only transformer-based pipelines from calamanCy. Both were trained using good old-fashioned supervised learning. I also included the results from finetuning XLM-RoBERTa (Conneau et al., 2019) and multilingual BERT (Devlin et al., 2019). You can read more about these pipelines in this blog post.

Pipeline	Dengue	Hatespeech	TLUnified-NER
Large word-vector (`tl_calamancy_lg`)	\(68.42 (0.01)\)	\(75.62 (0.02)\)	\(88.90 (0.01)\)
Transormer-based (`tl_calamancy_trf`)	\(72.45 (0.02)\)	\(78.25 (0.06)\)	\(90.34 (0.02)\)
XLM-RoBERTa (`xlm-roberta-base`)	\(67.20 (0.01)\)	\(77.57 (0.01)\)	\(88.03 (0.03)\)
Multilingual BERT (`bert-base-multilingual`)	\(71.07(0.04)\)	\(76.40 (0.02)\)	\(87.40 (0.02)\)

The graph below shows a better visual of our results. The grey bars represent our large language models while the red bars represent the supervised ones.

Discussion: why are LLMs underperforming?

It is apparent that our supervised approach outperformed zero-shot prompting in our datasets. These results are consistent with the findings of the BigScience group (Wang et al., 2022), where they showed that although decoder-only models trained on an autoregressive objective exhibited the strongest zero-shot generalization, they’re still outperformed by models trained via masked language modeling followed by multitask finetuning. I think there are two major reasons why LLMs underperform on Tagalog:

Conceptual gap: text generation != text understanding. I argue that one common misconception with LLMs is that we equate the generation of coherent texts to language understanding. Just because an LLM can “speak” coño or jejemon doesn’t mean it understands linguistic grammar. LLMs are, after all, stochastic parrots (Bender et al., 2021). They might be performant in leaderboards, but they can bring unnoticeable harm when used liberally. In addition, our decoder-only LLMs may not be suited to our structured prediction benchmarks.

As an aside, this may also be a call for building NLU-benchmarks for low-resource corpora. If you’re working on something related, I’m interested to help so feel free to reach out!
Data gap: Tagalog is still underrepresented in most LLM training corpora. Training an LLM requires a large data mixture. Datasets in this mixture usually include CommonCrawl, The Pile, C4, and Wikipedia among many others. Most of these datasets are heavily Anglocentric. For example, the Pile dataset is English-only while CommonCrawl is dominated by Western languages (Tagalog is at a mere \(0.0073\%\)).

Unfortunately, Tagalog is underrepresented even in multilingual LMs. For example, the XGLM language model (Lin et al., 2022) was only trained on 2.3k tokens of Tagalog data whereas the Bloom language model (Scao et al., 2022) doesn’t contain any Tagalog text at all. There’s still a long way to go. Currently, there are efforts such as Cohere’s Aya project that aim to close that multilingual gap.

Thoughts on how we should use LLMs

Given these gaps, I think that the best way to use LLMs in the context of low-resource languages is to maximize information-per-query (IPQ). Yes, I made that one up. The idea is to extract higher quality information for every query from an LLM. Here, I define quality as reusability, i.e., something that can still be refined further for other downstream tasks. I’d even argue that NLU-based outputs such as summarization, common-sense reasoning, and question-answering have inherently high information bandwidth (and hence high IPQ) because it taps to the LLM’s interconnections from its very large training corpora.

For example, using raw LLM outputs as the final predictions for a structured prediction task (NER, text categorization, sentiment analysis) has low IPQ. This is because we already exhausted the lifetime of our query by serving it directly in our system. Asking an LLM to tag a text as hatespeech or not may not be the most efficient use of its capabilities.

We can increase IPQ by using these predictions to assist data annotation, thereby producing a supervised model with a more deterministic performance. Other ways to maximize IPQ is to prompt an LLM in a chain-of-thought style to elicit “reasoning” (Wei et al., 2022) or building a knowledge graph from its “internal” model (Cohen et al., 2023)—anything that utilizes an LLM into its fullest potential in a single query.

I admit that this made-up measure is rough at best. For a more thoughtful reading, I highly recommend Eugene Yan’s blog post on LLM patterns. Notice that most of these patterns aim to maximize IPQ. I also recommend Vicki Boykis’s reflection as it echoes what I feel towards ChatGPT in this age of AI hype.

Final thoughts: we’re not there yet

I hope that this blog post is a more sober view of LLM capabilities for Tagalog. It would be great to live in a world where we don’t need to build corpora, but I don’t think we’re there yet. I still believe that LLMs have a use for structured prediction tasks, such as in annotation or as a silver-standard knowledge-base.

Personally, I’m interested in building parallel corpora so that we can have a comprehensive view of an LLM’s multilingual performance. I’m also curious about various ways we can maximize the information obtained from LLMs and using that for downstream tasks. Finally, there might also be a good argument for building an LLM geared towards low-resource languages, I do think that it is a worthwhile endeavour.

References

Emily Bender, Timnit Gebru, Angelina McMillan-Major and Margaret Mitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021): n. pag.
Neil Vicente P. Cabasag, Vicente Raphael C. Chan, Sean Christian Y. Lim, Mark Edward M. Gonzales, and Charibeth K. Cheng. 2019. Hate Speech in Philippine Election-Related Tweets: Automatic Detection and Classification Using Natural Language Processing. Philippine Computing Journal Dedicated Issue on Natural Language Processing, pages 1–14.
Evan Dennison S. Livelo and Charibeth Ko Cheng. 2018. Intelligent Dengue Infoveillance Using Gated Recurrent Neural Learning and Cross-Label Frequencies. 2018 IEEE International Conference on Agents (ICA), pages 2–7.
Jan Christian Blaise Cruz and Charibeth Ko Cheng. 2021. Improving Large-scale Language Models and Resources for Filipino. In International Conference on Language Resources and Evaluation
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. In Annual Meeting of the Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Lauanay, and Colin Raffel. 2022. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? Proceedings of the 39th International Conference on Machine Learning.
Xi Victoria Lin, , Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona T. Diab, Ves Stoyanov and Xian Li. 2022. Few-shot Learning with Multilingual Generative Language Models. Conference on Empirical Methods in Natural Language Processing.
Teven Le Scao, Angela Fan, Thomas Wolf and others 2022. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiV:abs/2211.05100.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai-hsin Chi, F. Xia, Quoc Le and Denny Zhou. “Chain of Thought Prompting Elicits Reasoning in Large Language Models.” ArXiv abs/2201.11903 (2022): n. pag.
Roi Cohen, Mor Geva, Jonathan Berant and Amir Globerson. “Crawling The Internal Knowledge-Base of Language Models.” Findings of the ACL (2023).

calamanCy: NLP pipelines for Tagalog

2023-08-01T00:00:00+08:00

Update (2023-12-06): This work was published in the EMNLP NLP-OSS workshop! Check out “calamanCy: A Tagalog Natural Language Processing Toolkit”!

I am excited to introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy to ensure easy experimentation and integration with other frameworks. It provides general-purpose multitask models with out-of-the-box support for dependency parsing, part-of-speech (POS) tagging, and named entity recognition (NER). The repository is available on Github, and you can install the package via pip:

pip install calamancy

import calamancy
nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Ako si Juan de la Cruz.")  # returns a spaCy Doc object

More importantly, calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework. In this blog post, I want to talk about the problem it’s trying to solve, my process on building the framework, some benchmarks, and future work.

How it works

calamanCy offers two word vector-based pipelines and one transformer-based pipeline. Suppose we want to detect named entities from a given text (NER):

import calamancy
nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Pumunta si Juan sa Japan kahapon.")

for ent in doc.ents:
    print(ent.text, ent.label_)  # (Juan, PER), (Japan, LOC)

Here, the variable nlp is an instance of spaCy’s Language class while doc is a spaCy Doc object. Each pipeline contains a dependency parser, tagger, and entity recognizer. The built-in entity recognizer was trained with annotations resembling ConLL. It contains entities such as Person, Organization, and Location.

You can use these models as-is or finetune them to your dataset. In a latter section, I’ll demonstrate these capabilities by benchmarking our models on both seen and unseen tasks. For more information on model training, please check out the spaCy documentation.

Tagalog NLP resources are disjointed

Despite Tagalog being a widely-spoken language here in the Philippines, model and data resources are still scarce. For example, our Universal Dependencies (UD) treebanks are tiny (less than 20k words) (Samson, 2018; Aquino and de Leon, 2020) and domain-specific corpora are few and far between (Cabasag et al., 2019; Livelo and Cheng, 2018).

In addition, we only have limited choices when it comes to Tagalog language models (LMs). For monolingual LMs, the state-of-the-art is RoBERTa-Tagalog (Cruz and Cheng, 2021). For multilingual LMs, we have the usual XLM-RoBERTa (Conneau et al., 2019) and multilingual BERT (Devlin et al., 2019). Tagalog is included in their training pool, but these models are still prone to the curse of multilinguality.

Therefore, consolidating these resources and providing more options to build Tagalog NLP pipelines is still an open problem.

calamanCy aims to consolidate these resources

calamanCy provides three language pipelines that fit any performance or accuracy requirements. Each pipeline provides out-of-the-box support for core NLP tasks:

Pipeline	Pretraining objective	Word embeddings	Dimensions
Medium-sized pipeline (tl_calamancy_md)	Predict some number of leading and trailing UTF-8 bytes for the words.	Uses floret static vectors trained on the TLUnified corpora.	50k unique vectors (200 dimensions), Size: 77 MB
Large-sized pipeline (tl_calamancy_lg)	Same pretraining objective as the medium-sized pipeline.	Uses fastText static vectors trained on CommonCrawl corpora.	714k unique vectors (300 dimensions), Size: 455 MB
Transformer-based pipeline (tl_calamancy_trf)	No separate pretraining because there’s no token-to-vector component.	Context-sensitive vectors from a transformer network.	Uses roberta-tagalog-base. Size: 813 MB

Pretraining objective: involves learning vectors from raw text to better inform our token-to-vector model. This process only applies to our static word vector models. The pretraining objective is a variant of the cloze task, also known as language modelling with approximate outputs (LMAO).
Word embeddings: may involve static vectors or a dense context-sensitive vector from a transformer. Here, I also used spaCy’s floret, an efficient version of fastText. Static word vectors are called as such because they are not parameters learned by a statistical model. On the other hand, word embeddings from a transformer involve the learned parameters.

The training process involves pretraining a filtered version of TLUnified (Cruz and Cheng, 2021), constructing static word embeddings if necessary, and training the downstream components. Each pipeline contains a Tagger, DependencyParser, Morphologizer, and EntityRecognizer spaCy components. These models are also available on HuggingFace 🤗.

You can reproduce the whole training procedure by running the corresponding spaCy project on Github. Finally, you can find a list of data sources in the table below:

Source	Authors	License
TLUnified Dataset	Jan Christian Blaise Cruz and Charibeth Cheng	GNU GPL 3.0
UD_Tagalog-TRG	Stephanie Samson	CC BY-SA 3.0
UD_Tagalog-Ugnayan	Angelina Aquino and Franz de Leon	CC BY-NC_SA 4.0

Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application.

Benchmarking experiments

Before calamanCy, you usually have two options if you want to build a pipeline for Tagalog: (1) piggyback on a model trained from a linguistically-similar language (cross-lingual transfer) or (2) finetune a multilingual LM like XLM-R or multilingual BERT on your data (multilingual finetuning). Here, I want to check if calamanCy is competitive enough against these alternatives. I tested on the following tasks and datasets:

Dataset	Task / Labels	Description
Hatespeech (Cabasag et al., 2019)	Binary text classification (hate speech, not hate speech)	Contains 10k tweets collected during the 2016 Philippine Presidential Elections labeled as hatespeech or non-hate speech.
Dengue (Livelo and Cheng, 2018)	Multilabel text classification (absent, dengue, health, sick, mosquito)	Contains 4k dengue-related tweets collected for a health infoveillance application that classifies text into dengue subtopics.
TLUnified-NER (Cruz and Cheng, 2021)	NER (Person, Organization, Location)	A held-out test split from the annotated TLUnified corpora containing news reports.
Merged UD (Aquino and de Leon, 2020; Samson, 2018)	Dependency parsing and POS tagging	Merged version of the Ugnayan and TRG treebanks from the Universal Dependencies framework.

For text categorization and NER, I ran the experiments for five trials and reported their average and standard deviation. For dependency parsing and POS tagging, I used 10-fold cross-validation because the combined UD treebank is still too small.

The results show that our calamanCy pipelines are competitive (you can reproduce the results by following this spaCy project):

Language Pipeline	Binary textcat (Hatespeech)	Multilabel textcat (Dengue)	NER (TLUnified-NER)	Dependency parsing, UAS (Merged UD)	Dependency parsing, LAS (Merged UD)
tl_calamancy_md	\(74.40 (0.05)\)	\(65.32 (0.04)\)	\(87.67 (0.03)\)	\(76.47\)	\(54.40\)
tl_calamancy_lg	\(75.62 (0.02)\)	\(68.42 (0.01)\)	\(88.90 (0.01)\)	\(82.13\)	\(70.32\)
tl_calamancy_trf	\(78.25 (0.06)\)	\(72.45 (0.02)\)	\(90.34 (0.02)\)	\(92.48\)	\(80.90\)

We also evaluated cross-lingual and multilingual approaches in our benchmarks:

Cross-lingual: we chose the source languages using a WALS-reliant metric (Agic, 2017) to choose the linguistically-closest languages to Tagalog and looked for their corresponding spaCy pipelines. We came up with Indonesian (id), Vietnamese (vi), Ukranian (uk), Romanian (ro), and Catalan (ca). However, only uk, ca, ro have spaCy pipelines. We finetuned each dataset for each task and evaluated them similarly to our Tagalog monolingual models.

Language Pipeline	Binary textcat (Hatespeech)	Multilabel textcat (Dengue)	NER (TLUnified-NER)	Dependency parsing, UAS (Merged UD)	Dependency parsing, LAS (Merged UD)
uk_core_news_trf	\(75.24 (0.05)\)	\(65.57 (0.01)\)	\(51.11 (0.02)\)	\(54.77\)	\(37.68\)
ro_core_news_lg	\(69.01 (0.01)\)	\(59.10 (0.01)\)	\(02.01 (0.00)\)	\(84.65\)	\(65.30\)
ca_core_news_trf	\(70.01 (0.02)\)	\(59.42 (0.03)\)	\(14.58 (0.02)\)	\(91.17\)	\(79.30\)

Multilingual: we used XLM RoBERTa and an uncased version of mBERT as our base transformer models. We also finetuned each model for each task and did similar evaluations. Note that finetuning on XLM RoBERTa (both base and large versions) may require at least a V100 GPU. I’ve seen more consistent and stable training with an A100 GPU. Same can be said for mBERT.

Language Pipeline	Binary textcat (Hatespeech)	Multilabel textcat (Dengue)	NER (TLUnified-NER)	Dependency parsing, UAS (Merged UD)	Dependency parsing, LAS (Merged UD)
xlm-roberta-base	\(77.57 (0.01)\)	\(67.20 (0.01)\)	\(88.03 (0.03)\)	\(88.34\)	\(76.07\)
bert-base-multilingual	\(76.40 (0.02)\)	\(71.07 (0.04)\)	\(87.40 (0.02)\)	\(90.79\)	\(78.52\)

What’s next

I highly recommend trying out calamanCy and giving your feedback so that I can improve the models over time. My priority for 0.2.0+ is to write domain-specific tokenizers to help with simple NLP tasks (e.g., for parsing tweets). I also want to do a few more data annotation projects as a precursor to v1.0.0. These projects include a more fine-grained label set for NER, and a better Universal Dependencies Treebank.

On the research side, I’m curious how large language models fare on Tagalog data. We’ve already built up some nice benchmarks because of this effort so it might be nice to do a side-by-side comparison for zero/few-shot prompting. I’m also interested in training language-specific adapters for efficiency.

FAQs

Can I use this in production? Yes. Compared to my previous blog post, the NER corpora has been reannotated with multiple annotators so I’m more confident with our corpus. However, note that some of the datasets used for training have non-commercial licenses, so the application of calamanCy is still restricted.
The licenses are too restrictive! It’s quite unfortunate that the UD treebanks and pretraining corpora available for Tagalog have stricter licenses. Aside from creating a new corpora myself, I’m still thinking of ways to get around that. If you have any ideas, then let me know!
What’s the design decision for only including X component? I want to make calamanCy as general-purpose as possible yet flexible enough to be finetuned on specific tasks. That’s why we have to train a new component for text categorization. All of the official spaCy pipelines also follow this architecture.

If you have any questions, feel free to reach out on Github or my email!

References

Stephanie Dawn Samson. 2018. A treebank prototype of Tagalog. Bachelor’s thesis, University of Tübingen, Germany.
Angelina A. Aquino and Franz A. de Leon. 2020. Parsing in the absence of related languages: Evaluating low-resource dependency parsers on Tagalog. In Universal Dependencies Workshop.
Neil Vicente P. Cabasag, Vicente Raphael C. Chan, Sean Christian Y. Lim, Mark Edward M. Gonzales, and Charibeth K. Cheng. 2019. Hate Speech in Philippine Election-Related Tweets: Automatic Detection and Classification Using Natural Language Processing. Philippine Computing Journal Dedicated Issue on Natural Language Processing, pages 1–14.
Evan Dennison S. Livelo and Charibeth Ko Cheng. 2018. Intelligent Dengue Infoveillance Using Gated Recurrent Neural Learning and Cross-Label Fre- quencies. 2018 IEEE International Conference on Agents (ICA), pages 2–7.
Jan Christian Blaise Cruz and Charibeth Ko Cheng. 2021. Improving Large-scale Language Models and Resources for Filipino. In International Conference on Language Resources and Evaluation
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale. In Annual Meeting of the Association for Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805

Some thoughts on the annotation process

2023-07-03T00:00:00+08:00

Parts of this work were published in the paper, “Developing a Named Entity Recognition Dataset for Tagalog”, at IJCNLP-AACL’s Southeast Asian Language Processing Workshop. Feel free to cite that paper in your own work.

First off, I’m happy to see such warm reception to my first blog post. Thank you! There are a few more experiments that I wanted to do for the sake of completeness and rigor. I hope to release the alpha version of calamanCy in August, so this blog post may as well be a lead-up to that release.

This blog post can be summarized as: “young and naive researcher just learned something very obvious!” I am mostly referring to the annotation process. Luckily, I was able to find two more folks to help with data annotation, and we’ve been updating the original dataset for the past three months.

Tl;dr, we just finished re-annotating TLUnified with NER tags! You can access the updated corpus, TLUnified-NER, in HuggingFace Datasets.

Data annotation is iterative

It’s harder to imagine this when you’re annotating alone. In fact, the final diagram in my February blog post shows this misconception. We don’t just annotate a thousand examples until we’re tired and call it a day. Instead, annotation is iterative:

Nils Reiter’s blog post has been my annotation bible for the past few months. The figure above is a simplified version of his annotation workflow. We start by creating a set of pilot annotations and then continually iterate until we reach a stop condition. For each round, we add new annotations while correcting our past annotations. This process makes everything a bit more involved, but at least we get higher quality annotations in the end.

Annotate a batch of examples

We tried to annotate 800-1000 examples for each round and ensured that each annotator gets the same batch of texts. Labeling that amount of data takes one-and-a-half to two weeks at most. During the pilot phase, we used the annotation guidelines I initially developed for myself. As for our software, we used Prodigy with the ner.manual recipe.

Note that for each round, we are adding more examples to the corpus. After six to seven syncs at the course of four months, we arrived at our target dataset size. Finally, we also tried to correct our past annotations based on our revisions to the annotation guidelines. However, there are no checks or QA for these corrections so I wasn’t able to track their diffs.

Evaluate the annotated batch

The evaluation step is perhaps the most crucial part of the annotation process. Ultimately, the goal is to improve the annotators’ understanding of the “phenomena.” This step usually involves the following activities:

Resolving disagreements / misconceptions: I compiled the annotations and computed for a partial inter-annotator agreement score (IAA). This process allowed me to estimate if our annotations are improving in quality.

For named-entity recognition (NER), it is not straightforward to compute for this metric (I used Cohen’s Kappa). It is possible to compute for this value at the token level, but this leads to an imbalanced dataset (e.g., there are many unlabeled tokens). So I followed what Deleger et al., (2012) and Brandsen et al., (2020) did and computed for the pairwise F1-score as well.
Mini-retrospective meetings: During the initial months of the annotation phase, I conducted sync meetings to discuss confusing examples and labels. Confusion usually came from edge-case examples or vagueness in the annotation guidelines. We try to resolve this by updating the guidelines or correcting our past annotations.

I try to make these meetings as short and async as possible (30 minutes to 1 hour). I pattern these meetings to a typical software development sprint retrospective.

We use Parabol (I co-opted the “Start, Stop, Continue” free template) as our collaboration software. We frame the questions like so:
- Start: What rules should we include in our annotation guidelines?
- Stop: Which rules are vague and confusing? Should we remove or update them?
- Continue: Which rules serve as good examples to retain?
Personally, I enjoy discussing rules as it compels me to establish a coherent pattern when labeling examples. We try to keep a “bank” of edge-cases and work together to address them. However, if we focus too much on individual examples, our meetings may become inundated with edge cases, hindering the improvement of the guidelines.
Assess if we need more annotations: For this annotation project, I have two stop conditions: (1) if the train curve doesn’t improve or (2) if we reached at least 5000 examples. Prodigy provides a train-curve command to check if we still need more examples by learning a model at 25%, 50%, and 75% of the training set.

For the most part, the trend points to us annotating more data, but my budget is running low and I have other things to do, so I stopped after we reached 7000 examples. I’m definitely game to annotate a few more, but in the future I’d want to include other useful labels such as morphological features or parts-of-speech (something for Universal Dependencies) in my next annotation project.

Here’s a chart on how our IAA metrics improved over time. These numbers don’t factor in our corrections. I’m just computing the metrics per batch as I receive them.

Finally, I found it helpful to have “north star questions” as I evaluate our annotations. It is easy to get bogged down by details that I might miss the bigger picture. These north star questions include:

“If I get these annotations, will they be helpful for doing [insert downstream application]?”
“Will this revision enhance our understanding of the phenomena?”
“Will it be helpful for the model or [any downstream application] to learn this edge-case?”

In the future, I think it would be better to do a more fine-grained annotation project. I realized later in the project that my entity types are too general that we tend to lump several categories in a single label. I think that’s the next phase of improvement that I can do next.

Update annotation guidelines

Updating the annotation guidelines allows us to codify our understanding of the task. Personally, it is also a good way to preserve our learnings and onboard future annotators in the project. Most of the updates involve adding new examples, clarifying definitions, and noting edge-cases.

I’ll be updating the calamanCy repository with the latest version of the annotation guidelines. It has grown quite a bit (and we’re using Google Docs for easier collaboration), but it might be good to transfer it in the open.

Final thoughts

I’m happy to see the growth of the project from its inception a few months ago. I’m also excited to share these updates and release the trained pipelines soon. Finally, I’d like to thank our annotators for helping out in the process.

References

Deleger, L., Li, Q., Lingren, T., Kaiser, M., Molnar, K., Stoutenborough, L., Kouril, M., Marsolo, K., and Solti, I. (2012). Building gold standard corpora for medical natural language processing tasks. AMIA Annual Symposium proceedings / AMIA Symposium. AMIA Symposium, 2012:144–153.
Brandsen, A., Verberne S., Wansleeben, M., and Lambers, K. (2020). Creating a Dataset for Named Entity Recognition in the Archaeology Domain. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4573–4577, Marseille, France. European Language Resources Association.

Study notes on parameter-efficient finetuning techniques

2023-05-01T00:00:00+08:00

Finetuning is a way to adapt pretrained language models (LMs) to a specific task or domain. It requires attaching a task head to the model and updating the weights of the entire network. However, this process can put a strain on one’s compute budget. This becomes more true as language models get larger and larger in every release.

In this blog post, I want to share my notes on parameter-efficient finetuning techniques (PEFT). Here, we only finetune a small number of parameters while keeping most of the LM parameters frozen. As a result, PEFT allows domain adaptation at a lower compute (and storage) cost. Lastly, this blog post is not a literature review; I will only discuss methods I personally like. For each PEFT, I will talk about its overview, related works, and high-level implementation.

Finetuning is the de facto transfer learning technique, but it has become inefficient

To recap, pretrained language models like BERT (Devlin et al., 2019) contain contextualized word representations that capture the meaning of each token and its context within the text. By themselves, they’re already useful. However, language models have enjoyed greater versatility and state-of-the-art performance because of finetuning (Howard and Ruder, 2018).

Much of the pretrained LMs we use today are based on transformer networks (Vaswani et al., 2017). Let’s review its architecture, as it will help us understand the PEFT techniques later on. Recall that most transformer networks consist of a stack of encoder and decoder layers with an attention mechanism:

The encoder layer consists of two sub-layers: an attention layer and a feedforward network. Its outputs are passed to the decoder, consisting of the same two sub-layers plus a cross-attention mechanism that attends the encoder’s output. Between each sub-layer, there is a residual (or skip) connection that is normalized through LayerNorm (Ba et al., 2016).

For transformers like BERT, there is no generative step, hence it only contains encoders. Here’s what a typical encoder layer looks like:

# Pseudocode of a typical encoder layer
class Encoder:
    def __call__(self, x: Tensor) -> Tensor:
        residual = x
        x = MultiHeadAttention(x)
        x = LayerNorm(x + residual)
        residual = x
        x = FeedForwardLayers(x)
        x = LayerNorm(x + residual)
        return x

This encoder-only transformer uses multi-head attention, where the attention function is applied in parallel over \(N_h\) heads. So given (1) a sequence of \(m\) vectors \(\mathbf{C}\in \mathbb{R}^{m\times d}\) which we will perform attention to, and a (2) query vector \(\mathbf{x} \in \mathbb{R}^{d}\), multi-head attention computes the output on each head and concatenates them:

\[MultiHeadAttention(\mathbf{C}, \mathbf{x}) = Concat(head_1, \dots, head_h)\\ where~~head_i = Attention(\mathbf{x}\mathbf{W}_q^{(i)}, \mathbf{C}\mathbf{W}_k^{(i)}, \mathbf{C}\mathbf{W}_v^{(i)})\]

And the feedforward network consists of a two linear transformations with a ReLU activation function:

\[Feedforward(\mathbf{x}) = ReLU(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\]

One common way to finetune is to attach a task-specific head at the end of a pretrained language model then train the entire network on our labeled data.¹ Although these LMs were trained on different tasks (for example, BERT was trained on masked language modeling and next sentence prediction), it is possible to refine its weights for other NLP problems (e.g., sentence tagging, sequence classification, etc.).

However, this process has become inefficient (Treviso et al., 2023). The number of parameters in pretrained language models has increased exponentially, exacerbating fears in an LM’s environmental impact (Strubell et al., 2019) and making them inaccessible in resource-constrained and consumer-grade environments. Hence, an efficient approach to finetuning is becoming more desirable.

We can make the finetuning process more efficient and modular

Recently, I’ve been interested in parameter-efficient techniques (PEFT) that are modular in nature. Most of these involve creating small, trainable modules (sometimes a feedforward network or a special matrix) that we attach to different parts of a larger, pretrained network. Usually, we keep the larger network frozen while we train the smaller modules on our data.

Most [parameter-efficient techniques] involve creating small, trainable modules that we attach to different parts of a larger, pretrained network.

I like parameter-efficient and modular techniques because they appeal to my engineering sensibilities. I like the “separation of concerns” between our network’s fundamental understanding about language (pretrained network) and its task-specific capabilities (modular network), as opposed to training a single monolithic model. We can even aggregate these modules to solve multi-domain (Gurugangan et al., 2022; Chronopoulou et al., 2023; Asai et al., 2022; Pfeiffer et al., 2021) or multilingual problems (Pfeiffer et al., 2020; Pfeiffer et al., 2021).

I won’t be talking about every technique in this post. Instead, I’ll focus on methods that I like. For a more comprehensive overview, I highly recommend looking at Treviso et al.’s (2022) survey of efficient NLP techniques and Pfeiffer et al.’s (2023) work on modular deep learning.

Adapters: attach small, trainable modules between transformer layers
Prompt and prefix tuning: attach tunable parameters near the transformer input
LoRA: decompose transformer weight updates into lower-rank matrices

Adapters: attach small, trainable modules between transformer layers

Houlsby et al. (2019) first proposed the use of adapters in the context of NLP. The idea is to attach a small feedforward network after each transformer sub-layer and tune it while keeping the larger network frozen:

# Pseudocode of a typical encoder layer with adapter
class EncoderWithAdapter:
    def __call__(self, x: Tensor) -> Tensor:
        residual = x
        x = MultiHeadAttention(x)
        x = AdapterNetwork(x)  # Usually another feedforward layer
        x = LayerNorm(x + residual)
        residual = x
        x = FeedForwardLayers(x)
        x = AdapterNetwork(x)  # Usually another feedforward layer
        x = LayerNorm(x + residual)
        return x

The adapter consists of a bottleneck architecture with a down- and up-projection, similar to autoencoders. It also contains an internal skip-connection. According to Houlsby et al. (2019), they were able to limit the number of trainable parameters to \(0.5-8\%\) of the original model.

# Pseudocode of an adapter network
class AdapterNetwork:
    def __call__(self, x: Tensor, m: int) -> Tensor:
        residual = x
        d = get_dims(x)
        x = DownProjection(input_dim=d, output_dim=m)
        x = NonLinearity(x)
        x = UpProjection(input_dim=m, output_dim=d)
        return x + residual

From then on, researchers have proposed alternative adapter architectures. One notable development is from Pfeiffer et al. (2021), where they reduced the number of adapters significantly by attaching only one adapter per transformer layer with minimal performance degradation. Alternative adapter approaches also include training only the bias parameters as in BitFit (Ben Zaken et al., 2022), learning a task-specific different vector as in diff-pruning (Guo et al., 2021), or connecting adapters in parallel (He et al., 2022).

Most adapters consist of a bottleneck architecture with a down- and up-projection

However, I’m more interested in composing adapters together to solve multi-domain, multi-task, or multilingual problems.² They add another level of flexibility to PEFT:

Multi-domain: For example, Chronopoulou et al. (2023) trained an adapter for each domain and computed its weight-space average at test time. They dubbed this as an AdapterSoup. The idea for “model soups” were based from Wortsman et al. (2022) and was inspired by convex optimization techniques.
Multi-task: Pfeiffer et al.’s (2021) work on AdapterFusion attempts to transfer knowledge from one task to another by combining their corresponding adapter representations. The “fusion” is done by introducing a new set of parameters \(\Psi\) (separate from the pretrained LM and adapter parameters, \(\Theta\) and \(\Phi\)) that learn to combine all the task adapters:

\[\Psi \leftarrow argmin_{\Psi} L_{m} (D_m; \Theta, \Phi_1, \dots, \Phi_{N}, \Psi)\]

Multilingual: The MAD-X framework (Pfeiffer et al., 2020) trains a language-specific adapter module via masked language modeling for the target language and a task-specific adapter module for the target task. This process enabled cross-lingual transfer given a pretrained multilingual model. They were also able to demonstrate its efficacy on languages unseen during the pretraining process.

The table below shows monolithic and modular techniques for each dimension of an NLP problem. Note that adapters still require a pretrained LM as its base, so don’t treat this as an either-or comparison.

Dimension	Monolithic	Modular
Multi-domain	Most general-purpose BERT pretrained models	AdapterSoup (Chronopoulou et al., 2023), DEMix (Gurugangan et al., 2022)
Multi-task	MT-DNN (Liu et al., 2019)	AdapterFusion (Pfeiffer et al., 2021)
Multi-lingual	XLM-R (Conneau et al., 2019), Multilingual BERT (Devlin et al., 2018)	MAD-X (Pfeiffer et al., 2020), UNKs everywhere (Pfeiffer et al., 2021)

There are still open problems I foresee with adapters. One is the complexity-efficiency tradeoff. It’s possible to “go crazy” in architecting adapters that the efficiency gain may not worth the complexity of the training set-up. Just like in engineering, there’s a tendency to go full-kubernetes when you can just use a simple virtual machine. I’m curious about the practical aspects of adapter training. I think AdapterHub is a good start. I’m still looking forward to developments in this field.

Prompt and prefix tuning: attach tunable parameters near the transformer input

Nowadays, the common way to leverage large-language models is through in-context learning via hard prompts (Brown et al., 2020). Take this prompt, for example, in a text classification task:

Determine whether the text below is a "Recipe" or "Not a recipe"

Text: """Add 2 cups of rice to 3 cloves of garlic, then 
add butter to make fried rice"""
Answer: Recipe

Text: """I'm not sure if that will work"""
Answer: Not a recipe

Text: """To make a caesar salad, combine romaine lettuce, 
parmesan cheese, olive oil, and grated eggs"""
Answer:

We refer to the text above as a hard prompt because we’re using actual tokens that are not differentiable (think of “hard” as something static or set in stone). The problem here is that the output of our LLM is highly-dependent on how we constructed our prompt. Language is combinatorial: it will take a lot of time to find the right “incantation” so that our LLM performs optimally. This begs the question: what if we can learn our prompts?

The problem [with in-context learning] is that the output of our LLM is highly-dependent on how we constructed our prompt.

Prompt and prefix tuning solves this by making use of soft prompts— a vector attached to the input embedding that we train while keeping the pretrained LM frozen. These two ideas appeared at the same time (Lester et al, 2021; Li and Liang, 2021) but have similar mechanisms. The only difference is that prompt tuning adds the tensor only at the input while prefix tuning adds the tensor at each transformer layer:

# Pseudocode of a typical encoder layer with soft prompts
class EncoderWithSoftPrompts:
    def __call__(self, x: Tensor, soft_prompt: Tensor) -> Tensor:
        soft_prompt = FeedForwardLayer(soft_prompt)
        x = concatenate([soft_prompt, x])
        residual = x
        x = MultiHeadAttention(x)
        x = LayerNorm(x + residual)
        residual = x
        x = FeedForwardLayers(x)
        x = LayerNorm(x + residual)
        return x

One of the currently popular implementations of prefix tuning is the LLaMa-Adapter (Zhang et al,. 2023). The mechanism is similar to prefix tuning with some subtle differences:

Add prefix vectors only at the top-most layers: just like in prefix tuning, LLaMa adapter adds “adaption prompts” as prefix to the input instruction tokens. However, these prompts are only added at the \(L\) top-most transformer layers unlike in Li and Liang (2021) (top-most: the first layers affected by backpropagation).
Zero-init attention with gating: here, the parameters near the attention mechanism are initialized to zero instead of at random. This process avoids unstable finetuning and “corruption” of LLaMa’s original knowledge.

I find these techniques (collectively known as p*-tuning) exciting because they open up different ways to update transformer inputs. I’m interested to examine, post-hoc, the dependency of these soft prompt weights to the data or domain. For example, we can probe these soft prompts to determine which parts of the input were most influential in generating the output (for both autoregressive and encoder-decoder networks). Lastly, I also find it exciting to seek ways to aggregate soft prompts together for multilingual applications.

LoRA: decompose transformer weight updates into lower-rank matrices

Lastly, Hu et al.’s (2021) work on Low-Rank Adaptation (LoRA) aims to make the weight update process more efficient. Their premise, based from Aghajanyan et al. (2021), is that although weight updates are of full-rank (each row and column are linearly-independent), they can still be represented into a lower-dimensional space while retraining most its structure (low-rank).

This technique is akin to using PCA or UMAP to reduce an n-dimensional (\(n>2\)) dataset into 2-d space in data visualization. In LoRA, we’re reducing the weight update matrix, \(\Delta W\). Note that we’re not really changing the shape of \(\Delta W\). Instead, we’re representing it into a low-rank form where redundancies are acceptable.

LoRA constrains the weight update \(\Delta W\) of a matrix \(W_0 \in \mathbb{R}^{d\times k}\) with two trainable low-rank matrices \(A\) and \(B\). This makes the weight update computation to be \(W_0 + \Delta W \rightarrow W_0 + BA\), where \(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r \times k}\), \(r \ll min(d,k)\):

These \(B\) and \(A\) matrices are trainable. They used a random Gaussian initialization for \(A\) and zero for \(B\), so \(\Delta W\) is a zero-matrix at the beginning. Here, the rank \(r\) is a hyperparameter that must be tuned. In a transformer network, LoRA is applied to the attention weights:

I haven’t seen a lot of works that expand on LoRA or rank-decomposition as an efficient finetuning technique. However, I’m interested to perform more empirical experiments on this method, especially on figuring out where else we can attach these LoRA matrices. Interestingly, LoRA has found a foothold in image data, especially in diffusion networks (Rombach et al., 2021). Perhaps this technique can be extended into other modes of data.

Final thoughts

In this blog post, we looked into a few parameter-efficient finetuning techniques (PEFTs) for large language models. First, we had adapters, that attach small trainable modules between transformer layers. Then prompt and prefix tuning, that attach tunable parameters near the transformer input. Finally, we had low-rank adaptation that decompose transformer weights into lower-rank matrices.

It is interesting that despite their differences, there are structural similarities. Recently, I’ve been interested in reading papers that surmise a general framework for PEFTs. He et al.’s (2021) posits that PEFTs can be seen as clever weight updates on a transformer subspace. Pfeiffer et al. (2023) generalizes this further into what we now know as modular deep learning. In the application side, AdapterHub provides an implementation framework that allows users to leverage these PEFTs.

It is interesting that despite their differences, there are structural similarities between parameter-efficient finetuning techniques.

Finally, my interest in PEFTs are motivated by going against the current zeitgeist in NLP: train larger and larger models with larger and larger datasets. Perhaps there’s value in moving the other way around. Moving the needle in efficiency can lead to truly accessible and democratized NLP.

References

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. 2021. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online. Association for Computational Linguistics.
Akari Asai, Mohammadreza Salehi, Matthew Peters, and Hannaneh Hajishirzi. 2022. ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6655–6672, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. “Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.” arXiv preprint arXiv:2106.10199 (2021).
Tom Brown, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
Alexandra Chronopoulou, et al. “Adaptersoup: Weight averaging to improve generalization of pretrained language models.” arXiv preprint arXiv:2302.07027 (2023).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, and Luke Zettlemoyer. 2022. DEMix Layers: Disentangling Domains for Modular Language Modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
Demi Guo, Alexander Rush, and Yoon Kim. 2021. Parameter-Efficient Transfer Learning with Diff Pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, Online. Association for Computational Linguistics.
Junxian He, et al. “Towards a unified view of parameter-efficient transfer learning.” arXiv preprint arXiv:2110.04366 (2021).
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
Neil Houlsby, et al. “Parameter-efficient transfer learning for NLP.” International Conference on Machine Learning. PMLR, 2019.
Edward J. Hu, et al. “LoRA: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. 2020. AdapterHub: A Framework for Adapting Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2020. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. 2021. UNKs Everywhere: Adapting Multilingual Language Models to New Scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10186–10203, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Jonas Pfeiffer, et al. “Modular deep learning.” arXiv preprint arXiv:2302.11529 (2023).
Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. 2021. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, Online. Association for Computational Linguistics.
Robin Rombach, et al. “High-resolution image synthesis with latent diffusion models.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” arXiv preprint arXiv:1906.02243 (2019).
Marcos Treviso, et al. “Efficient methods for natural language processing: a survey.” arXiv preprint arXiv:2209.00099 (2022).
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
Zirui Wang, Zachary C. Lipton, and Yulia Tsvetkov. 2020. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4438–4450, Online. Association for Computational Linguistics.
Mitchell Wortsman, et al. “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.” International Conference on Machine Learning. PMLR, 2022.
Renrui Zhang, et al. “Llama-adapter: Efficient fine-tuning of language models with zero-init attention.” arXiv preprint arXiv:2303.16199 (2023).

Footnotes

Note that it’s also possible to freeze the entire pretrained LM and only update the weights of the task-specific head. However, this process only works if the task-specific data is small to avoid overfitting. In most cases, updating both set of weights leads to better performance on the task at hand. ↩
NLP problems have multiple dimensions. Multi-domain: legal texts, finance documents, scientific publications, etc.; Multi-task: question answering, sequence classification, sequence tagging, etc.; Multi-lingual: English, German, French, etc. ↩

Labeling with GPT-3 using annotation guidelines

2023-03-25T00:00:00+08:00

Previously, I investigated how we can incorporate large language models (LLMs) into our annotation workflows. It was a fun blog post, and I encourage you to read it. This time, I want to extend this idea by including annotation guidelines in the prompt. Because these guidelines were written to define the parameters of a task, we hope that they can improve the annotation process by providing more context and examples.

Because [annotation] guidelines were written to define the parameters of a task, we hope that they can improve the annotation process by providing more context and examples.

In this blog post, I want to focus on argumentative sentence detection: we want to know if a given text is an argument. I’ll use the “minimum wage” dataset from the UKP Sentential Argument Mining Corpus (Stab, et al., 2018). In addition, I’ll use three other annotation guidelines from different NLP papers. I based these choices from the work of Jakobsen et al. (2022).

Because each guideline defines an argument differently and asks for different labels, I normalized them into 1: Argument and 0: No argument similar to Jakobsen et al.’s (2022) work. The table below summarizes these guidelines (the number beside each label is its normalized version):

Authors	Labels	How they defined an argument
Levy et al., 2018	Accept (Pro/Con) `(1)`, Reject `(0)`	Here they defined a claim as the conclusion, that is, the assertion the argument aims to prove. They didn’t mention anything about premises.
Stab et al., 2018	Attacking `(1)`, opposing `(1)`, non argument `(0)`	They have an explicit requirement where each claim should be backed-up by another claim or premise. Claims alone don’t equate to an argument.
Shnarch et al., 2018	Accept `(1)`, Reject `(0)`	They defined an argument as containing a claim (conclusion) and premise (evidence). Claims alone don’t equate to an argument.

By incorporating both the annotation guideline and large language model, we can get LLM predictions by feeding annotation guidelines into the prompt. This is similar to my previous blog post with the addition of more context from the annotation guideline. The engineering challenge here is on feeding a long string of text into a prompt constrained to a set amount of tokens.

I plan to accomplish this task using Prodigy, an annotation tool, and LangChain, a library for working with large language models. You can find the source code from this Github repository.

To save costs, I’ll only be using the 496 samples from the test set. I also discarded the Morante et al, 2020 annotation guideline because it’s eight pages long and API calls can balloon from processing it.

Fitting annotation guidelines into the prompt

Fitting a long document into OpenAI’s prompt is one of the primary engineering challenges in this project. The GPT-3.5 text-davinci-003 model only allows a maximum request length of 4097 tokens, which are shared between the prompt and its completion— most annotation guidelines won’t fit.

LangChain offers a simple solution: split the document into chunks and think of prompts as functions. By doing so, we can leverage a wide range of data engineering concepts such as map-reduce, reranking, and sequential processing. The LangChain API dubs these as MapReduce, MapRerank, and Refine respectively.

LangChain offers a simple solution: split the document into chunks and think of prompts as functions. We can then leverage a wide range of data engineering concepts.

During my experiments, I found that using the sequential prompting technique, Refine, works best for annotation guidelines. The output is more consistent and the model does not fail at performing the task. The figure below provides an overview of this process:

Split the document into smaller chunks. I used LangChain’s built-in spaCy splitter that splits text into sentences. This process ensures that the text is still coherent when passed to the prompt, especially when an annotation guideline provides exemplars for the task.
Write a seed prompt. The seed prompt asks GPT-3.5 to classify an example given the first chunk of the annotation guideline. It then returns a preliminary answer that will be refined later on using the refine prompt. For our project, the seed prompt looks like this:
```
 Context information is below.
 -----------------------------------------
 {context}
 -----------------------------------------
 Given the context information and not prior knowledge, classify
 the following text:
 {question}
```

Write a refine prompt The refine prompt asks GPT-3.5 to refine their answer given new information. The prompt is called successively until all chunks are shown. Then, we take the refined answer and assign it as our LLM’s prediction. The refine prompt looks like this:

 The original text to classify is as follows: {question}
 We have provided an existing answer: {existing_answer}
 We have the opportunity to refine the existing answer (only if needed)
 with some more context below.
 ----------------------------------------------
 {context}
 ----------------------------------------------
 Given the new context, refine the original answer to better
 classify the question. If the context isn't useful, return
 the original answer.

Notice that I’m using some terms from the Question-Answering domain: context refers to the chunk of text from the annotation guideline, and question is the example to be classified. I patterned my prompts to this domain because it’s easier to think of annotation as a question-answering task.

Funnily enough, this problem may be considered solved with the release of GPT-4. With a longer context length (32k tokens or roughly 25,000 words), you can fit in a whole annotation guideline without splitting it into chunks.

Comparing predictions to gold-standard data

For this evaluation step, I want to compare the predictions from each annotation guideline to the gold-standard annotations found in the UKP dataset. First, I normalized all labels into a binary classification task between an Argument and No argument:

Labels assigned to Argument: Accept (Pro/Con), Attacking, Opposing, Accept
Labels assigned to No argument: Reject, Non argument

This process gave us a dataset with \(227\) examples with the Argument class and \(270\) examples with the No argument class. The F1-score for each annotation guideline is shown below:

Scores	Stab et al. (2018)	Levy et al. (2018)	Shnarch et al. (2018)
Micro F1-score	\(\mathbf{68.35}\)	\(60.08\)	\(42.54\)
Macro F1-score	\(\mathbf{67.61}\)	\(53.71\)	\(37.94\)

F1-score (per type)	Stab et al. (2018)	Levy et al. (2018)	Shnarch et al. (2018)
No argument (`NoArgument`)	\(\mathbf{72.50}\)	\(89.26\)	\(54.83\)
Argument (`Argument`)	\(\mathbf{62.71}\)	\(36.54\)	\(21.05\)

As expected, the performance of the Stab et al., (2018) guideline is closest to the original annotations. It’s also interesting how the results are heavily biased towards the NoArgument class for both Levy et al. (2018) and Shnarch et al. (2018) guidelines. From a qualitative inspection, these results make sense because the wording from these guidelines denote a more stringent criteria for accepting statements as arguments:

Figure: Portion of annotation guidelines from Shnarch et al. (2018)

Figure: Portion of annotation guidelines from Levy et al. (2018)

It’s still hard to say which exact statements from the guideline informed an LLM’s decision. But because our prompting strategy refines the answer for each chunk of text, it’s possible that original Accept answers were rejected because of new information from the prompt.

Finally, I also noticed that the performance from the Stab et al. (2018) annotation guideline is worse than the supervised and zero-shot predictions from my previous blog post:

Scores	Zero-shot	Supervised	Stab et al. (2018)
Micro F1-score	\(\mathbf{81.45}\)	\(79.88\)	\(61.90\)
Macro F1-score	\(\mathbf{78.74}\)	\(77.52\)	\(55.02\)

F1-score (per type)	Zero-shot	Supervised	Stab, et al. (2018)
Supporting argument (`Argument_for`)	\(\mathbf{75.21}\)	\(73.60\)	\(48.74\)
No argument (`NoArgument`)	\(\mathbf{86.74}\)	\(85.66\)	\(72.50\)
Opposing argument (`Argument_against`)	\(\mathbf{74.26}\)	\(73.30\)	\(46.00\)

It’s an interesting negative result because it ran contrary to what I expected: we already have the annotation guideline, isn’t it supposed to work well? However, I realized that it’s still difficult to make a straight-up comparison between two prompts: it’s possible that one prompt was written poorly (not “fine-tuned” given known prompt engineering techniques) than the other. Personally, I will still dabble into this research lead, but it’s also possible that writing a short and sweet zero-shot prompt works best for our task.

Cross-guideline evaluation

This time, let’s lean into the idea that LLM’s capture the intention of annotation guidelines and compare them against one another. We take one guideline as reference, the rest as predictions, and compute the F1-score. We then arrive at the graph below:

It’s interesting that for all cases, the Stab et al. (2018) annotation guideline performs best (of course, discounting cases when we evaluate a guideline to itself). On the other hand, the Shnarch et al. (2018) performs the worst.

I don’t think there’s anything vital to conclude from these results. Perhaps they say something about how strict a guideline is? Maybe this can lead to experiments that investigate how similar guidelines are to one another. We usually measure text similarity via some cosine distance between the text’s vectors. However, guidelines are intentional, and maybe something can be said about the products of these intentions, which in this case, are the annotations.

Final thoughts

In this blog post, we looked into how we can incorporate annotation guidelines into our annotation workflows by including them in the prompt. In order to get around OpenAI’s token limit, we partitioned our document and passed each chunk sequentially into the prompt. All of these were accomplished using Prodigy and LangChain.

When comparing to gold-standard annotations, the original guidelines for the UKP dataset performed better compared to others that were written for other tasks. However, a zero-shot approach outperformed all methods. In fact, a straightforward supervised approach outperforms a prompt with annotation guidelines. I see this more as a negative result.

Moving forward, I think much can still be done in this line of work. I imagine using this process to evaluate how well an annotation guideline “models” the task. I wouldn’t use it to get few-shot predictions, it’s costly and not performant. In addition, it might be interesting to incorporate annotation guidelines in the annotation UI, perhaps to highlight relevant parts of the document that’s useful to accomplish a task. I’m interested to hear any suggestions or thoughts about this experiment. Feel free to reach out or comment below!

References

Eyal Shnarch, Leshem Choshen, Guy Moshkowich, Ranit Aharonov, and Noam Slonim. 2020. Unsupervised Expressive Rules Provide Explainability and Assist Human Experts Grasping New Domains. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2678–2697, Online. Association for Computational Linguistics.
Ran Levy, Ben Bogin, Shai Gretz, Ranit Aharonov, and Noam Slonim. 2018. Towards an argumentative content search engine using weak supervision. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2066–2081, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Roser Morante, Chantal van Son, Isa Maks, and Piek Vossen. 2020. Annotating Perspectives on Vaccination. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4964–4973, Marseille, France. European Language Resources Association.
Stab, C., Miller, T., Schiller, B., Rai, P., and Gurevych, I. (2018). Cross-topic argument mining from heterogeneous sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3664–3674, Brussels, Belgium, October-November. Association for Computational Linguistics.
Thorn Jakobsen, T.S., Barrett, M., Søgaard, A., & Lassen, D.S. (2022). The Sensitivity of Annotator Bias to Task Definitions in Argument Mining. In Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, pp. 44-61, Marseille, France. European Language Resources Association.