However, preference is subjective by nature, and few studies have tried articulating it. For example, some looked into different aspects of a response’s helpfulness / harmlessness (Bai et al., 2022) while others investigated surface-level characteristics like its length (Singhal et al., 2023).
In this blog post, I want to offer a different approach: what if instead of looking at qualitative aspects or token-level features, we use sentence embeddings? Sentence embeddings capture a text’s lexical and semantic meaning in a high-dimensional vector space. If so, can we ascertain lexical differences between chosen and rejected responses just by looking at text embeddings?
One reason why this is important is due to synthetic data. I think that it is easier to generate synthetic pairs conditioned on lexical distance (as opposed to some quality-based metric). Maybe, there are some tasks and domains where generating with respect to cosine distances is plausible.
First, I sampled preference data across different sources. For bigger datasets such as SHP, I only took a particular subset I am interested in. The table below shows the sources I used:
Dataset | Description |
---|---|
OpenAI’s Summarize from Human Feedback (Stiennon et al., 2022) | Dataset used to train a summarization reward model. I used the comparisons subset where each instance represents a matchup between two summaries. |
Stanford Human Preferences Dataset (Ethayarajh et al., 2022) | Contains a collection of human preferences over responses to questions or instructions. I used the explainlikeimfive_train subset to represent OpenQA questions. |
Argilla’s Ultrafeedback Multi-Binarized Cleaned Dataset | A clean version of the original Ultrafeedback dataset (Cui et al., 2023). The cleanup process can be found in their writeup. |
Tatsumoto Lab’s Alpaca Farm (Dubois et al., 2023) | The human preference subset of the Alpaca Farm dataset. The researchers used this subset to compare their LLM judge’s preferences. |
Berkeley Nest Lab’s Nectar Dataset | Preference ranking dataset for training the Starling 7B reward model (Zhu et al., 2023), and consequently, the Starling 7B language model. |
For OpenAI’s Summarize and SHP, the preferences are in the form of individual matchups. To get the canonical chosen and rejected responses, I used the Elo rating system to obtain the top and bottom completions.
Given a set of preference data, I split the completions based on whether they were chosen (\(\mathbf{y}_w\)) or rejected (\(\mathbf{y}_l\)) by an evaluator—human or GPT, depending on the dataset. Then, I embedded them using sentence-transformers/all-MiniLM-L6-v2 to produce 384-dimensional sentence embeddings. Finally, for each row, I computed the distance (\(\mathbf{d}\)) between the chosen and rejected vectors. The figure below illustrates this process.
To compute the distances, I used the cosine distance from scipy
.
Cosine distance measures the direction between two vectors, allowing us to capture similarity even if the length of the sentences or overall frequency of the words differ.
It is represented by the following equation:
where the distance value ranges from \((0, 2)\). Usually, when we talk about distances between preference pairs, we talk about quality-based distances. They’re often in the form of rankings (i.e., get top-1 and top-N) based on an evaluator’s assessment. Again, in this blog post we’re looking at lexical-based distances that are readily available from a text’s surface form. In the next section, I’ll discuss some interesting findings from these distance calculations.
Most of the charts I’ll be showing below are histograms. Here, the x-axis represents the cosine distance whereas the y-axis represents the probability density. We compute the probability density by normalizing the fraction of samples in each bin so that the sum of all bar areas equals 1. The best way to think about these value is in terms of chance, that is, how likely is a random preference pair have a distance \(\mathbf{d}\) on the x-axis?
The chart below shows the distribution across multiple preference datasets. AlpacaFarm and Nectar lie on both extremes. AlpacaFarm is particularly interesting because its completions were generated by API-based LLMs using prompts that replicate human variability and agreement. I’m unfamiliar with how exactly they prompted the LLM, but does that mean their process resulted in similar-looking texts?
On the other hand, Nectar’s completions were a combination of LLM outputs (GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-instruct, LLaMa-2-7B-chat, and Mistral-7B-Instruct) alongside other existing datasets. Because Nectar formats its preferences in terms of ranking, the chosen and rejected pairs here represent the top and bottom choices.
Other datasets have distributions that I expected. For example, OpenAI’s summarization dataset should still have closer preference pairs because of the task’s inherent nature. Summarization is about compressing a text while maintaining information. Upon checking the actual preferences and corresponding evaluator notes, I noticed that rejected completions are oftentimes a matter of recall.
Next, I looked into how Elo ranking corresponds to the cosine distance of the text embeddings. Preference datasets like OpenAI’s Summarization, SHP, and Berkeley-Nest’s Nectar represent their preferences as individual matchups, allowing us to compute the Elo rating of individual completions. Then, we can order these ratings to achieve a rank of completions from most preferable to least.
However, OpenAI’s Summarization and SHP have unequal number of ranks per prompt \(\mathbf{x}\). So to simplify the visualizations, I took the chosen completion \(\mathbf{y}_w\), the top-2 completion \(\mathbf{y}_{l,next}\), the middle-performer \(\mathbf{y}_{l,mid}\), and the bottom-performer \(\mathbf{y}_{l,last}\) (which is equivalent to \(\mathbf{y}_l\) in the previous section). On the other hand, Berkeley-Nest’s Nectar provides a 7-rank scale of preferences. This allowed me to compute the distance from the first and second choices until the last one: \(\mathbf{d}(\mathbf{y}_1, \mathbf{y}_{2\ldots7})\). Then, I plotted these distances in a histogram (I only retained the curve so that the charts look cleaner) as seen below:
The cosine distances from the OpenAI Summarization preference dataset follow a certain pattern: completions that are closer in ranking have smaller lexical distance. The average mid ranking is 2.042 (with a 4.109 average number of ranks) and the Pearson correlation between the distances and Elo ranking is 0.779.
For the Stanford Human Preferences (SHP) dataset, I chose the explainlikeimfive subset
to simulate OpenQA tasks.
Interestingly, it has a less pronounced visual correlation even though its Pearson-r is 0.785, much higher than OpenAI Summarization.
The average mid ranking is 1.967 with an average rank number of 4.600.
For Berkeley-Nest’s Nectar dataset, the rankings were already given so I didn’t have to compute my own. Here, the Pearson correlation is 0.818. If you look at the “chosen and rejected (2)” red line, you’ll notice that the cosine distances start very small but fall off afterwards. It is interesting that completions that performed similarly during matchups are quite similar to one another based on their embeddings.
Dataset | Number of ranks (avg) | Mid rank | Pearson-r Elo ranking | Pearson-r Elo rating |
---|---|---|---|---|
openai/summarize_from_feedback | 4.109 | 2.042 | 0.779 | -0.534 |
stanfordnlp/SHP | 4.600 | 1.967 | 0.7845 | -0.458 |
berkeley-nest/Nectar | 7.000 | 4.000 | 0.818 | - |
The table above shows the ranking statistics for each dataset. I also measured the Pearson correlation between the rejected text’s ranking (and Elo rating) with respect to its embedding distance from the chosen text. The sign (+/-) corresponds to the direction of the correlation. For example, the negative sign in the last column shows that as the text’s Elo rating increases, then its lexical distance from the chosen text decreases (i.e., they become more similar).
Finally, I was curious how individual attributes of preference manifest in lexical distances using the HelpSteer dataset (Wang et al., 2023). Most datasets only give us a single view of human judgment, but HelpSteer provides finegrained preferences such as helpfulness, correctness, coherence, complexity, and verbosity.
So, I did the same experiments for each of these attributes and found that the distribution didn’t change much. I’m not quite confident on how I preprocessed this dataset. Unlike other preference datasets that uses matchups, HelpSteer uses scores from 0 to 4 so some texts can end up having the same scores. Here, I simply sorted the texts with their score, and designated the chosen text as the first one on the list (whatever Python’s sort function made it to be), and the rejected text as the last element. You can see the figure below:
I think that there’s still a lot that can be done on this angle. One way is to format the data in terms of individual matchups. This process leads to a forced ranking, allowing us to easily designate the chosen and rejected pair. Since HelpSteer is the only one we have (as far as I know), then I’ll leave my analysis as is for now.
In this blog post, I presented a lexical view of preference pairs using embeddings. Using different preference datasets, I computed for their sentence embeddings, and then measured the cosine distance between chosen and rejected pairs. I found that some datasets exhibit lexical differences and that it correlates to human judgment (i.e., Elo rating). Finally, using the HelpSteer dataset, I saw that cosine distances are consistent even on different attributes of preference.
This experiment is really just a curiosity as I work on RLHF. I’ve been doing some experiments on my job that are a little bit orthogonal to this work. I think this is just my way of exploring interesting avenues and scratching my itch. If you’re interested in this type of work, feel free to reach out and discuss!
You can find the source code for this work on GitHub!
]]>This blog post is my (abridged) lecture in written format. You can find the slides here. Finally, thanks to Ryan Wesslen and Chang Hsin Lee for inviting me!
One of the major problems of the 21st century is disinformation. You’ll see it everywhere, from Facebook posts or X tweets to fake news websites! Combating disinformation is labor-intensive. Politifact, a fact-checking website, relies on volunteer journalists to scour the internet and manually label each source.
There are several efforts to automate the fact-checking process. A common approach is to treat it as an NLP pipeline composed of different tasks (Guo et al., 2022). Today, we will only focus on claim detection, the first step in an automated fact-checking pipeline.
Detecting claims is usually a dual problem: you’d also want to find the premises that support it. Together, the claim and its premises make up an argument. Applying NLP to this domain is often called argument mining. For this talk, I want to introduce two argument mining sub-tasks: (1) first, we want to highlight the claim and premise given a text (claim & premise extraction), and then, (2) we want to determine if a text supports, opposes, or is neutral to a certain topic (stance detection).
So, our general approach is to reframe these two sub-tasks as NLP tasks. First, we treat claim & premise extraction as a span labeling problem. We can use spaCy’s SpanCategorizer to obtain spans or arbitrary slices of text. Then, we treat stance detection as a text categorization problem. Similarly, we can use spaCy’s TextCategorizer to classify a text among our three stances (support, oppose, neutral).
Notice how we’ve decomposed this general problem of disinformation into tractable NLP tasks. And it is an important muscle to train. In computer science, we often learn about the divide and conquer algorithm, and this is a good application of that approach to a more fuzzy and, admittedly, complex problem.
As we already know, training NLP models such as a span or text categorizer requires a lot of data. I want to talk about different methods of collecting this dataset and emphasize how LLMs can fit into this workflow.
Before we get into LLMs, I want to talk about “traditional” ways of annotating data. On the left end, we have manual processes involving much human effort and curation. And then, on the right, we have more automated methods that rely heavily on a reference or base model. LLMs, as advanced as they are, still fall in between. They’re not fully manual but also not fully automated because writing a prompt still requires tuning and domain expertise.
But why are we still interested in LLMs? It’s because LLMs provide something that most semi-automated methods can’t: a model pretrained on web-scale data, and a highly flexible zero-shot capability. Let me put this in a Venn diagram— and for each space in this diagram, I’ll talk about how LLMs can specifically help in our annotation workflows.
One of the most straightforward applications of large language models is bootstrapping labeled data. Here, an LLM is a drop-in replacement for a base model that you’d usually train. LLMs differ because they were pretrained on web-scale data, giving it enough capacity even for your domain-specific task. So, how good is an LLM annotator?
To test this question, I worked on a portion of the UKP Sentential Argument Mining corpus (Stab et al., 2018). It contains several statements across various topics, and the task is to determine whether the statement supports, opposes, or is neutral to the topic— a text categorization problem.
The process was simple: I included each statement in a prompt and asked GPT-3.5 what the stance was. You can read more about my process in this blog post. My findings show that LLMs, when prompted in a zero-shot manner, are competitive on a baseline that I trained on the original labels. In addition, I also found myself annotating faster (and more correctly) when correcting LLM annotations compared to annotating from scratch. The latter finding is important because correcting annotations induces less cognitive load and human effort (Li et al., 2023, Zhang et al., 2023).
So, if LLMs can already provide competitive annotations, is our problem solved? We don’t have to annotate anymore? Remember, the reason why we collect these annotations is so that we can train a supervised model that can reliably approximate the task we’re interested in. The operating word here is reliable. There’s a huge variance in LLM performance, and one way to thin out that curve is to insert it in a human-in-the-loop workflow (Dai et al., 2023; Boubdir et al., 2023; Wang et al., 2023).
Another way we can use LLMs for annotation is by taking advantage of their flexibility. LLMs have zero-shot capabilities, i.e., we can always frame structured prediction tasks such as text categorization or named entity recognition as a question-answering problem. Back then, you’d need to train separate supervised models to achieve multi-task skills. I want to use an LLM’s flexibility to enhance the annotation experience.
This time, I want to introduce two workflows. The first one is still a text categorization problem, but I want to ask an LLM to pre-highlight the claims and premises so I can reference them during annotation. For the second one, I want to ask the LLM to do the reasoning for me. I’ll let it identify the claims and premises, then pre-annotate an answer, and then give me a reason for choosing that answer. This exercise aims to explore creative ways we can harness LLMs.
The process is similar to the first section, but I prompt for auxiliary information instead of prompting for the direct labels. LLMs make this possible because we can formulate each task as a question-answer pair. You’ll find examples of my prompt in the slides below. The prompt on the left is a straightforward span labeling prompt, where we ask the LLM to provide the exact spans from a text. On the other hand, the prompt on the right is a chain-of-thought prompt (Wei et al., 2023). Here, we induce an LLM to perform a series of reasoning tasks to arrive at a final answer.
The good thing about Prodigy is that you can easily incorporate this extra information in your annotation UI. On the bottom left, you’ll see that it highlights the claims and premises for each statement, allowing you to focus on the relevant details when labeling. On the bottom right, you’ll find that the UI metadata now contains the prompt’s reasoning steps.
There are many creative ways to improve annotation efficiency (and quality) using LLMs. One of my favorite papers from EMNLP was CoAnnotating (Li et al., 2023), which uses an uncertainty metric to allocate annotation tasks between humans and a chat model such as ChatGPT. We’ve seen a lot of LLM-as-an-assistant applications in the market for the past year, and I think that there’s an opportunity to apply the same perspective to the task of annotation.
Finally, I’m curious how LLMs parse information originally intended for humans. In most annotation projects, researchers write an annotation guideline to set up the parameters of the labeling task. These guidelines aim to reduce uncertainty about the phenomenon we are annotating. We can even think of these as prompts for humans!
This time, I want to focus on a simple task: determine whether a statement is an argument. It sounds easy because it’s “just” a binary classification task. However, after looking through various argument mining papers and their annotation guidelines, I realized that they each have their definition of what makes an argument!
So, this got me thinking: what if we include the annotation guideline in the prompt? You can check my entire experiment in this blog post. Back then, you could not fit a whole document into an LLM’s limited context length, so I used a continuous prompting strategy that showed chunks of the document and let the LLM update their answer based on new information. Langchain calls this a “refine chain” in their docs. As an aside, I’ve opted into using minichain in my recent projects as it is more lightweight and enough for my needs.
Including an annotation guideline in the prompt resulted in worse results—surprising. I couldn’t delve further, but I hypothesize that writing prompts for LLMs have a particular “dialect” vastly different from how we talk as humans. Annotation guidelines were written with humans in mind, and perhaps some qualities don’t transfer properly into LLM prompts. There are many confounding factors, of course. Maybe the refine strategy is not the best, or maybe I should’ve processed the text much better. An LLM’s prompt sensitivity is still an open problem.
But I learned one thing: we can use LLMs as a “first pass” when iterating over our annotation guidelines. Typically, you’d start with a pilot annotation with a small group of annotators as you write the guidelines, but there’s an opportunity to incorporate LLMs into the mix.
Before we end, I want to share an important question before you begin your annotation projects. You should always ask yourself: what is the label supposed to reflect? Knowing what you want to use the collected dataset for is paramount.
Rottger et al. (2022) named two paradigms for data annotation: prescriptive and descriptive. Prescriptive annotation is usually found in linguistic tasks such as named entity recognition or parts-of-speech tagging—where there is a “correct” answer for each instance. Here, you already have a function in mind and need to collect enough data to train a reliable model. On the other hand, descriptive annotation aims to capture the whole diversity of human judgment. You’d usually find this in subjective tasks like hate speech detection or human preference collection.
LLMs are pretty good at prescriptive annotation tasks. Some empirical evidence that supports it (Ashok et al., 2023; Chen et al., 2023; Sun et al., 2023), and it allows us to access the web-scale data it was pretrained upon.
And now to my final point: despite their web-scale and zero-shot capabilities, LLMs are only as good as how well you prompt them. During my early days in data science, there is this common adage: “garbage in, garbage out.” Usually, we say this when we want to refer to bad data. The problem with prompts is that the degree of freedom is much higher, which introduces ambiguity to our inputs. Hence, I don’t recommend using LLM outputs straight from the firehose and serving it immediately. There should be an intermediary step that minimizes this uncertainty, and that step is human annotation.
]]>$ git clone git@github.com:myorg/repo.git
Cloning into 'repo'...
ERROR: The 'myorg' organization has enabled or enforced SAML SSO.
To access this repository, you must use the HTTPS remote with a
personal access token or SSH with an SSH key and passphrase that
has been authorized for this organization.
Visit https://docs.github.com/articles/authenticating-to-a-github-organ
ization-with-saml-single-sign-on/ for more information.
First you need to generate your SSH key.
Sometimes, your organization will require you to generate a new one using your company email.
Nevertheless, the common denominator would be to run the ssh-keygen
command below:
ssh-keygen -t ed25519 -C lj@myorg.org
This will generate a key pair in the form of id_ed25519
and id_25519.pub
.
In Linux, you can find them in the ~./ssh/
directory.
We need to upload the one with the .pub
extension to GitHub.
Go to your GitHub Settings > SSH and GPG Keys > New SSH Key (or head to github.com/settings/keys).
Write a semi-descriptive title (I usually put the organization name), set the Key Type as “Authentication Key,” and copy the contents of the id_25519.pub
in the Key field.
First, test the connection by running:
$ ssh -T git@github.com
Hi username! You've successfully authenticated, but GitHub does
not provide shell access.
Then, start the SSH agent:
$ eval "$(ssh-agent -s)"
Agent pid 16935
It starts a background daemon and displays its process ID (in this case, 16935). We can then add our private keys while this agent is running.
$ ssh-add .ssh/id_ed25519
Identity added: .ssh/id_ed25519 (some other info)
At this point, you should now be able to clone your organization’s private repository. I haven’t really dug deep as to why it errored out the first time, I assumed that the keys are automatically added whenever I create them. Anyway, in case you also encountered this error, I hope this tutorial helps!
]]>For now, I am just curious as to how they would look like in our Tagalog NER dataset. My approach is similar to the blog post I mentioned. Only difference is that I’m using a trained RoBERTa Tagalog model to get the embeddings. Finally, there’s nothing new about these methods: dataset cartography has already been an active area of NLP research ever since (Swayamdipta et al., 2020; Balasubramanian et al., 2020).
You can find the code and implementation here.
Below you’ll find the t-SNE plot for all entities in our dataset. They are color-coded based on their type— Person (PER), Organization (ORG), and Location (LOC). When you hover over each point, you’ll see the span text, its label, and a portion of the sentence it belongs to. Feel free to explore around using the visualization tools from Plotly.
Figure: t-SNE plot for all entity labels in TLUnified-NER.
At first glance, we see that PER entities are clearly separated from ORG and LOC, whereas the other two have noticeable overlaps. My hunch here is that even if two entities have the same lexical properties, they have different semantic use. For example, Malacañang (the place where the Philippine President resides, akin to The White House) can either be an organization or location based on its usage. We can verify this observation by examining the confusion matrix of a simple transition-based NER model: it significantly misconstrues LOC entities as ORG (and vice-versa).
Embeddings set-up | ORG | LOC |
---|---|---|
Shared | \(+5\%\) | \(+3\%\) |
Context-sensitive | \(+12\%\) | \(+18\%\) |
To further test this “lexical-semantic confusion” hypothesis, I trained two additional models that account for a word’s position in the text. The first model uses spaCy’s shared token-to-vector layer that includes a dependency parser and POS tagger aside from NER. The hope is that by sharing information between these three components, our downstream NER model can learn how to disambiguate between these semantic differences. The second model uses a transformer network to obtain context-sensitive vectors. It is interesting then that the relative error for LOC ↔ ORG decreased when using these methods. Therefore, I highly-recommend using context-sensitive techniques when training models from this dataset.
I also want to share interesting clusters from examining all labels. I’m literally just spitballing here: there’s nothing methodical aside from inspecting a few clusters and checking their neighbors. With that in mind, take these observations with a grain of salt.
Political clusters are intriguing. There are some interesting neighborhoods that intrigued me. For example, the Nograles cluster is isolated from most PER entities. Its closest PER cluster is Arroyo, and the majority of its neighboring clusters include Mindanao, MILF, and some cities near Davao. My hunch is that most news stories in the corpus were written during a time when Prospero Nograles’s involvement in Davao and the Arroyo administration is apparent (he was the Speaker of the House).
Now, we’re entering speculative territory but it’s cool that you can at least draw political lines during the 2004-2010 administration. Of course, it’s hard to draw these lines because unlike the US, the Philippines has a multi-party system. It’s fun to point out but I admit that what I’ve been doing is just akin to a Rorschach test. If you’re looking for something more rigorous, I suggest reading the work of Rheault et al (2019). This led me to ask: can we predict shifts in political alliances from words alone? I think it is an interesting exercise— and especially challenging— given that political parties in the Philippines are not really defined by their ideologies.
Biases exist. I also noticed clusters that might potentially be sources of bias when training models from this dataset. For example, most news sources from Mindanao involve acts of terrorism from Abu Sayyaf and the Moro National Liberation Front. It is then unfortunate that entities such as Allah and Muslim are co-located within this neighborhood.
Personally, I’m interested to explore techniques to debias corpora from an embeddings standpoint. The works of Prost et al. (2019) and Kaneko et al. (2021) for gender bias come to mind.
Here, I plotted the embeddings for each entity type while categorizing them based on their span property: paren
(if the span is preceded by a parenthesis), all_caps
(if all characters in the span are in uppercase), initial
(if the span is the first subword in the text), and plain
as a catch-all category.
These classes are mutually exclusive, i.e., I automatically assign them based on the first property they fulfill.
Most PER entities were categorized as plain, and it is mostly expected. Although I find it interesting that there is a sizeable amount of names made up of initials such as FVR for Fidel V. Ramos or GMA for Gloria Macapagal Arroyo. Most of our benchmarks had little to no problem recognizing PER entities and my hunch is likely due to the straightforward and consistent structure of such entities.
The prevalence of initials in naming individuals usually stem from cultural influences in the Philippines. It is customary to use initials to refer to prominent figures, such as Presidents and CEOs. I have only scratched the surface on these entities, so feel free to explore around the interactive plot!
It is cool that if you squint hard enough, you can see cities in the Philippines arranged in their geographical location—based on their embeddings alone. Of course, there are still inconsistencies: Manila is located at the rightmost portion whereas Bulacan appears in the middle. Why would that be the case? My hunch is that in-country linguistic diversity is still apparent, and somewhat recoverable, through these embeddings. For example, Mindanao is predominantly Muslim, hence affecting naming and proper noun patterns (you’ll often see instances of Muhammad or Al- in Mindanaoan names). These variations borne out of geographical differences may correlate with the linguistic clusters we now see.
My other theory is politics. Although there is a central government, regional politics still dominate. Politicians tend to co-occur with one another in news reports, especially if they belong to the same region. Perhaps these co-occurences caused some of the “geographical separation” we see in our embeddings. Might be fun to explore in the future!
It’s nice that most government departments are members of the same cluster.
This can lead to improved accuracy in NER tasks, as a model can easily recognize and categorize these entities.
Hopefully, this can be a good visual cue that the embeddings have captured the underlying relationships between different organizations.
They are similar to PER entities, with the exception that they have a more recognizable orthographic “shape.”
For example, most organizations in news reports are acronyms.
And so, writers tend to give its full name first, then followed by its shorthand (e.g, XXXXX XXXXX XXX (XXX)
).
Visualizing embeddings is a nice exercise of “examining your data.” Although I have to admit that I have made some huge leaps of logic while explaining my observations above— I’m still getting better at this! In the future, I’m more interested in how these techniques can be applied, and how we can test them via a more empirical approach. For example, we can get LLM embeddings, cluster them together, and examine the outliers. There might be cool applications for correcting annotations or annotating a dataset from scratch. For now, I think this is fun!
Parts of this work were published in the paper, “Developing a Named Entity Recognition Dataset for Tagalog”, at IJCNLP-AACL’s Southeast Asian Language Processing Workshop. Feel free to cite that paper in your own work.
A few weeks ago, I saw an interesting blog post from Thinking Machines where they ran Filipino tweets on GPT-4 for a sentiment analysis task. Their prompt was simple. They asked: “what is the sentiment of this tweet?” They obtained a weighted F1-score of 76%— pretty decent for a straightforward zero-shot approach. This inspired me to test LLM performance on other Tagalog NLP tasks, hence these experiments.
In this blog post, I will test how these large language models (LLMs) fare compared to standard finetuning techniques in Tagalog. I will be benchmarking them on the named entity recognition (NER) and text categorization datasets from the calamanCy project.
As a refresher, you can check the datasets I’m using in the table below. I didn’t include the Universal Dependencies (UD) treebanks this time because querying third-party APIs is getting too costly.
Dataset | Task / Labels | Description |
---|---|---|
Hatespeech (Cabasag et al., 2019) | Binary text classification (hate speech, not hate speech) | Contains 10k tweets collected during the 2016 Philippine Presidential Elections labeled as hatespeech or non-hate speech. |
Dengue (Livelo and Cheng, 2018) | Multilabel text classification (absent, dengue, health, sick, mosquito) | Contains 4k dengue-related tweets collected for a health infoveillance application that classifies text into dengue subtopics. |
TLUnified-NER (Cruz and Cheng, 2021) | NER (Person, Organization, Location) | A held-out test split from the annotated TLUnified corpora containing news reports. |
I wrote a zero-shot prompt and ran it on the test set. Zero-shot prompting only requires a task description for inference. In addition, few-shot prompting is out of scope for this blog post—it’s too laborious to engineer prompts and it might be difficult to compare them properly. I’ll also run the experiments for three trials and report the mean and standard deviation to account for variance. The prompt text will still be in English so as to be consistent with the Thinking Machines blog post.
Finally, I am using spacy-llm throughout the experiments. I highly recommend trying spacy-llm if you’re building production-grade LLM pipelines. You can find and reproduce my work on Github! (Full disclosure: I used to contribute to earlier versions of spacy-llm as part of my work at Explosion)
The spacy-llm library provides a set of built-in prompt templates for zero- and few-shot prompting.
These prompts are categorized and versioned per task.
You can view them by checking the configuration file in the Github repo and looking at the components.llm.task
section.
For example, in NER, we have something like this:
[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = ["PER","ORG","LOC"]
label_definitions = {"PER": "PERSON", "ORG": "ORGANIZATION", "LOC": "LOCATION OR GEOPOLITICAL ENTITY"}
Here, spacy.NER.v2
points to a task
with its own prompt.
From there, you can check the documentation and cross-reference the template (tip: check the template
argument in the docs).
For NER, we have this Jinja2 file.
At runtime, spacy-llm
renders our config to the Jinja2 template, thereby producing the final prompt sent to the LLM:
You are an expert Named Entity Recognition (NER) system. Your task is
to accept Text as input and extract named entities for the set of
predefined entity labels. From the Text input provided, extract named
entities for each label in the following format:
PER: <comma delimited list of strings>
ORG: <comma delimited list of strings>
LOC: <comma delimited list of strings>
Below are definitions of each label to help aid you in what kinds of
named entities to extract for each label. Assume these definitions are
written by an expert and follow them closely.
PER: PERSON
ORG: ORGANIZATION
LOC: LOCATION OR GEOPOLITICAL ENTITY
Text:
'''
Pumunta si Juan sa Japan.
'''
I won’t be pasting the prompts for binary and multilabel text categorization here to save space. Again, the best way to view them is to check my configuration files and cross-reference them with the prompt templates in the spacy-llm repository.
Lastly, some spacy-llm
tasks provide additional arguments such as label_definitions
for explicitly describing a label to an LLM, and examples
for incorporating exemplars in few-shot prompting.
The library covers most of the core NLP tasks (NER, text categorization, and lemmatization) and seems to be adding more in the natural language understanding (NLU) space (e.g., summarization).
I tested on a variety of decoder-only large language models, from commercial ones like GPT-4 to open-source models like Dolly. The table below reports the results (Metrics: macro F1-score for Dengue and Hatespeech and F1-score for TLUnified-NER):
LLM | Dengue | Hatespeech | TLUnified-NER |
---|---|---|---|
OpenAI (gpt-4 ) |
\(\mathbf{62.04 (0.20)}\) | \(45.74 (1.16)\) | \(\mathbf{65.89 (0.44)}\) |
OpenAI (gpt-3.5-turbo ) |
\(51.21 (0.38)\) | \(\mathbf{73.90 (0.27)}\) | \(53.05 (0.42)\) |
Anthropic (claude-1 ) |
\(35.85 (0.02)\) | \(58.70 (0.03)\) | \(58.88 (0.03)\) |
Cohere (command ) |
\(39.27 (0.64)\) | \(16.38 (0.88)\) | \(25.48 (0.11)\) |
Databricks (dolly-v2-7b ) |
\(27.26 (0.40)\) | \(32.30 (0.18)\) | \(13.07 (0.14)\) |
TII (falcon-7b ) |
\(14.77 (0.35)\) | \(33.00 (0.11)\) | \(8.65 (0.04)\) |
Stability (stablelm-base-alpha-7b ) |
\(15.56 (0.08)\) | \(32.17 (0.24)\) | \(00.25 (0.03)\) |
OpenLM (open_llama_7b ) |
\(15.24 (0.43)\) | \(32.18 (0.73)\) | \(15.09 (0.48)\) |
For comparison, the table below shows the results for the word vector and encoder-only transformer-based pipelines from calamanCy. Both were trained using good old-fashioned supervised learning. I also included the results from finetuning XLM-RoBERTa (Conneau et al., 2019) and multilingual BERT (Devlin et al., 2019). You can read more about these pipelines in this blog post.
Pipeline | Dengue | Hatespeech | TLUnified-NER |
---|---|---|---|
Large word-vector (tl_calamancy_lg ) |
\(68.42 (0.01)\) | \(75.62 (0.02)\) | \(88.90 (0.01)\) |
Transormer-based (tl_calamancy_trf ) |
\(72.45 (0.02)\) | \(78.25 (0.06)\) | \(90.34 (0.02)\) |
XLM-RoBERTa (xlm-roberta-base ) |
\(67.20 (0.01)\) | \(77.57 (0.01)\) | \(88.03 (0.03)\) |
Multilingual BERT (bert-base-multilingual ) |
\(71.07(0.04)\) | \(76.40 (0.02)\) | \(87.40 (0.02)\) |
The graph below shows a better visual of our results. The grey bars represent our large language models while the red bars represent the supervised ones.
It is apparent that our supervised approach outperformed zero-shot prompting in our datasets. These results are consistent with the findings of the BigScience group (Wang et al., 2022), where they showed that although decoder-only models trained on an autoregressive objective exhibited the strongest zero-shot generalization, they’re still outperformed by models trained via masked language modeling followed by multitask finetuning. I think there are two major reasons why LLMs underperform on Tagalog:
Conceptual gap: text generation != text understanding. I argue that one common misconception with LLMs is that we equate the generation of coherent texts to language understanding. Just because an LLM can “speak” coño or jejemon doesn’t mean it understands linguistic grammar. LLMs are, after all, stochastic parrots (Bender et al., 2021). They might be performant in leaderboards, but they can bring unnoticeable harm when used liberally. In addition, our decoder-only LLMs may not be suited to our structured prediction benchmarks.
As an aside, this may also be a call for building NLU-benchmarks for low-resource corpora. If you’re working on something related, I’m interested to help so feel free to reach out!
Data gap: Tagalog is still underrepresented in most LLM training corpora. Training an LLM requires a large data mixture. Datasets in this mixture usually include CommonCrawl, The Pile, C4, and Wikipedia among many others. Most of these datasets are heavily Anglocentric. For example, the Pile dataset is English-only while CommonCrawl is dominated by Western languages (Tagalog is at a mere \(0.0073\%\)).
Unfortunately, Tagalog is underrepresented even in multilingual LMs. For example, the XGLM language model (Lin et al., 2022) was only trained on 2.3k tokens of Tagalog data whereas the Bloom language model (Scao et al., 2022) doesn’t contain any Tagalog text at all. There’s still a long way to go. Currently, there are efforts such as Cohere’s Aya project that aim to close that multilingual gap.
Given these gaps, I think that the best way to use LLMs in the context of low-resource languages is to maximize information-per-query (IPQ). Yes, I made that one up. The idea is to extract higher quality information for every query from an LLM. Here, I define quality as reusability, i.e., something that can still be refined further for other downstream tasks. I’d even argue that NLU-based outputs such as summarization, common-sense reasoning, and question-answering have inherently high information bandwidth (and hence high IPQ) because it taps to the LLM’s interconnections from its very large training corpora.
For example, using raw LLM outputs as the final predictions for a structured prediction task (NER, text categorization, sentiment analysis) has low IPQ. This is because we already exhausted the lifetime of our query by serving it directly in our system. Asking an LLM to tag a text as hatespeech or not may not be the most efficient use of its capabilities.
We can increase IPQ by using these predictions to assist data annotation, thereby producing a supervised model with a more deterministic performance. Other ways to maximize IPQ is to prompt an LLM in a chain-of-thought style to elicit “reasoning” (Wei et al., 2022) or building a knowledge graph from its “internal” model (Cohen et al., 2023)—anything that utilizes an LLM into its fullest potential in a single query.
I admit that this made-up measure is rough at best. For a more thoughtful reading, I highly recommend Eugene Yan’s blog post on LLM patterns. Notice that most of these patterns aim to maximize IPQ. I also recommend Vicki Boykis’s reflection as it echoes what I feel towards ChatGPT in this age of AI hype.
I hope that this blog post is a more sober view of LLM capabilities for Tagalog. It would be great to live in a world where we don’t need to build corpora, but I don’t think we’re there yet. I still believe that LLMs have a use for structured prediction tasks, such as in annotation or as a silver-standard knowledge-base.
Personally, I’m interested in building parallel corpora so that we can have a comprehensive view of an LLM’s multilingual performance. I’m also curious about various ways we can maximize the information obtained from LLMs and using that for downstream tasks. Finally, there might also be a good argument for building an LLM geared towards low-resource languages, I do think that it is a worthwhile endeavour.
Update (2023-12-06): This work was published in the EMNLP NLP-OSS workshop! Check out “calamanCy: A Tagalog Natural Language Processing Toolkit”!
I am excited to introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog.
It is built on top of spaCy to ensure easy experimentation and integration with other frameworks.
It provides general-purpose multitask models with out-of-the-box support for dependency parsing, part-of-speech (POS) tagging, and named entity recognition (NER).
The repository is available on Github, and you can install the package via pip
:
pip install calamancy
import calamancy
nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Ako si Juan de la Cruz.") # returns a spaCy Doc object
More importantly, calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework. In this blog post, I want to talk about the problem it’s trying to solve, my process on building the framework, some benchmarks, and future work.
calamanCy offers two word vector-based pipelines and one transformer-based pipeline. Suppose we want to detect named entities from a given text (NER):
import calamancy
nlp = calamancy.load("tl_calamancy_md-0.1.0")
doc = nlp("Pumunta si Juan sa Japan kahapon.")
for ent in doc.ents:
print(ent.text, ent.label_) # (Juan, PER), (Japan, LOC)
Here, the variable nlp
is an instance of spaCy’s Language
class while doc
is a spaCy Doc
object.
Each pipeline contains a dependency parser, tagger, and entity recognizer.
The built-in entity recognizer was trained with annotations resembling ConLL. It contains entities such as Person, Organization, and Location.
You can use these models as-is or finetune them to your dataset. In a latter section, I’ll demonstrate these capabilities by benchmarking our models on both seen and unseen tasks. For more information on model training, please check out the spaCy documentation.
Despite Tagalog being a widely-spoken language here in the Philippines, model and data resources are still scarce. For example, our Universal Dependencies (UD) treebanks are tiny (less than 20k words) (Samson, 2018; Aquino and de Leon, 2020) and domain-specific corpora are few and far between (Cabasag et al., 2019; Livelo and Cheng, 2018).
In addition, we only have limited choices when it comes to Tagalog language models (LMs). For monolingual LMs, the state-of-the-art is RoBERTa-Tagalog (Cruz and Cheng, 2021). For multilingual LMs, we have the usual XLM-RoBERTa (Conneau et al., 2019) and multilingual BERT (Devlin et al., 2019). Tagalog is included in their training pool, but these models are still prone to the curse of multilinguality.
Therefore, consolidating these resources and providing more options to build Tagalog NLP pipelines is still an open problem.
calamanCy provides three language pipelines that fit any performance or accuracy requirements. Each pipeline provides out-of-the-box support for core NLP tasks:
Pipeline | Pretraining objective | Word embeddings | Dimensions |
---|---|---|---|
Medium-sized pipeline (tl_calamancy_md) | Predict some number of leading and trailing UTF-8 bytes for the words. | Uses floret static vectors trained on the TLUnified corpora. | 50k unique vectors (200 dimensions), Size: 77 MB |
Large-sized pipeline (tl_calamancy_lg) | Same pretraining objective as the medium-sized pipeline. | Uses fastText static vectors trained on CommonCrawl corpora. | 714k unique vectors (300 dimensions), Size: 455 MB |
Transformer-based pipeline (tl_calamancy_trf) | No separate pretraining because there’s no token-to-vector component. | Context-sensitive vectors from a transformer network. | Uses roberta-tagalog-base. Size: 813 MB |
The training process involves pretraining a filtered version of TLUnified (Cruz and Cheng, 2021), constructing static word embeddings if necessary, and training the downstream components.
Each pipeline contains a Tagger
, DependencyParser
, Morphologizer
, and EntityRecognizer
spaCy components.
These models are also available on HuggingFace 🤗.
You can reproduce the whole training procedure by running the corresponding spaCy project on Github. Finally, you can find a list of data sources in the table below:
Source | Authors | License |
---|---|---|
TLUnified Dataset | Jan Christian Blaise Cruz and Charibeth Cheng | GNU GPL 3.0 |
UD_Tagalog-TRG | Stephanie Samson | CC BY-SA 3.0 |
UD_Tagalog-Ugnayan | Angelina Aquino and Franz de Leon | CC BY-NC_SA 4.0 |
Note that the Ugnayan treebank is not licensed for commercial use while TLUnified is under GNU GPL. Please consider these licenses when using the calamanCy pipelines in your application.
Before calamanCy, you usually have two options if you want to build a pipeline for Tagalog: (1) piggyback on a model trained from a linguistically-similar language (cross-lingual transfer) or (2) finetune a multilingual LM like XLM-R or multilingual BERT on your data (multilingual finetuning). Here, I want to check if calamanCy is competitive enough against these alternatives. I tested on the following tasks and datasets:
Dataset | Task / Labels | Description |
---|---|---|
Hatespeech (Cabasag et al., 2019) | Binary text classification (hate speech, not hate speech) | Contains 10k tweets collected during the 2016 Philippine Presidential Elections labeled as hatespeech or non-hate speech. |
Dengue (Livelo and Cheng, 2018) | Multilabel text classification (absent, dengue, health, sick, mosquito) | Contains 4k dengue-related tweets collected for a health infoveillance application that classifies text into dengue subtopics. |
TLUnified-NER (Cruz and Cheng, 2021) | NER (Person, Organization, Location) | A held-out test split from the annotated TLUnified corpora containing news reports. |
Merged UD (Aquino and de Leon, 2020; Samson, 2018) | Dependency parsing and POS tagging | Merged version of the Ugnayan and TRG treebanks from the Universal Dependencies framework. |
For text categorization and NER, I ran the experiments for five trials and reported their average and standard deviation. For dependency parsing and POS tagging, I used 10-fold cross-validation because the combined UD treebank is still too small.
The results show that our calamanCy pipelines are competitive (you can reproduce the results by following this spaCy project):
Language Pipeline | Binary textcat (Hatespeech) | Multilabel textcat (Dengue) | NER (TLUnified-NER) | Dependency parsing, UAS (Merged UD) | Dependency parsing, LAS (Merged UD) |
---|---|---|---|---|---|
tl_calamancy_md | \(74.40 (0.05)\) | \(65.32 (0.04)\) | \(87.67 (0.03)\) | \(76.47\) | \(54.40\) |
tl_calamancy_lg | \(75.62 (0.02)\) | \(68.42 (0.01)\) | \(88.90 (0.01)\) | \(82.13\) | \(70.32\) |
tl_calamancy_trf | \(78.25 (0.06)\) | \(72.45 (0.02)\) | \(90.34 (0.02)\) | \(92.48\) | \(80.90\) |
We also evaluated cross-lingual and multilingual approaches in our benchmarks:
Language Pipeline | Binary textcat (Hatespeech) | Multilabel textcat (Dengue) | NER (TLUnified-NER) | Dependency parsing, UAS (Merged UD) | Dependency parsing, LAS (Merged UD) |
---|---|---|---|---|---|
uk_core_news_trf | \(75.24 (0.05)\) | \(65.57 (0.01)\) | \(51.11 (0.02)\) | \(54.77\) | \(37.68\) |
ro_core_news_lg | \(69.01 (0.01)\) | \(59.10 (0.01)\) | \(02.01 (0.00)\) | \(84.65\) | \(65.30\) |
ca_core_news_trf | \(70.01 (0.02)\) | \(59.42 (0.03)\) | \(14.58 (0.02)\) | \(91.17\) | \(79.30\) |
Language Pipeline | Binary textcat (Hatespeech) | Multilabel textcat (Dengue) | NER (TLUnified-NER) | Dependency parsing, UAS (Merged UD) | Dependency parsing, LAS (Merged UD) |
---|---|---|---|---|---|
xlm-roberta-base | \(77.57 (0.01)\) | \(67.20 (0.01)\) | \(88.03 (0.03)\) | \(88.34\) | \(76.07\) |
bert-base-multilingual | \(76.40 (0.02)\) | \(71.07 (0.04)\) | \(87.40 (0.02)\) | \(90.79\) | \(78.52\) |
I highly recommend trying out calamanCy and giving your feedback so that I can improve the models over time. My priority for 0.2.0+ is to write domain-specific tokenizers to help with simple NLP tasks (e.g., for parsing tweets). I also want to do a few more data annotation projects as a precursor to v1.0.0. These projects include a more fine-grained label set for NER, and a better Universal Dependencies Treebank.
On the research side, I’m curious how large language models fare on Tagalog data. We’ve already built up some nice benchmarks because of this effort so it might be nice to do a side-by-side comparison for zero/few-shot prompting. I’m also interested in training language-specific adapters for efficiency.
If you have any questions, feel free to reach out on Github or my email!
Parts of this work were published in the paper, “Developing a Named Entity Recognition Dataset for Tagalog”, at IJCNLP-AACL’s Southeast Asian Language Processing Workshop. Feel free to cite that paper in your own work.
First off, I’m happy to see such warm reception to my first blog post. Thank you! There are a few more experiments that I wanted to do for the sake of completeness and rigor. I hope to release the alpha version of calamanCy in August, so this blog post may as well be a lead-up to that release.
This blog post can be summarized as: “young and naive researcher just learned something very obvious!” I am mostly referring to the annotation process. Luckily, I was able to find two more folks to help with data annotation, and we’ve been updating the original dataset for the past three months.
Tl;dr, we just finished re-annotating TLUnified with NER tags! You can access the updated corpus, TLUnified-NER, in HuggingFace Datasets.
It’s harder to imagine this when you’re annotating alone. In fact, the final diagram in my February blog post shows this misconception. We don’t just annotate a thousand examples until we’re tired and call it a day. Instead, annotation is iterative:
Nils Reiter’s blog post has been my annotation bible for the past few months. The figure above is a simplified version of his annotation workflow. We start by creating a set of pilot annotations and then continually iterate until we reach a stop condition. For each round, we add new annotations while correcting our past annotations. This process makes everything a bit more involved, but at least we get higher quality annotations in the end.
We tried to annotate 800-1000 examples for each round and ensured that each annotator gets the same batch of texts.
Labeling that amount of data takes one-and-a-half to two weeks at most.
During the pilot phase, we used the annotation guidelines I initially developed for myself.
As for our software, we used Prodigy with the ner.manual
recipe.
Note that for each round, we are adding more examples to the corpus. After six to seven syncs at the course of four months, we arrived at our target dataset size. Finally, we also tried to correct our past annotations based on our revisions to the annotation guidelines. However, there are no checks or QA for these corrections so I wasn’t able to track their diffs.
The evaluation step is perhaps the most crucial part of the annotation process. Ultimately, the goal is to improve the annotators’ understanding of the “phenomena.” This step usually involves the following activities:
Resolving disagreements / misconceptions: I compiled the annotations and computed for a partial inter-annotator agreement score (IAA). This process allowed me to estimate if our annotations are improving in quality.
For named-entity recognition (NER), it is not straightforward to compute for this metric (I used Cohen’s Kappa). It is possible to compute for this value at the token level, but this leads to an imbalanced dataset (e.g., there are many unlabeled tokens). So I followed what Deleger et al., (2012) and Brandsen et al., (2020) did and computed for the pairwise F1-score as well.
Mini-retrospective meetings: During the initial months of the annotation phase, I conducted sync meetings to discuss confusing examples and labels. Confusion usually came from edge-case examples or vagueness in the annotation guidelines. We try to resolve this by updating the guidelines or correcting our past annotations.
I try to make these meetings as short and async as possible (30 minutes to 1 hour). I pattern these meetings to a typical software development sprint retrospective.
We use Parabol (I co-opted the “Start, Stop, Continue” free template) as our collaboration software. We frame the questions like so:
Personally, I enjoy discussing rules as it compels me to establish a coherent pattern when labeling examples. We try to keep a “bank” of edge-cases and work together to address them. However, if we focus too much on individual examples, our meetings may become inundated with edge cases, hindering the improvement of the guidelines.
Assess if we need more annotations: For this annotation project, I have two stop conditions: (1) if the train curve doesn’t improve or (2) if we reached at least 5000 examples.
Prodigy provides a train-curve
command to check if we still need more examples by learning a model at 25%, 50%, and 75% of the training set.
For the most part, the trend points to us annotating more data, but my budget is running low and I have other things to do, so I stopped after we reached 7000 examples. I’m definitely game to annotate a few more, but in the future I’d want to include other useful labels such as morphological features or parts-of-speech (something for Universal Dependencies) in my next annotation project.
Here’s a chart on how our IAA metrics improved over time. These numbers don’t factor in our corrections. I’m just computing the metrics per batch as I receive them.
Finally, I found it helpful to have “north star questions” as I evaluate our annotations. It is easy to get bogged down by details that I might miss the bigger picture. These north star questions include:
In the future, I think it would be better to do a more fine-grained annotation project. I realized later in the project that my entity types are too general that we tend to lump several categories in a single label. I think that’s the next phase of improvement that I can do next.
Updating the annotation guidelines allows us to codify our understanding of the task. Personally, it is also a good way to preserve our learnings and onboard future annotators in the project. Most of the updates involve adding new examples, clarifying definitions, and noting edge-cases.
I’ll be updating the calamanCy repository with the latest version of the annotation guidelines. It has grown quite a bit (and we’re using Google Docs for easier collaboration), but it might be good to transfer it in the open.
I’m happy to see the growth of the project from its inception a few months ago. I’m also excited to share these updates and release the trained pipelines soon. Finally, I’d like to thank our annotators for helping out in the process.
In this blog post, I want to share my notes on parameter-efficient finetuning techniques (PEFT). Here, we only finetune a small number of parameters while keeping most of the LM parameters frozen. As a result, PEFT allows domain adaptation at a lower compute (and storage) cost. Lastly, this blog post is not a literature review; I will only discuss methods I personally like. For each PEFT, I will talk about its overview, related works, and high-level implementation.
To recap, pretrained language models like BERT (Devlin et al., 2019) contain contextualized word representations that capture the meaning of each token and its context within the text. By themselves, they’re already useful. However, language models have enjoyed greater versatility and state-of-the-art performance because of finetuning (Howard and Ruder, 2018).
Much of the pretrained LMs we use today are based on transformer networks (Vaswani et al., 2017). Let’s review its architecture, as it will help us understand the PEFT techniques later on. Recall that most transformer networks consist of a stack of encoder and decoder layers with an attention mechanism:
The encoder layer consists of two sub-layers: an attention layer and a feedforward network. Its outputs are passed to the decoder, consisting of the same two sub-layers plus a cross-attention mechanism that attends the encoder’s output. Between each sub-layer, there is a residual (or skip) connection that is normalized through LayerNorm (Ba et al., 2016).
For transformers like BERT, there is no generative step, hence it only contains encoders. Here’s what a typical encoder layer looks like:
# Pseudocode of a typical encoder layer
class Encoder:
def __call__(self, x: Tensor) -> Tensor:
residual = x
x = MultiHeadAttention(x)
x = LayerNorm(x + residual)
residual = x
x = FeedForwardLayers(x)
x = LayerNorm(x + residual)
return x
This encoder-only transformer uses multi-head attention, where the attention function is applied in parallel over \(N_h\) heads. So given (1) a sequence of \(m\) vectors \(\mathbf{C}\in \mathbb{R}^{m\times d}\) which we will perform attention to, and a (2) query vector \(\mathbf{x} \in \mathbb{R}^{d}\), multi-head attention computes the output on each head and concatenates them:
\[MultiHeadAttention(\mathbf{C}, \mathbf{x}) = Concat(head_1, \dots, head_h)\\ where~~head_i = Attention(\mathbf{x}\mathbf{W}_q^{(i)}, \mathbf{C}\mathbf{W}_k^{(i)}, \mathbf{C}\mathbf{W}_v^{(i)})\]And the feedforward network consists of a two linear transformations with a ReLU activation function:
\[Feedforward(\mathbf{x}) = ReLU(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2\]One common way to finetune is to attach a task-specific head at the end of a pretrained language model then train the entire network on our labeled data.1 Although these LMs were trained on different tasks (for example, BERT was trained on masked language modeling and next sentence prediction), it is possible to refine its weights for other NLP problems (e.g., sentence tagging, sequence classification, etc.).
However, this process has become inefficient (Treviso et al., 2023). The number of parameters in pretrained language models has increased exponentially, exacerbating fears in an LM’s environmental impact (Strubell et al., 2019) and making them inaccessible in resource-constrained and consumer-grade environments. Hence, an efficient approach to finetuning is becoming more desirable.
Recently, I’ve been interested in parameter-efficient techniques (PEFT) that are modular in nature. Most of these involve creating small, trainable modules (sometimes a feedforward network or a special matrix) that we attach to different parts of a larger, pretrained network. Usually, we keep the larger network frozen while we train the smaller modules on our data.
Most [parameter-efficient techniques] involve creating small, trainable modules that we attach to different parts of a larger, pretrained network.
I like parameter-efficient and modular techniques because they appeal to my engineering sensibilities. I like the “separation of concerns” between our network’s fundamental understanding about language (pretrained network) and its task-specific capabilities (modular network), as opposed to training a single monolithic model. We can even aggregate these modules to solve multi-domain (Gurugangan et al., 2022; Chronopoulou et al., 2023; Asai et al., 2022; Pfeiffer et al., 2021) or multilingual problems (Pfeiffer et al., 2020; Pfeiffer et al., 2021).
I won’t be talking about every technique in this post. Instead, I’ll focus on methods that I like. For a more comprehensive overview, I highly recommend looking at Treviso et al.’s (2022) survey of efficient NLP techniques and Pfeiffer et al.’s (2023) work on modular deep learning.
Houlsby et al. (2019) first proposed the use of adapters in the context of NLP. The idea is to attach a small feedforward network after each transformer sub-layer and tune it while keeping the larger network frozen:
# Pseudocode of a typical encoder layer with adapter
class EncoderWithAdapter:
def __call__(self, x: Tensor) -> Tensor:
residual = x
x = MultiHeadAttention(x)
x = AdapterNetwork(x) # Usually another feedforward layer
x = LayerNorm(x + residual)
residual = x
x = FeedForwardLayers(x)
x = AdapterNetwork(x) # Usually another feedforward layer
x = LayerNorm(x + residual)
return x
The adapter consists of a bottleneck architecture with a down- and up-projection, similar to autoencoders. It also contains an internal skip-connection. According to Houlsby et al. (2019), they were able to limit the number of trainable parameters to \(0.5-8\%\) of the original model.
# Pseudocode of an adapter network
class AdapterNetwork:
def __call__(self, x: Tensor, m: int) -> Tensor:
residual = x
d = get_dims(x)
x = DownProjection(input_dim=d, output_dim=m)
x = NonLinearity(x)
x = UpProjection(input_dim=m, output_dim=d)
return x + residual
From then on, researchers have proposed alternative adapter architectures. One notable development is from Pfeiffer et al. (2021), where they reduced the number of adapters significantly by attaching only one adapter per transformer layer with minimal performance degradation. Alternative adapter approaches also include training only the bias parameters as in BitFit (Ben Zaken et al., 2022), learning a task-specific different vector as in diff-pruning (Guo et al., 2021), or connecting adapters in parallel (He et al., 2022).
Most adapters consist of a bottleneck architecture with a down- and up-projection
However, I’m more interested in composing adapters together to solve multi-domain, multi-task, or multilingual problems.2 They add another level of flexibility to PEFT:
Multi-domain: For example, Chronopoulou et al. (2023) trained an adapter for each domain and computed its weight-space average at test time. They dubbed this as an AdapterSoup. The idea for “model soups” were based from Wortsman et al. (2022) and was inspired by convex optimization techniques.
Multi-task: Pfeiffer et al.’s (2021) work on AdapterFusion attempts to transfer knowledge from one task to another by combining their corresponding adapter representations. The “fusion” is done by introducing a new set of parameters \(\Psi\) (separate from the pretrained LM and adapter parameters, \(\Theta\) and \(\Phi\)) that learn to combine all the task adapters:
The table below shows monolithic and modular techniques for each dimension of an NLP problem. Note that adapters still require a pretrained LM as its base, so don’t treat this as an either-or comparison.
Dimension | Monolithic | Modular |
---|---|---|
Multi-domain | Most general-purpose BERT pretrained models | AdapterSoup (Chronopoulou et al., 2023), DEMix (Gurugangan et al., 2022) |
Multi-task | MT-DNN (Liu et al., 2019) | AdapterFusion (Pfeiffer et al., 2021) |
Multi-lingual | XLM-R (Conneau et al., 2019), Multilingual BERT (Devlin et al., 2018) | MAD-X (Pfeiffer et al., 2020), UNKs everywhere (Pfeiffer et al., 2021) |
There are still open problems I foresee with adapters. One is the complexity-efficiency tradeoff. It’s possible to “go crazy” in architecting adapters that the efficiency gain may not worth the complexity of the training set-up. Just like in engineering, there’s a tendency to go full-kubernetes when you can just use a simple virtual machine. I’m curious about the practical aspects of adapter training. I think AdapterHub is a good start. I’m still looking forward to developments in this field.
Nowadays, the common way to leverage large-language models is through in-context learning via hard prompts (Brown et al., 2020). Take this prompt, for example, in a text classification task:
Determine whether the text below is a "Recipe" or "Not a recipe"
Text: """Add 2 cups of rice to 3 cloves of garlic, then
add butter to make fried rice"""
Answer: Recipe
Text: """I'm not sure if that will work"""
Answer: Not a recipe
Text: """To make a caesar salad, combine romaine lettuce,
parmesan cheese, olive oil, and grated eggs"""
Answer:
We refer to the text above as a hard prompt because we’re using actual tokens that are not differentiable (think of “hard” as something static or set in stone). The problem here is that the output of our LLM is highly-dependent on how we constructed our prompt. Language is combinatorial: it will take a lot of time to find the right “incantation” so that our LLM performs optimally. This begs the question: what if we can learn our prompts?
The problem [with in-context learning] is that the output of our LLM is highly-dependent on how we constructed our prompt.
Prompt and prefix tuning solves this by making use of soft prompts— a vector attached to the input embedding that we train while keeping the pretrained LM frozen. These two ideas appeared at the same time (Lester et al, 2021; Li and Liang, 2021) but have similar mechanisms. The only difference is that prompt tuning adds the tensor only at the input while prefix tuning adds the tensor at each transformer layer:
# Pseudocode of a typical encoder layer with soft prompts
class EncoderWithSoftPrompts:
def __call__(self, x: Tensor, soft_prompt: Tensor) -> Tensor:
soft_prompt = FeedForwardLayer(soft_prompt)
x = concatenate([soft_prompt, x])
residual = x
x = MultiHeadAttention(x)
x = LayerNorm(x + residual)
residual = x
x = FeedForwardLayers(x)
x = LayerNorm(x + residual)
return x
One of the currently popular implementations of prefix tuning is the LLaMa-Adapter (Zhang et al,. 2023). The mechanism is similar to prefix tuning with some subtle differences:
I find these techniques (collectively known as p*-tuning) exciting because they open up different ways to update transformer inputs. I’m interested to examine, post-hoc, the dependency of these soft prompt weights to the data or domain. For example, we can probe these soft prompts to determine which parts of the input were most influential in generating the output (for both autoregressive and encoder-decoder networks). Lastly, I also find it exciting to seek ways to aggregate soft prompts together for multilingual applications.
Lastly, Hu et al.’s (2021) work on Low-Rank Adaptation (LoRA) aims to make the weight update process more efficient. Their premise, based from Aghajanyan et al. (2021), is that although weight updates are of full-rank (each row and column are linearly-independent), they can still be represented into a lower-dimensional space while retraining most its structure (low-rank).
This technique is akin to using PCA or UMAP to reduce an n-dimensional (\(n>2\)) dataset into 2-d space in data visualization. In LoRA, we’re reducing the weight update matrix, \(\Delta W\). Note that we’re not really changing the shape of \(\Delta W\). Instead, we’re representing it into a low-rank form where redundancies are acceptable.
LoRA constrains the weight update \(\Delta W\) of a matrix \(W_0 \in \mathbb{R}^{d\times k}\) with two trainable low-rank matrices \(A\) and \(B\). This makes the weight update computation to be \(W_0 + \Delta W \rightarrow W_0 + BA\), where \(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r \times k}\), \(r \ll min(d,k)\):
These \(B\) and \(A\) matrices are trainable. They used a random Gaussian initialization for \(A\) and zero for \(B\), so \(\Delta W\) is a zero-matrix at the beginning. Here, the rank \(r\) is a hyperparameter that must be tuned. In a transformer network, LoRA is applied to the attention weights:
I haven’t seen a lot of works that expand on LoRA or rank-decomposition as an efficient finetuning technique. However, I’m interested to perform more empirical experiments on this method, especially on figuring out where else we can attach these LoRA matrices. Interestingly, LoRA has found a foothold in image data, especially in diffusion networks (Rombach et al., 2021). Perhaps this technique can be extended into other modes of data.
In this blog post, we looked into a few parameter-efficient finetuning techniques (PEFTs) for large language models. First, we had adapters, that attach small trainable modules between transformer layers. Then prompt and prefix tuning, that attach tunable parameters near the transformer input. Finally, we had low-rank adaptation that decompose transformer weights into lower-rank matrices.
It is interesting that despite their differences, there are structural similarities. Recently, I’ve been interested in reading papers that surmise a general framework for PEFTs. He et al.’s (2021) posits that PEFTs can be seen as clever weight updates on a transformer subspace. Pfeiffer et al. (2023) generalizes this further into what we now know as modular deep learning. In the application side, AdapterHub provides an implementation framework that allows users to leverage these PEFTs.
It is interesting that despite their differences, there are structural similarities between parameter-efficient finetuning techniques.
Finally, my interest in PEFTs are motivated by going against the current zeitgeist in NLP: train larger and larger models with larger and larger datasets. Perhaps there’s value in moving the other way around. Moving the needle in efficiency can lead to truly accessible and democratized NLP.
Marcos Treviso, et al. “Efficient methods for natural language processing: a survey.” arXiv preprint arXiv:2209.00099 (2022).
Note that it’s also possible to freeze the entire pretrained LM and only update the weights of the task-specific head. However, this process only works if the task-specific data is small to avoid overfitting. In most cases, updating both set of weights leads to better performance on the task at hand. ↩
NLP problems have multiple dimensions. Multi-domain: legal texts, finance documents, scientific publications, etc.; Multi-task: question answering, sequence classification, sequence tagging, etc.; Multi-lingual: English, German, French, etc. ↩
Because [annotation] guidelines were written to define the parameters of a task, we hope that they can improve the annotation process by providing more context and examples.
In this blog post, I want to focus on argumentative sentence detection: we want to know if a given text is an argument. I’ll use the “minimum wage” dataset from the UKP Sentential Argument Mining Corpus (Stab, et al., 2018). In addition, I’ll use three other annotation guidelines from different NLP papers. I based these choices from the work of Jakobsen et al. (2022).
Because each guideline defines an argument differently and asks for different
labels, I normalized them into 1: Argument
and 0: No argument
similar to
Jakobsen et al.’s (2022) work. The table below
summarizes these guidelines (the number beside each label is its normalized
version):
Authors | Labels | How they defined an argument |
---|---|---|
Levy et al., 2018 | Accept (Pro/Con) (1) , Reject (0) |
Here they defined a claim as the conclusion, that is, the assertion the argument aims to prove. They didn’t mention anything about premises. |
Stab et al., 2018 | Attacking (1) , opposing (1) , non argument (0) |
They have an explicit requirement where each claim should be backed-up by another claim or premise. Claims alone don’t equate to an argument. |
Shnarch et al., 2018 | Accept (1) , Reject (0) |
They defined an argument as containing a claim (conclusion) and premise (evidence). Claims alone don’t equate to an argument. |
By incorporating both the annotation guideline and large language model, we can get LLM predictions by feeding annotation guidelines into the prompt. This is similar to my previous blog post with the addition of more context from the annotation guideline. The engineering challenge here is on feeding a long string of text into a prompt constrained to a set amount of tokens.
I plan to accomplish this task using Prodigy, an annotation tool, and LangChain, a library for working with large language models. You can find the source code from this Github repository.
To save costs, I’ll only be using the 496 samples from the test set. I also discarded the Morante et al, 2020 annotation guideline because it’s eight pages long and API calls can balloon from processing it.
Fitting a long document into OpenAI’s prompt is one of the primary engineering
challenges in this project. The GPT-3.5 text-davinci-003
model only allows a
maximum request length of 4097 tokens, which are shared between the prompt and
its completion— most annotation guidelines won’t fit.
LangChain offers a simple solution:
split the document into chunks and think of prompts as functions. By doing so,
we can leverage a wide range of data engineering concepts such as map-reduce,
reranking, and sequential processing. The LangChain
API
dubs these as MapReduce
, MapRerank
, and Refine
respectively.
LangChain offers a simple solution: split the document into chunks and think of prompts as functions. We can then leverage a wide range of data engineering concepts.
During my experiments, I found that using the sequential prompting technique,
Refine
, works best for annotation guidelines. The output is more consistent
and the model does not fail at performing the task. The figure below provides an
overview of this process:
Write a seed prompt. The seed prompt asks GPT-3.5 to classify an example given the first chunk of the annotation guideline. It then returns a preliminary answer that will be refined later on using the refine prompt. For our project, the seed prompt looks like this:
Context information is below.
-----------------------------------------
{context}
-----------------------------------------
Given the context information and not prior knowledge, classify
the following text:
{question}
Write a refine prompt The refine prompt asks GPT-3.5 to refine their answer given new information. The prompt is called successively until all chunks are shown. Then, we take the refined answer and assign it as our LLM’s prediction. The refine prompt looks like this:
The original text to classify is as follows: {question}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed)
with some more context below.
----------------------------------------------
{context}
----------------------------------------------
Given the new context, refine the original answer to better
classify the question. If the context isn't useful, return
the original answer.
Notice that I’m using some terms from the Question-Answering domain: context refers to the chunk of text from the annotation guideline, and question is the example to be classified. I patterned my prompts to this domain because it’s easier to think of annotation as a question-answering task.
Funnily enough, this problem may be considered solved with the release of GPT-4. With a longer context length (32k tokens or roughly 25,000 words), you can fit in a whole annotation guideline without splitting it into chunks.
For this evaluation step, I want to compare the predictions from each annotation
guideline to the gold-standard annotations found in the UKP dataset. First, I
normalized all labels into a binary classification task between an Argument
and No argument
:
Argument
: Accept (Pro/Con), Attacking, Opposing, AcceptNo argument
: Reject, Non argumentThis process gave us a dataset with \(227\) examples with the Argument
class
and \(270\) examples with the No argument
class. The F1-score for each
annotation guideline is shown below:
Scores | Stab et al. (2018) | Levy et al. (2018) | Shnarch et al. (2018) |
---|---|---|---|
Micro F1-score | \(\mathbf{68.35}\) | \(60.08\) | \(42.54\) |
Macro F1-score | \(\mathbf{67.61}\) | \(53.71\) | \(37.94\) |
F1-score (per type) | Stab et al. (2018) | Levy et al. (2018) | Shnarch et al. (2018) |
---|---|---|---|
No argument (NoArgument ) |
\(\mathbf{72.50}\) | \(89.26\) | \(54.83\) |
Argument (Argument ) |
\(\mathbf{62.71}\) | \(36.54\) | \(21.05\) |
As expected, the performance of the Stab et al., (2018)
guideline is closest to the original annotations. It’s also interesting how the
results are heavily biased towards the NoArgument
class for both Levy et al.
(2018) and Shnarch et al. (2018)
guidelines. From a qualitative inspection, these results make sense because the wording
from these guidelines denote a more stringent criteria for accepting statements as
arguments:
Figure: Portion of annotation guidelines from Shnarch et al. (2018)
Figure: Portion of annotation guidelines from Levy et al. (2018)
It’s still hard to say which exact statements from the guideline informed an
LLM’s decision. But because our prompting strategy refines the answer for each
chunk of text, it’s possible that original Accept
answers were rejected
because of new information from the prompt.
Finally, I also noticed that the performance from the Stab et al. (2018) annotation guideline is worse than the supervised and zero-shot predictions from my previous blog post:
Scores | Zero-shot | Supervised | Stab et al. (2018) |
---|---|---|---|
Micro F1-score | \(\mathbf{81.45}\) | \(79.88\) | \(61.90\) |
Macro F1-score | \(\mathbf{78.74}\) | \(77.52\) | \(55.02\) |
F1-score (per type) | Zero-shot | Supervised | Stab, et al. (2018) |
---|---|---|---|
Supporting argument (Argument_for ) |
\(\mathbf{75.21}\) | \(73.60\) | \(48.74\) |
No argument (NoArgument ) |
\(\mathbf{86.74}\) | \(85.66\) | \(72.50\) |
Opposing argument (Argument_against ) |
\(\mathbf{74.26}\) | \(73.30\) | \(46.00\) |
It’s an interesting negative result because it ran contrary to what I expected: we already have the annotation guideline, isn’t it supposed to work well? However, I realized that it’s still difficult to make a straight-up comparison between two prompts: it’s possible that one prompt was written poorly (not “fine-tuned” given known prompt engineering techniques) than the other. Personally, I will still dabble into this research lead, but it’s also possible that writing a short and sweet zero-shot prompt works best for our task.
This time, let’s lean into the idea that LLM’s capture the intention of annotation guidelines and compare them against one another. We take one guideline as reference, the rest as predictions, and compute the F1-score. We then arrive at the graph below:
It’s interesting that for all cases, the Stab et al. (2018) annotation guideline performs best (of course, discounting cases when we evaluate a guideline to itself). On the other hand, the Shnarch et al. (2018) performs the worst.
I don’t think there’s anything vital to conclude from these results. Perhaps they say something about how strict a guideline is? Maybe this can lead to experiments that investigate how similar guidelines are to one another. We usually measure text similarity via some cosine distance between the text’s vectors. However, guidelines are intentional, and maybe something can be said about the products of these intentions, which in this case, are the annotations.
In this blog post, we looked into how we can incorporate annotation guidelines into our annotation workflows by including them in the prompt. In order to get around OpenAI’s token limit, we partitioned our document and passed each chunk sequentially into the prompt. All of these were accomplished using Prodigy and LangChain.
When comparing to gold-standard annotations, the original guidelines for the UKP dataset performed better compared to others that were written for other tasks. However, a zero-shot approach outperformed all methods. In fact, a straightforward supervised approach outperforms a prompt with annotation guidelines. I see this more as a negative result.
Moving forward, I think much can still be done in this line of work. I imagine using this process to evaluate how well an annotation guideline “models” the task. I wouldn’t use it to get few-shot predictions, it’s costly and not performant. In addition, it might be interesting to incorporate annotation guidelines in the annotation UI, perhaps to highlight relevant parts of the document that’s useful to accomplish a task. I’m interested to hear any suggestions or thoughts about this experiment. Feel free to reach out or comment below!
In this blog post, I want to explore how this approach translates to more intricate and challenging annotation tasks, such as argument mining where one has to trace chains of reasoning. I’ll be working on a portion of the UKP Sentential Argument Mining Corpus (Stab, et al., 2018), where sentences are categorized either as a supporting argument, an attacking argument, or a non-argument for a given topic. I want to investigate two major questions:
Can zero-shot annotations be reliable? Here, I’d like to benchmark zero-shot annotations from GPT-3 and compare them with a baseline approach. I don’t think we should rely on GPT-3 alone to annotate our data, but it doesn’t hurt to see if they work.
Can LLMs provide extra affordance? I’d like to explore UI elements in which LLMs can help human annotators reduce their cognitive load when labeling. I want to explore an LLM’s ability to highlight spans or provide reason for their labels. Each of these affordances represent a different level of “reliance” on an LLM’s capabilities.
I will focus on the topic of minimum wage in the UKP corpus. I find it interesting, and the number of samples is small enough that I don’t have to worry about OpenAI API costs.
Tokens | No argument | Supporting | Opposing | |
---|---|---|---|---|
Training set | \(42589\) | \(968\) | \(414\) | \(396\) |
Development set | \(4899\) | \(108\) | \(46\) | \(44\) |
Test set | \(13257\) | \(270\) | \(116\) | \(111\) |
Total | \(60745\) | \(1346\) | \(576\) | \(551\) |
Table: Dataset statistics for the minimum wage
subset of the UKP Sentential Argument Mining corpus (Stab, et al., 2018).
You can find the full project in this Github repository. It is an instance of a spaCy project that you can run to reproduce my results.
First, I want to check how accurate GPT-3’s zero-shot annotations are. Then, I will compare it against a standard approach of training a supervised model from the corpus. To qualify, the word “reliability” here is shallow. I’m not making claims on a language model’s trustworthiness, only its test set accuracy.
I will compare GPT-3’s zero-shot annotations against a standard approach of training a supervised model from the corpus.
In the supervised set-up, I’m using spaCy’s TextCategorizer to perform an exclusive text classification task. It uses a stacked ensemble of a linear bag-of-words model, and a neural network model initialized with the weights of a large RoBERTa transformer (Liu, et al., 2019).1 To recap how supervised learning works, we train a model from the training and development sets, then evaluate the predictions on a held-out test set as shown in the figure below:
In the zero-shot set-up, I completely ignore the training and development sets and include test set examples into the prompt. Then, I send this prompt to GPT-3 and parse the results. Finally, I treat whatever it returns as its predictions and compare them with the gold-standard test data.
I formatted the prompt like this:
Determine whether the text is a supporting argument (Argument_for),
opposing argument (Argument_against), or none (NoArgument) regarding
the topic of "minimum_wage." Answer in the following format:
answer: <Argument_for,Argument_against, or NoArgument>
Text:
"""
Increasing minimum wage will increase our standard of living.
"""
And GPT-3 answers in the form of:2
answer: Argument_for
For comparison, I evaluated the predictions of the supervised and zero-shot setup using the gold-standard test data. The results are shown in the table below:
Scores | Zero-shot | Supervised |
---|---|---|
Micro F1-score | \(\mathbf{81.45}\) | \(79.88\) |
Macro F1-score | \(\mathbf{78.74}\) | \(77.52\) |
F1-score (per type) | Zero-shot | Supervised |
---|---|---|
Supporting argument (Argument_for ) |
\(\mathbf{75.21}\) | \(73.60\) |
No argument (NoArgument ) |
\(\mathbf{86.74}\) | \(85.66\) |
Opposing argument (Argument_against ) |
\(\mathbf{74.26}\) | \(73.30\) |
Interestingly, the zero-shot pipeline performs better than the supervised pipeline. For example, the Macro F1-score, which reports how well the two classifiers fare in light of an imbalanced dataset, shows the zero-shot classifier leading by a hair against the supervised model. These results answer the question we posed in this section— we can rely on GPT-3’s zero-shot annotations when labeling an argument mining dataset.
Perhaps the wrong conclusion to make here is that we can just replace our supervised model with a zero-shot classifier in production. It’s an appealing thought because we see higher scores from our LLM predictions. However, that’s a trap because we already know the test set. In production, we either don’t have access to gold-standard annotations, or we’re still making it ourselves. So instead, think of our zero-shot predictions as silver-standard annotations that we can refine further to produce trustworthy, gold-standard labels.
Think of our zero-shot predictions as silver-standard annotations that we can refine further to produce trustworthy, gold-standard labels.
In this section, I want to explore what other capabilities a large language model can offer as we annotate our dataset. I want to think of these as affordances, something that an annotator can use as they produce gold-standard data.
In the context of argument mining, we can use LLMs to (1) highlight an argument’s claim and premise and (2) provide a reason as to why it labeled a particular text as such. I’d like to think of these affordances as different levels of reliance over GPT-3’s capabilities:
I noticed that I become more inattentive as I use affordances that rely heavily on LLMs.
From my annotation experience, I noticed that I become more inattentive as I use affordances that rely heavily on LLMs. It’s easier for me to just accept whatever the language model suggests. This can be dangerous because an LLM can, in all its biases, influence my annotations. More so if it demonstrates some level of intelligible text to defend its decision. Perhaps it’s just my personal negligence and laziness, but I’d like to highlight this experience to provide some context to the next sections.
Palau and Moens (2009) define an argument as a set of premises that support a claim.3 For our dataset, I would like to introduce an annotation set-up where the premise and the claim, if present, are highlighted to guide annotation.
Annotation set-up where the premise and claim, if present, are highlighted to guide annotation.
This set-up aims to give the annotator extra information to label a particular text. For example, they can use the highlighted spans to easily check the premise of an argument with respect to its claim. The hope is that through this, we can reduce the cognitive load of annotation as the relevant parts of the document are already emphasized.
For this to work, we need to (1) treat the premise and the claim as spans and prompt GPT-3 to identify them for each text as a span labeling task. Then, we (2) pass this information into our annotation tool and label as usual, except that the relevant spans are now highlighted:
For GPT-3, the prompt goes like this:
From the text below, identify the exact span of text that represents
the premise, and the claim regarding the topic of minimum wage.
Answer in the following format:
Premise: <comma delimited list of string or N/A>
Claim: <string or N/A>
Here's the text
Text:
"""
In 2009, an increase in minimum wage resulted to a
higher standard of living .
"""
And we expect the answer in the form of:
Premise: "In 2009", "higher standard of living"
Claim: "increase in minimum wage"
I implemented this process using the OpenAI
recipes from
Prodigy. In particular, I used the ner.openai.fetch
recipe
to prompt GPT-3 to extract spans based on the labels I provided—i.e.,
Premise
and Claim
. This recipe attaches the premise and claim as spans into
a new corpus that I can load using Prodigy’s built-in
textcat.manual
recipe. Because
of this set-up, the spans are highlighted in the UI as shown below:
This set-up allows annotators to take advantage of relevant spans as they decide
the label for a particular example. We can also combine our textcat
annotations earlier to pre-select the category choice so that annotators can
confirm an LLM’s prediction. For example, they can compare the highlighted
premise to the predicted category and decide whether to correct or accept the
suggested annotation.
Finally, I want to mention that this set-up still relies on a human annotator’s reasoning in order to label the text. The UI presents the relevant components, i.e. an argument’s premise and claim, but it’s still up to the annotator whether to use or ignore these affordances. In the next section, we will double-down on GPT-3’s capabilities and ask it to “reason” why it labeled a particular text as such.
This time, we increase our dependence on an LLM’s output by asking it to explain why it labeled a particular text as such. Wei, et al. (2022) proposes a prompting method called chain-of-thought to elicit complex reasoning from large language models. We will be using this technique in our argument mining dataset.
Chain-of-thought prompting is often applied to arithmetic, commonsense, and symbolic reasoning tasks. The trick is to decompose a problem into intermediate steps and solve each before giving the final answer. Researchers have found empirical gains when using chain-of-thought prompting, but it’s still a hotly debated topic whether large language models can reason like humans do (Bender et al., 2021 and Bender and Koller, 2020).
One way we can apply chain-of-thought prompting to our dataset is by decomposing an argument into its premise and claim. Then, we can examine each premise and determine if they are in favor, against, or irrelevant to the claim.4 From there, we can start classifying the text amongst our three labels. This “reasoning pipeline” is shown in the figure below:
The prompt is a bit long, but it goes like this: I first provided the
instructions for the task (i.e., the labels to classify with, the format for
parsing, etc.), then I included exemplars via chain-of-thought. You can find the
complete prompt by clicking the details
tab below:
Chain of thought prompt for argument mining annotation</summary>
Determine whether the text below is a supporting argument (Argument_for),
opposing argument (Argument_against), or none (NoArgument) regarding
the topic of "minimum wage." First, identify the premise and the
claim, then determine if the reasons or evidence presented in the
argument are in favor of the claim, against the claim, or irrelevant
to the claim. Answer in the following format:
answer: <Argument_for, Argument_against, or NoArgument>
reason: <reasoning process>
Below are some examples
Text:
"""
Or instead of hiring fewer employees , the company may start
outsourcing jobs to employees in countries that are willing to
work for much less than $ 10.10 per hour , resulting in fewer
jobs for Americans ."
"""
answer: Argument_against
reason: The premise of the text is that if the minimum
wage is increased to $10.10 per hour, then companies may
start outsourcing jobs to countries where employees are
willing to work for much less, leading to fewer jobs for
Americans. This is an argument against increasing the
minimum wage. Therefore the answer is Argument_against.
Text:
"""
{{text}}
"""
</details>
I used the same recipe as the zero-shot experiment earlier, but changed the prompt to use chain-of-thought. I then loaded OpenAI’s output back to a custom Prodigy recipe so that I can view the results in a UI. Below, you’ll find some screenshots of this updated UI (they might be too small in your screen so feel free to open the images in another tab). Here, we see GPT-3’s explanation alongside its suggestion:
Ideally, annotators can refer to GPT-3’s explanation to confirm or correct a suggestion. Personally, I find this workflow helpful on texts that are either confusing or hard to parse. Having the premise and claim deconstructed in an explicit way can help improve annotation efficiency and hopefully reduce annotator fatigue.
In this blog post, we looked into different ways on how large language models like GPT-3 augment the annotation process. We did this in the context of argument mining, a complex annotation task that involves reasoning. First, we examined how zero-shot predictions fare in the corpus, and then explored potential affordances to improve the annotation process. We categorized these affordances based from their reliance on LLMs— manual, directed, and dependent.
A directed approach uses an LLM to provide supplementary information that a human annotator can use to inform their labeling decision. On the other hand, a dependent approach asks an LLM to not only suggest labels, but also explain why a text was labeled as such. Personally, I find myself being more inattentive as I go from a more dependent approach to an LLM-guided annotation.
The final annotation still rests on a human annotator’s decision, not the LLM’s. These affordances are mere guides that provide convenience through suggestions.
The final annotation still rests on a human annotator’s decision, not the LLM’s. These affordances are mere guides that provide convenience through suggestions. I find it comforting, because I believe there’s still some nuance in most annotation tasks that only we can capture.
Finally, I think I made some claims here about improving efficiency because of these UI affordances (e.g., inattentiveness, parsing chain-of-thought). I admit that these were all from my personal experiences when annotating and trying out these recipes. I’m happy to hear suggestions on how I can verify or test these claims. Feel free to drop a comment below.
You can check the configuration file I used for this project in this Github repository. ↩
I don’t really expect that GPT-3 will always abide with the response format I set— it’s a statistical model, after all. However, I didn’t encounter any parser errors during my experiments. ↩
Several NLP papers may still have varying degrees of explicitness towards this definition (Jakobsen, et al., 2022), but for the purposes of this blog post, we’ll stick with the one above. ↩
Personally, I think what qualifies as a chain-of-thought prompt is a bit vague. For now, I scoped it as “any prompt that contains a coherent series of intermediate reasoning steps that lead to the final answer of a problem.” Lastly, I highly recommend going through the paper’s OpenReview discussion, there are some interesting points raised by the reviewers regarding evaluation. ↩