IOAI Philippines 2026 NLP Lecture
Slides GitHub

Contents

  • Problem 1: SkyAssist
  • Step 1: (R)ead the problem
  • Step 2: (I)mplement the Baseline
  • Step 3: (C)heck for Errors
  • Step 4: (E)nhance the Solution
  • Trivial enhancements
  • Data-centric improvements

Problem 1: SkyAssist¶

Task Description: You've been hired by SkyAssist, a startup building a voice assistant for airline travelers. Passengers say things like:

"Show me flights from Boston to Denver on Friday"
"What's the fare for a first class ticket to Atlanta"
"I need ground transportation in San Francisco"

Your system needs to do two things:

  1. Task A — Understand the request. Figure out what the passenger wants overall. Each utterance falls into exactly one category (e.g. flight, airfare, ground_service, airline, ...).

  2. Task B — Extract the details. Pull out the key pieces of information from the utterance. Each word is given a label marking which detail (if any) it belongs to.

Your goal is to train a single multi-task model that handles both Task A and Task B at once.

Dataset¶

You are provided with the following files:

  1. train.csv — labeled training data (4,978 utterances).
  2. test.csv — held-out test data for your final submission.

CSV format: each row is one passenger utterance with four columns:

column description
id unique row identifier
intent the single intent label for the whole utterance (e.g. flight, airfare, ground_service)
text the utterance, lower-cased and split into space-separated words
slots a space-separated BIO tag for every word in text, aligned one-to-one

BIO format (for slot filling):

  • B-xxx — the Beginning of a slot of type xxx
  • I-xxx — a word Inside (continuing) the same slot (e.g. san francisco → B-toloc.city_name I-toloc.city_name)
  • O — Outside any slot (a word with no detail to extract)

The training set contains 22 intent labels and 123 slot labels. Note that the number of words in text always equals the number of tags in slots.

Restrictions¶

  1. Do not use any external data. You may augment the training set, but only using data provided in this problem.
  2. Do not train or fine-tune on the test set. test.csv may be used only to generate your final predictions — never for training.
  3. Submission format: produce a CSV with columns id, intent, slots, where slots is a space-separated BIO tag per word (one tag for each word in text).

Step 1: (R)ead the problem¶

One of the first steps in R.I.C.E. is to read and understand what the task entails. We can then see that the problem is really two machine learning tasks that share a single input, i.e., the passenger's utterance:

  • Task A assigns one label to the whole sentence. Sorting a piece of text into one of several categories is text classification; for a voice assistant, this specific version is called intent classification.
  • Task B assigns one label to every word. Labeling each token in a sequence is sequence labeling (or token classification); here it is called slot filling, and the BIO scheme lets us group multi-word details such as san francisco into a single slot.

The nice thing about NLP is that the shape of a problem is often the same even when it goes by different names depending on the field:

  • Text -> a label: text categorization
  • Span of text -> a label: span categorization
  • Text -> text (generated one word at a time): language modelling / text generation

For our case, we need a combination of text and span categorization. Officially, this problem is called intent classification (still text categorization) and slot filling (still span categorization).

Once you can spot the shape, you'll see it everywhere. NER is span categorization, pretraining a model is language modelling, and supervised fine-tuning is text generation (a text -> text problem that maps an instruction to its desired response).

Rule of thumb: the first two shapes (categorization) are usually solved with encoders like BERT, models built to understand text. The last shape (language modelling and generation) is solved with decoders like GPT, models built to produce text. Since SkyAssist is text + span categorization, we'll reach for an encoder.

Step 2: (I)mplement the Baseline¶

My suggestion is to first write a function for ingesting the data and a function for preparing a submission.

Here's a trick: choose one data ingestion strategy and stick with it. So for NLP, HuggingFace's transformers library is quite good, so we do our best to express any input data for that format. The goal is to create a DataLoader (torch).

When creating the DataLoader function, you need to first figure out:

  • How should a single instance be represented (torch.utils.data.Dataset).
  • How should each batch (i.e, a group of instances) be loaded (transformers.DataCollatorWithPadding).
In [1]:
from typing import Any

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from transformers import (
    AutoModel,
    AutoTokenizer,
    DataCollatorWithPadding,
    PreTrainedTokenizerBase,
    BatchEncoding,
)
import pandas as pd
from sklearn.model_selection import train_test_split
/Users/ljvmiranda/Developer/ioaiph26-nlp/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

So the setup will be like this:

label2id = ...  # (A) Implement text label into integers
id2label = ...  # (B) Just the reverse

data_loader = DataLoader(
    dataset,  # (C) Dataset class (SkyAssistDataset)
    collate_fn=data_collator,  # (D) DataCollatorWithPadding (SkyAssistDataCollator)
    batch_size=256,  # affects the collate_fn
    shuffle=True,  # affects the collate_fn
)
In [2]:
full_train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

train_df, dev_df = train_test_split(
    full_train_df,
    test_size=0.1,  # 10% held out for the dev set
    random_state=42,  # fixed seed so the split is reproducible
    shuffle=True,
)

train_df = train_df.reset_index(drop=True)
dev_df = dev_df.reset_index(drop=True)
In [ ]:
# (A) Implement text label mapping to integers
# Intents: one label per row
# Slots: the column is space-separated, so split every row and collect the unique tags
intents = sorted(train_df["intent"].unique())
label2id_intents = {label: i for i, label in enumerate(intents)}
slots = sorted({tag for row in train_df["slots"] for tag in row.split()})
label2id_slots = {label: i for i, label in enumerate(slots)}

# (B) Just the inverse
id2label_intents = {i: label for label, i in label2id_intents.items()}
id2label_slots = {i: label for label, i in label2id_slots.items()}
In [4]:
# (C) How to represent a single instance?
class SkyAssistDataset(Dataset):
    def __init__(
        self,
        df: pd.DataFrame,
        label2id_intents: dict[str, int],
        label2id_slots: dict[str, int],
    ) -> None:
        self.df = df.reset_index(drop=True)
        self.label2id_intents = label2id_intents
        self.label2id_slots = label2id_slots

    def __len__(self) -> int:
        return len(self.df)

    def __getitem__(self, idx: int) -> dict[str, Any]:
        row = self.df.iloc[idx]
        words = row["text"].split()  # ["i", "want", "boston", ...]

        # One id per word
        slot_labels = [
            self.label2id_slots.get(slot, 0) for slot in row["slots"].split()
        ]

        # One id for the whole text
        intent_label = self.label2id_intents.get(row["intent"], 0)

        return {
            "input_text": words,
            "intent_label": intent_label,
            "slot_labels": slot_labels,
        }
In [5]:
# (D) How to represent a batch (group of instances?)
class SkyAssistDataCollator(DataCollatorWithPadding):
    def __init__(
        self, tokenizer: PreTrainedTokenizerBase, slot_pad_id: int = -100
    ) -> None:
        super().__init__(tokenizer)
        self.slot_pad_id = slot_pad_id  # positions the loss should ignore

    def __call__(self, features: list[dict[str, Any]]) -> BatchEncoding:
        # Tokenize the already-split words and pad the whole batch at once
        batch = self.tokenizer(
            [f["input_text"] for f in features],
            is_split_into_words=True,
            padding=True,
            truncation=True,
            return_tensors="pt",
        )

        # Line up the per-word slot labels with the (sub)word tokens
        aligned_slots = []
        for i, f in enumerate(features):
            word_ids = batch.word_ids(i)  # which original word each token came from
            labels = []
            previous_word = None
            for word_id in word_ids:
                if word_id is None:
                    # special tokens: [CLS], [SEP], [PAD]
                    labels.append(self.slot_pad_id)
                elif word_id != previous_word:
                    # first subword of a word gets the real label
                    labels.append(f["slot_labels"][word_id])
                else:
                    # extra subwords of the same word are ignored
                    labels.append(self.slot_pad_id)
                previous_word = word_id
            aligned_slots.append(labels)

        # Attach the labels as tensors
        batch["slot_labels"] = torch.tensor(aligned_slots, dtype=torch.long)
        batch["intent_label"] = torch.tensor(
            [f["intent_label"] for f in features], dtype=torch.long
        )
        return batch

Then, let's make a function for creating or preparing the submission

In [6]:
def make_submission(
    test_df: pd.DataFrame,
    intent_preds: list[int],  # one intent id per row, in test_df order
    slot_preds: list[list[int]],  # one list of slot ids per row (one id per word)
    id2label_intents: dict[int, str],
    id2label_slots: dict[int, str],
    path: str = "submission.csv",
) -> pd.DataFrame:
    """Format model predictions into the submission CSV. No inference here."""
    intents = [id2label_intents[i] for i in intent_preds]
    slots = [" ".join(id2label_slots[s] for s in row) for row in slot_preds]

    submission = pd.DataFrame(
        {
            "id": test_df["id"].tolist(),
            "intent": intents,
            "slots": slots,
        }
    )
    submission.to_csv(path, index=False)
    return submission
In [7]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
In [8]:
# Wrap each split into a DataLoader using the Dataset (C) and collator (D).
collator = SkyAssistDataCollator(tokenizer)

train_ds = SkyAssistDataset(train_df, label2id_intents, label2id_slots)
dev_ds = SkyAssistDataset(dev_df, label2id_intents, label2id_slots)
test_ds = SkyAssistDataset(test_df, label2id_intents, label2id_slots)

train_loader = DataLoader(train_ds, collate_fn=collator, batch_size=256, shuffle=True)
dev_loader = DataLoader(dev_ds, collate_fn=collator, batch_size=32, shuffle=False)
test_loader = DataLoader(test_ds, collate_fn=collator, batch_size=32, shuffle=False)

The simplest baseline one can implement is adding a Head on top of an Encoder (usually a BERT model). Normally, each Head is a neural network (hence nn.Linear) that performs a specific task (TextCat, SpanCat, etc.).

So we create a shared encoder that reads the sentence, plus two heads: a sentence-level head that predicts the intent, and a word-level head that predicts a slot tag for each word.

In [9]:
# A single multi-task model: one shared encoder + two heads.
class SkyAssistModel(nn.Module):
    def __init__(self, model_name: str, num_intents: int, num_slots: int) -> None:
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)  # the shared BERT body
        hidden_size = self.encoder.config.hidden_size
        self.intent_head = nn.Linear(
            hidden_size, num_intents
        )  # whole sentence -> intent
        self.slot_head = nn.Linear(hidden_size, num_slots)  # each token -> slot

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
        slot_labels: torch.Tensor | None = None,
        intent_label: torch.Tensor | None = None,
        **kwargs: Any,
    ) -> dict[str, Any]:
        # B = Batch Size, T = Tokens, H = Hidden Size

        # Let BERT read the batch (the mask tells it to skip padding)
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)

        # For each token, a vector capturing its meaning in context (a "context vector")
        sequence_output = outputs.last_hidden_state  # (B, T, H)

        # The [CLS] token (position 0) stands for the whole sentence
        cls_output = sequence_output[:, 0]  # (B, H)

        # Sentence summary -> one score per intent
        intent_logits = self.intent_head(cls_output)  # (B, num_intents)

        # Each token -> one score per slot label
        slot_logits = self.slot_head(sequence_output)  # (B, T, num_slots)

        # Only score ourselves when labels are given (i.e. during training)
        loss = None
        if intent_label is not None and slot_labels is not None:
            # CrossEntropy skips positions labeled -100 by default
            loss_fn = nn.CrossEntropyLoss()

            # How wrong the intent guess is
            intent_loss = loss_fn(intent_logits, intent_label)

            # How wrong the slot guesses are (flattened to (B*T, num_slots))
            slot_loss = loss_fn(
                slot_logits.reshape(-1, slot_logits.size(-1)),
                slot_labels.reshape(-1),
            )

            # Learn both tasks at once by adding the two losses
            loss = intent_loss + slot_loss

        # loss is for training, the logits are for predicting
        return {
            "loss": loss,
            "intent_logits": intent_logits,
            "slot_logits": slot_logits,
        }


model = SkyAssistModel(
    model_name,
    num_intents=len(label2id_intents),
    num_slots=len(label2id_slots),
)
Loading weights: 100%|██████████| 199/199 [00:00<00:00, 14618.39it/s]
[transformers] BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED:	can be ignored when loading from different task/architecture; not ok if you expect identical arch.
In [10]:
# Train the model: adjust the encoder + heads to fit the training data ("fine-tuning").
device = (
    "cuda"
    if torch.cuda.is_available()
    else ("mps" if torch.backends.mps.is_available() else "cpu")
)
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    for batch in train_loader:
        # Move the batch tensors onto the same device as the model
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass: the model returns the combined intent + slot loss
        outputs = model(**batch)
        loss = outputs["loss"]

        # Backward pass: compute gradients, take one step, then reset gradients
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.item()

    print(
        f"epoch {epoch + 1}/{num_epochs}  train loss: {total_loss / len(train_loader):.4f}"
    )
epoch 1/3  train loss: 4.9622
epoch 2/3  train loss: 2.4199
epoch 3/3  train loss: 1.7027
In [11]:
# Inference: run the trained model on the test set, then format the submission.
model.eval()
intent_preds: list[int] = []
slot_preds: list[list[int]] = []

with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)

        # Best intent per sentence, best slot per token
        batch_intents = outputs["intent_logits"].argmax(dim=-1)  # (B,)
        batch_slots = outputs["slot_logits"].argmax(dim=-1)  # (B, T)
        true = batch["slot_labels"]  # (B, T), -100 = ignore

        for i in range(batch_intents.size(0)):
            intent_preds.append(batch_intents[i].item())

            # Keep one prediction per word (the first-subword positions)
            mask = true[i] != -100
            slot_preds.append(batch_slots[i][mask].tolist())

submission = make_submission(
    test_df, intent_preds, slot_preds, id2label_intents, id2label_slots
)
submission.head()
Out[11]:
id intent slots
0 0 flight O O O O O O O O B-fromloc.city_name O B-toloc....
1 1 airfare O O O O O O O O B-fromloc.city_name O B-toloc....
2 2 flight O O O O O O O O O B-fromloc.city_name O B-tolo...
3 3 flight O O O O O O O O O B-fromloc.city_name O B-tolo...
4 4 flight O O O O O O B-fromloc.city_name O B-toloc.city...

Step 3: (C)heck for Errors¶

A useful habit here is to look for a symptom, then run a quick probe to confirm the cause. The common ones for this pipeline:

Sign something is wrong Likely cause How to probe and check
Loss is flat, NaN, or bouncing around loss not wired up, learning rate too high, or zero_grad missing Overfit 1 to 2 batches for many steps. The loss should fall close to 0. If it cannot, the bug is structural, not a tuning issue.
Dev accuracy looks high but the predictions feel useless class imbalance, so the model just predicts the majority (O for slots, flight for intent) Report F1 (and look at it per class), compare against an always-predict-majority baseline, and print the distribution of predictions.
A row's predicted slot count does not equal its number of words alignment or truncation bug For every row, assert len(pred_slots) == len(text.split()). Inspect one long sentence by hand.
The submission seems scrambled against the ids test_loader was shuffled, or the row order changed somewhere Confirm test_loader uses shuffle=False, then spot-check a few ids against their text.
Illegal tag sequences (an I-x with no B-x before it) each token is labeled independently, so nothing enforces valid order Scan predictions for any I- tag not preceded by a matching B-.
Labels look shifted or just wrong, even though the code runs fine id2label is not the exact inverse of label2id, or a special-token offset Decode one example and print the word, the gold label, and the predicted label side by side.
CrossEntropyLoss throws an index error a label id is greater than or equal to the number of classes (head size mismatch) Check that num_slots == len(label2id_slots) and that every label id is either in range or -100.
Dev score is far below the train score overfitting to the training set Track train loss vs dev loss (or F1) across epochs and watch the gap.

Step 4: (E)nhance the Solution¶

So there are definitely many ways to enhance the solution. I'll suggest some trivial approaches (hot-swapping the encoder model), something more basic (adding a Conditional Random Field Layer), and more involved (data-centric).

Trivial enhancements¶

Stronger Encoder¶

The most trivial enhancement is to rerun the whole pipeline with a stronger encoder. For example, you can use something like roberta-base or distilbert. However, always check the restrictions. Sometimes you're only restricted to a given base model, so you won't have some degree-of-freedom for this enhancement.

Hyperparameter Tuning¶

This is quite cheap and usually gives the biggest improvement for the least effort. Here are some hyperparameters to watch out for:

  • Learning rate schedule: encoders are quite sensitive to this. I suggest a short warmup, then decaying the learning rate afterwards.

    from transformers import get_linear_schedule_with_warmup
    
    num_epochs = 10
    num_steps = num_epochs * len(train_loader)
    
    optimizer = AdamW(model.parameters(), lr=3e-5)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=int(0.1 * num_steps),  # warm up over the first 10% of steps
        num_training_steps=num_steps,
    )
    
    # ...then call scheduler.step() right after optimizer.step() in the loop
    
  • Train longer with early stopping: instead of fixing the number of epochs to a set number, keep training and stop at the best epoch (once the dev score stops improving).

    best_dev_score = 0.0
    patience, bad_epochs = 2, 0
    
    for epoch in range(num_epochs):
        train_one_epoch(model, train_loader, optimizer)  # your training loop
        dev_score = evaluate(model, dev_loader)          # your dev F1
    
        if dev_score > best_dev_score:
            best_dev_score = dev_score
            torch.save(model.state_dict(), "best.pt")    # keep the best checkpoint
            bad_epochs = 0
        else:
            bad_epochs += 1
            if bad_epochs >= patience:                   # no improvement for `patience` epochs
                break
    
  • Smaller training batch: another knob. Reduce the batch size and adjust the learning rate to match.

    train_loader = DataLoader(train_ds, collate_fn=collator, batch_size=32, shuffle=True)
    optimizer = AdamW(model.parameters(), lr=2e-5)
    

Architectural Improvements¶

Another easy change is to add a Conditional Random Field (CRF) layer on top of the slot head.

Right now each token is labeled on its own, so nothing stops the model from producing an illegal sequence, like an I-toloc.city_name with no B-toloc.city_name before it. A CRF fixes this at the source: it scores the whole label sequence at once and learns which labels are allowed to follow which.

It changes two things in practice:

  • Training: the per-token cross-entropy on the slots is replaced by a single sequence-level loss (the negative log-likelihood of the entire label path).
  • Inference: instead of taking the best label per token (argmax), it runs Viterbi decoding (a method that finds the best path through the sequence), which returns the best valid sequence of labels.

In code, the slot scores become the CRF's emissions (the raw per-token scores it reads):


And at inference you decode the best valid path instead of argmax:

mask = attention_mask.bool()
mask[:, 0] = True
best_paths = model.crf.decode(slot_logits, mask=mask)  # list[list[int]]: one valid tag per token

One practical gotcha: the CRF's mask and your -100 alignment have to agree. Here we mask out padding with attention_mask and replace any -100 with a dummy label, which is the simplest version that works. Lining the mask up exactly with the first-subword positions takes a little more care.

In [ ]:
# pip install pytorch-crf
from torchcrf import CRF


class SkyAssistModelCRF(nn.Module):
    def __init__(self, model_name, num_intents, num_slots):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        hidden = self.encoder.config.hidden_size
        self.intent_head = nn.Linear(hidden, num_intents)
        self.slot_head = nn.Linear(hidden, num_slots)
        self.crf = CRF(
            num_slots, batch_first=True
        )  # learns which labels are allowed to follow which

    def forward(
        self, input_ids, attention_mask, slot_labels=None, intent_label=None, **kwargs
    ):
        seq = self.encoder(
            input_ids=input_ids, attention_mask=attention_mask
        ).last_hidden_state
        intent_logits = self.intent_head(seq[:, 0])
        slot_logits = self.slot_head(seq)  # the CRF's "emissions"

        # The CRF cannot read -100, so mask those spots and give them a dummy label
        mask = attention_mask.bool()
        mask[:, 0] = True  # the CRF requires the first step to be active

        loss = None
        if slot_labels is not None and intent_label is not None:
            safe = slot_labels.clone()
            safe[slot_labels == -100] = (
                0  # any valid id; the mask makes the CRF ignore them
            )
            slot_loss = -self.crf(
                slot_logits, safe, mask=mask
            )  # one loss over the whole label sequence
            intent_loss = nn.functional.cross_entropy(intent_logits, intent_label)
            loss = intent_loss + slot_loss

        return {
            "loss": loss,
            "intent_logits": intent_logits,
            "slot_logits": slot_logits,
        }

Data-centric improvements¶

Often the biggest and most reliable gains come not from changing the model but from improving the data. The Restrictions forbid outside data, but they do allow augmenting the training set using what you already have.

For example, a strong trick for slot filling is entity swapping. Because every slot span is already labeled, you can replace one value with another value of the same type that appears elsewhere in training. For example:

show me flights to  boston
O    O  O       O   B-toloc.city_name

Swap boston for another city the data already contains (denver, san francisco, ...) and you get a brand new, correctly labeled sentence, without ever leaving the provided data.

First, collect every surface form of each slot type that appears in the training data:

In [13]:
import random
from collections import defaultdict


def extract_spans(words, tags):
    """Find each slot span as (slot_type, start, end), with end exclusive."""
    spans, i = [], 0
    while i < len(tags):
        if tags[i].startswith("B-"):
            slot_type = tags[i][2:]
            j = i + 1
            while j < len(tags) and tags[j] == f"I-{slot_type}":
                j += 1
            spans.append((slot_type, i, j))
            i = j
        else:
            i += 1
    return spans


def build_value_pool(df):
    """Collect every surface form per slot type, from the training data only."""
    pool = defaultdict(list)
    for _, row in df.iterrows():
        words, tags = row["text"].split(), row["slots"].split()
        for slot_type, start, end in extract_spans(words, tags):
            pool[slot_type].append(words[start:end])
    return pool


value_pool = build_value_pool(train_df)
print(
    "cities seen as toloc.city_name:",
    [" ".join(v) for v in value_pool["toloc.city_name"][:5]],
)
cities seen as toloc.city_name: ['san francisco', 'baltimore', 'philadelphia', 'atlanta', 'salt lake city']

Then, for each sentence, swap its spans for other values of the same type and append the new rows to the training set:

In [14]:
def augment_row(text, slots, value_pool, p=0.5, rng=random.Random(42)):
    """Swap each slot span with another value of the same type (entity swapping)."""
    words, tags = text.split(), slots.split()
    spans = {
        start: (slot_type, end) for slot_type, start, end in extract_spans(words, tags)
    }

    new_words, new_tags, i = [], [], 0
    while i < len(words):
        if i in spans:
            slot_type, end = spans[i]
            pool = value_pool.get(slot_type, [])
            # With probability p, swap in another value of the same type
            new_value = rng.choice(pool) if pool and rng.random() < p else words[i:end]
            new_words += new_value
            new_tags += [f"B-{slot_type}"] + [f"I-{slot_type}"] * (len(new_value) - 1)
            i = end
        else:
            new_words.append(words[i])
            new_tags.append(tags[i])
            i += 1
    return " ".join(new_words), " ".join(new_tags)


# Make one augmented copy of each training row, then add them to the training set
augmented = []
for _, row in train_df.iterrows():
    new_text, new_slots = augment_row(row["text"], row["slots"], value_pool)
    augmented.append(
        {"id": row["id"], "intent": row["intent"], "text": new_text, "slots": new_slots}
    )

train_df_aug = pd.concat([train_df, pd.DataFrame(augmented)], ignore_index=True)

print("original: ", train_df.iloc[0]["text"])
print("augmented:", augmented[0]["text"])
original:  please show me any united flights including connections between boston and san francisco at 5 in the evening
augmented: please show me any united flights including connect between baltimore and san francisco at 5 in the evening

IOAI Philippines 2026 NLP Lecture