<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ljvmiranda921.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ljvmiranda921.github.io/" rel="alternate" type="text/html" /><updated>2026-06-01T03:24:40+08:00</updated><id>https://ljvmiranda921.github.io/feed.xml</id><title type="html">Lj V. Miranda</title><subtitle>A collection of notes, projects, and essays.
</subtitle><entry><title type="html">The Filipino qualities we need to build world-class language technologies</title><link href="https://ljvmiranda921.github.io/notebook/2026/05/23/on-what-it-takes/" rel="alternate" type="text/html" title="The Filipino qualities we need to build world-class language technologies" /><published>2026-05-23T00:00:00+08:00</published><updated>2026-05-23T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2026/05/23/on-what-it-takes</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2026/05/23/on-what-it-takes/"><![CDATA[<p><span class="firstcharacter">I</span> gave a talk at the <a href="https://www.aap.ph/">Analytics &amp; AI Association of the Philippines (AAP)</a> on the topic of Philippine language model evaluation (<a href="https://docs.google.com/presentation/d/1KOdSsFsk8io59bKRzwU-4u1iFNxDumlSNLQfUYgbTfU/edit?usp=sharing">Slides</a>), specifically on <a href="/projects/2025/08/21/filbench/">FilBench</a>.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
The purpose of the meetup was really to figure out (1) whether we can build a “Philippine LLM” and (2) what doing so would entail.
I had some experience building open language models back in my previous work, and my research on multilinguality is directly related, so I was able to share my thoughts and experiences.</p>

<p>To preface: the answer to the question of “can we actually build Filipino-centric LLMs?” is definitely <strong>YES</strong>.
The tools are open-source, the <a href="https://rlhfbook.com">recipes</a> are publicly available, and there’s definitely a lot of talent.
However, the challenges are two-fold:</p>

<ol>
  <li>
    <p>Building Filipino-centric LLMs <strong>does not entail applying Silicon Valley approaches to our local contexts</strong>. The capital difference is vast and the ecosystem is different. It’s better to look into how our neighbors (geographically and economically) are doing it, such as Malaysia or Africa.</p>
  </li>
  <li>
    <p>Figuring out real needs is paramount to <strong>prevent the project from becoming a vanity training exercise.</strong> If the use-case can be solved by the next ChatGPT or Claude release, then maybe it’s not worth going deeper than the application layer.</p>

    <p>Recently, there’s been a lot of talk about retaining ownership over the whole LM development pipeline, hence the proliferation of sovereign LLMs. It’s a common motivation, but I’m not yet well-versed in this topic. The point is: are we building this to fulfill a business need? To include more Philippine languages? To own the LM development stack?</p>
  </li>
</ol>

<p>So can we build Filipino-centric LLMs and contribute world-class technologies? My answer is <strong>YES</strong>, because we Filipinos have the <strong>unique qualities and values</strong> to do so.</p>

<p>The next step then is to provide an environment where these qualities will flourish. 
These qualities are: <strong>Diskarte</strong>, <strong>Sipag at Tiyaga</strong>, and <strong>Bayanihan</strong>.
Let me explain them in the following sections.</p>

<blockquote>
  <p>We can build Filipino-centric LLMs and contribute world-class language technologies because we have the 
unique qualities to do so: Diskarte, Sipag at Tiyaga, and Bayanihan. The challenge then is to create an environment that allows these qualities to flourish.</p>
</blockquote>

<div style="text-align: center;">
  <p><img src="/assets/images/on-what-it-takes/summary.png" style="border: 1px solid black; padding: 10px; width: 600px" /></p>
</div>

<p>The slides below are taken from the closing remarks of my talk (about 75% of the talk is about FilBench), but the text doesn’t map to them one-to-one.
Just think of this blog post as an extended version of those closing slides.</p>

<h2 id="diskarte---achieving-a-goal-under-extreme-constraints">Diskarte - achieving a goal under extreme constraints</h2>

<p>After World War 2, several US Army Willys MB Trucks were left in the Philippines as American troops left the country.
There was a shortage of public transportation due to the destruction of infrastructure, and so, we began stripping down these trucks and altering them to add metal roofs, 
and various paintings and ornaments.
This later on became the jeepney or jeep, which is now our cultural icon and the current mode of transportation today.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>
Diskarte is difficult to translate into English because it can mean many things: resourcefulness, ingenuity, creative thinking, etc. 
There is this nice <a href="https://www.pap.ph/assets/files/journals/defining-diskarte-exploring-cognitive-processes-personality-traits-and-social-constraints-in-crea.pdf">paper</a> that talks about this at length, and I encourage anyone to read it.
Personally, I see diskarte as the ability to achieve a goal under extreme constraints.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/on-what-it-takes/diskarte.png" style="border: 1px solid black; padding: 10px; width: 600px" /></p>
</div>

<blockquote>
  <p>The best way to innovate is similar to how we approached the jeepney: taking these innovations from the Global North and adapting them to our local contexts through ingenuity and resourcefulness.</p>
</blockquote>

<p>I argue that the <strong>best way to innovate is similar to how we approached the jeepney</strong>: taking these innovations from the Global North and adapting them to our local contexts through ingenuity and resourcefulness.
Simply importing Silicon Valley approaches is not enough.
I wrote about this at length in a recent <a href="https://arxiv.org/abs/2604.21637">survey paper</a>, but one approach I’m currently exploring is careful synthetic data generation to address the lack of training data.
Synthetic data is now a common approach in training frontier language models, but I believe we can strip it down and, similar to the jeepney, add some ornaments / adjustments to make it work for low-resource languages.
I encourage you to read the survey paper, as I mention potential paths for improvement such as creating task-specific small language models, deploying models at the edge, and improving a model’s capabilities through a robust set of harnesses.</p>

<div style="float:right; width:320px; margin:0 0 10px 20px;">
  <p><img src="/assets/images/on-what-it-takes/internet-penetration-asean.png" style="width: 100%;" /></p>
  <p style="font-size: 0.8em; color: #555; text-align: center; margin-top: 6px; line-height: 1.35;">
    <em>Individuals using the Internet (% of population), 1990–2024. The Philippines trails most ASEAN neighbors. Source: <a href="https://data.worldbank.org/indicator/IT.NET.USER.ZS">World Bank / ITU (IT.NET.USER.ZS)</a>.</em>
  </p>
</div>

<p>That said, the Philippine NLP landscape is constrained in terms of both data and compute.
Although there is a significant presence of Filipino speakers and the language is well-represented on the internet, few have transformed this resource into useful, high-quality datasets.
This situation is even more pronounced for other Philippine languages.
In addition, our compute situation (for both development and deployment) leaves a lot to be desired.
We <a href="https://epoch.ai/data/gpu-clusters?view=map&amp;tab=point&amp;mapPointBubbleSize=log+Hardware+quantity">lack the compute infrastructure</a> to train LLMs (Epoch AI Data on GPU Clusters, 2026), and our <a href="https://data.worldbank.org/indicator/IT.NET.USER.ZS?end=2024&amp;locations=PH-TH-SG-MY-VN-LA-ID-KH-MM-BN-TL&amp;start=1990&amp;view=chart">internet and technology penetration lags behind that of our ASEAN neighbors</a> (ITU Data based on the World Bank).</p>

<h2 id="sipag-at-tiyaga---supporting-sustained-effort-through-time">Sipag at Tiyaga - supporting sustained effort through time</h2>

<p>There is a famous saying: “Pag may tiyaga, may nilaga.”
Perhaps the nearest English translation I can do is: “If the patience is true, you’ll get a stew.”<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
I’d argue that most components of a language modeling recipe (especially in multilingual post-training) are more a test of patience than flashes of insight: 
collecting enough data through large-scale annotation or synthesis, 
figuring out the right data mix across many ablations, 
and ensuring the evals faithfully reflect the capabilities we care about.
How, then, can we support these types of activities?</p>

<div style="text-align: center;">
  <p><img src="/assets/images/on-what-it-takes/tiyaga.png" style="border: 1px solid black; padding: 10px; width: 600px" /></p>
</div>

<blockquote>
  <p>Most components of a language modeling recipe (especially in multilingual post-training) are more a test of patience than flashes of insight.</p>
</blockquote>

<p>On the brighter side, I feel quite optimistic about some of the teams scattered across government and academia.
For example, E-CAIR has built task-specific systems, and I’ve seen some of their members publishing in *CL and ICML!
I also met some folks from AIM <a href="/notebook/2025/08/01/field-report-acl25/">last year in Vienna at ACL</a>, so there’s good investment happening there too.
And I’ve chatted with a few undergraduate thesis groups in UST and PUP (and I still get the occasional e-mail about <a href="/projects/2023/08/01/calamancy/">calamanCy</a>), so there’s clearly a loose network of groups working on the fringes.
What I’m trying to say is that there is talent here willing to do the hard parts—it’s just a matter of getting these efforts more coordinated and aligned.</p>

<p>I’d like to propose two dimensions on how to support these activities: infrastructure and incentives.
Infrastructure here can take two forms: <em>compute</em> grants that give people room to experiment, and <em>governance</em> structures with a central decision-maker to keep efforts coordinated.
The first one is quite self-explanatory.
But I think the second one is much more important.</p>

<p>For example, it is <strong>quite difficult to find what is the canonical National AI strategy for the Philippines</strong>.
The <a href="https://oecd.ai/en/dashboards/policy-initiatives/national-ai-strategy-roadmap-20-naisr-20">OECD AI Policy Navigator</a> and <a href="https://www.unesco.org/ethics-ai/en/philippines">UNESCO</a> points to <a href="https://naisr.cair.ph/">NAISR v2</a> which is from DTI, but the website says it was superseded by another Philippine AI Strategy Roadmap by DOST. 
Now, the only DOST-related roadmap I can find is this <a href="https://pcieerd.dost.gov.ph/wp-content/uploads/2026/01/Artificial_Intelligence_Roadmap_Dec15.pdf">slide deck</a>, but that can’t be it, right?</p>

<h2 id="bayanihan---collective-action-to-achieve-a-goal">Bayanihan - collective action to achieve a goal</h2>

<p>I always find the imagery of bayanihan inspiring.
It’s now uncommon today due to most houses being made of concrete, but the spirit of bayanihan still lives:
I remember back in high school when we had Brigada Eskwela and gathered together (my classmates, teachers, and even parents) to clean classrooms, or even the disaster relief packages we assembled for victims of typhoon Yolanda back in college.
I’d like to think that this is possible in the context of LLM development.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/on-what-it-takes/bayanihan.png" style="border: 1px solid black; padding: 10px; width: 600px" /></p>
</div>

<blockquote>
  <p>For such a network to work, it has to be focused on a single objective, much like the imagery of bayanihan where the only goal is to move a house from one place to another.</p>
</blockquote>

<p>I wrote about this <a href="/notebook/2024/12/17/filipino-llm/">last year</a>, and I still believe in the capability of grassroots networks.
We attempted to build a <a href="https://filbench.github.io">small research network</a> earlier this year, and while it has been a slow start with a few hiccups along the way, I feel optimistic about what we can achieve next.</p>

<p>However, for such a network to work, it has to be <strong>focused on a single objective</strong>, much like the imagery of bayanihan where the only goal is to move a house from one place to another.
The challenge with loose networks is that everyone has their own priorities and interests, so the effort diffuses.
Again, I want to stress that focusing on a single, measurable objective is important.
In my opinion, the project should be well-scoped, short-term, and designed so that nodes can pitch in and make drive-by contributions whenever they want.
I still have a lot to learn about managing these types of networks, so if you’ve done this before and would like to chat, please reach out!</p>

<p>If grassroots networks are to be multi-sectoral, there has to be a way to align diverse and often conflicting incentives.
In academia, for example, there is pressure to secure grants and publications (which, in computer science, means publishing in top conferences and journals).
Industry, on the other hand, tends to prefer engineering-heavy, context-specific deployments.
As a PhD student, I find that Industry Tracks in conferences are a good avenue for these partnerships: the student gets to work on a concrete project and can even write a paper about it.
Finally, the involvement across sectors doesn’t have to be deep on all sides.
It could be a donor-recipient setup where, for example, the government provides the problem statement, industry provides the resources, and academia provides the scientific rigor to achieve the goal.</p>

<h2 id="final-thoughts-and-reflections">Final thoughts and reflections</h2>

<p>If we want to build Filipino-centric language models, we need to build them our way.
This entails adapting innovations from other parts of the world to our local contexts and constraints rather than simply importing them <em>(diskarte)</em>,
creating governance and incentive structures that support long-horizon, experimental work <em>(sipag at tiyaga)</em>,
and building grassroots networks around a shared, well-scoped goal <em>(bayanihan)</em>.</p>

<p>To make this concrete: a good goal would be to train Filipino-centric LMs that cover the country’s major languages (Tagalog, Cebuano, Ilokano, Hiligaynon, Bikol, Kapampangan).
The most reasonable approach is to post-train on a strong multilingual base model, then evaluate on a new version of FilBench where all these languages are well-represented.
The first major constraint is data, so we need to source it through scraping, native-speaker annotation, or synthetic data generation (<em>diskarte</em>).
The second is compute and funding, which calls for partnerships with industry or organizations willing to underwrite them (<em>bayanihan</em>).
Finally, much of this work demands sustained experimentation, data ablations, and people willing to see it through (<em>sipag at tiyaga</em>).</p>

<p>As I’ve said, the tools and recipes are publicly available, and I truly believe we have the qualities to make it work.
The ball simply needs to get rolling.</p>

<!-- For LI:

I gave a talk at the AI & Analytics Association of the Philippines (AAP) about FilBench, our Filipino-centric LLM evaluation benchmark. The purpose of the meetup was actually to discuss what it takes to build a Filipino-centric language model.

I closed my talk with two major points: (1) we cannot simply import Silicon Valley approaches into our local context, we need to adapt them; and (2) the tools and recipes are already open-source, but what matters more is that we Filipinos possess the qualities to build world-class technologies.

So what are these qualities? In my opinion, they are:
1. Diskarte (achieving goals under extreme constraints)
2. Sipag at Tiyaga (hard work and patience on long-horizon tasks)
3. Bayanihan (uniting around a shared goal)

I then walked through specific challenges in building LLMs where each of these qualities shines. 
Ultimately, I believe that to build Filipino-centric LMs, we need to create an environment where these qualities can flourish.

I wrote more about my thoughts in a blog post (linked in the comments). Would love to discuss this further, just reach out :)  -->
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I won’t be talking a lot about FilBench in this blog post. But if you’re curious, check out my <a href="/projects/2025/08/21/filbench/">blog post</a>, check the <a href="https://huggingface.co/spaces/filbench/filbench-leaderboard">leaderboard</a>, or read the <a href="https://aclanthology.org/2025.emnlp-main.127/">paper</a>! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I really like the imagery of the jeepney to symbolize diskarte because my dad used to own and drive one, and I learned how to read by reading handpainted jeepney signs. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>The idea of course is that making a stew requires long cooking time where tough cuts of meat require hours of slow simmering to become tender. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="tagalog" /><category term="filipino" /><category term="nlp" /><category term="llm" /><summary type="html"><![CDATA[I was invited to give a talk at Analytics & AI Association of the Philippines (AAP) on FilBench, and in general, building Filipino LLMs. This blog post covers some of my thoughts on this topic.]]></summary></entry><entry><title type="html">Reflections of a Pinoy applying to CS PhD programs abroad</title><link href="https://ljvmiranda921.github.io/life/2025/10/30/grad-school-apps/" rel="alternate" type="text/html" title="Reflections of a Pinoy applying to CS PhD programs abroad" /><published>2025-10-30T00:00:00+08:00</published><updated>2025-10-30T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/life/2025/10/30/grad-school-apps</id><content type="html" xml:base="https://ljvmiranda921.github.io/life/2025/10/30/grad-school-apps/"><![CDATA[<p><span class="firstcharacter">B</span>ack in 2021, I decided to embark on a PhD program and pursue an academic research career.
I knew I wanted to do research and had a strong feeling I’d enjoy academia.
However, it took me <em>four years</em> to finally apply.
<em>What happened, and what took me so long?</em>
In this blog post, I want to reflect on the past four years—spanning multiple countries, jobs, and experiences.
The path was long and unconventional.
That journey is finally over, and now I’m on the cusp of another one.</p>

<p>This is not a “How to apply to grad school” blog post, but perhaps you’ll find some nuggets of wisdom along the way.
If you want an actual guide, I recommend reading John Boaz Lee’s <a href="https://drive.google.com/file/d/1N5ETwBh9dyLpxGRKIA9LXXJ_Jy44i1TP/view">guide</a> (more Filipino context) or Tim Dettmer’s <a href="https://timdettmers.com/2018/11/26/phd-applications/">blog post</a> (realistic and quite sobering).</p>

<!-- include: https://nguyenthanhvuh.github.io/phd-cs-us/demystify.pdf -->

<p style="border:3px; border-style:solid; border-color:#a00000; padding: 1em;">
📣 Finally, some news: I started my <strong>PhD in Computation, Cognition, and Language</strong> at the <strong>University of Cambridge</strong> this Fall!
I also got accepted to universities in the US and Canada, so this admissions cycle was truly a blessing.
</p>

<p><em>All photos you’ll see below were taken by a Gameboy Camera. I realized when writing this blog post that my camera captured all events in the past four years. For more photos, check out my <a href="/gallery">gallery</a>.</em></p>

<h2 id="the-reality-of-cs-grad-school-applications">The reality of CS grad school applications</h2>

<p>Application for PhD programs, especially for machine learning and NLP, has become increasingly competitive.
There was a time when I was looking at the newly-admitted PhD students, and it was common to see them with first-author publications in top conferences prior to their PhD.
One needs only to look at <a href="https://twitter.com/ShirleyYXWu/status/1876033230186615251?ref_src=twsrc%5Etfw">these tweets</a> from professors and admissions committee members during the Fall 2025 cycle to recognize how competitive the CS PhD applications have been.</p>

<p>Despite my NLP industry experience, I don’t have the academic research backround expected from PhD applicants.
Early on, I realized that pivoting from industry to academia would entail a lot of hard work.
A part of me thought that maybe I should have just continued to a PhD right after my Masters when the path seemed clearer.
But my industry years helped confirm that I genuinely enjoyed research work!</p>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/bike.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/bgc.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/lantern.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (–2021): Snippets of my life in Bonifacio Global City (BGC) a few weeks after my time in TM. A bike on display inside Coffee Project, some view of BGC, and a Japanese lantern in The Fort.</em></p>

<p>I could already imagine the challenges in this transition: relocating to another country, accepting a significant pay cut, and adjusting my life stages (e.g., getting married, starting a family) to align with a new career trajectory.
Looking back at 2021, I remember feeling overwhelmed by the self-imposed pressure to make everything fall perfectly into place.</p>

<h2 id="the-long-road-to-phd-applications">The long road to PhD applications</h2>

<p>While working at <a href="https://explosion.ai">Explosion</a>, I was fortunate to be exposed to NLP research early on by working on our very first <a href="https://arxiv.org/abs/2212.09255">technical report</a> on hash embeddings.
This experience gave me the confidence to undertake my own independent work in Tagalog NLP, which led to <a href="https://aclanthology.org/2023.nlposs-1.1.pdf">calamanCy</a> and <a href="https://arxiv.org/abs/2311.07161">TLUnified-NER</a>.
In addition, doing software engineering work on the <a href="https://prodigy.ai/">Prodigy annotation tool</a> inspired my interest in researching data-centric approaches to NLP.</p>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/skyscraper.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/berlin.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/remote.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (2021—2022): view from my condo when I was working from home in McKinley Hill, the Brandenburg gate during the ExplosionAI meet-up at Berlin, my mechanical keyboard.</em></p>

<p>My time at <a href="https://allenai.org">Ai2</a> as a pre-doc was incredibly formative.
Working alongside experienced researchers gave me a clearer picture of what academic research entailed and helped me identify the qualities I wanted to develop in myself as I grow into this career.
I was also lucky to be part of large collaborative projects like <a href="https://allenai.org/papers/tulu-3-report.pdf">Tülu 3</a> and <a href="https://allenai.org/blog/olmo2">OLMo 2</a> which gave me hands-on experience with frontier model post-training.
Working at Ai2 gave me a baseline of what high-quality research looks like.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
In addition, there’s really something in the air back in the US that made me more ambitious and driven— I hope to bring these qualities wherever I go.</p>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/library.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/aurora.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/ai2.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (2024): the Fremont library where I spent a lot of time writing papers and doing PhD apps, view of the Aurora bridge on the way to the office, the Ai2 logo in the office.</em></p>

<p>I got lucky because of the people who took chances on me, which is why I developed a mantra over time: <strong>people over projects, projects over publications.</strong>
I prioritize finding people whom I’m excited and energized to work with over specific projects—I believe that every topic is equally interesting given the right people.
During the four-year journey, I’ve been fortunate to collaborate with researchers from different places who shared my interests, particularly in multilingual NLP.
Everyone I worked with made me feel welcome, and I hope I can be that person for another aspiring researcher in the future.</p>

<blockquote>
  <p>I’ve developed a mantra over time: people over projects, projects over publications.</p>
</blockquote>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/fremont.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/pottery.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/market.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (2024): snippets of my life in Seattle including the Fremont artwork, the pottery studio at Yu Tang Ceramics, and the Farmer’s Market in downtown.</em></p>

<p>My Masters professor, <a href="https://nclab.w.waseda.jp/jinglu/personal.html">Jinglu Hu</a>, also supported me all the way during my PhD applications.
He was happy to write me a recommendation letter, and I’m more than happy that he wrote it—it felt like things had gone full circle.
I have fond memories of Waseda back in 2016, and that time was formative in my career.</p>

<h2 id="on-the-admissions-cycle">On the admissions cycle</h2>

<p>The admissions cycle spanned three months, from December to February, with the earliest deadline on December 4.
Because I also had to prepare for my wedding in December, I tried to finish all applications and essays by November.
I vividly recall arriving at the library as soon as it opened and remaining there until it closed.
The holiday season also brought a reflective mood, with fewer people around and a quieter atmosphere, I was able to reflect on what type of research I wanted to pursue.</p>

<!-- add pictures of fremont library, pho place in fremont, cafe ladro -->
<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/fremont2.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/ladro3.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/pcc.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (2024-2025): snapshots during the admissions season: writing essays in the Fremont library or in Cafe Ladro, buying food from PCC.</em></p>

<p>I applied to 7 programs, and they’re actually quite expensive!
I missed some of the fee waiver deadlines because I was busy, so I had to shell out $1,350 to $1,500 for application fees.
In addition, some universities require official TOEFL scores, so you also have to pay TOEFL fees in order to send your test results to those universities (~$35).
Don’t make my mistake and keep in mind that fee waivers exist and most universities are generous, especially for international students, but just note that grad school apps can be very expensive.</p>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/bike2.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/phinney.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/usa.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (2024-2025): The PhD applications forced me to think whether I want to stay in the US or move again abroad. Some snippets of my life (from left to right): a bike in Roosevelt Island, my apartment at Phinney, a US flag in Grand Central Station.</em></p>

<p>By January, I started receiving interview invites.
I had a mix of interview formats during this cycle: a research presentation, a short 15-20 minute informal chat, and a formal interview.
The last two weeks of January and the early weeks of February were the most intense, as they involved a lot of waiting and uncertainty.
I tried to keep myself busy, focusing on work and my current research projects— those were the longest two months I had ever experienced.</p>

<h2 id="on-choosing-phd-programs">On choosing PhD programs</h2>

<p>One of the hardest challenges in entering a PhD is balancing your career aspirations and future life plans.
This is especially true for someone like me in their early 30s: a PhD program can take 4-6 years, pays lower than your typical software engineering job, and will coincide with the time when I might want to start a family.
Making this decision in my thirties is different from if I had made this decision during my twenties.
Getting into a PhD becomes a life decision more so than a career decision.</p>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/cbtl.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/retreat.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/ring.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right (early 2022 and 2023): coffee shop where my wife and I talked about our plans, my room during a Holy Week retreat in Sacred Heart, our engagement ring.</em></p>

<p>I’d like to think that my industry years have matured me well enough to de-risk entering a PhD program.
This factored into my choices: will I enjoy my research topic? will I be inspired by my environment? how old will I be after the PhD program and what life/career options will I have by then?</p>

<p>A lot has been written on how to choose PhD programs. My favorite was <a href="https://timdettmers.com/2022/03/13/how-to-choose-your-grad-school">Tim Dettmer’s post</a>. 
Based on his blog, I think I went with the Stability perspective.
During ACL, I met several folks from Cambridge (although from different departments), and they seem genuinely happy.
Interestingly, during the conference, I met an old friend of mine from spaCy and just realized that he was also from Cambridge!
In addition, UK is one of the few countries where my wife can stay with me and work.
PhD years are long and gritty, and having a rock to lean on is important.</p>

<blockquote>
  <p>Getting into a PhD becomes a life decision more so than a career decision.</p>
</blockquote>

<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/cornell3.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/sage_hall.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/nyc.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>Photos during my PhD visit days in NYC.</em></p>

<p>Although choosing a PhD program seems like a cerebral choice that entails a lot of calculus, a large part of my decision-making involved a lot of <a href="https://www.ignatianspirituality.com/making-good-decisions/an-approach-to-good-choices/an-ignatian-framework-for-making-a-decision/">discernment</a>.
Attuning myself to my <a href="https://godinallthings.com/2013/09/02/desire/"><em>desires</em></a> and praying over them helped a lot.
One thing I’ve learned during this time is that <strong>I can never make this decision with absolute certainty.</strong>
There’s definitely this fear of closing other doors and potential futures—what if you made the wrong choice?
Doubt will always be present, and I should welcome that.</p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>It’s almost November, and my wife and I have been here at Cambridge for almost a month.
There’s a lot to say about Cambridge—the collegiate system,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> the traditions, the culture, and more.
Things still look unfamiliar, but I’m taking my time to let the place reveal itself to me, knowing that we’ll be here for a while.
It’s also encouraging to see that Cambridge has a strong support for mature students like me.
I’ve met PhDs who also started at the same age as I did, some have even brought their kids!
It’s very different from my previous campus tour experience, where most incoming PhDs were fresh from undergrad.
In some way, I don’t feel alone and different, and that’s important.</p>

<!-- pictures of Cambridge! -->
<div style="display: flex; justify-content: center;">
  <p><img src="/assets/images/grad-school-apps/churchill.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/cambridge_ul.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" />
<img src="/assets/images/grad-school-apps/uk_phonebooth.png" width="220" style="border: 2px solid black; margin: 0 5px; filter: grayscale(95%);" /></p>
</div>
<p style="text-align: center;"><em>From left to right: Churchill College (a modern college in Cambridge with brutalist architecture), the University Library, a phonebooth repurposed into a library book deposit.</em></p>

<p>I’m happy to close this chapter that lasted for four years.
I’m thankful to all the people who helped me along the way, as I’ve learned a lot about myself and who I wanted to be.
The path was long and winding, but I’m happy we made it.
The next few years will be equally exciting, but I just want to linger in this moment—that in-between space as you turn the page to another chapter.</p>

<p>Now that I’m past the hump of PhD applications, I’d like to give back: if you’re interested in doing research on Tagalog and Filipino NLP with the intention to publish, feel free to reach out, and I’m happy to mentor you! I can also review research statements, but I’m not an expert!</p>

<p>If you’ve reached this part, dear reader, thank you for reading. Here I am in my matriculation gown!</p>

<p style="text-align: center;"><img src="/assets/images/grad-school-apps/matriculation.jpg" alt="" width="400px" /></p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>There are many people in Ai2 who helped me of course: Yanai, Valentina, Faeze, Nathan, Kyle R., Sachin, Shashank, Harsh, Ashish, Hannah and Noah (and much more)! It does take a village. I’m also super thankful to Pradeep, Hanna, and Yizhong who drafted my recommendation letters! <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>It’s an interesting system: all students also apply to colleges that are akin to “houses” in Harry Potter. I got into <a href="https://www.chu.cam.ac.uk/">Churchill</a> which focuses on STEM and has the best (according to my research) accommodation. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>LJ MIRANDA</name></author><category term="life" /><summary type="html"><![CDATA[A long reflection about my journey from industry to grad school applications&mdash;spanning multiple countries, jobs, and experiences. This is not an advise post, but I hope you'll find something valuable along the way.]]></summary></entry><entry><title type="html">Introducing FilBench: An Open LLM Evaluation Suite for Filipino</title><link href="https://ljvmiranda921.github.io/projects/2025/08/21/filbench/" rel="alternate" type="text/html" title="Introducing FilBench: An Open LLM Evaluation Suite for Filipino" /><published>2025-08-21T00:00:00+08:00</published><updated>2025-08-21T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/projects/2025/08/21/filbench</id><content type="html" xml:base="https://ljvmiranda921.github.io/projects/2025/08/21/filbench/"><![CDATA[<p><a href="https://github.com/filbench/filbench-eval" class="github-corner" aria-label="View source on GitHub"><svg width="80" height="80" viewBox="0 0 250 250" style="fill:#151513; color:#fff; position: absolute; top: 0; border: 0; left: 0; transform: scale(-1, 1);" aria-hidden="true"><path d="m0,0 115,115 15,0 12,27 108,108 0,-250 z"></path><path d="m128.3,109.0 c-14.5,-9.3 -9.3,-19.4 -9.3,-19.4 3,-6.9 1.5,-11 1.5,-11 -1.3,-6.6 2.9,-2.3 2.9,-2.3 3.9,4.6 2.1,11 2.1,11 -2.6,10.3 5.1,14.6 8.9,15.9" fill="currentColor" style="transform-origin: 130px 106px;" class="octo-arm"></path><path d="m115.0,115.0 c-0.1,0.1 3.7,1.5 4.8,0.4 l13.9,-13.8 c3.2,-2.4 6.2,-3.2 8.5,-3.0 -8.4,-10.6 -14.7,-24.2 1.6,-40.6 4.7,-4.6 10.2,-6.8 15.9,-7.0 0.6,-1.6 3.5,-7.4 11.7,-10.9 0,0 4.7,2.4 7.4,16.1 4.3,2.4 8.4,5.6 12.1,9.2 3.6,3.6 6.8,7.8 9.2,12.2 13.7,2.7 16.1,7.4 16.1,7.4 -3.5,8.2 -9.3,11.1 -10.9,11.7 -0.2,5.7 -2.4,11.2 -7.0,15.9 -16.4,16.3 -30.0,10.0 -40.6,1.6 0.2,2.3 -0.6,5.3 -3.0,8.5 l-13.8,13.9 c-1.1,1.1 0.3,4.9 0.4,4.8" fill="currentColor"></path></svg></a><style>.github-corner:hover .octo-arm{animation:octocat-wave 560ms ease-in-out}@keyframes octocat-wave{0%,100%{transform:rotate(0)}20%,60%{transform:rotate(-25deg)}40%,80%{transform:rotate(10deg)}}@media (max-width:500px){.github-corner:hover .octo-arm{animation:none}.github-corner .octo-arm{animation:octocat-wave 560ms ease-in-out}}</style></p>

<p><span class="firstcharacter">A</span>t the end of 2024, I wrote about my <a href="/notebook/2024/12/17/filipino-llm/">desiderata for Filipino NLP</a>.
One of which was <strong>evaluation</strong>.
I said that most of “how we measure LLM capabilities in Filipino are anecdotal: we post a screenshot of ChatGPT writing in Filipino and claim that it already has that capability—we need a systematic approach to evaluating these models.”
Fast forward to today, I’m happy that we are now inching towards systematic evaluations for Filipino.</p>

<p>Without further ado, I introduce <strong>FilBench</strong>, an LLM Evaluation Suite for Filipino!</p>

<iframe src="https://filbench-filbench-leaderboard.hf.space" frameborder="1" width="800" height="500"></iframe>

<p> </p>

<ul>
  <li><strong>Paper</strong>: <a href="https://arxiv.org/abs/2508.03523">https://arxiv.org/abs/2508.03523</a></li>
  <li><strong>Code</strong>: <a href="https://github.com/filbench/filbench-eval">github.com/filbench/filbench-eval</a></li>
  <li><strong>Leaderboard</strong>: <a href="https://huggingface.co/spaces/UD-Filipino/filbench-leaderboard">hf.co/spaces/UD-Filipino/filbench-leaderboard</a></li>
</ul>

<h2 id="what-is-filbench">What is FilBench?</h2>

<p><img src="/assets/images/filbench/filbench_main.svg" align="right" height="500" /></p>

<p><a href="">FilBench</a> is a (1) <strong>benchmark</strong> to test LLM capabilities on Filipino, and a (2) <strong>leaderboard</strong> to track the progress of LLM development for Philippine languages.
When building FilBench, I imagine two types of audiences:</p>

<ul>
  <li>
    <p>The <strong>multilingual NLP research scientist</strong> who wants to test whether their new language model generalizes to other languages.
FilBench aims to provide a robust and comprehensive evaluation suite for Filipino tasks and use-cases.</p>
  </li>
  <li>
    <p>The <strong>language model developer</strong> who wants to know which language model fits best for the application their building.
The FilBench leaderboard provides a detailed breakdown of a model’s capabilities, and analyses of the parameter- and cost-efficiency of such models.</p>
  </li>
</ul>

<p>When building FilBench, we took stock of what the Philippine NLP research community are evaluating pre-ChatGPT language models upon.
Although they are good at understanding language system capabilities, they are ill-posed for current LLM evaluation.
We then curated these datasets into four main categories: Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation.</p>

<p>Finally, FilBench has been accepted to <a href="https://2025.emnlp.org/">EMNLP Main</a>! 
EMNLP is one of the top NLP conferences in the field and getting a paper published in this venue can be competitive. 
We’re happy with the outcome, so catch Ely and Joseph as they present FilBench in Suzhou, China this November!</p>

<h2 id="on-building-filbench">On Building FilBench</h2>

<p>One thing that I found exciting when building FilBench is that it felt like I’m assembling the Avengers of Filipino NLP.
It started with just the three of us— <a href="https://www.linkedin.com/in/elyanahaco2000/">Ely</a>, <a href="https://www.linkedin.com/in/connermanuel/">Conner</a>, and I— working on the <a href="https://github.com/huggingface/data-is-better-together">Data Is Better Together annotation project</a> from HuggingFace.
I’ve met them separately through other projects like <a href="https://seacrowd.github.io/">SEACrowd</a> and earlier correspondences.
Then, we’ve onboarded <a href="https://www.josephimperial.com/">Joseph</a> and <a href="https://blaisecruz.com/">Blaise</a>, who have been working on Filipino NLP since I entered the field.
We’re just five right now, but hopefully next time there’ll be more of us.</p>

<p>I find grassroots projects like this to be appealing, given that all of us are volunteers spending our free time on FilBench.
We automatically pass the “will-they-care-about-this-project” bar, making collaboration smoother.
There were hurdles too, such as scrounging up compute for running evals, or finding the right time across five timezones.
But the scrappiness of the group is energizing, and it allowed us to iterate and build FilBench despite little to no resources.</p>

<p>In addition, it was also nice working with the OpenEvals team at HuggingFace, since they help maintain <a href="https://github.com/huggingface/lighteval">lighteval</a>, the evaluation framework we used for Filbench.
I’m also thankful to <a href="https://github.com/NathanHB">Nathan Habib</a>, <a href="https://clefourrier.github.io/">Clementine Fourrier</a>, and <a href="https://danielvanstrien.xyz/">Daniel van Strien</a> for their feedback on the <a href="https://huggingface.co/blog/filbench">official HuggingFace blog post</a>.
Furthermore, FilBench is now part of the official community evaluations on lighteval, so check it out there!</p>

<!-- screenshot of the header of the filbench HF blog post -->

<p>We have some interesting projects lined up and there’s definitely still some space on the bench. 
I’m very optimistic about the future of Filipino NLP research, and would love to broaden my collaborations more.
So if you’re interested in collaborating with us, then <a href="mailto:filbench-eval@googlegroups.com">reach out</a>!</p>

<h2 id="my-next-plans-for-filipino-nlp">My next plans for Filipino NLP</h2>

<p>For me, the next few months will bring about a lot of change and adjustment.
I’m currently vacationing in the Philippines until October so I’m yet to initiate new research projects for the latter half of the year.
However, I haven’t been idle: I’ve been shaping up ideas and talking to some people so hopefully I’m back in action by Q4.</p>

<ul>
  <li>
    <p><strong>On FilBench v2</strong>: I have a lot of new ideas for FilBench v2.
However, I want to maintain my key priority for developing FilBench, i.e., to improve the scope of Philippine languages in our benchmark.
More specifically, my metric of success for FilBench is to cover all major regional languages in the Philippines.</p>

    <p>We also have some plans regarding a cost-efficient and mini version of FilBench.
This allows model evaluators to easily get feedback on their FilBench scores without running the full set.
The details are being finalized, but the goal is to release FilBench-Mini by the end of the year.</p>
  </li>
  <li>
    <p><strong>On Filipino NLP</strong>: Most of what I’ve mentioned in <a href="/notebook/2024/12/17/filipino-llm/">my desiderata</a> still hold.
Now that I’m heading towards my PhD, I’m also figuring out how I can balance my interests in Filipino NLP and my general PhD research project.
Nevertheless, I believe our efforts on FilBench have proven that grassroots and volunteer-led research projects can be successful.</p>
  </li>
</ul>

<p>I’d like to collaborate with <strong>organizations who can provide compute.</strong>
I have some ideas on low-resource LLM training and data curation (with Filipino as a case study).
Training a Filipino-centric LLM is often a conversation topic I hear a lot, but I think there should be some finesse on executing and framing this project, so that it becomes impactful and not “yet another” LLM that the next frontier LLM like GPT-4.1 can easily beat.</p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>Overall, I’m very happy with the progress we had in FilBench.
It is a good project where I was able to collaborate with other contemporaries in Filipino NLP, and resulted in a main conference acceptance.
Projects like these don’t stop after publication— in fact the whole “Filipino NLP” project is just beginning.
With FilBench, we now have a comprehensive and easy-to-run benchmark for LLMs.
The next step is to then figure out what kind of model—or more specifically, what type of pre/post-training data, architecture, training paradigm— can top this leaderboard.</p>

<p>If you’re interested to work on language-specific evals and Filipino NLP in general, just <a href="mailto:ljvmiranda@gmail.com">shoot me an e-mail</a> and let’s chat!</p>

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">miranda2025filbench</span><span class="p">,</span>
  <span class="na">title</span>   <span class="p">=</span> <span class="s">"Fil{B}ench: {C}an {LLM}s {U}nderstand and {G}enerate {F}ilipino?"</span><span class="p">,</span>
  <span class="na">author</span>  <span class="p">=</span> <span class="s">"Miranda, Lester James Validad and 
            Aco, Elyanah and 
            Manuel, Conner and 
            Cruz, Jan Christian Blaise and 
            Imperial, Joseph Marvin"</span><span class="p">,</span>
  <span class="na">journal</span> <span class="p">=</span> <span class="s">"arXiv preprint arXiv:2508.03523"</span><span class="p">,</span>
  <span class="na">year</span>    <span class="p">=</span> <span class="m">2025</span>
<span class="p">}</span>
</code></pre></div></div>]]></content><author><name>LJ MIRANDA</name></author><category term="projects" /><category term="nlp" /><category term="language technology" /><category term="natural language processing" /><category term="tagalog" /><category term="low resource" /><category term="llm" /><category term="machine learning" /><summary type="html"><![CDATA[This National Language Month, I'm proud to introduce FilBench, a big step forward in Filipino NLP evaluation. This work was also accepted at EMNLP Main! Read to learn more about this project.]]></summary></entry><entry><title type="html">Field Report: ACL 2025</title><link href="https://ljvmiranda921.github.io/notebook/2025/08/01/field-report-acl25/" rel="alternate" type="text/html" title="Field Report: ACL 2025" /><published>2025-08-01T00:00:00+08:00</published><updated>2025-08-01T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2025/08/01/field-report-acl25</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2025/08/01/field-report-acl25/"><![CDATA[<p><span class="firstcharacter">I</span> had an incredible time at the <a href="https://2025.aclweb.org/">ACL 2025 Conference</a> in Vienna.
ACL is one of the top conferences in NLP, with researchers presenting a wide range of work from computational linguistics to frontier large language models.
This was also the very first NLP conference I attended.
Although I had published in *CL venues before, I never had the chance to attend in person.
Attending ACL was also a great way to immerse myself in the broader NLP community before starting my Ph.D.</p>

<h2 id="works-i-presented-during-the-conference">Works I presented during the conference</h2>

<p style="text-align: center;"><img src="/assets/images/field-report-acl25/presentations.png" alt="" width="720px" /></p>

<p>Self-promotion: I had four papers accepted in the ACL Main Proceedings, three of which I was the first or co-first author on (all in the Resources &amp; Evaluation track):</p>

<ul>
  <li><a href="https://arxiv.org/abs/2410.19133">Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback</a><br />TLDR: We find that some preference instances are better annotated by humans than by language models (LMs). We use that information to train a hybrid preference router (HyPER) that allocates instances to either humans or LMs.</li>
  <li><a href="https://arxiv.org/abs/2410.15522">M-RewardBench: Evaluating Reward Models in Multilingual Settings.</a><br />TLDR: We introduce a new benchmark for evaluating reward models (RMs) in 23 diverse languages. By evaluating several RMs on M-RewardBench, we uncovered signficant gaps in RM performance between English and non-English languages.</li>
  <li><a href="https://arxiv.org/abs/2505.20428">The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project</a><br />TLDR: We introduce the largest Tagalog treebank to date, 100x larger than previous treebanks.
By building this treebank, we stretch the limits of the Universal Dependencies framework and challenge its “universality.”</li>
</ul>

<p>And a big collab project from SEACrowd:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2503.07920">Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia</a><br />TLDR: We present SEA-VL, one of the largest open-source initiatives to develop high-quality and culturally-relevant data for SEA languages. We also highlight representation gaps in SEA, especially for vision-language models.</li>
</ul>

<p><em>Really, four papers?!</em> This seems like a big feat, but these works were completed at different times—it’s just that the timing led them to be published simultaneously.
In fact, HyPER was a reject from ICLR that we further refined, and my involvement in UD-NewsCrawl started way back in 2023.
Good research takes time, and I’m lucky to have great collaborators during these long periods of work.</p>

<h2 id="papers-that-piqued-my-attention">Papers that piqued my attention</h2>

<p>I enjoyed going through the poster presentations during the conference.
Since I was presenting, I wasn’t able to attend most of the Main posters.
However, I had a great time talking to other researchers in the other sessions!</p>

<ul>
  <li>
    <p><a href="https://aclanthology.org/2025.acl-long.1141/">Language Models Resist Alignment: Evidence from Data Compression</a>: It’s one of the best papers of ACL and presents a physics-inspired view of post-training behaviour.
According to the paper, post-trained LMs exhibit some kind of <em>elasticity</em>, where it reverts back to its behaviour distribution during pretraining when finetuned further.
The rest of the work is an analysis of this elasticity, where they found that it positively correlates with increased model size and expansion of pre-training data.</p>
  </li>
  <li>
    <p><a href="https://aclanthology.org/2025.acl-long.435/">Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce</a>: A good paper that examines the patterns in language resource development form a survey of several NLP practitioners.
It also highlights some potential systemic issues with large-scale community efforts such as credit attribution.
I find this paper thought-provoking as it looks into the <em>meta</em> of language resource development and reveals issues that we tend to take for granted.</p>
  </li>
  <li>
    <p><a href="https://aclanthology.org/2025.acl-long.779/">Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?</a>: This is an interesting paper that examines how tools can improve annotation quality in LLM-as-a-judge scenarios.
I’ve been working in finetuning LMs for tool-use at Ai2, and I’m very curious of its applications in some of the domains I care about.
There is good empirical evidence for using tools during annotation, and I’m excited to see where else we can apply this framework.</p>
  </li>
</ul>

<p>Honorable mentions:</p>

<ul>
  <li><a href="https://aclanthology.org/2025.acl-industry.27/">A Perspective on LLM Data Generation with Few-shot Examples: from Intent to Kubernetes Manifest</a>: fun industry-track paper that solves the problem of writing annoying Kubernetes configs! Tickled the software engineer in me.</li>
  <li><a href="https://aclanthology.org/2025.findings-acl.1347/">PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play</a>: an interesting framework for refining tool instructions. Might be useful for improving the quality of synthetic tool-use data.</li>
  <li><a href="https://aclanthology.org/2025.findings-acl.873/">Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models</a>: a good addition to the literature of LLM routing. Can even be useful for some LLM-as-a-judge applications.</li>
</ul>

<h2 id="miscellaneous">Miscellaneous</h2>

<p style="text-align: center;"><img src="/assets/images/field-report-acl25/socials.png" alt="" width="720px" /></p>

<ul>
  <li>
    <p><strong>I attended the SEACrowd Birds-of-a-Feather (BoF) session!</strong> Finally, I’ve met my long-time collaborators.
I’ve been involved with the SEACrowd group during their initial project, so it’s nice to meet Holy, Samuel, and Aji in person.
During the BoF, we talked about what is “enough” for Southeast Asian NLP, what’s missing (it’s always money), and what are our current successes.
I honestly look forward to SEACrowd events.
I traveled to ACL alone, so being with some familiar faces gave me a feeling of having a <em>home base</em> in a sea of strangers.</p>
  </li>
  <li>
    <p><strong>Met with some Pinoy NLP researchers.</strong> I’m also glad to see Filipinos publishing in these top NLP conferences.
I met with some folks from the Asian Institute of Management (Sir Chris, Japer, K-Ann), SEACrowd (Blaise and Anton), AI Singapore (Railey), and Radboud University (Jane).
Data science in the Philippines is a small world: each of us were connected in one way or another (either we’re from the same high school, college, job, etc.).
I was very happy to see Pinoys and almost shouted <em>Uy pilipins!</em> upon meeting them.</p>
  </li>
  <li>
    <p><strong>Thank you for the water, kind stranger!</strong> During the second day of the conference, I was presenting two posters at the same time! It was quite intense as I jump from one work to another.
Add Vienna’s summer heat and I was definitely sweaty and haggard.
Suddenly, a kind stranger offered me some water and I felt refreshed and energized.
It was a nice gesture and now a core memory of my conference experience.
I wasn’t able to catch their name, but if you’re reading this: thank you for the water, kind stranger!</p>
  </li>
</ul>

<h2 id="final-thoughts">Final thoughts</h2>

<p>Attending ACL 2025 is a great way to close my pre-PhD years and start my new journey as a PhD student.
As my first NLP conference, ACL gave me a favorable first impression of the community at large.
Perhaps I’ve been lucky with the people I’ve talked to, but everyone was warm, friendly, and welcoming.
Good vibes all throughout, no notes.</p>

<p>Also, I enjoyed writing this field report: it’s a good exercise for me to synthesize things I’ve learned during the conference, and hopefully it’s informative for non-attendees to get an inside look into these academic research events.
I admit I haven’t been writing in this blog recently (although in general, I’ve written way more during the past two years), so this might be another way to fill-in those spaces during the year.
Hoping to write more field reports in the future!
<em>And finally, some photos of my post-conference trip:</em></p>

<p style="text-align: center;"><img src="/assets/images/field-report-acl25/places.png" alt="" width="800px" /></p>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="nlp" /><category term="acl" /><category term="conference" /><category term="research" /><category term="natural language processing" /><category term="ai" /><category term="llm" /><category term="reasoning" /><summary type="html"><![CDATA[Here is my field report from the ACL 2025 Conference in Vienna, Austria. Overall, it was a great experience: the vibes are good and I'm happy to have met the larger NLP community!]]></summary></entry><entry><title type="html">‘Draw me a swordsman’: Can tool-calling LLMs draw pixel art?</title><link href="https://ljvmiranda921.github.io/notebook/2025/07/20/draw-me-a-swordsman/" rel="alternate" type="text/html" title="‘Draw me a swordsman’: Can tool-calling LLMs draw pixel art?" /><published>2025-07-20T00:00:00+08:00</published><updated>2025-07-20T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2025/07/20/draw-me-a-swordsman</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2025/07/20/draw-me-a-swordsman/"><![CDATA[<p><span class="firstcharacter">R</span>ecently, we’ve witnessed language models become increasingly adept at using real-world tools through their function-calling capabilities.
Whenever I <a href="https://ljvmiranda921.itch.io">develop games</a>, I use <a href="https://www.aseprite.org/"><strong>Aseprite</strong></a> to create pixel art characters and environments.
Aseprite is powerful: I can prepare spritesheets, test color palettes, and preview animations—it’s not something any AI image generator can simply replace.
So I wondered: could I incorporate LLMs into my workflow?</p>

<!--Aseprite screenshot-->

<p>In this blog post, I tested LLMs with tool-calling capabilities on the following tasks:</p>

<ul>
  <li><strong>Task 1: Draw a swordsman</strong>: an easy task that generates a static image.</li>
</ul>

<blockquote>
  <p>Draw me a pixel art of a swordsman.</p>
</blockquote>

<ul>
  <li><strong>Task 2: Draw a swordsman performing a slashing sequence</strong>: a harder task.</li>
</ul>

<blockquote>
  <p>Draw a 4-frame spritesheet showing a swordsman performing a sword slash attack sequence, with each frame capturing a different stage of the slashing motion from windup to follow-through.</p>
</blockquote>

<p>The characters in Task 1 and Task 2 don’t have to be the same.
Task 2 is particularly interesting because creating sprite sheets is a common use-case in game development.
It’s also challenging for LLMs since it requires sequential understanding: each frame must logically follow the previous one to create believable animation.
Unlike generating a single image, sprite sheets demand consistency in character design, progression, and timing.</p>

<p>Here’s a <strong>human baseline</strong> from a <a href="https://ljvmiranda921.itch.io/abyss">game</a> I made a few years ago:</p>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/human_simple.png" alt="task 1 image" width="80" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/human_spritesheet.png" alt="task 2 image" width="500" /></td>
    </tr>
  </tbody>
</table>

<p>I evaluated each LLM’s output using two criteria, each scored from 0 to 3:</p>

<ul>
  <li><strong>Correctness</strong>: Did the LLM follow the instructions accurately?</li>
  <li><strong>Creativity</strong>: How original or artistic was the LLM’s approach?</li>
</ul>

<p>To get a single representative score, I compute the average of correctness and creativity for each task, then average those two task scores together—a benchmark I’ll now dub as <strong>SwordsBench score</strong>!</p>

<p><strong>You can find the full code <a href="https://github.com/ljvmiranda921/scratch/tree/master/2025-07-11-aseprite-mcp">on GitHub</a></strong></p>

<h2 id="setting-up-the-agent-environment-interaction">Setting up the Agent-Environment Interaction</h2>

<p>This project also helped me understand the development workflow for MCPs and LLM agents.
I like framing this setup similar to reinforcement learning: we instruct an <strong>Agent</strong> (an LLM) to interact with the <strong>Environment</strong> (standardized via MCP) to accomplish a task, as shown in the diagram below:</p>

<p style="text-align: center;"><img src="/assets/images/draw-me-a-swordsman/testbed.svg" alt="" width="800px" /><br />
<em>In the tool-calling paradigm, we instruct an Agent to interact with the Environment in order to accomplish a task. The Agent can be implemented via the <a href="https://openai.github.io/openai-agents-python/">OpenAI Agents SDK</a> or natively in <a href="https://modelcontextprotocol.io/quickstart/user">Claude Desktop</a>, while the Environment is an MCP server that calls Aseprite commands.</em></p>

<h3 id="mcp-server-environment">MCP Server Environment</h3>

<p>The <strong>Environment</strong> receives commands from the Agent, executes them, and provides feedback that the agent can use for its next action.
Recently, Anthropic released the <a href="https://modelcontextprotocol.io/introduction">Model Context Protocol (MCP)</a>, which provides a common interface for how agents can interact with any given environment.
I think of MCP as a standardized set of affordances for the Agent.</p>

<p>For this project, I used the Aseprite MCP implementation from <a href="https://github.com/diivi/aseprite-mcp">divii/aseprite-mcp</a> with some slight modifications and bug fixes.
The Aseprite MCP server exposes the following tools:</p>

<table>
  <thead>
    <tr>
      <th>Tool Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>create_canvas</strong></td>
      <td>Create a new Aseprite canvas with specific dimensions</td>
    </tr>
    <tr>
      <td><strong>add_layer</strong></td>
      <td>Add a new layer to an existing Aseprite file with a specified layer name</td>
    </tr>
    <tr>
      <td><strong>add_frame</strong></td>
      <td>Add a new frame to an existing Aseprite file for animation purposes</td>
    </tr>
    <tr>
      <td><strong>draw_pixels</strong></td>
      <td>Draw individual pixels on the canvas with coordinates and hex colors</td>
    </tr>
    <tr>
      <td><strong>draw_line</strong></td>
      <td>Draw a line between two points with customizable color and thickness</td>
    </tr>
    <tr>
      <td><strong>draw_rectangle</strong></td>
      <td>Draw a rectangle with specified position, dimensions, color, and fill</td>
    </tr>
    <tr>
      <td><strong>fill_area</strong></td>
      <td>Fill an area with color using a paint bucket tool from a starting coordinate</td>
    </tr>
    <tr>
      <td><strong>draw_circle</strong></td>
      <td>Draw a circle with specified center point, radius, color, and optional fill</td>
    </tr>
    <tr>
      <td><strong>export_sprite</strong></td>
      <td>Export the Aseprite file to other formats like PNG, GIF, JPG, etc.</td>
    </tr>
    <tr>
      <td><strong>preview_image</strong></td>
      <td>Read and display an image file as base64 data for preview purposes</td>
    </tr>
  </tbody>
</table>

<p>Under the hood, these tools are Lua commands that are sent to the <code class="language-plaintext highlighter-rouge">aseprite</code> executable.
For example, the <code class="language-plaintext highlighter-rouge">draw_pixels</code> tool call is simply a series of <code class="language-plaintext highlighter-rouge">img:putPixel</code> commands for drawing pixels:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">mcp</span><span class="p">.</span><span class="n">tool</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">draw_pixels</span><span class="p">(</span><span class="n">filename</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">pixels</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="p">...</span>
    <span class="k">for</span> <span class="n">pixel</span> <span class="ow">in</span> <span class="n">pixels</span><span class="p">:</span>
      <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">pixel</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"x"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="n">pixel</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"y"</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
      <span class="n">color</span> <span class="o">=</span> <span class="n">pixel</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"color"</span><span class="p">,</span> <span class="s">"#000000"</span><span class="p">)</span>
      <span class="c1"># Convert hex to RGB
</span>      <span class="n">color</span> <span class="o">=</span> <span class="n">color</span><span class="p">.</span><span class="n">lstrip</span><span class="p">(</span><span class="s">"#"</span><span class="p">)</span>
      <span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">color</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">],</span> <span class="mi">16</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">color</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">],</span> <span class="mi">16</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="n">color</span><span class="p">[</span><span class="mi">4</span><span class="p">:</span><span class="mi">6</span><span class="p">],</span> <span class="mi">16</span><span class="p">)</span>

      <span class="n">script</span> <span class="o">+=</span> <span class="sa">f</span><span class="s">"""
      img:putPixel(</span><span class="si">{</span><span class="n">x</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">y</span><span class="si">}</span><span class="s">, Color(</span><span class="si">{</span><span class="n">r</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">g</span><span class="si">}</span><span class="s">, </span><span class="si">{</span><span class="n">b</span><span class="si">}</span><span class="s">, 255))
      """</span>

    <span class="c1"># Submit the script to Aseprite
</span>    <span class="n">execute_lua_command</span><span class="p">(</span><span class="n">script</span><span class="p">)</span>
    <span class="p">...</span>
</code></pre></div></div>

<p>Then, the function <code class="language-plaintext highlighter-rouge">execute_lua_command</code> calls the <code class="language-plaintext highlighter-rouge">aseprite</code> executable and passes the Lua script to the <code class="language-plaintext highlighter-rouge">--script</code> argument.
You can find the full implementation of the Aseprite MCP server in <a href="https://github.com/ljvmiranda921/scratch/tree/master/2025-07-11-aseprite-mcp/aseprite_mcp">this repository</a>.</p>

<h3 id="large-language-model-agent">Large Language Model Agent</h3>

<p>The <strong>Agent</strong> has access to tools that interact with the execution environment to accomplish a task.
We use language models like GPT-4.1 or Claude Sonnet 4—some of which were trained for tool-use—as agents.
Models with tool-use capabilities can output function calls based on the tools in their system prompt.
These function calls are parsed by an intermediary layer (such as an MCP server) and passed to the execution environment.</p>

<p>I used the <a href="https://openai.github.io/openai-agents-python/">OpenAI Agents SDK</a> to build the Aseprite agent.
The implementation is quite straightforward: I just need to create an instance of an <code class="language-plaintext highlighter-rouge">Agent</code> class and let it interact with the execution environment (<code class="language-plaintext highlighter-rouge">mcp_servers=[server]</code>).
For open-weight models, I host them as an <a href="https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html">OpenAI-compatible vLLM server</a> and pass the server URL to a <a href="https://docs.litellm.ai/docs/providers/openai_compatible">LiteLLM proxy</a> when instantiating the <code class="language-plaintext highlighter-rouge">Agent</code> class.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="k">with</span> <span class="n">mcp_server</span> <span class="k">as</span> <span class="n">server</span><span class="p">:</span>
    <span class="k">with</span> <span class="n">trace</span><span class="p">(</span><span class="n">workflow_name</span><span class="o">=</span><span class="n">workflow_name</span><span class="p">):</span>
      <span class="n">model</span> <span class="o">=</span> <span class="n">get_model</span><span class="p">(</span><span class="n">model_name</span><span class="p">)</span>
      <span class="n">agent</span> <span class="o">=</span> <span class="n">Agent</span><span class="p">(</span>
          <span class="n">name</span><span class="o">=</span><span class="s">"Assistant"</span><span class="p">,</span>
          <span class="n">instructions</span><span class="o">=</span><span class="n">system_prompt</span><span class="p">,</span>
          <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
          <span class="n">mcp_servers</span><span class="o">=</span><span class="p">[</span><span class="n">server</span><span class="p">],</span>
      <span class="p">)</span>

      <span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">Runner</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
        <span class="n">starting_agent</span><span class="o">=</span><span class="n">agent</span><span class="p">,</span> 
        <span class="nb">input</span><span class="o">=</span><span class="n">request</span><span class="p">,</span>
      <span class="p">)</span>
</code></pre></div></div>

<p>When an <code class="language-plaintext highlighter-rouge">Agent</code> is instantiated, it contains information about the <code class="language-plaintext highlighter-rouge">model</code> it’s using, its <code class="language-plaintext highlighter-rouge">system_prompt</code>, and the <code class="language-plaintext highlighter-rouge">mcp_servers</code> it’s connected to.
I experimented with different system prompts—the one below gave me the best results:</p>

<blockquote>
  <p>You are a creative and artistic function-calling agent that can use pixel art
tools to perform a drawing task. You have a good knowledge of color, form, and
movement.  Your output must always be saved as an image file in the PNG format.
If you encounter an error, find a way to resolve it using other available tools.</p>
</blockquote>

<p>To initiate an interaction between the <code class="language-plaintext highlighter-rouge">Agent</code> and the MCP server, we simply pass the agent to a <code class="language-plaintext highlighter-rouge">Runner</code> class with our actual request as <code class="language-plaintext highlighter-rouge">input</code>.</p>

<h2 id="results">Results</h2>

<p>Below are the results of several models on the simple swordsman and spritesheet tasks.
The figures below show the best of three agent-environment interactions, selected based on correctness and creativity criteria.</p>

<h4 id="qwen-3-32b-swordsbench-025">Qwen 3 32B (SwordsBench: 0.25)</h4>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/qwen3-32b_simple.png" alt="task 1 image" width="80" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/qwen3-32b_spritesheet.png" alt="task 2 image" width="500" /></td>
    </tr>
  </tbody>
</table>

<p>Qwen3 performed poorly on both tasks.
I can see the attempt to draw a swordsman in Task 1: the brown circle is the head, the cyan rectangle is the body, and the grey rectangle is the sword.
However, the Task 2 drawing is poor and doesn’t produce anything recognizable.
I actually ran Qwen 3 with both reasoning turned on and off.
Generally, <strong>using tools with reasoning resulted in better-looking figures</strong>, though it took much longer to complete.</p>

<p>Funny enough, when I read the reasoning trajectories, it seems that Qwen 3 likes blaming the tools.
It will say something akin to: “the <code class="language-plaintext highlighter-rouge">draw_line</code> tool doesn’t work, so I have no way to perform this task” (it actually works).
I’m sorry Qwen, but this is a skill issue (minus points for the attitude too).</p>

<p><strong>Task 1</strong> - Creativity (0/3), Correctness (1/3)<br />
<strong>Task 2</strong> - Creativity (0/3), Correctness (0/3)</p>

<h4 id="gpt-4o-swordsbench-075">GPT-4o (SwordsBench: 0.75)</h4>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/gpt-4o_simple.png" alt="task 1 image" width="80" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/gpt-4o_spritesheet.png" alt="task 2 image" width="500" /></td>
    </tr>
  </tbody>
</table>

<p>GPT-4o’s intent is clearer than Qwen3’s output.
The Task 1 drawing can definitely be a swordsman if I squint hard enough.
There’s a recognizable sword shape, but the body is either missing or unclear.
The Task 2 spritesheet shows promise: there’s clear movement from one sprite to another, and the blue “swordsman” maintains consistency across frames.
However, it seems like GPT-4o has taken a lot of creative liberties in designing the swordsman, as the blue blob is hard to identify by itself.
This is a good and earnest try.</p>

<p><strong>Task 1</strong> - Creativity (1/3), Correctness (1/3)<br />
<strong>Task 2</strong> - Creativity (0/3), Correctness (1/3)</p>

<h4 id="gpt-41-swordsbench-125">GPT-4.1 (SwordsBench: 1.25)</h4>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/gpt-4.1_simple.png" alt="task 1 image" width="80" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/gpt-4.1_spritesheet.png" alt="task 2 image" width="500" /></td>
    </tr>
  </tbody>
</table>

<p>GPT-4.1 shows a marked improvement over GPT-4o.
The Task 1 swordsman is much more recognizable—there’s a clear humanoid figure with distinct head, body, and limbs, but no well-defined sword.
The proportions are reasonable and it actually looks like a character you might find in a classic pixel art game.
For Task 2, while the character design remains consistent across frames, the animation sequence doesn’t quite work.
The “sword” ends up looking more like a gun being held horizontally, and there’s no clear slashing motion or arc.
It’s a decent attempt at maintaining character consistency, but fails to capture the essence of a sword attack.</p>

<p><strong>Task 1</strong> - Creativity (2/3), Correctness (1/3)<br />
<strong>Task 2</strong> - Creativity (1/3), Correctness (1/3)</p>

<h4 id="claude-sonnet-4-swordsbench-20">Claude Sonnet 4 (SwordsBench: 2.0)</h4>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/claude-sonnet_simple.png" alt="task 1 image" width="80" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/claude-sonnet_spritesheet.png" alt="task 2 image" width="500" /></td>
    </tr>
  </tbody>
</table>

<p>Claude Sonnet 4 is getting there but isn’t quite the most polished.
The Task 1 swordsman does look like a proper pixel art character—you can make out the basic humanoid form and it has that classic retro game aesthetic.
It’s recognizable as a swordsman even if the details aren’t super crisp.</p>

<p>Task 2 is where things get weird: while there’s definitely a slash motion happening across the frames, the character completely loses all sense of body proportions.
The figure becomes distorted and inconsistent, though you can still trace the sword movement from windup to strike.
It’s like the model understood the motion but forgot about maintaining the character’s form.</p>

<p><strong>Task 1</strong> - Creativity (2/3), Correctness (3/3)<br />
<strong>Task 2</strong> - Creativity (1/3), Correctness (2/3)</p>

<h4 id="claude-opus-4-swordsbench-25">Claude Opus 4 (SwordsBench: 2.5)</h4>

<table>
  <thead>
    <tr>
      <th>Task 1: Swordsman</th>
      <th>Task 2: Draw a 4-frame spritesheet of a sword slash attack</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><img src="/assets/images/draw-me-a-swordsman/claude-opus_simple.png" alt="task 1 image" width="100" /></td>
      <td><img src="/assets/images/draw-me-a-swordsman/claude-opus_spritesheet.png" alt="task 2 image" width="550" /></td>
    </tr>
  </tbody>
</table>

<p>Claude Opus 4 produces solid results across both tasks.
The Task 1 swordsman shows good creativity with a distinctive character design—clear head, body, and sword, though the sword shape could be more refined.
What’s particularly interesting is that <strong>Opus seems to have developed a consistent character concept</strong>: even across different prompts, it drew essentially the same swordsman design.</p>

<p>Task 2 delivers a proper slash sequence where you can follow the sword’s motion from windup to strike, and the character maintains its form throughout the animation.
It’s a competent spritesheet that demonstrates understanding of both animation principles and character consistency.</p>

<p><strong>Task 1</strong> - Creativity (3/3), Correctness (2/3)<br />
<strong>Task 2</strong> - Creativity (2/3), Correctness (3/3)</p>

<h2 id="discussion">Discussion</h2>

<p>Recently, I’ve been conducting research on endowing models with tool-use capabilities.
I’ve been deep in the modeling side, so it’s interesting to see and address some knowledge gaps from a developer’s perspective.
Here are some of my learnings:</p>

<h3 id="models-are-constrained-by-the-tools-available-to-them">Models are constrained by the tools available to them</h3>

<blockquote>
  <p><em>“Give me a lever long enough and a fulcrum on which to place it, and I shall move the world.”</em></p>
</blockquote>

<p>The Aseprite MCP server exposes basic primitives—from drawing a single pixel to basic shapes.
In reality, pixel artists use many other Aseprite features: color pickers, multiple onion frames for animations, and more.
This might explain why the resulting artworks look blocky and basic compared to human-created art.</p>

<p>From a developer’s perspective, choosing <strong>which tools to expose and their granularity</strong> (e.g., drawing pixels → drawing shapes) is crucial for MCP server design.
Most models work within the constraints of provided tools, so determining which tools to build gives you the right leverage from a “very-smart-assistant.”
From a researcher’s perspective, it would be interesting to endow models with the capability to create their own tools or assess tool quality.
I’ve seen this in Claude Code, where it writes Python scripts to perform complex tasks.
There are many long-horizon tasks that make this area exciting!</p>

<h3 id="not-all-tasks-or-use-cases-require-a-tool-calling-llm">Not all tasks or use-cases require a tool-calling LLM</h3>

<p>When I started this project, I was excited about having a language model interact with a program I’ve been using.
Now I realize that drawing pixel art might not be the best use-case for a tool-calling LLM.
Creating an MCP server requires significant time investment—I could have just drawn inspiration from an image generation model or Pinterest and created the swordsman myself.
Better Aseprite MCP use-cases might include asking an AI assistant to export drawings into Godot-compatible spritesheets, recolor artwork with different palettes, or correct pixel dimensions for isometric art.
There are many possibilities for augmenting human workflows with tool-use LLMs.</p>

<p>From a researcher’s perspective, it’s worth <strong>assessing whether domains and use-cases in famous benchmarks like BFCL truly reflect how tool-calling LLMs are used.</strong>
Many of these were scraped from GitHub or API repositories, but these APIs’ affordances differ greatly from actual tool usage.
For example, the <code class="language-plaintext highlighter-rouge">draw_pixel</code> function from my MCP server might be a valid test case, but it doesn’t reflect the complexity and sequential nature of real tool-calling tasks.
Perhaps this explains why Claude Code has worked so well and gained traction in the developer community.
Instead of going broad, it went deep into a particular use-case—software development—and optimized for that.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>In this blog post, I learned about developer tooling for agents by having function-calling LLMs interact with Aseprite.
The results were mixed: Claude Opus 4 generated the most creative pixel art while following instructions.
I was surprised that models like GPT-4 weren’t very creative.
This exercise revealed interesting avenues for future work.
I’m curious about endowing LLMs with the capability to create their own tools when necessary.
I’m also interested in evaluation, especially for complex and domain-specific tasks.
Finally, I’m starting to understand why MCP—or a general protocol for LLM interaction—is important.
Writing an MCP server feels more convenient than building REST endpoints.
It’s still early, but I hope MCP becomes more prevalent in the future.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Maybe instead of focusing on a domain (e.g. coding, general chat, graphic design) when developing and evaluating models, we focus on a certain profession or occupation (i.e., software developer, executive assistant, designer). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="pixel art" /><category term="aseprite" /><category term="llm" /><category term="nlp" /><category term="tool-calling" /><category term="mcp" /><category term="claude" /><category term="gpt" /><summary type="html"><![CDATA[Just a fun weekend experiment on model-context protocol (MCP): I asked several tool-calling LLMs to draw a 4-frame spritesheet of a swordsman performing a slash attack using an Aseprite MCP I built. The results were interesting!]]></summary></entry><entry><title type="html">Desiderata for Filipino NLP in the Age of LLMs</title><link href="https://ljvmiranda921.github.io/notebook/2024/12/17/filipino-llm/" rel="alternate" type="text/html" title="Desiderata for Filipino NLP in the Age of LLMs" /><published>2024-12-17T00:00:00+08:00</published><updated>2024-12-17T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2024/12/17/filipino-llm</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2024/12/17/filipino-llm/"><![CDATA[<p><span class="firstcharacter">B</span>ack when I started working in Filipino NLP, my standard approach in training models is to encode linguistics knowledge via meticulous data annotation, feature engineering, and extensive testing.
Take <a href="https://ljvmiranda921/calamanCy">calamanCy</a> for example: we spent countless hours annotating <a href="https://aclanthology.org/2023.sealp-1.2/">named-entity recognition (NER)</a> datasets and validating <a href="https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl">dependency parsing treebanks</a> to train statistical models using frameworks like <a href="https://spacy.io">spaCy</a>.
Back then, these components typically power larger systems that handle tasks like information extraction and question answering. But lately, we’ve seen how LLMs can tackle these tasks end-to-end, without needing any of these components at all.</p>

<p>The pace of progress in multilingual LLMs has also been relentless this year.
We’ve seen several high-profile releases such as Meta’s <a href="https://ai.meta.com/blog/meta-llama-3-1/">Llama 3.1</a>, Cohere for AI’s <a href="https://cohere.com/blog/aya-expanse-connecting-our-world">Aya Expanse</a>, and Southeast Asian models such as AI Singapore’s <a href="https://sea-lion.ai/">SEA-LION</a> and Sea AI’s <a href="https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b">Sailor</a>.
Sure, one can still argue that building custom-made pipelines is still viable because current LLMs don’t always work well on Filipino (and I’ve argued these before [<a href="/notebook/2023/08/04/llm-tagalog/">1</a>] [<a href="/notebook/2024/07/02/talk-dlsu/">2</a>]) and that there are <a href="https://speakerdeck.com/inesmontani/applied-nlp-with-llms-beyond-black-box-monoliths">advantages in building modular, non-black box components</a>,
but the <a href="https://huggingface.co/CohereForAI/aya-101">impressive gains in multilinguality</a> from <a href="https://arxiv.org/abs/2001.08361">data scale</a> have changed my perspective somewhat.</p>

<p>While I still believe in building artisanal Filipino NLP resources, I now see that <strong>we need to simultaneously support the development of multilingual LLMs by creating high-quality Filipino datasets and benchmarks.</strong>
This way, we can <strong>actively push for the inclusion of Philippine languages in the next generation of multilingual LLMs</strong>, rather than just waiting for improvements to happen on their own.</p>

<blockquote>
  <p>We need to focus on creating high-quality datasets and evaluation benchmarks [to] actively push for the inclusion of Philippine languages in the next generation of multilingual LLMs.</p>
</blockquote>

<p>Filipino is still left behind even in the age of LLMs (<a href="#final-thoughts-are-we-truly-low-resource">more on this later</a>).
Sometimes I feel a tinge of sadness when a research group releases a new multilingual LLM and Filipino is not supported.
You can’t blame them— there’s not a lot of readily-available Filipino data for LM training and evaluation.
This is even true for other Philippine languages such as Hiligaynon, Kapampangan, and Ilokano.
There are still missing pieces ripe for research.</p>

<p>In this blog post, I want to talk about three actionable directions for Filipino NLP: (1) create resources that support LLM post-training, (2) build reliable benchmarks for Filipino, and (3) participate in grassroots research and annotation efforts.
<strong>Also, if you are interested to collaborate in these types of efforts, feel free to <a href="mailto:ljvmiranda@gmail.com">reach out</a>!</strong></p>

<h3 id="create-resources-that-enable-llm-post-training">Create resources that enable LLM post-training</h3>

<p style="border:3px; border-style:solid; border-color:#a00000; padding: 1em;">
<b>Key Insight:</b> 
Better to focus on post-training since it requires relatively lower investment than pretraining.
Prioritize collecting instruction finetuning datasets since it’s in this step where we usually observe significant performance gains. 
Better if they contain general chat, but domain-specific data should also work.
</p>

<p>Post-training is the stage in the large language modelling pipeline where we adapt a pretrained model to a specific style of input for chat interactions, such as following natural language instructions or responding in accordance with human preferences.
This stage usually involves two main steps: instruction finetuning (IFT) and preference finetuning (PreFT).
I want to focus on the former.
Most IFT data comes in question-answer pairs containing a <em>user instruction</em>, an optional <em>context</em>, and a given <em>response</em>.</p>

<!-- PreFT data, on the other hand, consists of human preferences on model outputs, which can be collected either [manually](https://arxiv.org/abs/2204.05862) or using [another language model](https://arxiv.org/abs/2310.01377) (or a [combination of both](https://arxiv.org/abs/2410.19133)). -->

<p>For the next year or so, I believe there’s a <strong>more urgent need for Filipino IFT datasets.</strong></p>

<p style="text-align: center;"><img src="/assets/images/filipino-llm/llm_training.png" alt="" width="700px" /><br />
<em>A simple language modelling pipeline (as seen in models like InstructGPT, Tulu 2, etc.).<br />
Currently, we lack quality Filipino data for post-training.</em></p>

<p>I want to focus on collecting IFT data because it can be <strong>tailored to specific domains</strong> and is <strong>more economical to run experiments with</strong>.
This means that NLP researchers interested in Filipino can still continue focusing on their own domains of interest while still contributing to this larger goal of improving our Filipino data pool.
Take <a href="https://arxiv.org/abs/2406.07835">SciRIFF</a> for example: it contains question answering pairs for scientific literature that serves the authors’ own purpose, yet we were able to use it in <a href="https://arxiv.org/abs/2411.15124">Tülu 3</a> to build <em>generalist language models</em> that are capable of chat, reasoning, coding, and other skills.
In addition, <strong>IFT is computationally cheaper than pretraining</strong>; laboratories with a decent grant and cloud capacity can <a href="https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#hardware-requirement">easily finetune a 7B-parameter model</a>.
Preference data is also important, but collecting it requires more annotation effort and stronger multilingual models that <em>actually work</em> in Filipino (for that we need good evaluation, which I’ll discuss in the <a href="#build-reliable-benchmarks-for-filipino">next section</a>).</p>

<blockquote>
  <p>Philippine languages lack quality instruction finetuning data</p>
</blockquote>

<p><strong>As of now, Philippine languages lack quality IFT data.</strong>
The best we have so far is the Aya dataset, with around 1.46k samples for Tagalog and 4.12M for Cebuano.
At first glance, Cebuano looks promising with more than a million examples, but upon inspection, majority of these examples were translated from another language (possibly English) or was derived from the Cebuano Wikipedia which is <a href="https://en.wikipedia.org/wiki/Cebuano_Wikipedia">mostly synthetic and unnatural</a>.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
There are several ways to collect IFT data. We can (1) annotate our own datasets, (2) translate existing English IFT datasets into Filipino, or (3) repurpose existing Filipino datasets into a question-answering format.
It’s important to note that these datasets don’t need to focus on general chat.
Researchers can continue working in their domains of interest while reframing their problems as question-answering tasks.</p>

<p>Ideally, collecting Filipino IFT instances <strong>in the order of hundreds of thousands (100K-400K) is crucial for this work.</strong>
It would be even better if these instances were evenly distributed across our major languages (e.g., Tagalog, Cebuano, Ilokano, Hiligaynon).
Once we have this dataset, it will then be easier for us, the Filipino language community, to train our own generalist LLMs.
In addition, it also makes it easier for other organizations to incorporate our dataset into their own data mixing pipelines thereby increasing the representation of Filipino to these larger-scale LM projects.
Collecting 100k instances seems daunting, but I already have some ideas in mind.</p>

<h3 id="build-reliable-benchmarks-for-filipino">Build reliable benchmarks for Filipino</h3>

<p style="border:3px; border-style:solid; border-color:#a00000; padding: 1em;">
<b>Key Insight:</b> 
We need to systematically answer the question: <i>“Does this LLM work on Filipino?”</i>
Although we already have several datasets that examine various aspects of the Filipino language,
we need to curate which of these are relevant for LLM evaluation then build an evaluation suite.
</p>

<p>Even before we start training language models, it is important to measure how current state-of-the-art LLMs perform on Filipino.
Most of the evidence I see is anecdotal: someone will post a single ChatGPT screenshot in Filipino and claim that the model already <em>understands</em> the language.
However, it is easy to see how this performance can degrade during extended conversations, such as misunderstanding idioms and expressions or lacking cultural knowledge.
We need a <strong>systematic approach to evaluating these models.</strong></p>

<p>We have several promising benchmarks at hand.
For example, <a href="https://huggingface.co/datasets/aisingapore/kalahi">KALAHI</a> tests an LLM’s ability to discern the correct response in culturally-specific situations that Filipinos face in their day-to-day lives.
<a href="https://huggingface.co/datasets/jcblaise/newsph_nli">NewsPH NLI</a> and <a href="https://huggingface.co/datasets/NLPinas/EMoTES-3K">EMoTES-3k</a> are also relevant as they reflect some of the potential questions that one might ask an LLM.
I believe that through the years, we have developed several datasets that tests different facets of the Filipino language.
We need to <strong>scour and curate</strong> them to filter those that are relevant for Filipino use-cases.
Several frameworks allow us to do this, such as <a href="https://github.com/EleutherAI/lm-evaluation-harness">Eleuther AI’s harness</a> and <a href="https://github.com/huggingface/lighteval">HuggingFace’s lighteval</a>, which enable us to seamlessly evaluate multiple LMs at scale.</p>

<p>Creating a language-specific benchmark is useful because it <strong>serves not just the academic community for that language but also the industry at large.</strong>
It allows us to say <em>“this LLM works on these specific Filipino tasks but fails at others”</em> in academia and helps advise industry practitioners on which LLMs work on their particular use cases.
This also opens up several potential avenues to advocate particular research directions for Filipino— focusing on language X, building a language-specific LLM, etc.— because we have metrics to hold ourselves to (while acknowledging <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart’s Law</a>).
Basically, we just need something instead of nothing, and that something is a huge step forward.</p>

<h3 id="participate-in-grassroots-research-efforts">Participate in grassroots research efforts</h3>

<p style="border:3px; border-style:solid; border-color:#a00000; padding: 1em;">
<b>Key Insight:</b> 
There are several large-scale grassroots research efforts happening as we speak.
We need a lot of Filipino representation in these initiatives.
Here’s an opportunity: join our <a href="https://grassroots.science">Grassroots Science effort</a> that we will launch next year!
</p>

<p>For the past few years, I’ve witnessed several grassroots NLP efforts that led to significant breakthroughs in the multilingual world.
<a href="https://seacrowd.github.io/">SEACrowd</a> is one example.
They were able to rally a community of researchers from Southeast Asia (SEA) and <a href="https://arxiv.org/abs/2406.10118">build a data hub for all SEA datasets</a>, which is very much needed today.
Other examples include <a href="https://share.hsforms.com/10OrjljwpQ52ILJA6ftENIwch5vw">Cohere for AI’s Aya</a> and HuggingFace’s <a href="https://huggingface.co/data-is-better-together">Data is Better Together</a> projects.
Right now, it’s nice to see familiar Filipino faces participating in these communities, but <strong>it would be nice if we can increase our involvement in these larger grassroots projects</strong>.</p>

<p style="text-align: center;"><img src="/assets/images/filipino-llm/seacrowd.jpg" alt="" width="500px" /></p>

<p>I strongly believe that we can achieve true impact in multilingual LLM research via a <strong>participatory approach</strong>.
This means actively collaborating with researchers and immersing ourselves in grassroots efforts focused on data annotation, data curation, and model testing.
This approach stands in constrast to the <a href="https://www.washingtonpost.com/world/2023/08/28/scale-ai-remotasks-philippines-artificial-intelligence/">sweatshop model</a>, where Filipino annotators, though recruited, are excluded from meaningful participation since they merely follow annotation guidelines set by their employers without input into the process.</p>

<p><strong>Here’s a call to action</strong>: we’re launching a <a href="https://grassroots.science/">year-long Grassroots effort</a> to collect preference data for LLM post-training in different languages.
It would be awesome to see fresh Filipino faces helping us out— so <a href="https://docs.google.com/forms/d/e/1FAIpQLSeLI-bwV0VYdwmRRqAzHtTSBMajNkUL-DG97LASSD2RmIZ1SQ/viewform">join us</a>!
This is important because human preferences, even in English, is still tricky due to its subjectivity and diversity.
Things that may be harmless to some cultures might be harmful for us.
I believe it is important for us, the Filipino research community, to have a say on how the next-generation of multilingual LLMs will be trained.</p>

<p style="text-align: center;"><img src="/assets/images/filipino-llm/grassroots.jpg" alt="" width="700px" /></p>

<h3 id="final-thoughts-are-we-truly-low-resource">Final thoughts: are we truly ‘low-resource’?</h3>

<p>“Low-resource” is a term the NLP research community use to describe languages that lack sufficient resources for building language technologies.
Many indigenous and endangered languages fall into this category due to their limited number of speakers and dedicated NLP researchers.
Tagalog occupies an interesting middle ground: while we have a large speaker population and presumably extensive written content, there remains a scarcity of readily available datasets for downstream NLP tasks.</p>

<p>One of my favorite papers this year, <a href="https://arxiv.org/pdf/2410.20817"><em>The Zeno’s Paradox of Low-Resource Languages</em></a>, helped clarify these definitions by examining how we define “low-resource” across different axes: Artifacts, Resources, Socio-Political factors, and Agency.
Although Tagalog has millions of speakers (<strong>↑ Resources</strong>), it still lacks high-quality data for several core NLP and language modelling tasks (<strong>↓ Artifacts</strong>), and there remains significant room for growth in our participatin in developing these language technologies (<strong>· Agency</strong>).
I appreciate this framework because it provides multiple dimensions for measuring a language’s low-resource status, eliminating the need to debate or bikeshed new definitions.</p>

<p>I maintain that Philippine languages remain low-resource across several dimensions.
Even Tagalog, our majority language, still lacks the necessary tools and datasets to produce robust NLP pipelines.
I believe the three research directions I described above can both increase the number of artifacts available for building language technologies and enhance our agency as a research community.
I admit that I haven’t done enough for Filipino NLP this year<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> and this blog post serves not just as a research statement but <strong>also a commitment to improve my involvement in this language.</strong>
I have some ideas (the ideas in this blog post are just a small part of it), so if you want to help out, <a href="mailto:ljvmiranda@gmail.com">feel free to reach out</a>!</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The Cebuano Wikipedia is the second-largest Wikipedia in terms of number of articles. Although this appears impressive, its size is due to an article-generating bot called <a href="https://en.wikipedia.org/wiki/Lsjbot">Lsjbot</a> rather than a dedicated group of Wikipedia volunteers. Unfortunately, the articles in Cebuano Wikipedia are unnatural and do not reflect how the language is actually used by native speakers. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>This year we published <a href="https://aclanthology.org/2024.emnlp-main.296/">SEACrowd</a>, <a href="https://aclanthology.org/2024.naacl-long.243/">Universal NER</a>, and the <a href="https://huggingface.co/collections/UD-Filipino/universal-dependencies-for-tagalog-67573d625baa5036fd59b317">largest Tagalog UD Treebank</a>, but most of these efforts started back in 2023. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="tagalog" /><category term="filipino" /><category term="nlp" /><category term="llm" /><summary type="html"><![CDATA[The rise of LLMs is forcing us to rethink Filipino NLP. But there's still a ton of work to do&mdash;just not the stuff you might think. Here's my take on what's worth doing, what's a waste of time, and where Filipino NLP research should be heading.]]></summary></entry><entry><title type="html">Guest lecture @ DLSU Manila: Artisanal Filipino NLP Resources in the time of Large Language Models</title><link href="https://ljvmiranda921.github.io/notebook/2024/07/02/talk-dlsu/" rel="alternate" type="text/html" title="Guest lecture @ DLSU Manila: Artisanal Filipino NLP Resources in the time of Large Language Models" /><published>2024-07-02T00:00:00+08:00</published><updated>2024-07-02T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2024/07/02/talk-dlsu</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2024/07/02/talk-dlsu/"><![CDATA[<p><span class="firstcharacter">I</span> was invited to give a talk to a graduate-level NLP class about my work on Filipino resources.
It was fun preparing and giving that talk because I was able to synthesize my thoughts and look back on my previous research.
This blog post is my lecture in text format.
<strong>You can find the slides in this <a href="https://docs.google.com/presentation/d/10wrKZoBouh3agrgkTLjvJN9g9WwqdrjiwWSSztoyWdA/edit?usp=sharing">link</a> (and here’s the <a href="https://youtu.be/oY-uQ66z3Ss?si=RxXyxb52TQ70356R">video</a>).</strong>
Finally, thank you to <a href="https://www.dlsu.edu.ph/colleges/ccs/faculty-profile/cheng-charibeth/">Dr. Charibeth Cheng</a> for the invitation!</p>

<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vTRVq0Lo3adiVDkcgJpsuz0RHDFvFWS-9gj2r2kg2dIi-33BvnRSYH1FyOiUQ0dSys_sT44f7uyHigz/embed?start=false&amp;loop=true&amp;delayms=3000" frameborder="0" width="720" height="434" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

<p> </p>

<hr />

<p> </p>

<p>Given all the rage in LLMs today, is it still worth it to build artisanal Filipino NLP resources?
Hopefully we can answer this question in the context of the work that I’ve done.
I’ve worked on large models in the 7B-70B parameter range, but I’ve also done several things for low-resource languages and built models in the ~100M parameter range.
Tonight, I want to juxtapose these two sides and call one as <em>artisanal</em> and the other as <em>large-scale</em>.</p>

<h2 id="what-is-artisanal-nlp">What is artisanal NLP?</h2>

<p>In this talk, I want to contrast two types of ideas when building language technologies.
You have <em>artisanal</em> on one end, as shown by this photo of a handmade pottery—carefully constructed by its creator.
On the other hand, you have these “mass-produced” objects done in <em>large-scale</em>.
I want to differentiate them in terms of three dimensions.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide00.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<ol>
  <li>
    <p><strong>Effort</strong>: artisanal NLP resources often require specialized knowledge and effort.
For example, it is important to know something about Filipino grammar and morphology when building task-specific models for Filipino.
Large-scale models can get away with just providing large amounts of data (as we see in most web-scale datasets).</p>
  </li>
  <li>
    <p><strong>Size</strong>: currently, our definitions of what’s small or large change every month.
But in general, artisanal models are relatively smaller in terms of parameter size.
Large-scale models need to be bigger because they have to accommodate more tasks and domains.</p>
  </li>
  <li>
    <p><strong>Utility</strong>: artisanal models and datasets tend to be bespoke, i.e., built for a specific task or requirement.
On the other hand, most large-scale models we see today were made for general-purpose applications.</p>
  </li>
</ol>

<p>Notice that I’m a bit vague whether I’m talking about models or datasets.
I like to think of artisanal vs. large-scale as an <em>attitude</em> for building language technologies or NLP artifacts.
Finally, this talk is not about one being better than the other.
However, I want to focus more on the merits of these artisanal NLP resources by discussing parts of my research and my work.</p>

<p>You can see the outline of my lecture below.
For the rest of this talk, I’ll fill the blanks and talk about the merits of artisanal NLP while discussing portions of my research.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide01.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<h2 id="artisanal-nlp-resources-are-high-effort-but-impactful-for-the-language-community">Artisanal NLP resources are high-effort, but impactful for the language community</h2>

<p>In this section, I want to talk about <a href="https://huggingface.co/datasets/ljvmiranda921/tlunified-ner">TLUnified-NER</a> (link to <a href="https://aclanthology.org/2023.sealp-1.2/">SEALP ‘23 paper</a>), a Named-Entity Recognition dataset that I’ve built.
NER is a classical NLP problem: given a text, you want to look for named-entities such as names of persons, locations, or organizations.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide02.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>This is already an easy task for English NER.
However, NER resources for Tagalog are still lacking.
We don’t have a lot of labeled data, and in consequence we don’t have a lot of models.
There are many ways to get around this problem (e.g., cross-lingual transfer learning, zero-shot from an LLM),
but we still lack reliable test sets for evaluation.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide03.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>In my opinion, a good NER dataset should be open-access, high-quality, and standardized.
Most of the NER datasets available for us only fills two of these three attributes:
<a href="https://huggingface.co/datasets/unimelb-nlp/wikiann">WikiANN</a> has general-purpose tags that follow CoNLL and can be downloaded from HuggingFace, but the <a href="https://arxiv.org/pdf/2202.12288">quality of annotations are pretty bad</a>.
<a href="https://catalog.ldc.upenn.edu/LDC2023T02">LORELEI</a> is a high-quality dataset, but has strict license restrictions and quite expensive!
Finally, we have several hand-annotated datasets for Filipino, but most of them were made for highly-specific tasks.
Back in 2022, there’s an obvious gap to fill for Tagalog NER.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide04.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide05.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide06.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide07.png" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>And so we built TLUnified-NER.
It is publicly accessible, high-quality, and follows the CoNLL standard.
I also curated the texts to ensure that it represents how we commonly write Tagalog.
The annotation process is done through several rounds (or sprints).
I hired two more annotators and then for each round we annotate a batch of examples, evaluate the annotated batch, and update the annotation guidelines to improve quality.
You can learn more about this process in <a href="https://aclanthology.org/2023.sealp-1.2.pdf"><strong>our paper</strong></a>.
I also wrote some of my thoughts on the annotation process <a href="/notebook/2023/07/03/devlog-calamancy/">in a blogpost</a>.</p>

<p>After building the dataset, there are two questions that I want to answer:
first, is the NER task learnable from our annotations? Then, is it better than existing NER datasets?
For the first one, I created baseline approaches using various mixes of word embeddings and language coverage.
For all cases, we achieved decent performance.
Then for the second question, we compared a model trained on WikiANN and a model trained from TLUnified-NER.
In most cases, our model outperforms the WikiANN model, showing that it’s a better dataset to train models upon.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide07.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide08.png" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>To end this part of the talk, I want to show that <strong>NER is just one piece of the NLP puzzle.</strong>
There are still a lot of tasks to build resources on.
I believe that increasing the coverage of Filipino resources allows us to not only train models, but create comprehensive evaluation benchmarks for existing LLMs today.
Recently, we’ve seen a lot of claims that LLMs can “speak” Filipino, but most of these are cherry-picked examples and highly vibes-based.
If we can create a systematic benchmark that allows us to confidently claim performance, then that would be a big contribution to the field.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide09.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<h2 id="artisanal-nlp-resources-are-capable-of-doing-a-few-things-but-can-do-them-well">Artisanal NLP resources are capable of doing a few things, but can do them well</h2>

<p>In this section, I’ll talk about <a href="https://github.com/ljvmiranda921/calamanCy">calamanCy</a>, a spaCy-based toolkit that I built for Tagalog (<a href="https://aclanthology.org/2023.nlposs-1.1/">NLP-OSS ‘23 paper</a>).
As you already know, spaCy is a toolkit for core linguistic tasks such as dependency parsing, tokenization, and NER.
However, most of the models we provide in-house are focused on high-resource languages.</p>

<p>What most people do is they finetune spaCy pipelines on their own language or domain.
So you’ll see libraries in the <a href="https://spacy.io/universe">spaCy universe</a> for all kinds of applications.</p>

<!-- - Multilinguality: [DaCy](https://github.com/centre-for-humanities-computing/DaCy) for Danish and [HuSpaCy](https://github.com/huspacy/huspacy) for Hungarian.
- Domain-specific texts: [Hobbit-spaCy](https://github.com/wjbmattingly/hobbit-spacy), [scispaCy](https://github.com/allenai/scispacy), and [medspaCy](https://github.com/medspacy/medspacy) for fictional, scientific, and medical texts.
- Ancient languages: [latinCy](https://huggingface.co/latincy) for Latin, [greCy](https://github.com/jmyerston/greCy) for Greek, and [**my work on SIGTYP '24**](https://aclanthology.org/2024.sigtyp-1.18/) on several Ancient & Medieval languages. -->

<table>
  <thead>
    <tr>
      <th>Domain</th>
      <th>Example libraries</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Multilinguality</td>
      <td><a href="https://github.com/centre-for-humanities-computing/DaCy">DaCy</a> for Danish and <a href="https://github.com/huspacy/huspacy">HuSpaCy</a> for Hungarian.</td>
    </tr>
    <tr>
      <td>Scientific texts</td>
      <td><a href="https://github.com/allenai/scispacy">scispaCy</a>, and <a href="https://github.com/medspacy/medspacy">medspaCy</a> for scientific and medical texts.</td>
    </tr>
    <tr>
      <td>Old Languages</td>
      <td><a href="https://huggingface.co/latincy">latinCy</a> for Latin, <a href="https://github.com/jmyerston/greCy">greCy</a> for Greek, and <a href="https://aclanthology.org/2024.sigtyp-1.18/"><strong>my work on SIGTYP ‘24</strong></a> on several Ancient &amp; Medieval languages.</td>
    </tr>
  </tbody>
</table>

<p>So this prompted me the question: why don’t we build a spaCy pipeline for Tagalog?</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide10.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>But what does it mean to build spaCy pipeline?</p>

<p>First, think of a <a href="https://spacy.io/usage/processing-pipelines">spaCy pipeline</a> as a series of functions that identifies key linguistic features in a text.
So a tokenizer is a function that looks for tokens, a tagger is a function for parts-of-speech (POS) tags, and so on.
Then at the end, you obtain a <a href="https://spacy.io/api/doc"><code class="language-plaintext highlighter-rouge">Doc</code> object</a> that contains all these linguistic features.</p>

<p>Building a spaCy pipeline means training these functions and composing them into a single package.
In fact, we have a well-documented master thread on <a href="https://github.com/explosion/spaCy/discussions/3056">how to add pretrained language support for spaCy</a>.
Most of these functions are based on <a href="https://spacy.io/api/architectures">neural network architectures</a>, and hence require some non-trivial amount of data to train.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide11.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>One of the hardest parts of building <a href="https://github.com/ljvmiranda921/calamanCy">calamanCy</a> is curating datasets to “train” these functions.
For example, the current Tagalog treebanks are <a href="/notebook/2022/04/24/low-resource-dep-parse/">too small to train a reliable dependency parser and POS tagger</a>.
Also, TLUnified-NER doesn’t exist back then, so I still have to build it.
You can read more about this curation process in the <a href="https://aclanthology.org/2023.nlposs-1.1/">calamanCy paper</a>.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide12.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>It was a long process, but the most important question is: was it worth it?
To that I remember this figure from Matthew Honnibal’s blog post on <a href="https://explosion.ai/blog/against-llm-maximalism">LLM maximalism</a>.
I think there is value in curating these datasets and training these models as it helps me understand which parts of the linguistic pipeline really requires an LLM, and which could be done more reliably by a simple approach.
In addition, we were also able to show empirically that models trained on calamanCy performs pretty well, even compared to commercial LLM APIs.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide13.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>As shown in the charts below, we found that even commercial LLMs like GPT-4 and Claude don’t fare well on our test set in a zero-shot setting.
There are many possible reasons, of course.
One major reason is that these models aren’t optimized for multilinguality, and hence have a tiny amount of Tagalog texts in their corpora.
I’d love to revisit this experiment soon, especially with the release of multilingual LLMs such as <a href="https://huggingface.co/SeaLLMs">SeaLLM</a>, <a href="https://aisingapore.org/aiproducts/sea-lion/">Sea-LION</a>, and <a href="https://huggingface.co/CohereForAI/aya-101">Aya-101</a>.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide14.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide15.png" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>So…even if artisanal NLP models can do a few things (but do them well), I’m still optimistic that there are opportunities to do more things while doing them well.
I want you to remember this chart below.
There are a lot of opportunities to work on datasets used for training models with general-purpose capabilities and/or building task-specific models.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide16.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<h2 id="artisanal-nlp-resources-may-not-be-the-most-mainstream-but-fills-vital-research-gaps">Artisanal NLP resources may not be the most mainstream, but fills vital research gaps</h2>

<p>As we all know, LLMs are all the rage today.
In the past two years, there has been an explosion of open LLMs, and way more soon!
However, working on LLM research is costly— a high-spec consumer-grade machine might not be enough to finetune a 7B-parameter model.
So how do you contribute to the field if you are GPU-poor?</p>

<p>I will answer this in the context of my collaborations and other works.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide17.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>The first option is to continue building resources, but at scale.
It’s quite common to see researchers first work on a single language (mostly that of their native tongue) and then move on to multilinguality, as the latter presents a different opportunity for creativity.</p>

<p>For <a href="https://aclanthology.org/2024.naacl-long.243/">Universal NER</a>, we created a multilingual NER corpora for several languages, all based from treebanks in the <a href="https://universaldependencies.org/">Universal Dependencies (UD)</a> framework.
This effort allows us to create a consistent and standardized annotation for 13 diverse languages— very important for multilingual research.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide18.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>As an illustration, we can do cross-lingual transfer learning as shown below, where we compare the performance of different models trained on a source language for different test set languages.
We can even show that for low-resource languages like Tagalog, we can get by with training an NER model from a Portuguese treebank as an alternative.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide19.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>Another option is to figure out areas where LLMs are lacking.
Most LLMs right now aren’t explicitly trained to be multilingual.
They’re incidentally multilingual, probably because of some stray artifacts in their pretraining data.
But for the past few months, we’ve seen the rise of multilingual LLMs such as Aya-101, Sea-LION, and more.
Are they actually better for multilingual data?</p>

<p>That’s one of the questions we sought to answer in the <a href="https://arxiv.org/abs/2406.10118">SEACrowd project</a>.
First, we crowdsourced datasets all over Southeast Asia.
This allows us to create a data hub for SEA-specific resources.
The data hub, in itself, is already a big contribution.
It also allowed me to see available datasets from the Philippines, and there’s actually quite a lot!</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide20.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>Then, we curated a multilingual SEA benchmark for both NLU (natural language understanding) and NLG (natural langauge generation) tasks.
Upon testing, we found that LLMs trained for multilinguality actually perform quite well (e.g., Aya-101 13B).
Even LLMs targeted for SEA languages are competitive with commercial APIs.</p>

<p>What I found very interesting is that when testing for generation “quality” (i.e., are the generated texts natural-sounding or translationese?), we found that even the best LLM is still 58% natural-sounding.
There’s still a long way to go to improve the quality of generations.
As of now, everything is still vibes-based, but having an empirical benchmark certainly helps!</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide21.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-dlsu/slide22.png" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<h2 id="conclusion-what-types-of-artisanal-nlp-contributions-can-you-pursue">Conclusion: what types of artisanal NLP contributions can you pursue?</h2>

<p>I want to close this talk by showing you this chart.
There are a lot of ways to improve Filipino NLP, it just depends on where you’re interested in:</p>

<ul>
  <li><strong>Model-centric or data-centric</strong>: what types of artifacts are you excited to build? Do you want to curate datasets or do you want to build models?</li>
  <li><strong>General-purpose or specific</strong>: are you aiming to build towards generally-capable LLMs, or do you want to solve a specific task?</li>
  <li><strong>Training or evals</strong>: what is the purpose of the artifact you’re creating? Is it for training models or for evaluating existing models?</li>
</ul>

<p>There are opportunities for each quadrant of this chart.
I myself am interested in data-centric approaches for both task-specific and generally-capable models.
There’s still a lot of domains that need foundational NLP resources (think Universal Dependencies treebanks), and several questions can still be asked regarding the datasets we use for training state-of-the-art LLMs.</p>

<p>Hopefully this inspires you to figure out what your project would be!</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-dlsu/slide23.png" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p> </p>

<hr />

<p> </p>

<h2 id="postscript">Postscript</h2>

<p>I really enjoyed preparing for this talk as it helped me synthesize my past work on Filipino NLP.
Right now, I’m finding ways to marry my current research (preference / alignment-tuning for LLMS) and my previous work on low-resource and multilinguality.
I’m quite inspired by works such as the <a href="https://arxiv.org/abs/2404.16019">PRISM Alignment Project</a>, <a href="https://arxiv.org/abs/2405.15032">Cohere’s Aya-23 dataset</a>, and the <a href="https://arxiv.org/abs/2406.18682">Multilingual Alignment prism</a>.
I think there are interesting questions at the intersection of multilinguality and preference data, but I don’t think our conclusions should be something like “Oh, people who speak language X prefers Y and Z” or “People from country A prefers B.”
Still a long way to go, but I’m excited to pursue these topics in the future!
If this caught your attention, feel free to <a href="mailto:ljvmiranda@gmail.com">reach out</a>!</p>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="tagalog" /><category term="nlp" /><category term="llm" /><category term="dlsu" /><category term="calamancy" /><category term="spacy" /><category term="natural language processing" /><category term="tagalog nlp" /><summary type="html"><![CDATA[Last month, I had another guest lecture, this time in Dr. Charibeth Cheng's graduate class in DLSU. Here, I talked about the craft of building small-scale yet effective NLP models for Filipino in the face of today's large language models.]]></summary></entry><entry><title type="html">A lexical view of contrast pairs in preference datasets</title><link href="https://ljvmiranda921.github.io/notebook/2024/03/12/contrast-pairs/" rel="alternate" type="text/html" title="A lexical view of contrast pairs in preference datasets" /><published>2024-03-12T00:00:00+08:00</published><updated>2024-03-12T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2024/03/12/contrast-pairs</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2024/03/12/contrast-pairs/"><![CDATA[<p><span class="firstcharacter">P</span>reference data is a staple in the final step of the LLM training pipeline.
During RLHF, we train a reward model by showing pairs of chosen and rejected model outputs so that it can teach a policy model how to generate more preferable responses.
The hope is, our reward model can capture the nuance and diversity of human judgment.</p>

<p>However, preference is subjective by nature, and few studies have tried articulating it.
For example, some looked into different aspects of a response’s helpfulness / harmlessness (<a href="https://arxiv.org/abs/2204.05862">Bai et al., 2022</a>) while others investigated surface-level characteristics like its length (<a href="https://arxiv.org/pdf/2310.03716.pdf">Singhal et al., 2023</a>).</p>

<p>In this blog post, I want to offer a different approach: <strong>what if instead of looking at qualitative aspects or token-level features, we use sentence embeddings?</strong>
Sentence embeddings capture a text’s lexical and semantic meaning in a high-dimensional vector space.
If so, can we ascertain lexical differences between chosen and rejected responses <em>just</em> by looking at text embeddings?</p>

<p>One reason why this is important is due to synthetic data.
I think that it is easier to generate synthetic pairs conditioned on lexical distance (as opposed to some quality-based metric).
Maybe, there are some tasks and domains where generating with respect to cosine distances is plausible.</p>

<h2 id="getting-preference-data">Getting preference data</h2>

<p>First, I sampled preference data across different sources.
For bigger datasets such as SHP, I only took a particular subset I am interested in.
The table below shows the sources I used:</p>

<table>
  <thead>
    <tr>
      <th>Dataset</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>OpenAI’s Summarize from Human Feedback (<a href="https://arxiv.org/abs/2009.01325">Stiennon et al., 2022</a>)</td>
      <td>Dataset used to train a summarization reward model. I used the <code class="language-plaintext highlighter-rouge">comparisons</code> subset where each instance represents a matchup between two summaries.</td>
    </tr>
    <tr>
      <td>Stanford Human Preferences Dataset (<a href="https://proceedings.mlr.press/v162/ethayarajh22a.html">Ethayarajh et al., 2022</a>)</td>
      <td>Contains a collection of human preferences over responses to questions or instructions. I used the <code class="language-plaintext highlighter-rouge">explainlikeimfive_train</code> subset to represent OpenQA questions.</td>
    </tr>
    <tr>
      <td><a href="https://huggingface.co/datasets/argilla/ultrafeedback-multi-binarized-quality-preferences-cleaned">Argilla’s Ultrafeedback Multi-Binarized Cleaned Dataset</a></td>
      <td>A clean version of the original Ultrafeedback dataset (<a href="https://arxiv.org/abs/2310.01377">Cui et al., 2023</a>). The cleanup process can be found <a href="https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences">in their writeup</a>.</td>
    </tr>
    <tr>
      <td>Tatsumoto Lab’s Alpaca Farm (<a href="https://arxiv.org/abs/2305.14387">Dubois et al., 2023</a>)</td>
      <td>The human preference subset of the Alpaca Farm dataset. The researchers used this subset to compare their LLM judge’s preferences.</td>
    </tr>
    <tr>
      <td><a href="https://huggingface.co/datasets/berkeley-nest/Nectar">Berkeley Nest Lab’s Nectar Dataset</a></td>
      <td>Preference ranking dataset for training the Starling 7B reward model (<a href="https://starling.cs.berkeley.edu/">Zhu et al., 2023</a>), and consequently, the Starling 7B language model.</td>
    </tr>
  </tbody>
</table>

<!-- talk about elo ranking for matchup-type datasets -->

<p>For OpenAI’s Summarize and SHP, the preferences are in the form of individual matchups.
To get the canonical chosen and rejected responses, I used the <a href="https://en.wikipedia.org/wiki/Elo_rating_system">Elo rating system</a> to obtain the top and bottom completions.</p>

<h2 id="measuring-distance-between-pairs">Measuring distance between pairs</h2>

<p>Given a set of preference data, I split the completions based on whether they were chosen (\(\mathbf{y}_w\)) or rejected (\(\mathbf{y}_l\)) by an evaluator—human or GPT, depending on the dataset.
Then, I embedded them using <a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">sentence-transformers/all-MiniLM-L6-v2</a> to produce 384-dimensional sentence embeddings.
Finally, for each row, I computed the distance (\(\mathbf{d}\)) between the chosen and rejected vectors.
The figure below illustrates this process.</p>

<p style="text-align: center;"><img src="/assets/images/contrast-pairs/process.png" alt="" width="700px" /></p>

<p>To compute the distances, I used the cosine distance from <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html"><code class="language-plaintext highlighter-rouge">scipy</code></a>.
Cosine distance measures the direction between two vectors, allowing us to capture similarity even if the length of the sentences or overall frequency of the words differ.
It is represented by the following equation:</p>

\[\mathbf{d}(\mathbf{v}_w, \mathbf{v}_l) = 1 - \dfrac{\mathbf{v}_w \cdot \mathbf{v}_l}{\lVert\mathbf{v}_w\rVert_2 \lVert\mathbf{v}_l\rVert_2}
,\]

<p>where the distance value ranges from \((0, 2)\).
Usually, when we talk about distances between preference pairs, we talk about <strong>quality-based distances</strong>.
They’re often in the form of rankings (i.e., get top-1 and top-N) based on an evaluator’s assessment.
Again, in this blog post we’re looking at <strong>lexical-based distances</strong> that are readily available from a text’s surface form.
In the next section, I’ll discuss some interesting findings from these distance calculations.</p>

<h2 id="findings">Findings</h2>

<p>Most of the charts I’ll be showing below are histograms.
Here, the x-axis represents the cosine distance whereas the y-axis represents the probability density.
We compute the probability density by normalizing the fraction of samples in each bin so that the sum of all bar areas equals 1.
The best way to think about these value is in terms of <em>chance</em>, that is, how likely is a random preference pair have a distance \(\mathbf{d}\) on the x-axis?</p>

<h3 id="lexical-differences-are-apparent-in-some-datasets">Lexical differences are apparent in some datasets</h3>

<p>The chart below shows the distribution across multiple preference datasets.
AlpacaFarm and Nectar lie on both extremes.
AlpacaFarm is particularly interesting because its completions were generated by API-based LLMs using prompts that replicate human variability and agreement.
I’m unfamiliar with how exactly they prompted the LLM, but does that mean their process resulted in similar-looking texts?</p>

<p>On the other hand, Nectar’s completions were a combination of LLM outputs (GPT-4, GPT-3.5-turbo, GPT-3.5-turbo-instruct, LLaMa-2-7B-chat, and Mistral-7B-Instruct) alongside other existing datasets.
Because Nectar formats its preferences in terms of ranking, the chosen and rejected pairs here represent the top and bottom choices.</p>

<iframe width="720" height="540" frameborder="0" scrolling="no" src="/assets/images/contrast-pairs/distance_hist_plot.html"></iframe>

<p>Other datasets have distributions that I expected.
For example, OpenAI’s summarization dataset should still have closer preference pairs because of the task’s inherent nature.
Summarization is about compressing a text while maintaining information.
Upon checking the actual preferences and corresponding <a href="https://openaipublic.blob.core.windows.net/summarize-from-feedback/website/index.html#/tldr_comparisons">evaluator notes</a>, I noticed that rejected completions are oftentimes a matter of recall.</p>

<h3 id="elo-ranking-correlates-with-cosine-distance">Elo ranking correlates with cosine distance</h3>

<p>Next, I looked into how Elo ranking corresponds to the cosine distance of the text embeddings.
Preference datasets like OpenAI’s Summarization, SHP, and Berkeley-Nest’s Nectar represent their preferences as individual matchups, allowing us to compute the Elo rating of individual completions.
Then, we can order these ratings to achieve a rank of completions from most preferable to least.</p>

<p style="text-align: center;"><img src="/assets/images/contrast-pairs/elo_ranking.png" alt="" width="720px" /></p>

<p>However, OpenAI’s Summarization and SHP have unequal number of ranks per prompt \(\mathbf{x}\).
So to simplify the visualizations, I took the chosen completion \(\mathbf{y}_w\), the top-2 completion \(\mathbf{y}_{l,next}\), the middle-performer \(\mathbf{y}_{l,mid}\), and the bottom-performer \(\mathbf{y}_{l,last}\) (which is equivalent to \(\mathbf{y}_l\) in the previous section).
On the other hand, Berkeley-Nest’s Nectar provides a 7-rank scale of preferences.
This allowed me to compute the distance from the first and second choices until the last one: \(\mathbf{d}(\mathbf{y}_1, \mathbf{y}_{2\ldots7})\).
Then, I plotted these distances in a histogram (I only retained the curve so that the charts look cleaner) as seen below:</p>

<div style="text-align:center">
  <iframe width="360" height="600" frameborder="0" scrolling="no" src="/assets/images/contrast-pairs/distance_rank_plot_openai___summarize_from_feedback.html"></iframe>
  <iframe width="360" height="600" frameborder="0" scrolling="no" src="/assets/images/contrast-pairs/distance_rank_plot_stanford___SHP.html"></iframe>
  <iframe width="720" height="500" frameborder="0" scrolling="no" src="/assets/images/contrast-pairs/distance_rank_plot_berkeley-nest___Nectar_fine.html"></iframe>
</div>

<p>The cosine distances from the <strong>OpenAI Summarization</strong> preference dataset follow a certain pattern:
completions that are closer in ranking have smaller lexical distance. The average mid ranking is 2.042 (with a 4.109 average number of ranks) and the Pearson correlation between the distances and Elo ranking is 0.779.</p>

<p>For the <strong>Stanford Human Preferences (SHP)</strong> dataset, I chose the <code class="language-plaintext highlighter-rouge">explainlikeimfive subset</code> to simulate OpenQA tasks. 
Interestingly, it has a less pronounced visual correlation even though its Pearson-r is 0.785, much higher than OpenAI Summarization. 
The average mid ranking is 1.967 with an average rank number of 4.600.</p>

<p>For <strong>Berkeley-Nest’s Nectar</strong> dataset, the rankings were already given so I didn’t have to compute my own.
Here, the Pearson correlation is 0.818.
If you look at the “chosen and rejected (2)” red line, you’ll notice that the cosine distances start very small but fall off afterwards.
It is interesting that completions that performed similarly during matchups are quite similar to one another based on their embeddings.</p>

<table>
  <thead>
    <tr>
      <th>Dataset</th>
      <th>Number of ranks (avg)</th>
      <th>Mid rank</th>
      <th>Pearson-r Elo ranking</th>
      <th>Pearson-r Elo rating</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>openai/summarize_from_feedback</td>
      <td>4.109</td>
      <td>2.042</td>
      <td>0.779</td>
      <td>-0.534</td>
    </tr>
    <tr>
      <td>stanfordnlp/SHP</td>
      <td>4.600</td>
      <td>1.967</td>
      <td>0.7845</td>
      <td>-0.458</td>
    </tr>
    <tr>
      <td>berkeley-nest/Nectar</td>
      <td>7.000</td>
      <td>4.000</td>
      <td>0.818</td>
      <td>-</td>
    </tr>
  </tbody>
</table>

<p>The table above shows the ranking statistics for each dataset.
I also measured the Pearson correlation between the rejected text’s ranking (and Elo rating) with respect to its embedding distance from the chosen text.
The sign (+/-) corresponds to the direction of the correlation. 
For example, the negative sign in the last column shows that as the text’s Elo rating increases, then its lexical distance from the chosen text decreases (i.e., they become more similar).</p>

<h3 id="lexical-distance-is-consistent-across-preference-attributes">Lexical distance is consistent across preference attributes</h3>

<p>Finally, I was curious how individual attributes of preference manifest in lexical distances using the <a href="https://huggingface.co/datasets/nvidia/HelpSteer">HelpSteer</a> dataset (<a href="https://arxiv.org/abs/2311.09528">Wang et al., 2023</a>).
Most datasets only give us a single view of human judgment, but HelpSteer provides finegrained preferences such as helpfulness, correctness, coherence, complexity, and verbosity.</p>

<p>So, I did the same experiments for each of these attributes and found that the distribution didn’t change much.
I’m not quite confident on how I preprocessed this dataset.
Unlike other preference datasets that uses matchups, HelpSteer uses scores from 0 to 4 so some texts can end up having the same scores.
Here, I simply sorted the texts with their score, and designated the chosen text as the first one on the list (whatever Python’s sort function made it to be), and the rejected text as the last element.
You can see the figure below:</p>

<iframe width="720" height="550" frameborder="0" scrolling="no" src="/assets/images/contrast-pairs/distance_helpsteer_plot.html"></iframe>

<p>I think that there’s still a lot that can be done on this angle.
One way is to format the data in terms of individual matchups.
This process leads to a forced ranking, allowing us to easily designate the chosen and rejected pair.
Since HelpSteer is the only one we have (as far as I know), then I’ll leave my analysis as is for now.</p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>In this blog post, I presented a lexical view of preference pairs using embeddings.
Using different preference datasets, I computed for their sentence embeddings, and then measured the cosine distance between chosen and rejected pairs.
I found that some datasets exhibit lexical differences and that it correlates to human judgment (i.e., Elo rating).
Finally, using the HelpSteer dataset, I saw that cosine distances are consistent even on different attributes of preference.</p>

<p>This experiment is really just a curiosity as I work on RLHF.
I’ve been doing some experiments on my job that are a little bit orthogonal to this work.
I think this is just my way of exploring interesting avenues and scratching my itch.
If you’re interested in this type of work, feel free to reach out and discuss!</p>

<p>You can find the <a href="https://github.com/ljvmiranda921/scratch/tree/master/2024-02-21-contrast-pairs">source code for this work</a> on GitHub!</p>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="rlhf" /><category term="preference data" /><category term="llm" /><category term="shp" /><category term="openai" /><category term="berkeley-nest" /><summary type="html"><![CDATA[Can we spot differences between preference pairs just by looking at their word embeddings? In this blog post, I want to share my findings from examining lexical distances between chosen and rejected responses in preference datasets.]]></summary></entry><entry><title type="html">Guest lecture @ UNC Charlotte: Labeling with LLMs</title><link href="https://ljvmiranda921.github.io/notebook/2024/02/21/talk-unc-charlotte/" rel="alternate" type="text/html" title="Guest lecture @ UNC Charlotte: Labeling with LLMs" /><published>2024-02-21T00:00:00+08:00</published><updated>2024-02-21T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2024/02/21/talk-unc-charlotte</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2024/02/21/talk-unc-charlotte/"><![CDATA[<p><span class="firstcharacter">A</span> few weeks ago, I held a guest lecture in the DSBA 6188: Text Mining and Information Retrieval Class at UNC Charlotte on using large language models (LLMs) for annotation. 
It was fun because I could expand my previous blog posts on LLM annotation into a full-fledged lecture.</p>

<p>This blog post is my (abridged) lecture in written format. 
<strong>You can find the slides in this <a href="https://docs.google.com/presentation/d/1uGoI8meg66gATzim03ZQlQsvQHYUFw2SDetiAosaNzA/edit?usp=sharing">link</a>.</strong>
Finally, thanks to <a href="https://twitter.com/ryanwesslen">Ryan Wesslen</a> and <a href="https://twitter.com/ChangLeeTW">Chang Hsin Lee</a> for inviting me!
<!-- You can also watch the live recording [here](https://youtu.be/bKKlx46MopQ?si=Iul6PnApqr6IknwI). --></p>

<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRbvp2rKHC9B5fm7IvDz_k0y4odjLsSnsTJPJMGsJnE51v1o1KTC7uGK9BcFPfiQ2JkiJwiGDyw_zM5/embed?start=false&amp;loop=true&amp;delayms=3000" frameborder="0" width="720" height="434" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

<p> </p>

<hr />

<p> </p>

<h2 id="case-in-point-automated-fact-checking">Case in point: automated fact checking</h2>

<p>One of the major problems of the 21st century is disinformation.
You’ll see it everywhere, from Facebook posts or X tweets to fake news websites!
Combating disinformation is labor-intensive. 
<a href="https://www.politifact.com/">Politifact</a>, a fact-checking website, relies on volunteer journalists to scour the internet and manually label each source.</p>

<p>There are several efforts to automate the fact-checking process.
A common approach is to treat it as an NLP pipeline composed of different tasks (<a href="https://aclanthology.org/2022.tacl-1.11/">Guo et al., 2022</a>).
Today, we will only focus on <strong>claim detection</strong>, the first step in an automated fact-checking pipeline.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide05.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide06.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide07.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide08.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>Detecting claims is usually a dual problem: you’d also want to find the premises that support it.
Together, the claim and its premises make up an <strong>argument.</strong>
Applying NLP to this domain is often called <strong>argument mining.</strong>
For this talk, I want to introduce two argument mining sub-tasks: (1) first, we want to highlight the claim and premise given a text (<em>claim &amp; premise extraction</em>), and then, (2) we want to determine if a text supports, opposes, or is neutral to a certain topic (<em>stance detection</em>).</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide11.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide13.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>So, our general approach is to reframe these two sub-tasks as NLP tasks.
First, we treat claim &amp; premise extraction as a span labeling problem.
We can use spaCy’s <a href="https://spacy.io/api/spancategorizer">SpanCategorizer</a> to obtain spans or arbitrary slices of text.
Then, we treat stance detection as a text categorization problem.
Similarly, we can use spaCy’s <a href="https://spacy.io/api/textcategorizer">TextCategorizer</a> to classify a text among our three stances (support, oppose, neutral).</p>

<p>Notice how we’ve decomposed this general problem of disinformation into tractable NLP tasks.
And it is an important muscle to train.
In computer science, we often learn about the <em>divide and conquer</em> algorithm, and this is a good application of that approach to a more fuzzy and, admittedly, complex problem.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide15.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>As we already know, training NLP models such as a span or text categorizer requires a lot of data. 
I want to talk about different methods of collecting this dataset and emphasize how LLMs can fit into this workflow.</p>

<h2 id="annotating-argument-mining-data-with-llms">Annotating argument mining data with LLMs</h2>

<p>Before we get into LLMs, I want to talk about “traditional” ways of annotating data.
On the left end, we have manual processes involving much human effort and curation.
And then, on the right, we have more automated methods that rely heavily on a reference or base model.
LLMs, as advanced as they are, still fall in between.
They’re not fully manual but also not fully automated because writing a prompt still requires tuning and domain expertise.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/traditional.png" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide18.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>But why are we still interested in LLMs? 
It’s because LLMs provide something that most semi-automated methods can’t: a model pretrained on web-scale data, and a highly flexible zero-shot capability.
Let me put this in a Venn diagram— and for each space in this diagram, I’ll talk about how LLMs can specifically help in our annotation workflows.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide21.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<h3 id="bootstrapping-in-a-human-in-the-loop-workflow">Bootstrapping in a human-in-the-loop workflow</h3>

<p>One of the most straightforward applications of large language models is bootstrapping labeled data.
Here, an LLM is a drop-in replacement for a base model that you’d usually train.
LLMs differ because they were pretrained on web-scale data, giving it enough capacity even for your domain-specific task. 
So, how good is an LLM annotator?</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide24.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide25.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>To test this question, I worked on a portion of the UKP Sentential Argument Mining corpus (<a href="https://aclanthology.org/D18-1402/">Stab et al., 2018</a>). 
It contains several statements across various topics, and the task is to determine whether the statement supports, opposes, or is neutral to the topic— a text categorization problem.</p>

<p>The process was simple: I included each statement in a prompt and asked GPT-3.5 what the stance was. You can read more about my process in <a href="/notebook/2023/03/24/llm-annotation/">this blog post</a>. 
My findings show that LLMs, when prompted in a zero-shot manner, are competitive on a baseline that I trained on the original labels.
In addition, I also found myself annotating faster (and more correctly) when correcting LLM annotations compared to annotating from scratch.
The latter finding is important because correcting annotations induces less cognitive load and human effort (<a href="https://aclanthology.org/2023.emnlp-main.92/">Li et al., 2023</a>, <a href="https://arxiv.org/abs/2311.04345">Zhang et al., 2023</a>).</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide27.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide28.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>So, if LLMs can already provide competitive annotations, is our problem solved?
We don’t have to annotate anymore?
Remember, the reason why we collect these annotations is so that we can train a supervised model that can reliably approximate the task we’re interested in.
The operating word here is reliable.
There’s a huge variance in LLM performance, and one way to <em>thin out</em> that curve is to insert it in a human-in-the-loop workflow (<a href="https://arxiv.org/pdf/2310.15100.pdf">Dai et al., 2023</a>; <a href="https://arxiv.org/pdf/2310.14424.pdf">Boubdir et al., 2023</a>; <a href="https://arxiv.org/pdf/2305.17926.pdf">Wang et al., 2023</a>).</p>

<h3 id="directing-annotations-by-providing-extra-info-in-the-ui">Directing annotations by providing extra info in the UI</h3>

<p>Another way we can use LLMs for annotation is by taking advantage of their flexibility.
LLMs have zero-shot capabilities, i.e., we can always frame structured prediction tasks such as text categorization or named entity recognition as a question-answering problem.
Back then, you’d need to train separate supervised models to achieve multi-task skills.
I want to use an LLM’s flexibility to enhance the annotation experience.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide30.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>This time, I want to introduce two workflows. 
The first one is still a text categorization problem, but I want to ask an LLM to pre-highlight the claims and premises so I can reference them during annotation.
For the second one, I want to ask the LLM to do the reasoning for me. 
I’ll let it identify the claims and premises, then pre-annotate an answer, and then give me a reason for choosing that answer.
This exercise aims to explore creative ways we can harness LLMs.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide31.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide34.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>The process is similar to the first section, but I prompt for auxiliary information instead of prompting for the direct labels.
LLMs make this possible because we can formulate each task as a question-answer pair.
You’ll find examples of my prompt in the slides below.
The prompt on the left is a straightforward span labeling prompt, where we ask the LLM to provide the exact spans from a text.
On the other hand, the prompt on the right is a chain-of-thought prompt (<a href="https://arxiv.org/abs/2201.11903">Wei et al., 2023</a>). 
Here, we induce an LLM to perform a series of reasoning tasks to arrive at a final answer.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide32.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide35.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide33.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide37.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>The good thing about <a href="https://prodigy.ai">Prodigy</a> is that you can easily incorporate this extra information in your annotation UI.
On the bottom left, you’ll see that it highlights the claims and premises for each statement, allowing you to focus on the relevant details when labeling.
On the bottom right, you’ll find that the  UI metadata now contains the prompt’s reasoning steps.</p>

<p>There are many creative ways to improve annotation efficiency (and quality) using LLMs. 
One of my favorite papers from EMNLP was CoAnnotating (<a href="https://aclanthology.org/2023.emnlp-main.92.pdf">Li et al., 2023</a>), which uses an uncertainty metric to allocate annotation tasks between humans and a chat model such as ChatGPT.
We’ve seen a lot of LLM-as-an-assistant applications in the market for the past year, and I think that there’s an opportunity to apply the same perspective to the task of annotation.</p>

<h3 id="revealing-ambiguity-in-our-annotation-guidelines">Revealing ambiguity in our annotation guidelines</h3>

<p>Finally, I’m curious how LLMs parse information originally intended for humans. 
In most annotation projects, researchers write an <strong>annotation guideline</strong> to set up the parameters of the labeling task.
These guidelines aim to reduce uncertainty about the phenomenon we are annotating. 
We can even think of these as prompts for humans!</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide39.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>This time, I want to focus on a simple task: determine whether a statement is an argument. 
It sounds easy because it’s “just” a binary classification task.
However, after looking through various argument mining papers and their annotation guidelines, I realized that they each have their definition of what makes an argument!</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide40.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>So, this got me thinking: what if we include the annotation guideline in the prompt?
You can check my entire experiment in this <a href="/notebook/2023/03/25/langchain-annotation/">blog post</a>.
Back then, you could not fit a whole document into an LLM’s limited context length, so I used a continuous prompting strategy that showed chunks of the document and let the LLM update their answer based on new information.
Langchain calls this a <a href="https://js.langchain.com/docs/modules/chains/document/refine">“refine chain”</a> in their docs.
As an aside, I’ve opted into using <a href="https://github.com/srush/MiniChain">minichain</a> in my recent projects as it is more lightweight and enough for my needs.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide41.jpg" style="border: 1px solid black; padding: 2px; width: 360px" />
  <img src="/assets/images/talk-unc-charlotte/slide42.jpg" style="border: 1px solid black; padding: 2px; width: 360px" /></p>
</div>

<p>Including an annotation guideline in the prompt resulted in worse results—surprising. 
I couldn’t delve further, but I hypothesize that writing prompts for LLMs have a particular “dialect” vastly different from how we talk as humans.
Annotation guidelines were written with humans in mind, and perhaps some qualities don’t transfer properly into LLM prompts.
There are many confounding factors, of course. 
Maybe the refine strategy is not the best, or maybe I should’ve processed the text much better.
An LLM’s prompt sensitivity is still an open problem.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide44.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>But I learned one thing: we can use LLMs as a “first pass” when iterating over our annotation guidelines.
Typically, you’d start with a pilot annotation with a small group of annotators as you write the guidelines, but there’s an opportunity to incorporate LLMs into the mix.</p>

<h2 id="final-thoughts">Final thoughts</h2>

<p>Before we end, I want to share an important question before you begin your annotation projects.
You should always ask yourself: what is the label supposed to reflect?
Knowing what you want to use the collected dataset for is paramount.</p>

<p><a href="https://arxiv.org/pdf/2112.07475.pdf">Rottger et al. (2022)</a> named two paradigms for data annotation: prescriptive and descriptive.
<strong>Prescriptive</strong> annotation is usually found in linguistic tasks such as named entity recognition or parts-of-speech tagging—where there is a “correct” answer for each instance. 
Here, you already have a function in mind and need to collect enough data to train a reliable model.
On the other hand, <strong>descriptive</strong> annotation aims to capture the whole diversity of human judgment.
You’d usually find this in subjective tasks like hate speech detection or human preference collection.</p>

<div style="text-align: center;">
  <p><img src="/assets/images/talk-unc-charlotte/slide48.jpg" style="border: 1px solid black; padding: 10px; width: 700px" /></p>
</div>

<p>LLMs are pretty good at prescriptive annotation tasks.
Some empirical evidence that supports it (<a href="https://arxiv.org/pdf/2305.15444.pdf">Ashok et al., 2023</a>; <a href="https://arxiv.org/pdf/2311.08723.pdf">Chen et al., 2023</a>; <a href="https://arxiv.org/pdf/2305.08377.pdf">Sun et al., 2023</a>), and it allows us to access the web-scale data it was pretrained upon.</p>

<p>And now to my final point: despite their web-scale and zero-shot capabilities, LLMs are only as good as how well you prompt them.
During my early days in data science, there is this common adage: “garbage in, garbage out.”
Usually, we say this when we want to refer to bad data.
The problem with prompts is that the degree of freedom is much higher, which introduces ambiguity to our inputs.
Hence, I don’t recommend using LLM outputs straight from the firehose and serving it immediately.
There should be an intermediary step that minimizes this uncertainty, and that step is human annotation.</p>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="argument mining" /><category term="fake news" /><category term="llm" /><category term="evaluation" /><category term="annotation" /><category term="prodigy" /><category term="ai" /><category term="large language models" /><summary type="html"><![CDATA[A few weeks ago, I held a guest lecture at University of North Carolina Charlotte on how we can use large language models for annotation in the context of argument mining and fact verification. Here are the contents of that lecture in blog post format.]]></summary></entry><entry><title type="html">How to set up Git and SSH when your org has enforced SAML SSO</title><link href="https://ljvmiranda921.github.io/notebook/2023/11/28/git-ssh-saml/" rel="alternate" type="text/html" title="How to set up Git and SSH when your org has enforced SAML SSO" /><published>2023-11-28T00:00:00+08:00</published><updated>2023-11-28T00:00:00+08:00</updated><id>https://ljvmiranda921.github.io/notebook/2023/11/28/git-ssh-saml</id><content type="html" xml:base="https://ljvmiranda921.github.io/notebook/2023/11/28/git-ssh-saml/"><![CDATA[<p><span class="firstcharacter">W</span>hile cloning a repository from an organization with SAML SSO, I encountered an SSH error. I’ve been using Git with SSH before, and I admit that this was new:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>git clone git@github.com:myorg/repo.git
Cloning into <span class="s1">'repo'</span>...
ERROR: The <span class="s1">'myorg'</span> organization has enabled or enforced SAML SSO.  
To access this repository, you must use the HTTPS remote with a 
personal access token or SSH with an SSH key and passphrase that 
has been authorized <span class="k">for </span>this organization.

Visit https://docs.github.com/articles/authenticating-to-a-github-organ
ization-with-saml-single-sign-on/ <span class="k">for </span>more information.
</code></pre></div></div>

<h1 id="step-1-create-an-ssh-key-and-upload-it-to-your-github-account">Step 1: Create an SSH key and upload it to your GitHub account</h1>

<p>First you need to generate your SSH key. 
Sometimes, your organization will require you to generate a new one using your company email.
Nevertheless, the common denominator would be to run the <code class="language-plaintext highlighter-rouge">ssh-keygen</code> command below:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh-keygen <span class="nt">-t</span> ed25519 <span class="nt">-C</span> lj@myorg.org
</code></pre></div></div>

<p>This will generate a key pair in the form of <code class="language-plaintext highlighter-rouge">id_ed25519</code> and <code class="language-plaintext highlighter-rouge">id_25519.pub</code>.
In Linux, you can find them in the <code class="language-plaintext highlighter-rouge">~./ssh/</code> directory.
We need to upload the one with the <code class="language-plaintext highlighter-rouge">.pub</code> extension to GitHub.
Go to your GitHub <strong>Settings</strong> &gt; <strong>SSH and GPG Keys</strong> &gt; <strong>New SSH Key</strong> (or head to <a href="https://github.com/settings/keys">github.com/settings/keys</a>).</p>

<p>Write a semi-descriptive title (I usually put the organization name), set the <strong>Key Type</strong> as “Authentication Key,” and copy the contents of the <code class="language-plaintext highlighter-rouge">id_25519.pub</code> in the <strong>Key</strong> field.</p>

<h1 id="step-2-add-your-ssh-key-to-the-ssh-agents-list">Step 2: Add your SSH key to the SSH agent’s list</h1>

<p>First, test the connection by running:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ssh -T git@github.com
Hi username! You've successfully authenticated, but GitHub does 
not provide shell access.
</code></pre></div></div>

<p>Then, start the SSH agent:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ eval "$(ssh-agent -s)"
Agent pid 16935
</code></pre></div></div>

<p>It starts a background daemon and displays its process ID (in this case, 16935). 
We can then add our private keys while this agent is running.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ssh-add .ssh/id_ed25519
Identity added: .ssh/id_ed25519 (some other info)
</code></pre></div></div>

<p>At this point, you should now be able to clone your organization’s private repository. 
I haven’t really dug deep as to why it errored out the first time, I assumed that the keys are automatically added whenever I create them.
Anyway, in case you also encountered this error, I hope this tutorial helps!</p>]]></content><author><name>LJ MIRANDA</name></author><category term="notebook" /><category term="git" /><category term="github" /><category term="saml" /><category term="sso" /><category term="ssh" /><category term="how to" /><summary type="html"><![CDATA[While cloning a repository from an organization, I encountered an SSH error that I've never seen before. It's something related to SAML SSO. I managed to solve it, so I'm documenting the steps here. Hope it helps you too!]]></summary></entry></feed>