0% found this document useful (0 votes)
31 views

Few-Shot Learning Tutorial - Medium

Uploaded by

deebakwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Few-Shot Learning Tutorial - Medium

Uploaded by

deebakwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Get unlimited access to the best of Medium for less than $1/week.

Become a member

Achieve 90% Results in Few-Shot Text Classification


with Just 0.1% Data
Knowledgator Engineering · Follow
5 min read · Dec 27, 2023

Listen Share More

Zero-shot abilities of modern LLMs are truly inspiring and make us feel that AGI is pretty close.
However, it requires large networks and pre-training on huge massive of data. And still, it’s not
enough. You need to fine-tune a model specifically to your case to tackle actual business
problems with acceptable accuracy. What makes a difference, in this case, is how few examples
you need to achieve reasonable results. In our team, we developed a zero-shot text classification
model that, with just 8 examples per label, can achieve up to 90% and beat huge LLMs fine-
tuned on thousands of examples. In this tutorial, we will show you how to achieve the same
results with our open-source zero-shot text classification model.

Requirements and data


First of all, make sure that you have installed the following libraries:

pip install datasets transformers accelerate setfit

datasets: unified interface for managing and accessing diverse machine learning datasets.

transformers: Hugging Face library offering pre-trained models and tools for natural
language processing tasks.
accelerate: library that enables the same PyTorch code to be run across any distributed
configuration by adding just four lines of code.

setfit: an efficient and prompt-free framework for few-shot fine-tuning of Sentence


Transformers.

Okay, right now we need to download a dataset: we will use the “emotion” dataset, which
contains 6 classes of different emotions that describe a text. Then we will split the dataset into
test and train, and from the train, we will randomly select 48 examples, with an average of 8
examples per label.

from datasets import load_dataset

def get_train_dataset(dataset, N):


ids = []
label2count = {}
train_dataset = dataset['train'].shuffle(seed=41)
for id, example in enumerate(train_dataset):
if example['label'] not in label2count:
label2count[example['label']]=1
elif label2count[example['label']]>=N:
continue
else:
label2count[example['label']]+=1
ids.append(id)
return train_dataset.select(ids)

#emotion
emotion_dataset = load_dataset("dair-ai/emotion")
test_dataset = emotion_dataset['test']
classes = test_dataset.features["label"].names
N = 8
train_dataset = get_train_dataset(emotion_dataset, N)

SetFit
Firstly, we will see what results we can achieve with SetFit — an alternative few-shot learning
approach that uses text embeddings for classification. SetFit is the latest breakthrough in this
field, an open-source framework for few-shot fine-tuning of Sentence Transformers. Creators
claim that with just 8 labeled examples per class on the Customer Reviews (CR) sentiment
dataset, SetFit surpasses fine-tuned RoBERTa Large on the whole training set of 3k examples.

Then, we’ll execute the same task with our approach and compare results (ours are much
better).
from setfit import SetFitModel, Trainer, TrainingArguments
from sklearn.metrics import classification_report

model = SetFitModel.from_pretrained("BAAI/bge-base-en-v1.5")

args = TrainingArguments(
batch_size=32,
num_epochs=1,
)

trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
trainer.train()

To test the model, we run the following command:

preds = model.predict(test_dataset['text'])
print(classification_report(test_dataset['label'], preds,
target_names=classes, digits=4))
SetFit results on emotion dataset

We got even slightly worse results than SetFit demonstrates in a zero-shot setting. One of the
reasons is that the uniform distribution of labels in the training set does not reflect the real
distribution and, usually, the SetFit approach requires more examples to distinguish different
examples in vector space. Our method is more universal, and the fine-tuning of the model does
not require training some additional classification heads.

Comprehend-it method
Let’s try our approach now, firstly you need to initialize the model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'knowledgator/comprehend_it-base'

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name)

Our approach is based on a text classification model that was trained to distinguish whether two
statements are neutral, contradict each other, or entail.

Right now, let’s initialize all data processing functions:


from transformers import TrainingArguments, Trainer
from transformers import DataCollatorWithPadding
from datasets import Dataset
import random
import torch
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def transform_dataset(dataset, classes, template = '{}'):


new_dataset = {'sources':[], 'targets': [], 'labels': []}

texts = dataset['text']
labels = dataset['label']

label2count = {}
for label in labels:
if label not in label2count:
label2count[label]=1
else:
label2count[label]+=1
count = len(labels)
label2prob = {label:lc/count for label, lc in label2count.items()}
unique_labels = list(label2prob)
probs = list(label2prob.values())

ids = list(range(len(labels)))
for text, label_id in zip(texts, labels):
label = classes[label_id]
for i in range(len(classes)-1):
new_dataset['sources'].append(text)
new_dataset['targets'].append(template.format(label))
new_dataset['labels'].append(1.)

for i in range(len(classes)-1):
neg_class_ = label
while neg_class_==label:
# neg_class_ = random.sample(classes, k=1)[0]
neg_lbl = np.random.choice(unique_labels, p=probs)
neg_class_ = classes[neg_lbl]

new_dataset['sources'].append(text)
new_dataset['targets'].append(template.format(neg_class_))
new_dataset['labels'].append(-1.)
return Dataset.from_dict(new_dataset)

def compute_metrics(eval_pred):
predictions, labels = eval_pred

predictions = np.argmax(predictions, axis=1)


return accuracy.compute(predictions=predictions, references=labels)

def tokenize_and_align_label(example):
hypothesis = example['targets']

seq = example["sources"]+hypothesis

tokenized_input = tokenizer(seq, truncation=True, max_length=512,


padding="max_length")

label = example['labels']
if label==1.0:
label = torch.tensor(1)
elif label==0.0:
label = torch.tensor(2)
else:
label = torch.tensor(0)
tokenized_input['label'] = label
return tokenized_input

And let’s process the training dataset and run training:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

dataset = transform_dataset(train_dataset, classes)

tokenized_dataset = dataset.map(tokenize_and_align_label)
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1)

training_args = TrainingArguments(
output_dir='comprehendo',
learning_rate=3e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset['test'],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

trainer.train()

trainer.save_model('comprehender')

To use our model for inference, we can utilise the Hugging Face pipeline for zero-shot
classification:

from transformers import pipeline


from sklearn.metrics import classification_report
from tqdm import tqdm

classifier = pipeline("zero-shot-classification",
model='comprehendo',tokenizer=tokenizer, device=device)

And let’s test the model:

preds = []
label2idx = {label: id for id, label in enumerate(classes)}

for example in tqdm(test_dataset):


pred = classifier(example['text'],classes)['labels'][0]
idx = label2idx[pred]
preds.append(idx)

print(classification_report(test_dataset['label'], preds,
target_names=classes, digits=4))

We got impressive results, considering that not all labels were in our dataset, and the results
were 8% higher in terms of micro F1 score than our model in the zero-shot setting.
Open in app

Comprehend-it results on emotion dataset


Search
Conclusion
As a result, our approach significantly outperformed SetFit; however, it’s important to note that
SetFit will be faster depending on model size and amount of labels. Our approach depends on
the number of labels because it requires full attention between text and labels, so we should run
model N time, which is an equal amount of labels. So, the choice depends on the balance
between performance, accuracy and amount of training examples you have.

Benchmark:

NLP Llm Few Shot Learning Text Classification AI


Follow

Written by Knowledgator Engineering


296 Followers

Open-Source ML research company focused on developing fundamental encoder-based model for information extraction
https://knowledgator.com/

More from Knowledgator Engineering

Knowledgator Engineering

Extract custom table from PDF with LLMs


Portable Document Format (PDF) is one of the most widely used file formats for sharing information,
especially in academic, scientific…

5 min read · Sep 30, 2023


648 2

Knowledgator Engineering

How to classify text into millions of classes


Text Classification is one of the fundamental tasks in NLP and ML in general. During years of development,
the fields have emerged with a…

5 min read · Feb 29, 2024

130 2
Knowledgator Engineering

Extract any named entities from PDF using custom Spacy pipeline
Portable Document Format (PDF) stands as a predominant file format for distributing content, particularly in
academic, scientific…

6 min read · Dec 3, 2023

368
Knowledgator Engineering

Run T5 model 🦾 on 100k tokens 📚 20x faster ⚡


Small language models (SLMs) are essential in tasks that require instant response or in situations with low
computational resources…

5 min read · Apr 25, 2024

132

See all from Knowledgator Engineering

Recommended from Medium

Knowledgator Engineering

Game-Changing Few-Shot Learning Works With 8 Examples Per Label


Currently #1 approach for Text Classification
8 min read · Jan 9, 2024

215

Yu-Cheng Tsai in Towards Data Science

Are GPTs Good Embedding Models


A surprising experiment to show that the devil is in the details

6 min read · May 18, 2024

159

Lists

Natural Language Processing


1477 stories · 990 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 385 saves

Generative AI Recommended Reading


52 stories · 1081 saves
What is ChatGPT?
9 stories · 363 saves

Okan Yenigün

Exploring Hugging Face: Topic Modeling


Topic Modeling with BERTopic

7 min read · Jan 7, 2024

125 1
Research Graph

Automated Knowledge Graph Construction with Large Language Models — Part 2


Harvesting the Power and Knowledge of Large Language Models

6 min read · May 13, 2024

136 2
Howard Chi in WrenAI

How to use OpenAI GPT-4o to query your database?


Today, OpenAI released its latest LLM model, GPT-4o. People are sharing crazy applications built on top of
this groundbreaking model. By…

5 min read · May 14, 2024

538 9

Dr. Dhanya NM

Zero Shot Learning for text classification


In this article we will learn about how to do zero shot learning for classifying a text.

8 min read · Feb 1, 2024

See more recommendations

You might also like