Few-Shot Learning Tutorial - Medium
Few-Shot Learning Tutorial - Medium
Become a member
Zero-shot abilities of modern LLMs are truly inspiring and make us feel that AGI is pretty close.
However, it requires large networks and pre-training on huge massive of data. And still, it’s not
enough. You need to fine-tune a model specifically to your case to tackle actual business
problems with acceptable accuracy. What makes a difference, in this case, is how few examples
you need to achieve reasonable results. In our team, we developed a zero-shot text classification
model that, with just 8 examples per label, can achieve up to 90% and beat huge LLMs fine-
tuned on thousands of examples. In this tutorial, we will show you how to achieve the same
results with our open-source zero-shot text classification model.
datasets: unified interface for managing and accessing diverse machine learning datasets.
transformers: Hugging Face library offering pre-trained models and tools for natural
language processing tasks.
accelerate: library that enables the same PyTorch code to be run across any distributed
configuration by adding just four lines of code.
Okay, right now we need to download a dataset: we will use the “emotion” dataset, which
contains 6 classes of different emotions that describe a text. Then we will split the dataset into
test and train, and from the train, we will randomly select 48 examples, with an average of 8
examples per label.
#emotion
emotion_dataset = load_dataset("dair-ai/emotion")
test_dataset = emotion_dataset['test']
classes = test_dataset.features["label"].names
N = 8
train_dataset = get_train_dataset(emotion_dataset, N)
SetFit
Firstly, we will see what results we can achieve with SetFit — an alternative few-shot learning
approach that uses text embeddings for classification. SetFit is the latest breakthrough in this
field, an open-source framework for few-shot fine-tuning of Sentence Transformers. Creators
claim that with just 8 labeled examples per class on the Customer Reviews (CR) sentiment
dataset, SetFit surpasses fine-tuned RoBERTa Large on the whole training set of 3k examples.
Then, we’ll execute the same task with our approach and compare results (ours are much
better).
from setfit import SetFitModel, Trainer, TrainingArguments
from sklearn.metrics import classification_report
model = SetFitModel.from_pretrained("BAAI/bge-base-en-v1.5")
args = TrainingArguments(
batch_size=32,
num_epochs=1,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
trainer.train()
preds = model.predict(test_dataset['text'])
print(classification_report(test_dataset['label'], preds,
target_names=classes, digits=4))
SetFit results on emotion dataset
We got even slightly worse results than SetFit demonstrates in a zero-shot setting. One of the
reasons is that the uniform distribution of labels in the training set does not reflect the real
distribution and, usually, the SetFit approach requires more examples to distinguish different
examples in vector space. Our method is more universal, and the fine-tuning of the model does
not require training some additional classification heads.
Comprehend-it method
Let’s try our approach now, firstly you need to initialize the model:
model_name = 'knowledgator/comprehend_it-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Our approach is based on a text classification model that was trained to distinguish whether two
statements are neutral, contradict each other, or entail.
accuracy = evaluate.load("accuracy")
texts = dataset['text']
labels = dataset['label']
label2count = {}
for label in labels:
if label not in label2count:
label2count[label]=1
else:
label2count[label]+=1
count = len(labels)
label2prob = {label:lc/count for label, lc in label2count.items()}
unique_labels = list(label2prob)
probs = list(label2prob.values())
ids = list(range(len(labels)))
for text, label_id in zip(texts, labels):
label = classes[label_id]
for i in range(len(classes)-1):
new_dataset['sources'].append(text)
new_dataset['targets'].append(template.format(label))
new_dataset['labels'].append(1.)
for i in range(len(classes)-1):
neg_class_ = label
while neg_class_==label:
# neg_class_ = random.sample(classes, k=1)[0]
neg_lbl = np.random.choice(unique_labels, p=probs)
neg_class_ = classes[neg_lbl]
new_dataset['sources'].append(text)
new_dataset['targets'].append(template.format(neg_class_))
new_dataset['labels'].append(-1.)
return Dataset.from_dict(new_dataset)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
def tokenize_and_align_label(example):
hypothesis = example['targets']
seq = example["sources"]+hypothesis
label = example['labels']
if label==1.0:
label = torch.tensor(1)
elif label==0.0:
label = torch.tensor(2)
else:
label = torch.tensor(0)
tokenized_input['label'] = label
return tokenized_input
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
tokenized_dataset = dataset.map(tokenize_and_align_label)
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.1)
training_args = TrainingArguments(
output_dir='comprehendo',
learning_rate=3e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset['test'],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model('comprehender')
To use our model for inference, we can utilise the Hugging Face pipeline for zero-shot
classification:
classifier = pipeline("zero-shot-classification",
model='comprehendo',tokenizer=tokenizer, device=device)
preds = []
label2idx = {label: id for id, label in enumerate(classes)}
print(classification_report(test_dataset['label'], preds,
target_names=classes, digits=4))
We got impressive results, considering that not all labels were in our dataset, and the results
were 8% higher in terms of micro F1 score than our model in the zero-shot setting.
Open in app
Benchmark:
Open-Source ML research company focused on developing fundamental encoder-based model for information extraction
https://knowledgator.com/
Knowledgator Engineering
Knowledgator Engineering
130 2
Knowledgator Engineering
Extract any named entities from PDF using custom Spacy pipeline
Portable Document Format (PDF) stands as a predominant file format for distributing content, particularly in
academic, scientific…
368
Knowledgator Engineering
132
Knowledgator Engineering
215
159
Lists
Okan Yenigün
125 1
Research Graph
136 2
Howard Chi in WrenAI
538 9
Dr. Dhanya NM