0% found this document useful (0 votes)
71 views28 pages

huggingface basics

The Hugging Face ecosystem is a platform for natural language processing (NLP) and machine learning (ML), offering tools like the Transformers library, datasets, and an inference API for model deployment. It simplifies tasks such as text classification, named entity recognition, and text generation through a high-level pipeline API. The ecosystem supports community collaboration and provides resources for training and evaluating ML models.

Uploaded by

arjun kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views28 pages

huggingface basics

The Hugging Face ecosystem is a platform for natural language processing (NLP) and machine learning (ML), offering tools like the Transformers library, datasets, and an inference API for model deployment. It simplifies tasks such as text classification, named entity recognition, and text generation through a high-level pipeline API. The ecosystem supports community collaboration and provides resources for training and evaluating ML models.

Uploaded by

arjun kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

huggingface

April 2, 2025

1 What is Hugging Face� Ecosystem ?


The Hugging Face ecosystem is a comprehensive and popular platform for natural language pro-
cessing (NLP) and machine learning (ML). It provides a wide range of tools, libraries, and services
designed to facilitate the development, deployment, and sharing of ML models.

1.0.1 Hugging Face Ecosystem


• Transformers Library:
– Transformers: Provides thousands of pre-trained models for tasks like text classifica-
tion, translation, question answering, and text generation.
– Tokenizers: Specialized for tokenizing text, essential for NLP model preparation.
• Datasets:
– Access to various datasets for NLP and ML, simplifying loading, processing, and man-
agement.
• Hugging Face Hub:
– Platform for sharing and discovering pre-trained models and datasets.
• Inference API:
– Cloud service for deploying models and obtaining predictions via API calls.
• Spaces:
– Create and share interactive ML applications using Gradio or Streamlit.
• Training and Deployment:
– Tools for training models on custom datasets and deploying them using PyTorch, Ten-
sorFlow, and cloud services.
• Model Evaluation:
– Tools for evaluating and improving ML model performance on various tasks.
• Community and Collaboration:
– Active community sharing models, datasets, and knowledge, with forums and learning
resources.

2 Pipeline
The pipeline in Hugging Face’s Transformers library is a high-level abstraction that simplifies the
use of pre-trained models for various natural language processing (NLP) tasks. It allows users to
perform complex tasks with minimal code.

1
2.0.1 Hugging Face Pipeline API Tasks
• Text Classification: Sentiment analysis, spam detection, etc.
• Named Entity Recognition (NER): Identifying entities like names, dates, and locations
in text.
• Question Answering: Answering questions based on a given context.
• Text Generation: Generating text based on a given prompt (e.g., with GPT-2).
• Translation: Translating text from one language to another.
• Summarization: Generating a summary of a given text.
• Text2Text Generation: Tasks like summarization or translation using models like T5.
• Fill-Mask: Predicting masked words in a sentence (e.g., with BERT).
• Zero-Shot Classification: Classifying text into categories without explicit training on those
categories.
[1]: # ! pip show transformers

[2]: from transformers import pipeline

u:\hugging_face\venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress


not found. Please update jupyter and ipywidgets. See
https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm

[ ]: # Sentiment Analysis
classifier = pipeline('sentiment-analysis')
result = classifier("I hate using Hugging Face transformers!")
print(result)

[ ]: # Named Entity Recognition (NER)


ner = pipeline('ner')
result = ner("My name is atharva and I live in atharva.")
print(result)

[5]: # Question Answering


question_answerer = pipeline('question-answering')
result = question_answerer(question="What is Hugging Face?", context="Hugging␣
↪Face is a company that provides open-source NLP technologies.")

print(result)

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-


squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-
cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is
not recommended.
Device set to use cuda:0
{'score': 0.6646773219108582, 'start': 16, 'end': 68, 'answer': 'a company that
provides open-source NLP technologies'}

2
[ ]: # Text Generation
generator = pipeline('text-generation')
result = generator("Once upon a time,")
print(result)

2.0.2 AutoTokenizer Class


• Purpose: Automatically selects the appropriate tokenizer for a given model.
• Model Agnostic: Works with any model in the Hugging Face library.
• Simplifies Tokenization: Automatically handles model-specific tokenization nuances.

Key Features:
• Auto Detection: Identifies the correct tokenizer based on the model name or path.
• Easy Initialization: “‘python from transformers import AutoTokenizer tokenizer =
AutoTokenizer.from_pretrained(“bert-base-uncased”)
• Automatic Addition: The AutoTokenizer automatically adds special tokens (like [CLS],
[SEP], etc.) required by the model.
• Purpose: Special tokens are used for tasks like classification, separation of sentences, and
padding.
[7]: from transformers import AutoTokenizer

# Specify the model checkpoint


model_checkpoint = 'bert-base-uncased'

# Load the tokenizer associated with the model


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Example text
text = ["Transformers are incredibly powerful.","Transformers are awesome"]

# Tokenize text
tokens = tokenizer(text, padding=True, truncation=True, return_tensors='pt') #␣
↪'pt' for PyTorch tensors ,'tf' for tensorflow tensors.

[8]: tokens

[8]: {'input_ids': tensor([[ 101, 19081, 2024, 11757, 3928, 1012, 102],
[ 101, 19081, 2024, 12476, 102, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1,
1],
[1, 1, 1, 1, 1, 0, 0]])}

3
2.0.3 Tokenizer Output Explanation
input_ids:
• Description: These are the token IDs corresponding to the input text.
• Details:
– Each token in the vocabulary of the model is assigned a unique ID.
– Special tokens like [CLS] (start of sequence) and [SEP] (end of sequence) are included.
– Padding tokens (usually 0) are added to ensure all sequences in a batch have the same
length.
• Example:
– [101, 19081, 2024, 11757, 3928, 1012, 102]: Represents “Transformers are pow-
erful.” with [CLS] (101) at the start and [SEP] (102) at the end.
– [101, 19081, 2024, 12476, 102, 0, 0]: Represents “Transformers are versatile”
with padding tokens (0) added to match the length of the longest sequence.

token_type_ids:
• Description: These indicate the segment to which each token belongs. Used primarily for
tasks involving sentence pairs (e.g., question answering).
• Details:
– For single sentences, all values are 0.
– For sentence pairs, the first sentence tokens are 0 and the second sentence tokens are 1.
• Example:
– [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]]: Since both examples are single
sentences, all token type IDs are 0.

attention_mask:
• Description: This indicates which tokens should be attended to (1) and which are just
padding (0).
• Details:
– 1 for actual tokens and 0 for padding tokens.
– Helps the model to ignore the padding tokens during processing.
• Example:
– [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0]]: Indicates that the first se-
quence is fully attended to, and the second sequence has padding that should be ignored.
[9]: # Example text
text = "Transformers are incredibly powerful."

# Tokenize the text


tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)

Tokens: ['transformers', 'are', 'incredibly', 'powerful', '.']


Input IDs: [19081, 2024, 11757, 3928, 1012]

4
The decode method allows us to check how the final output of the tokenizer translates back into
text.
[10]: # Convert tokens to token IDs
token_ids = tokenizer(text, return_tensors='pt')
print("Token IDs:", token_ids)
print(tokenizer.decode(input_ids))

Token IDs: {'input_ids': tensor([[ 101, 19081, 2024, 11757, 3928, 1012,
102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1]])}
transformers are incredibly powerful.

[11]: # To get the token_id for the PAD


print(tokenizer.pad_token_id)
print(tokenizer.pad_token)

0
[PAD]

2.0.4 Model-Specific Tokenizers


Model-specific tokenizers are designed to work with specific pre-trained models, handling tokeniza-
tion according to the requirements of each model.

Key Features:
• Customization: Tailored to the model’s architecture and vocabulary.
• Special Tokens: Automatically adds model-specific special tokens (e.g., [CLS], [SEP] for
BERT).
• Tokenization: Breaks down text into tokens that the model can process.
• BertTokenizer: Adds [CLS] and [SEP] tokens, lowercases the text, and splits words into
word pieces.
• GPT2Tokenizer: Does not add special tokens by default, uses Byte Pair Encoding (BPE)
for tokenization.
• RobertaTokenizer: Similar to BertTokenizer but designed for RoBERTa model, using a
different pre-training strategy and vocabulary. css
[12]: from transformers import BertTokenizer

# Initialize the tokenizer


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize a sample text


inputs = tokenizer("Hello, this is an example using BertTokenizer.",␣
↪padding=True, truncation=True, return_tensors='pt')

print(inputs)

5
{'input_ids': tensor([[ 101, 7592, 1010, 2023, 2003, 2019, 2742, 2478,
14324, 18715,
18595, 6290, 1012, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1]])}

[13]: statements = ["Hello, this is an example using GPT2Tokenizer.", "learning␣


↪huggingface"]

[14]: from transformers import GPT2Tokenizer

# Initialize the tokenizer


tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Tokenize a sample text


inputs = tokenizer(statements, padding=True, truncation=True,␣
↪return_tensors='pt')

print(inputs)

{'input_ids': tensor([[15496, 11, 428, 318, 281, 1672, 1262, 402,


11571, 17,
30642, 7509, 13],
[40684, 46292, 2550, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

[15]: from transformers import RobertaTokenizer

# Initialize the tokenizer


tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# Tokenize a sample text


inputs = tokenizer("Hello, this is an example using RobertaTokenizer.",␣
↪padding=True, truncation=True, return_tensors='pt')

print(inputs)

{'input_ids': tensor([[ 0, 31414, 6, 42, 16, 41, 1246, 634,


1738, 102,
45643, 6315, 4, 2]]), 'attention_mask': tensor([[1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

2.0.5 AutoModel Class


Purpose:

6
• Automatically selects the appropriate model architecture for a given pre-trained model.

Key Features:
• Model Agnostic: Works with any model in the Hugging Face library.
• Ease of Use: Simplifies loading pre-trained models with a single line of code.
[ ]: from transformers import AutoTokenizer, AutoModel

model_checkpoint = 'bert-base-uncased'

# Initialize the tokenizer


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenize a sample text


inputs = tokenizer("Hello, this is an example using AutoModel.",␣
↪return_tensors='pt')

print(inputs)
print("\n")
# Initialize the model
model = AutoModel.from_pretrained(model_checkpoint)

# Perform a forward pass


outputs = model(**inputs)

# Print the outputs


print(outputs)

2.0.6 1. Extract Word Embeddings


The number 768 represents the hidden size (or dimensionality) of BERT's internal representation

[17]: token_embedding = outputs.last_hidden_state[0, 1] # 2nd token embedding

[18]: all_token_embeddings = outputs.last_hidden_state[0] # Shape: (sequence_length,␣


↪768)

2.0.7 2. Use Sentence Embedding

[19]: sentence_embedding = outputs.pooler_output # Shape: (1, 768)

The output is not the final predictions here but rather the hidden states or embeddings produced
by the model. These outputs can be used as features for further processing or as input to other
layers (e.g., classification layers).

[20]: # from transformers import AutoModel

# bert_model = AutoModel.from_pretrained('bert-base-uncased')

7
# print(type(bert_model))
# print(bert_model)

# gpt_model = AutoModel.from_pretrained('gpt2')
# print(type(gpt_model))
# print(gpt_model)

# bart_model = AutoModel.from_pretrained('facebook/bart-large-cnn')
# print(type(bart_model))
# print(bart_model)

2.0.8 Custom Classification Model Using AutoModel


Purpose:
• Use AutoModel and add a custom classification head for specific tasks like sequence classifi-
cation.
[21]: import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Define a custom classification model


class CustomClassificationModel(nn.Module):
def __init__(self, model_checkpoint, num_labels):
super(CustomClassificationModel, self).__init__()
self.automodel = AutoModel.from_pretrained(model_checkpoint)
self.classifier = nn.Linear(self.automodel.config.hidden_size,␣
↪num_labels)

def forward(self, input_ids, attention_mask=None, token_type_ids=None):


outputs = self.automodel(input_ids, attention_mask=attention_mask,␣
↪token_type_ids=token_type_ids)

# Get the hidden state of the [CLS] token


cls_output = outputs.last_hidden_state[:, 0, :]
logits = self.classifier(cls_output)
return logits

# Model checkpoint and number of labels


model_checkpoint = 'bert-base-uncased'
num_labels = 2 # Example for binary classification

# Initialize the tokenizer


tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

8
[22]: inputs1 = tokenizer("Hello, this is an example using a custom classification␣
↪head. and it is bad", return_tensors='pt')

inputs2 = tokenizer("Hello, this is an example using a custom classification␣


↪head. and it is very good", return_tensors='pt')

[23]: # Initialize the custom model


model = CustomClassificationModel(model_checkpoint, num_labels)

# Perform a forward pass


logits = model(**inputs2)

# Apply softmax to get probabilities


probabilities = F.softmax(logits, dim=-1)

# Convert probabilities to predicted class labels


predictions = torch.argmax(probabilities, dim=-1)

# Print the probabilities and the predicted class


print(probabilities)

tensor([[0.4308, 0.5692]], grad_fn=<SoftmaxBackward0>)

[24]: predictions = torch.argmax(probabilities, dim=-1)

[25]: predictions

[25]: tensor([1])

2.0.9 AutoModelFor** Classes


Purpose:
• Automatically select the appropriate model architecture with a task-specific head for various
NLP tasks.

Key Features:
• Task-Specific: Each class is designed for a specific NLP task.
• Ease of Use: Simplifies loading and using pre-trained models with the appropriate heads.

Common AutoModelFor Classes:


1. AutoModelForSequenceClassification:
• Used for tasks like text classification and sentiment analysis.
2. AutoModelForCausalLM:
• Used for tasks like language modeling and text generation.
3. AutoModelForTokenClassification:
• Used for tasks like named entity recognition (NER).

9
AutoModelForSequenceClassification:
[26]: from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch

# Initialize the tokenizer and model


model_checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

# Tokenize input text


inputs = tokenizer("Hello, this is an example for sequence classification.",␣
↪return_tensors='pt')

# Perform a forward pass


outputs = model(**inputs)
logits = outputs.logits

# Apply softmax to get probabilities


probabilities = F.softmax(logits, dim=-1)
predictions = torch.argmax(probabilities, dim=-1)

# Print the probabilities and predicted class


print(probabilities)
print(predictions)

Some weights of BertForSequenceClassification were not initialized from the


model checkpoint at bert-base-uncased and are newly initialized:
['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it
for predictions and inference.
tensor([[0.7151, 0.2849]], grad_fn=<SoftmaxBackward0>)
tensor([0])
AutoModelForCausalLM:
[ ]: # AutoModelForCausalLM

from transformers import AutoTokenizer, AutoModelForCausalLM

# Initialize the tokenizer and model


model_checkpoint = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

# Tokenize input text


inputs = tokenizer("Once upon a time", return_tensors='pt')

10
# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50,␣
↪num_return_sequences=1)

# Decode and print the generated text


generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n")
print(generated_text)

AutoModelForTokenClassification:
[ ]: from transformers import AutoTokenizer, AutoModelForTokenClassification

# Initialize the tokenizer and model


model_checkpoint = 'dbmdz/bert-large-cased-finetuned-conll03-english'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

# Tokenize input text


inputs = tokenizer("John lives in New York City.", return_tensors='pt')

# Perform a forward pass


outputs = model(**inputs)
logits = outputs.logits

# Get the predictions


predictions = torch.argmax(logits, dim=-1)

# Print the predictions


predicted_tokens = [model.config.id2label[prediction.item()] for prediction in␣
↪predictions[0]]

print("\n")
print(predicted_tokens)

2.0.10 Model-Specific Classes


Model-specific classes in Hugging Face Transformers are designed to handle specific NLP tasks by
adding the appropriate heads to pre-trained models. These classes simplify the use of models for
tasks like text classification, text generation, and token classification.
BertForSequenceClassification:
[29]: from transformers import BertTokenizer, BertForSequenceClassification
import torch

11
# Load the tokenizer and model
model_checkpoint = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_checkpoint)
model = BertForSequenceClassification.from_pretrained(model_checkpoint)

# Define labels (these are examples; adjust based on your actual model's␣
↪training)

labels = ["Negative", "Positive"]

# Input sentences
sentences = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this"
]

# Tokenize and encode the sentences


inputs = tokenizer(sentences, padding=True, truncation=True,␣
↪return_tensors='pt')

# Perform a forward pass and get logits


outputs = model(**inputs).logits

# Apply softmax to get probabilities


probabilities = torch.nn.functional.softmax(outputs, dim=-1)

# Get the predicted class


predictions = torch.argmax(probabilities, dim=-1)

# Print the probabilities and predicted classes


for i, sentence in enumerate(sentences):
print("\n")
print(f"Sentence: {sentence}")
print(f"Probabilities: {probabilities[i].tolist()}")
print(f"Predicted Class: {labels[predictions[i]]}")

Some weights of BertForSequenceClassification were not initialized from the


model checkpoint at bert-base-uncased and are newly initialized:
['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it
for predictions and inference.

Sentence: I've been waiting for a HuggingFace course my whole life.


Probabilities: [0.48447129130363464, 0.5155287384986877]
Predicted Class: Positive

12
Sentence: I hate this
Probabilities: [0.4586593806743622, 0.5413405895233154]
Predicted Class: Positive
GPT2LMHeadModel:
[ ]: # GPT2LMHeadModel

from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the tokenizer and model


tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Input prompt
prompt = "Once upon a time"

# Tokenize and encode the prompt


inputs = tokenizer(prompt, return_tensors='pt')

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50,␣
↪num_return_sequences=1)

# Decode and print the generated text


generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n")
print("Generated Text:", generated_text)

2.0.11 AutoConfig Class


Purpose: The AutoConfig class in Hugging Face Transformers is used to automatically load the
configuration of a pre-trained model. This configuration includes model architecture details and
hyperparameters, which are essential for initializing models correctly.

Key Features:
• Automatic Configuration Loading: Load configurations without specifying the model
class explicitly.
• Customization: Modify model configurations to suit specific needs.
[31]: from transformers import AutoConfig

# Load configuration for a specific model checkpoint


model_checkpoint = 'bert-base-uncased'
config = AutoConfig.from_pretrained(model_checkpoint)

# Print the configuration


print(config)

13
BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.50.3",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}

[ ]: # Modify the configuration


config.num_labels = 5 # Change the number of labels for classification

# Print the modified configuration


print(config)
config.hidden_act

[ ]: # Using Configuration to Initialize a Model

from transformers import AutoConfig, AutoModelForSequenceClassification

# Load and modify the configuration


model_checkpoint = 'bert-base-uncased'
config = AutoConfig.from_pretrained(model_checkpoint)
config.num_labels = 5 # Change the number of labels for classification

# Initialize the model with the modified configuration


model = AutoModelForSequenceClassification.from_config(config)

# Print the model


print(model)

14
2.0.12 Model-Specific Configuration Classes
Purpose: Model-specific configuration classes in Hugging Face Transformers are used to define
the architecture and hyperparameters for specific models. These configurations are essential for
initializing models correctly and can be customized to suit specific needs.
[ ]: # BERT Configuration
from transformers import BertConfig, BertForSequenceClassification

# Load and modify the BERT configuration


config = BertConfig.from_pretrained('bert-base-uncased')
config.num_labels = 5 # Change the number of labels for classification

# Initialize the BERT model with the modified configuration


model = BertForSequenceClassification(config)

# Print the model


print(model)

[ ]: # GPT-2 Configuration

from transformers import GPT2Config, GPT2LMHeadModel

# Load and modify the GPT-2 configuration


config = GPT2Config.from_pretrained('gpt2')
config.output_hidden_states = True # Change the configuration to output hidden␣
↪states

# Initialize the GPT-2 model with the modified configuration


model = GPT2LMHeadModel(config)

# Print the model


print(model)

[ ]: # DistilBERT Configuration

from transformers import DistilBertConfig, DistilBertForTokenClassification

# Load and modify the DistilBERT configuration


config = DistilBertConfig.from_pretrained('distilbert-base-uncased')
config.num_labels = 9 # Change the number of labels for NER

# Initialize the DistilBERT model with the modified configuration


model = DistilBertForTokenClassification(config)

# Print the model


print(model)

15
2.0.13 Dataset Class
Purpose: The Dataset class in the Hugging Face datasets library is used to handle and ma-
nipulate datasets efficiently. It supports a wide range of operations for loading, processing, and
transforming datasets, making it easier to prepare data for machine learning models.

Key Features:
• Loading Datasets: Load datasets from local files or the Hugging Face Hub.
• Processing: Apply various preprocessing and transformation functions.
• Splitting: Split datasets into training, validation, and test sets.
• Batching: Efficiently batch data for model training and evaluation.

Important Methods:
1. Loading a Dataset:
• load_dataset(): Loads a dataset from a local file or the Hugging Face Hub.
[37]: # %%capture
# !pip install datasets

[38]: from datasets import load_dataset

# Load a dataset from the Hugging Face Hub


dataset = load_dataset('imdb')

[39]: dataset

[39]: DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})

[40]: train_subset = dataset['train'].select(range(10000))

[41]: train_subset

[41]: Dataset({
features: ['text', 'label'],

16
num_rows: 10000
})

[42]: import pandas as pd

pd.DataFrame(dataset['train'][0], index=[0])

[42]: text label


0 I rented I AM CURIOUS-YELLOW from my video sto… 0

[43]: import pandas as pd

[44]: pd.DataFrame(dataset['train'][0:3])

[44]: text label


0 I rented I AM CURIOUS-YELLOW from my video sto… 0
1 "I Am Curious: Yellow" is a risible and preten… 0
2 If only to avoid making this type of film in t… 0

[45]: train_subset = dataset['train'].select(range(10000))

[46]: train_subset

[46]: Dataset({
features: ['text', 'label'],
num_rows: 10000
})

[47]: dataset['train'].features

[47]: {'text': Value(dtype='string', id=None),


'label': ClassLabel(names=['neg', 'pos'], id=None)}

train_test_split(): Splits a dataset into training and test sets.


[48]: # Split the dataset into training and test sets
split_dataset = dataset['train'].train_test_split(test_size=0.1)
train_data = split_dataset['train']
test_data = split_dataset['test']

map(): Applies a function to all examples in the dataset.


[49]: # Define a preprocessing function
def preprocess_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)

# Assign a padding token if not set


if tokenizer.pad_token is None:

17
tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else␣
'[PAD]'

tokenizer.add_special_tokens({'pad_token': tokenizer.pad_token})

# Apply the preprocessing function to the dataset


tokenized_dataset = dataset['train'].map(preprocess_function, batched=True)

DataLoader: Used to create batches of data for training and evaluation.


[ ]: from torch.utils.data import DataLoader

# Create a DataLoader for the training data


train_dataloader = DataLoader(tokenized_dataset, batch_size=8, shuffle=True)

# Print the first batch


for batch in train_dataloader:
print(batch)
break

[ ]: import pprint
# Print the first batch
for batch in train_dataloader:
pprint.pprint(batch)
break

2.0.14 Dataset Class Examples


Purpose: The load_dataset function can be used to load datasets from local files or directly
from the internet. This flexibility makes it easy to work with various data sources.

Example 1: Loading a Dataset from a Local File Path Code: “‘python from datasets
import load_dataset

3 Specify the file path (assuming a CSV file format)


file_path = ‘/path/to/your/local/file.csv’

4 Load the dataset from the local file


dataset = load_dataset(‘csv’, data_files=file_path)

5 Print the first example


print(dataset[‘train’][0])

18
[52]: ## Loading a Dataset from the Internet

from datasets import load_dataset

# Specify the URL of the dataset (assuming a CSV file format)


url = 'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv'

# Load the dataset from the URL


dataset = load_dataset('csv', data_files=url)

# Print the first example


print(dataset['train'][0])

{'Month': 'JAN', ' "1958"': 340, ' "1959"': 360, ' "1960"': 417}

5.0.1 DataCollator Class


Purpose: The DataCollator class in the Hugging Face Transformers library is used to collate
data into batches and prepare them for model input. It handles various preprocessing tasks such
as padding, masking, and formatting to ensure that the input data is compatible with the model
requirements.

Key Features:
• Padding: Ensures that all sequences in a batch have the same length by adding padding
tokens.
• Masking: Creates attention masks to distinguish between real tokens and padding tokens.
• Formatting: Prepares data in the correct format required by the model.
1. DataCollatorWithPadding:
• Automatically pads the sequences in a batch to the same length.

6 What is dynamic padding ?


Unlike static padding where every sequence in the dataset is padded to the maximum length found
in the dataset, dynamic padding adjusts the length of sequences within each batch to the longest
sequence in that batch. This minimizes the amount of padding used and can reduce computation
time significantly.
By only padding to the longest sequence in a batch, dynamic padding reduces the number of
unnecessary operations (like processing padding tokens) that the model has to perform. This can
lead to faster training times and more efficient memory usage, as less padding means fewer data
points to process.
[53]: from transformers import DataCollatorWithPadding, AutoTokenizer
from torch.utils.data import DataLoader

# Initialize the tokenizer


tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

19
# Define a simple dataset
dataset = [
{"text": "I've been waiting for a HuggingFace course my whole life."},
{"text": "I hate this"}
]

# Tokenize the dataset


tokenized_dataset = [tokenizer(data['text'], truncation=True) for data in␣
↪dataset]

# Initialize the DataCollator


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create a DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)

# Print the batch


for batch in dataloader:
print(batch)
print("\n")

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037,
17662, 12172,
2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

[54]: # DataCollatorForLanguageModeling
# Prepares data for language modeling tasks by masking tokens.

from transformers import DataCollatorForLanguageModeling, AutoTokenizer


from torch.utils.data import DataLoader

# Initialize the tokenizer


tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Define a simple dataset


dataset = [
{"text": "I've been waiting for a HuggingFace course my whole life."},

20
{"text": "I hate this"}
]

# Tokenize the dataset


tokenized_dataset = [tokenizer(data['text'], truncation=True) for data in␣
↪dataset]

# Initialize the DataCollator with masking


data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True,␣
↪mlm_probability=0.15)

# Create a DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)

# Print the batch


for batch in dataloader:
print(batch)

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 103, 103, 2005, 1037,
17662, 12172,
27589, 103, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels':
tensor([[-100, -100, -100, -100, 2042, 3403, -100, -100, -100, -100, 2607, 2026,
-100, -100, -100, -100],
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100]])}

[55]: # # DataCollatorForSeq2Seq
# # Prepares data for sequence-to-sequence tasks such as translation and␣
↪summarization.

# from transformers import DataCollatorForSeq2Seq, AutoTokenizer,␣


↪AutoModelForSeq2SeqLM

# from torch.utils.data import DataLoader

# # Initialize the tokenizer and model


# tokenizer = AutoTokenizer.from_pretrained('t5-small')
# model = AutoModelForSeq2SeqLM.from_pretrained('t5-small')

# # Define a simple dataset


# dataset = [

21
# {"text": "translate English to French: HuggingFace is a great library."},
# {"text": "translate English to French: I love programming."}
# ]

# # Tokenize the dataset


# tokenized_dataset = [tokenizer(data['text'], truncation=True) for data in␣
↪dataset]

# # Initialize the DataCollator


# data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# # Create a DataLoader
# dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)

# # Print the batch


# for batch in dataloader:
# print(batch)

6.0.1 TrainingArguments Class


Purpose: The TrainingArguments class in the Hugging Face Transformers library is used to
specify various hyperparameters and configurations for the training process. It allows you to cus-
tomize the training setup, including learning rate, batch size, number of epochs, and more.

Key Features:
• Learning Rate: Configure the learning rate for the optimizer.
• Batch Size: Set the batch size for training and evaluation.
• Number of Epochs: Specify the number of training epochs.
• Logging: Enable logging of training metrics and save logs to a specified directory.
• Evaluation Strategy: Configure when to perform evaluation during training.
• Checkpointing: Set up model checkpointing to save the model at specified intervals.

Example Parameters:
• output_dir: Directory to save the model checkpoints.
• evaluation_strategy: Evaluation strategy to use during training ("steps" or "epoch").
• learning_rate: Learning rate for the optimizer.
• per_device_train_batch_size: Batch size for training.
• per_device_eval_batch_size: Batch size for evaluation.
• num_train_epochs: Number of training epochs.
• weight_decay: Weight decay for the optimizer.
• logging_dir: Directory to save the logs.
• logging_steps: Log training metrics every specified number of steps.

Example Usage: “‘python from transformers import TrainingArguments

22
7 Define the training arguments
training_args = TrainingArguments( output_dir=‘./results’, # Directory to save the model check-
points evaluation_strategy=‘epoch’, # Evaluate at the end of every epoch learning_rate=2e-5,
# Learning rate for the optimizer per_device_train_batch_size=8, # Batch size for training
per_device_eval_batch_size=8, # Batch size for evaluation num_train_epochs=3, # Number of
training epochs weight_decay=0.01, # Weight decay for the optimizer logging_dir=‘./logs’, # Di-
rectory to save the logs logging_steps=10, # Log training metrics every 10 steps save_steps=500,
# Save model checkpoint every 500 steps save_total_limit=2, # Limit the total number of check-
points )

7.0.1 Trainer API


Purpose:
• Simplifies the process of training and evaluating models.
• Provides a flexible and extensible framework for various training and evaluation tasks.

Key Features:
• Training: Handles the training loop, including forward and backward passes, optimizer steps,
and learning rate scheduling.
• Evaluation: Supports model evaluation on validation and test sets.
• Data Loading: Integrates seamlessly with DataLoader and Dataset objects.
• Logging: Provides logging and tracking of training metrics.
• Checkpointing: Supports model checkpointing to save and load models during training.

Important Methods:
• train(): Starts the training loop.
• evaluate(): Evaluates the model on a given dataset.
• predict(): Generates predictions on a given dataset.
• save_model(): Saves the model and tokenizer to disk.
• log(): Logs training metrics.

23
8jfrffycd

April 2, 2025

[1]: import torch


torch.cuda.empty_cache() # Clears unused memory
torch.cuda.reset_peak_memory_stats()

[2]: from transformers import AutoTokenizer


from datasets import load_dataset

# Load the dataset


dataset = load_dataset('imdb')

# Split the dataset into training and test sets


dataset = load_dataset('imdb')

# Split the dataset into training and test sets


split_dataset = dataset['train'].train_test_split(test_size=0.1)
train_data = split_dataset['train']
test_data = split_dataset['test']

train_data = split_dataset["train"].select(range(1000))
test_data = split_dataset["test"].select(range(1000))

model_checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Tokenization function
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True)

# Apply tokenization and remove the original text column


train_tokenized_dataset = train_data.map(tokenize_function,␣
↪batched=True,remove_columns=['text'])

test_tokenized_dataset = test_data.map(tokenize_function,␣
↪batched=True,remove_columns=['text'])

# Check dataset structure

1
print(train_tokenized_dataset) # Should contain input_ids, attention_mask, and␣
↪label

print("\n")

print(test_tokenized_dataset) # Should contain input_ids, attention_mask, and␣


↪label

u:\hugging_face\venv\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress


not found. Please update jupyter and ipywidgets. See
https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Map: 100%|����������| 1000/1000 [00:00<00:00, 1835.77 examples/s]
Map: 100%|����������| 1000/1000 [00:00<00:00, 2982.79 examples/s]
Dataset({
features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1000
})

Dataset({
features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1000
})

[3]: print(train_tokenized_dataset.column_names)
print(test_tokenized_dataset.column_names)

['label', 'input_ids', 'token_type_ids', 'attention_mask']


['label', 'input_ids', 'token_type_ids', 'attention_mask']

[4]: train_tokenized_dataset = train_tokenized_dataset.select_columns(["input_ids",␣


↪"attention_mask", "token_type_ids", "label"])

test_tokenized_dataset = test_tokenized_dataset.select_columns(["input_ids",␣
↪"attention_mask", "token_type_ids", "label"])

[5]: train_tokenized_dataset

[5]: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'label'],
num_rows: 1000
})

[6]: test_tokenized_dataset

2
[6]: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'label'],
num_rows: 1000
})

[7]: # print(train_tokenized_dataset.keys())
# print(test_tokenized_dataset.keys())

[8]: import torch


torch.cuda.empty_cache() # Clears unused memory
torch.cuda.reset_peak_memory_stats()

torch.cuda.empty_cache() # Clears unused memory


torch.cuda.reset_peak_memory_stats()

torch.cuda.empty_cache() # Clears unused memory


torch.cuda.reset_peak_memory_stats()

torch.cuda.empty_cache() # Clears unused memory


torch.cuda.reset_peak_memory_stats()

[9]: from transformers import AutoModelForSequenceClassification, Trainer,␣


↪TrainingArguments

from datasets import load_dataset


import numpy as np
from sklearn.metrics import accuracy_score

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.to("cuda")

training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
eval_steps=2,
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=1,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=2,
fp16=True,
gradient_accumulation_steps=4
)

3
def compute_metrics(p):
preds = np.argmax(p.predictions, axis=1)
return {"accuracy": accuracy_score(p.label_ids, preds)}

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized_dataset,
eval_dataset=test_tokenized_dataset,
compute_metrics=compute_metrics
)

# Train the model


trainer.train()

# Evaluate the model


eval_results = trainer.evaluate()

# Print evaluation results


print(f"Evaluation results: {eval_results}")
trainer.save_model()

Some weights of BertForSequenceClassification were not initialized from the


model checkpoint at bert-base-uncased and are newly initialized:
['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it
for predictions and inference.
u:\hugging_face\venv\lib\site-packages\transformers\training_args.py:1611:
FutureWarning: `evaluation_strategy` is deprecated and will be removed in
version 4.46 of � Transformers. Use `eval_strategy` instead
warnings.warn(
<IPython.core.display.HTML object>
<IPython.core.display.HTML object>
Evaluation results: {'eval_loss': 0.6541457772254944, 'eval_accuracy': 0.734,
'eval_runtime': 127.8558, 'eval_samples_per_second': 7.821,
'eval_steps_per_second': 0.25, 'epoch': 1.0}

[10]: trainer.save_model()

[15]: import matplotlib.pyplot as plt

# Extract training loss and evaluation loss from log history


train_steps = []

4
train_loss = []
eval_steps = []
eval_loss = []

for entry in trainer.state.log_history:


if "loss" in entry: # Training loss
train_steps.append(entry["step"])
train_loss.append(entry["loss"])
if "eval_loss" in entry: # Validation loss
eval_steps.append(entry["step"])
eval_loss.append(entry["eval_loss"])

# Plot Training Loss vs. Steps


plt.figure(figsize=(10, 5))
plt.plot(train_steps, train_loss, label="Training Loss", color="blue")
plt.plot(eval_steps, eval_loss, label="Validation Loss", color="red")
plt.xlabel("Steps")
plt.ylabel("Loss")
plt.title("Training & Validation Loss over Steps")
plt.legend()
plt.show()

[ ]:

You might also like