huggingface basics
huggingface basics
April 2, 2025
2 Pipeline
The pipeline in Hugging Face’s Transformers library is a high-level abstraction that simplifies the
use of pre-trained models for various natural language processing (NLP) tasks. It allows users to
perform complex tasks with minimal code.
1
2.0.1 Hugging Face Pipeline API Tasks
• Text Classification: Sentiment analysis, spam detection, etc.
• Named Entity Recognition (NER): Identifying entities like names, dates, and locations
in text.
• Question Answering: Answering questions based on a given context.
• Text Generation: Generating text based on a given prompt (e.g., with GPT-2).
• Translation: Translating text from one language to another.
• Summarization: Generating a summary of a given text.
• Text2Text Generation: Tasks like summarization or translation using models like T5.
• Fill-Mask: Predicting masked words in a sentence (e.g., with BERT).
• Zero-Shot Classification: Classifying text into categories without explicit training on those
categories.
[1]: # ! pip show transformers
[ ]: # Sentiment Analysis
classifier = pipeline('sentiment-analysis')
result = classifier("I hate using Hugging Face transformers!")
print(result)
print(result)
2
[ ]: # Text Generation
generator = pipeline('text-generation')
result = generator("Once upon a time,")
print(result)
Key Features:
• Auto Detection: Identifies the correct tokenizer based on the model name or path.
• Easy Initialization: “‘python from transformers import AutoTokenizer tokenizer =
AutoTokenizer.from_pretrained(“bert-base-uncased”)
• Automatic Addition: The AutoTokenizer automatically adds special tokens (like [CLS],
[SEP], etc.) required by the model.
• Purpose: Special tokens are used for tasks like classification, separation of sentences, and
padding.
[7]: from transformers import AutoTokenizer
# Example text
text = ["Transformers are incredibly powerful.","Transformers are awesome"]
# Tokenize text
tokens = tokenizer(text, padding=True, truncation=True, return_tensors='pt') #␣
↪'pt' for PyTorch tensors ,'tf' for tensorflow tensors.
[8]: tokens
[8]: {'input_ids': tensor([[ 101, 19081, 2024, 11757, 3928, 1012, 102],
[ 101, 19081, 2024, 12476, 102, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1,
1],
[1, 1, 1, 1, 1, 0, 0]])}
3
2.0.3 Tokenizer Output Explanation
input_ids:
• Description: These are the token IDs corresponding to the input text.
• Details:
– Each token in the vocabulary of the model is assigned a unique ID.
– Special tokens like [CLS] (start of sequence) and [SEP] (end of sequence) are included.
– Padding tokens (usually 0) are added to ensure all sequences in a batch have the same
length.
• Example:
– [101, 19081, 2024, 11757, 3928, 1012, 102]: Represents “Transformers are pow-
erful.” with [CLS] (101) at the start and [SEP] (102) at the end.
– [101, 19081, 2024, 12476, 102, 0, 0]: Represents “Transformers are versatile”
with padding tokens (0) added to match the length of the longest sequence.
token_type_ids:
• Description: These indicate the segment to which each token belongs. Used primarily for
tasks involving sentence pairs (e.g., question answering).
• Details:
– For single sentences, all values are 0.
– For sentence pairs, the first sentence tokens are 0 and the second sentence tokens are 1.
• Example:
– [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]]: Since both examples are single
sentences, all token type IDs are 0.
attention_mask:
• Description: This indicates which tokens should be attended to (1) and which are just
padding (0).
• Details:
– 1 for actual tokens and 0 for padding tokens.
– Helps the model to ignore the padding tokens during processing.
• Example:
– [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 0, 0]]: Indicates that the first se-
quence is fully attended to, and the second sequence has padding that should be ignored.
[9]: # Example text
text = "Transformers are incredibly powerful."
4
The decode method allows us to check how the final output of the tokenizer translates back into
text.
[10]: # Convert tokens to token IDs
token_ids = tokenizer(text, return_tensors='pt')
print("Token IDs:", token_ids)
print(tokenizer.decode(input_ids))
Token IDs: {'input_ids': tensor([[ 101, 19081, 2024, 11757, 3928, 1012,
102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1]])}
transformers are incredibly powerful.
0
[PAD]
Key Features:
• Customization: Tailored to the model’s architecture and vocabulary.
• Special Tokens: Automatically adds model-specific special tokens (e.g., [CLS], [SEP] for
BERT).
• Tokenization: Breaks down text into tokens that the model can process.
• BertTokenizer: Adds [CLS] and [SEP] tokens, lowercases the text, and splits words into
word pieces.
• GPT2Tokenizer: Does not add special tokens by default, uses Byte Pair Encoding (BPE)
for tokenization.
• RobertaTokenizer: Similar to BertTokenizer but designed for RoBERTa model, using a
different pre-training strategy and vocabulary. css
[12]: from transformers import BertTokenizer
print(inputs)
5
{'input_ids': tensor([[ 101, 7592, 1010, 2023, 2003, 2019, 2742, 2478,
14324, 18715,
18595, 6290, 1012, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1]])}
print(inputs)
print(inputs)
6
• Automatically selects the appropriate model architecture for a given pre-trained model.
Key Features:
• Model Agnostic: Works with any model in the Hugging Face library.
• Ease of Use: Simplifies loading pre-trained models with a single line of code.
[ ]: from transformers import AutoTokenizer, AutoModel
model_checkpoint = 'bert-base-uncased'
print(inputs)
print("\n")
# Initialize the model
model = AutoModel.from_pretrained(model_checkpoint)
The output is not the final predictions here but rather the hidden states or embeddings produced
by the model. These outputs can be used as features for further processing or as input to other
layers (e.g., classification layers).
# bert_model = AutoModel.from_pretrained('bert-base-uncased')
7
# print(type(bert_model))
# print(bert_model)
# gpt_model = AutoModel.from_pretrained('gpt2')
# print(type(gpt_model))
# print(gpt_model)
# bart_model = AutoModel.from_pretrained('facebook/bart-large-cnn')
# print(type(bart_model))
# print(bart_model)
8
[22]: inputs1 = tokenizer("Hello, this is an example using a custom classification␣
↪head. and it is bad", return_tensors='pt')
[25]: predictions
[25]: tensor([1])
Key Features:
• Task-Specific: Each class is designed for a specific NLP task.
• Ease of Use: Simplifies loading and using pre-trained models with the appropriate heads.
9
AutoModelForSequenceClassification:
[26]: from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F
import torch
10
# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50,␣
↪num_return_sequences=1)
AutoModelForTokenClassification:
[ ]: from transformers import AutoTokenizer, AutoModelForTokenClassification
print("\n")
print(predicted_tokens)
11
# Load the tokenizer and model
model_checkpoint = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_checkpoint)
model = BertForSequenceClassification.from_pretrained(model_checkpoint)
# Define labels (these are examples; adjust based on your actual model's␣
↪training)
# Input sentences
sentences = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this"
]
12
Sentence: I hate this
Probabilities: [0.4586593806743622, 0.5413405895233154]
Predicted Class: Positive
GPT2LMHeadModel:
[ ]: # GPT2LMHeadModel
# Input prompt
prompt = "Once upon a time"
# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50,␣
↪num_return_sequences=1)
Key Features:
• Automatic Configuration Loading: Load configurations without specifying the model
class explicitly.
• Customization: Modify model configurations to suit specific needs.
[31]: from transformers import AutoConfig
13
BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.50.3",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
14
2.0.12 Model-Specific Configuration Classes
Purpose: Model-specific configuration classes in Hugging Face Transformers are used to define
the architecture and hyperparameters for specific models. These configurations are essential for
initializing models correctly and can be customized to suit specific needs.
[ ]: # BERT Configuration
from transformers import BertConfig, BertForSequenceClassification
[ ]: # GPT-2 Configuration
[ ]: # DistilBERT Configuration
15
2.0.13 Dataset Class
Purpose: The Dataset class in the Hugging Face datasets library is used to handle and ma-
nipulate datasets efficiently. It supports a wide range of operations for loading, processing, and
transforming datasets, making it easier to prepare data for machine learning models.
Key Features:
• Loading Datasets: Load datasets from local files or the Hugging Face Hub.
• Processing: Apply various preprocessing and transformation functions.
• Splitting: Split datasets into training, validation, and test sets.
• Batching: Efficiently batch data for model training and evaluation.
Important Methods:
1. Loading a Dataset:
• load_dataset(): Loads a dataset from a local file or the Hugging Face Hub.
[37]: # %%capture
# !pip install datasets
[39]: dataset
[39]: DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})
[41]: train_subset
[41]: Dataset({
features: ['text', 'label'],
16
num_rows: 10000
})
pd.DataFrame(dataset['train'][0], index=[0])
[44]: pd.DataFrame(dataset['train'][0:3])
[46]: train_subset
[46]: Dataset({
features: ['text', 'label'],
num_rows: 10000
})
[47]: dataset['train'].features
17
tokenizer.pad_token = tokenizer.eos_token if tokenizer.eos_token else␣
'[PAD]'
↪
tokenizer.add_special_tokens({'pad_token': tokenizer.pad_token})
[ ]: import pprint
# Print the first batch
for batch in train_dataloader:
pprint.pprint(batch)
break
Example 1: Loading a Dataset from a Local File Path Code: “‘python from datasets
import load_dataset
18
[52]: ## Loading a Dataset from the Internet
{'Month': 'JAN', ' "1958"': 340, ' "1959"': 360, ' "1960"': 417}
Key Features:
• Padding: Ensures that all sequences in a batch have the same length by adding padding
tokens.
• Masking: Creates attention masks to distinguish between real tokens and padding tokens.
• Formatting: Prepares data in the correct format required by the model.
1. DataCollatorWithPadding:
• Automatically pads the sequences in a batch to the same length.
19
# Define a simple dataset
dataset = [
{"text": "I've been waiting for a HuggingFace course my whole life."},
{"text": "I hate this"}
]
# Create a DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)
{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037,
17662, 12172,
2607, 2026, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}
[54]: # DataCollatorForLanguageModeling
# Prepares data for language modeling tasks by masking tokens.
20
{"text": "I hate this"}
]
# Create a DataLoader
dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)
{'input_ids': tensor([[ 101, 1045, 1005, 2310, 103, 103, 2005, 1037,
17662, 12172,
27589, 103, 2878, 2166, 1012, 102],
[ 101, 1045, 5223, 2023, 102, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]]), 'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels':
tensor([[-100, -100, -100, -100, 2042, 3403, -100, -100, -100, -100, 2607, 2026,
-100, -100, -100, -100],
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100]])}
[55]: # # DataCollatorForSeq2Seq
# # Prepares data for sequence-to-sequence tasks such as translation and␣
↪summarization.
21
# {"text": "translate English to French: HuggingFace is a great library."},
# {"text": "translate English to French: I love programming."}
# ]
# # Create a DataLoader
# dataloader = DataLoader(tokenized_dataset, batch_size=2,␣
↪collate_fn=data_collator)
Key Features:
• Learning Rate: Configure the learning rate for the optimizer.
• Batch Size: Set the batch size for training and evaluation.
• Number of Epochs: Specify the number of training epochs.
• Logging: Enable logging of training metrics and save logs to a specified directory.
• Evaluation Strategy: Configure when to perform evaluation during training.
• Checkpointing: Set up model checkpointing to save the model at specified intervals.
Example Parameters:
• output_dir: Directory to save the model checkpoints.
• evaluation_strategy: Evaluation strategy to use during training ("steps" or "epoch").
• learning_rate: Learning rate for the optimizer.
• per_device_train_batch_size: Batch size for training.
• per_device_eval_batch_size: Batch size for evaluation.
• num_train_epochs: Number of training epochs.
• weight_decay: Weight decay for the optimizer.
• logging_dir: Directory to save the logs.
• logging_steps: Log training metrics every specified number of steps.
22
7 Define the training arguments
training_args = TrainingArguments( output_dir=‘./results’, # Directory to save the model check-
points evaluation_strategy=‘epoch’, # Evaluate at the end of every epoch learning_rate=2e-5,
# Learning rate for the optimizer per_device_train_batch_size=8, # Batch size for training
per_device_eval_batch_size=8, # Batch size for evaluation num_train_epochs=3, # Number of
training epochs weight_decay=0.01, # Weight decay for the optimizer logging_dir=‘./logs’, # Di-
rectory to save the logs logging_steps=10, # Log training metrics every 10 steps save_steps=500,
# Save model checkpoint every 500 steps save_total_limit=2, # Limit the total number of check-
points )
Key Features:
• Training: Handles the training loop, including forward and backward passes, optimizer steps,
and learning rate scheduling.
• Evaluation: Supports model evaluation on validation and test sets.
• Data Loading: Integrates seamlessly with DataLoader and Dataset objects.
• Logging: Provides logging and tracking of training metrics.
• Checkpointing: Supports model checkpointing to save and load models during training.
Important Methods:
• train(): Starts the training loop.
• evaluate(): Evaluates the model on a given dataset.
• predict(): Generates predictions on a given dataset.
• save_model(): Saves the model and tokenizer to disk.
• log(): Logs training metrics.
23
8jfrffycd
April 2, 2025
train_data = split_dataset["train"].select(range(1000))
test_data = split_dataset["test"].select(range(1000))
model_checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenization function
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True)
test_tokenized_dataset = test_data.map(tokenize_function,␣
↪batched=True,remove_columns=['text'])
1
print(train_tokenized_dataset) # Should contain input_ids, attention_mask, and␣
↪label
print("\n")
Dataset({
features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
num_rows: 1000
})
[3]: print(train_tokenized_dataset.column_names)
print(test_tokenized_dataset.column_names)
test_tokenized_dataset = test_tokenized_dataset.select_columns(["input_ids",␣
↪"attention_mask", "token_type_ids", "label"])
[5]: train_tokenized_dataset
[5]: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'label'],
num_rows: 1000
})
[6]: test_tokenized_dataset
2
[6]: Dataset({
features: ['input_ids', 'attention_mask', 'token_type_ids', 'label'],
num_rows: 1000
})
[7]: # print(train_tokenized_dataset.keys())
# print(test_tokenized_dataset.keys())
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
model.to("cuda")
training_args = TrainingArguments(
output_dir='./results',
evaluation_strategy='epoch',
eval_steps=2,
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=1,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=2,
fp16=True,
gradient_accumulation_steps=4
)
3
def compute_metrics(p):
preds = np.argmax(p.predictions, axis=1)
return {"accuracy": accuracy_score(p.label_ids, preds)}
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_tokenized_dataset,
eval_dataset=test_tokenized_dataset,
compute_metrics=compute_metrics
)
[10]: trainer.save_model()
4
train_loss = []
eval_steps = []
eval_loss = []
[ ]: