DATASCIENCE LAB
DATASCIENCE LAB
1. Creating
2. a NumPy Array a. Basic ndarray
3. b. Array of zeros
4. c. Array of ones
5. d. Random numbers in ndarray
6. e. An array of your choice
7. f. Imatrix in NumPy
8. g. Evenly spaced ndarray
Here’s how you can create different types of NumPy arrays based on your requests:
1. Basic ndarray
python
import numpy as np
# Basic ndarray
arr = np.array([1, 2, 3, 4, 5])
print(arr)
2. Array of Zeros
# Array of zeros
zeros_arr = np.zeros((3, 3)) # 3x3 matrix of zeros
print(zeros_arr)
3. Array of Ones
python
Copy code
# Array of ones
ones_arr = np.ones((2, 4)) # 2x4 matrix of ones
print(ones_arr)
To create an array with random numbers (uniform distribution between 0 and 1):
python
Copy code
# Random numbers in ndarray
random_arr = np.random.rand(3, 3) # 3x3 matrix of random numbers
print(random_arr)
python
Copy code
# Random integers in ndarray
random_int_arr = np.random.randint(0, 10, size=(2, 5)) # 2x5 matrix with integers between 0
and 10
print(random_int_arr)
You can create an array based on a specific sequence or data you want. Here's an example of a
custom array:
python
Copy code
# An array of your choice
custom_arr = np.array([10, 20, 30, 40])
print(custom_arr)
python
Copy code
# Identity matrix
identity_matrix = np.eye(4) # 4x4 identity matrix
print(identity_matrix)
python
Copy code
# Evenly spaced ndarray
evenly_spaced_arr = np.linspace(0, 10, 5) # 5 values from 0 to 10 (inclusive)
print(evenly_spaced_arr)
You can also use np.arange to create an array with a specified step:
python
Copy code
# Evenly spaced ndarray using np.arange
even_arr = np.arange(0, 10, 2) # Values from 0 to 10 with step size of 2
print(even_arr)
2. The Shape and Reshaping of NumPy Array
Here’s how you can create different types of NumPy arrays based on your requests:
1. Basic ndarray
python
Copy code
import numpy as np
# Basic ndarray
arr = np.array([1, 2, 3, 4, 5])
print(arr)
2. Array of Zeros
python
Copy code
# Array of zeros
zeros_arr = np.zeros((3, 3)) # 3x3 matrix of zeros
print(zeros_arr)
3. Array of Ones
To create an array with random numbers (uniform distribution between 0 and 1):
python
Copy code
# Random numbers in ndarray
random_arr = np.random.rand(3, 3) # 3x3 matrix of random numbers
print(random_arr)
python
Copy code
# Random integers in ndarray
random_int_arr = np.random.randint(0, 10, size=(2, 5)) # 2x5 matrix with integers between 0
and 10
print(random_int_arr)
You can create an array based on a specific sequence or data you want. Here's an example of a
custom array:
python
Copy code
# An array of your choice
custom_arr = np.array([10, 20, 30, 40])
print(custom_arr)
python
Copy code
# Identity matrix
identity_matrix = np.eye(4) # 4x4 identity matrix
print(identity_matrix)
You can also use np.arange to create an array with a specified step:
python
Copy code
# Evenly spaced ndarray using np.arange
even_arr = np.arange(0, 10, 2)
# Values from 0 to 10 with step size of 2
print(even_arr)
Each of these snippets demonstrates how to create different types of arrays in NumPy. You can
adjust the dimensions and values based on your specific requirements.
You said:
2. The Shape and Reshaping of NumPy Array a. Dimensions of NumPy array b. Shape of
NumPy array c. Size of NumPy array d. Reshaping a NumPy array e. Flattening a NumPy array
Transpose of a NumPy array
The number of dimensions (axes) of a NumPy array can be obtained using the .ndim attribute.
python
Copy code
import numpy as np
The shape of a NumPy array refers to the dimensions of the array (rows, columns, etc.). It can be
accessed using the .shape attribute.
python
Copy code
# Get the shape of the array
print("Shape of the array:", arr.shape)
For the above example, it will output (2, 3), indicating that the array has 2 rows and 3 columns.
c. Size of NumPy Array
The size of a NumPy array refers to the total number of elements in the array, which can be
obtained using the .size attribute.
python
Copy code
# Get the size of the array
print("Size of the array:", arr.size)
This will give you the total number of elements in the array. For example, in a 2x3 array, the size
will be 6.
You can reshape an array using the .reshape() method. This changes the shape of the array
without changing its data.
python
Copy code
# Reshape the array
reshaped_arr = arr.reshape(3, 2) # Reshaping to a 3x2 array
print("Reshaped array:")
print(reshaped_arr)
Note: The total number of elements must stay the same when reshaping. For example, if the
original array has 6 elements, the reshaped array must also have 6 elements (e.g., 2x3, 3x2).
Flattening refers to converting a multidimensional array into a one-dimensional array. This can
be done using .flatten() or .ravel().
python
Copy code
# Flatten the array
flattened_arr = arr.flatten()
print("Flattened array:", flattened_arr)
Alternatively, .ravel() also flattens the array, but it returns a flattened view of the array whenever
possible.
python
Copy code
# Flatten the array using ravel
raveled_arr = arr.ravel()
print("Raveled array:", raveled_arr)
f. Transpose of a NumPy Array
The transpose of an array is obtained by swapping rows and columns. This can be done using .T.
python
Copy code
# Transpose of the array
transposed_arr = arr.T
print("Transposed array:")
print(transposed_arr)
Example Walkthrough:
python
Copy code
import numpy as np
Output:
lua
Copy code
Dimensions of the array: 2
Shape of the array: (2, 3)
Size of the array: 6
Reshaped array:
[[1 2]
[3 4]
[5 6]]
Flattened array: [1 2 3 4 5 6]
Transposed array:
[[1 4]
[2 5]
[3 6]]
Key Points:
Expanding a NumPy array refers to increasing its dimensions by adding new axes. The function
np.expand_dims() is commonly used for this. It adds a new axis at a specified position.
Example:
python
Copy code
import numpy as np
# Creating a 1D array
arr = np.array([1, 2, 3])
Squeezing a NumPy array removes dimensions of size 1 from the shape of an array. This is done
using the np.squeeze() function.
Example:
python
Copy code
import numpy as np
print("Original array:")
print(arr)
print("Shape of original array:", arr.shape)
Output:
lua
Copy code
Original array:
[[[1]
[2]
[3]]]
Shape of original array: (1, 3, 1)
Squeezed array:
[1 2 3]
Shape of squeezed array: (3,)
# Using np.argsort() to get the indices that would sort the array
sorted_indices = np.argsort(arr)
print("Indices that would sort the array:", sorted_indices)
Output:
lua
Copy code
Sorted array: [1 2 3 4 5]
Sorted array in descending order: [5 4 3 2 1]
4. Indexing and Slicing of NumPy Array a. Slicing 1-D NumPy arrays b. Slicing 2-D
NumPy arrays c. Slicing 3-D NumPy arrays d. Negative slicing of NumPy arrays
Slicing in a 1D array allows you to extract a portion of the array using a start, stop, and step
value.
Syntax for 1D array slicing:
python
Copy code
arr[start:stop:step]
Example:
python
Copy code
import numpy as np
# Creating a 1D array
arr = np.array([10, 20, 30, 40, 50, 60, 70])
Output:
javascript
Copy code
Array sliced from index 2 to 5: [30 40 50]
Array sliced with step 2: [10 30 50 70]
Array sliced with step -1 (reversed): [70 60 50 40 30 20 10]
Slicing a 2D array allows you to select subarrays along both axes (rows and columns).
Example:
python
Copy code
import numpy as np
# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Slicing the 2D array
print("Sliced 2D array (rows 1 to 3 and columns 1 to 2):")
print(arr_2d[1:3, 1:3])
Output:
lua
Copy code
Sliced 2D array (rows 1 to 3 and columns 1 to 2):
[[5 6]
[8 9]]
For 3D arrays, you can slice across three dimensions: depth (axis 0), rows (axis 1), and columns
(axis 2).
Example:
python
Copy code
import numpy as np
# Creating a 3D array
arr_3d = np.array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]],
[[9, 10], [11, 12]]])
Output:
lua
Copy code
Sliced 3D array (depth 0 to 2, row 0, all columns):
[[1 2]
[5 6]]
[[ 5]
[ 7]]]
Negative slicing allows you to slice an array starting from the end rather than the beginning.
Negative indexing is useful when you want to select elements from the end without knowing the
exact size of the array.
Example:
import numpy as np
# Creating a 1D array
arr = np.array([10, 20, 30, 40, 50, 60, 70])
Output:
Last 3 elements using negative indexing: [50 60 70]
All elements except the last 2: [10 20 30 40 50]
Reverse the array using negative slicing: [70 60 50 40 30 20 10]
5. Stacking and Concatenating Numpy Arrays a. Stacking ndarrays b. Concatenating
ndarrays c. Broadcasting in Numpy Arrays
Example:
python
Copy code
import numpy as np
Output:
lua
Copy code
Stacked along axis 0:
[[1 2 3]
[4 5 6]]
Horizontally stacked:
[1 2 3 4 5 6]
Vertically stacked:
[[1 2 3]
[4 5 6]]
Depth-wise stacked:
[[[1 5]
[2 6]]
[[3 7]
[4 8]]]
Concatenation refers to joining two or more arrays along a specified axis. NumPy provides
np.concatenate() for this operation. You can concatenate arrays along any axis, not just 0 or 1.
Example:
python
Copy code
import numpy as np
Output:
lua
Copy code
Concatenated along axis 0: [1 2 3 4 5 6]
Broadcasting follows a set of rules to determine whether two arrays can be broadcast together.
The key rules are:
1. If the arrays have different numbers of dimensions, the shape of the smaller array is
padded with 1s on the left side until they have the same number of dimensions.
2. If the dimensions of the arrays do not match, broadcasting is possible only if one of the
arrays has a dimension of size 1 in that position.
Example of Broadcasting:
python
Copy code
import numpy as np
Output:
less
Copy code
Result of broadcasting and adding arrays A and B:
[[ 2 4 6]
[ 6 7 9]
[10 11 12]]
A DataFrame is the primary data structure in Pandas, and you can create it from various data
sources like dictionaries, lists, or NumPy arrays.
Example:
python
Copy code
import pandas as pd
df = pd.DataFrame(data)
print("DataFrame created from dictionary:")
print(df)
Output:
sql
Copy code
DataFrame created from dictionary:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
You can apply conditions to filter data within a DataFrame. This allows you to select rows that
meet specific criteria.
Output:
java
Copy code
Filtered DataFrame (Age > 30):
Name Age City
2 Charlie 35 Chicago
3 David 40 Houston
You can add new columns to an existing DataFrame by simply assigning values to a new column
name.
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
Output:
sql
Copy code
DataFrame with a new column 'Salary':
Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 70000
3 David 40 Houston 80000
7. Perform following operations using pandas a. Filling NaN with string b. Sorting based on
column values c. groupby()
In Pandas, you can fill NaN values with a specific value using the fillna() method. If you want to
replace NaN with a string (or any other value), you can do so easily.
Example:
python
Copy code
import pandas as pd
import numpy as np
Output:
sql
Copy code
DataFrame after filling NaN with 'Unknown':
Name Age City
0 Alice 25 New York
1 Bob Unknown Los Angeles
2 Charlie 35 Chicago
3 David Unknown Unknown
Example:
python
Copy code
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
Output:
csharp
Copy code
DataFrame sorted by 'Age' in ascending order:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
c. groupby() in Pandas
The groupby() function in Pandas is used to group data based on one or more columns and then
apply an aggregate function to the grouped data. Common operations include summing,
averaging, or counting values in each group.
Example:
python
Copy code
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Los Angeles']
})
# Grouping the DataFrame by 'City' and calculating the average age for each city
grouped_df = df.groupby('City')['Age'].mean()
print("Average age for each city:")
print(grouped_df)
Output:
vbnet
Copy code
Average age for each city:
City
Chicago 35.0
Houston 40.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64
You can read a text file into a DataFrame using pd.read_csv(). Text files can have custom
delimiters (spaces, tabs, or others). If the text file is space-delimited, use the delim_whitespace
parameter.
Example:
python
Copy code
import pandas as pd
Parameters:
o delim_whitespace=True: This allows Pandas to treat any whitespace as a
delimiter.
o You can also use sep=' ', sep='\t', or any other custom delimiter for more control.
CSV (Comma Separated Values) is the most common tabular data format. Pandas provides
pd.read_csv() to read CSV files.
Example:
python
Copy code
import pandas as pd
Parameters:
o sep=',': Specifies the delimiter (default is comma).
o header=0: Row number to use as column names (default is 0).
o index_col: To specify which column to use as the index.
Pandas can read .xls and .xlsx files using the pd.read_excel() function. You’ll need the openpyxl
library for .xlsx files and xlrd for .xls.
Example:
python
Copy code
import pandas as pd
# Reading an Excel file (default is sheet_name=0 for the first sheet)
df_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
print(df_excel)
Parameters:
o sheet_name: Specifies the sheet to read by name or index. If sheet_name=None, it
reads all sheets.
o header: Defines the row(s) to use as column names.
o usecols: To select specific columns.
JSON (JavaScript Object Notation) is commonly used for hierarchical data. Pandas can easily
read JSON files into DataFrames using pd.read_json().
Example:
python
Copy code
import pandas as pd
Parameters:
o orient: You can specify how to read the JSON file. For instance, 'records' means
each line is a dictionary (the default).
o lines=True: If the file contains one JSON object per line, use this option.
2. CSV Files:
python
Copy code
df_csv = pd.read_csv('file.csv') # For comma-separated
df_csv = pd.read_csv('file.csv', sep=';') # For semicolon-separated
3. Excel Files:
python
Copy code
df_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
df_excel_all = pd.read_excel('file.xlsx', sheet_name=None) # Read all sheets
4. JSON Files:
python
Copy code
df_json = pd.read_json('file.json')
df_json_nested = pd.read_json('file.json', orient='records', lines=True)
9. Read the following file formats a. Pickle files b. Image files using PIL c. Multiple files
using Glob d. Importing data from database
Pickle files are used to serialize and deserialize Python objects, making them convenient for
saving and loading complex data structures. You can read Pickle files using the pickle module or
pandas (for DataFrames).
print(data)
The Pillow library (a fork of PIL, the Python Imaging Library) allows you to open, manipulate,
and save various image formats like PNG, JPEG, GIF, etc.
Example:
python
Copy code
from PIL import Image
# Optionally, you can convert the image to grayscale or perform other manipulations
image_gray = image.convert('L')
image_gray.show()
You can also save or manipulate images further using Pillow's various methods.
The glob module allows you to find all pathnames matching a specified pattern. It is useful for
reading multiple files from a directory, such as all .txt files or .csv files.
Example:
python
Copy code
import glob
glob.glob('pattern'): Finds all files matching the pattern (e.g., *.txt for all text files).
You can then read and process each file as needed.
# Loop through all CSV files and read them into a DataFrame
for csv_file in csv_files:
df = pd.read_csv(csv_file)
print(df)
To import data from a database like SQLite, MySQL, PostgreSQL, etc., you can use the
pandas.read_sql() function. You'll need a database connection, and each type of database
requires a different connection method.
python
Copy code
import pickle
with open('file.pkl', 'rb') as f:
data = pickle.load(f)
python
Copy code
import pandas as pd
df = pd.read_pickle('file.pkl')
python
Copy code
from PIL import Image
image = Image.open('file.jpg')
image.show()
3. Reading Multiple Files using glob:
python
Copy code
import glob
files = glob.glob('path/to/folder/*.txt')
for file in files:
with open(file, 'r') as f:
content = f.read()
print(content)
python
Copy code
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
python
Copy code
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://user:password@host/database')
df = pd.read_sql('SELECT * FROM table_name', engine)
# Loop through each article title and print the title and the link
for title in article_titles:
link = title.find('a') # Get the link inside the <h2> tag
if link:
title_text = title.get_text()
article_link = link.get('href')
print(f"Title: {title_text}")
print(f"Link: {article_link}")
print("-" * 40)
else:
print("Failed to fetch the webpage. Status code:", response.status_code)
11. Perform following preprocessing techniques on loan prediction dataseta. Feature Scaling
b. Feature Standardization c. Label Encoding d. One Hot Encoding.
1. Feature Scaling
Feature scaling ensures that features have similar ranges, which is important for algorithms that
rely on the distance between points (e.g., k-nearest neighbors, support vector machines). The
most common techniques for feature scaling are Min-Max Scaling and Standardization.
Min-Max scaling scales the data to a specific range, often [0, 1].
python
Copy code
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Example dataset
data = {'Age': [25, 30, 35, 40, 45],
'Income': [40000, 50000, 60000, 70000, 80000],
'LoanAmount': [100000, 200000, 150000, 120000, 180000]}
df = pd.DataFrame(data)
print(df)
Output:
plaintext
Copy code
Age Income LoanAmount
0 25 0.0 0.000000
1 30 0.25 0.500000
2 35 0.5 0.250000
3 40 0.75 0.000000
4 45 1.0 0.750000
2. Feature Standardization
Standardization scales the data to have a mean of 0 and a standard deviation of 1. This is useful
for algorithms like Logistic Regression, SVM, or Linear Regression that assume the data is
normally distributed.
print(df)
Output:
plaintext
Copy code
Age Income LoanAmount
0 25 -1.414214 -1.297771
1 30 -0.707107 0.129777
2 35 0.000000 -0.587023
3 40 0.707107 -1.138628
4 45 1.414214 1.893645
3. Label Encoding
Label encoding is used when the target variable is categorical and has a natural order (like
"Low", "Medium", "High"). It encodes categories as integers.
# Initialize LabelEncoder
encoder = LabelEncoder()
print(encoded_status) # Output: [1 0 1 0 1]
In this case:
Approved -> 1
Denied -> 0
4. One-Hot Encoding
One-hot encoding is used when categorical features have no ordinal relationship (like Gender,
MaritalStatus). It converts categorical variables into binary columns (1 or 0), one column for
each category.
print(df_encoded)
Output:
plaintext
Copy code
Marital_Status_Divorced Marital_Status_Married Marital_Status_Single
0 0 0 1
1 0 1 0
2 0 0 1
3 0 1 0
4 1 0 0
The feature Marital_Status is converted into three binary columns (Marital_Status_Single,
Marital_Status_Married, Marital_Status_Divorced).
12. Perform following visualizations using matplotlib a. Bar Graph b. Pie Chart c. Box Plot
d. Histogram e. Line Chart and Subplots f. Scatter Plot
1. Bar Graph
A bar graph is useful to represent categorical data with rectangular bars where the length of the
bar represents the value.
python
Copy code
import matplotlib.pyplot as plt
2. Pie Chart
A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical
proportions.
python
Copy code
# Data for the pie chart
labels = ['Apple', 'Banana', 'Cherry', 'Date']
sizes = [35, 25, 20, 20]
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
# Adding a title
plt.title('Fruit Distribution')
# Displaying the plot
plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle
plt.show()
3. Box Plot
A box plot (or box-and-whisker plot) is used to represent the distribution of numerical data based
on a five-number summary: minimum, first quartile, median, third quartile, and maximum.
python
Copy code
import numpy as np
4. Histogram
A histogram is used to represent the distribution of numerical data. It groups the data into bins
and counts the number of data points in each bin.
python
Copy code
# Random data for the histogram
data = np.random.randn(1000)
# Creating a histogram
plt.hist(data, bins=30, color='orange', edgecolor='black')
A line chart is useful for showing data trends over a continuous range (e.g., time series). Subplots
allow multiple plots to be displayed in a single figure.
python
Copy code
# Data for line chart
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
6. Scatter Plot
A scatter plot is used to represent the relationship between two continuous variables.
python
Copy code
# Data for scatter plot
x = np.random.rand(100)
y = np.random.rand(100)
1. Install NLTK: Open a terminal (or command prompt) and run the following command to
install NLTK via pip:
bash
Copy code
pip install nltk
2. Verify Installation: After installation, you can verify that NLTK has been successfully
installed by importing it in a Python script or in an interactive Python session.
python
Copy code
import nltk
print(nltk.__version__) # Print the NLTK version
python
Copy code
import nltk
nltk.download()
This will open a GUI window where you can select which datasets to download.
Alternatively, you can download specific resources like so:
python
Copy code
nltk.download('punkt') # For tokenization
nltk.download('stopwords') # For stop words
Example Usage:
Once installed, you can begin using NLTK for tasks like tokenization, stemming, or part-of-
speech tagging. Here's an example to tokenize text into words:
python
Copy code
import nltk
from nltk.tokenize import word_tokenize
# Sample text
text = "NLTK is a leading platform for building Python programs to work with human language
data."
print(tokens)
Output:
plaintext
Copy code
['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'w
14. Python program to implement with Python Sci Kit-Learn & NLTK
Steps:
1. Install Required Libraries: First, make sure you have NLTK and Scikit-Learn
installed.
bash
Copy code
pip install nltk scikit-learn
2. Download Necessary NLTK Data: For this example, we'll need NLTK's stopwords and
punkt for tokenization.
python
Copy code
import nltk
nltk.download('stopwords')
nltk.download('punkt')
3. Text Classification Program: We'll use the 20 Newsgroups dataset from Scikit-learn
for classification. The task is to classify text documents into one of several predefined
categories.
python
Copy code
import nltk
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Step 2: Preprocess the text using NLTK (tokenization, stopword removal, punctuation
removal)
def preprocess_text(text):
# Tokenize the text
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
# Step 3: Convert the text data into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed_docs)
1. Dataset:
o We use Scikit-learn's fetch_20newsgroups to load a dataset of newsgroup
documents categorized into 20 topics.
2. Text Preprocessing:
o We preprocess the text data using NLTK for:
Tokenization: Using nltk.word_tokenize.
Removing Punctuation: We filter out any tokens that are not alphabetic.
Removing Stopwords: Using the stopwords corpus from NLTK.
3. Vectorization:
o Scikit-learn's CountVectorizer is used to convert the processed text documents
into a matrix of token counts. This transforms the text data into a format that can
be used for machine learning.
4. Model Training:
o We use Multinomial Naive Bayes (MultinomialNB), a classifier well-suited for
text classification tasks.
5. Evaluation:
o We evaluate the model's performance using classification report, which shows
precision, recall, and F1-score for each category.
Sample Output:
The output will be a classification report that provides performance metrics for each category:
plaintext
Copy code
Classification Report:
precision recall f1-score support
o implement a Python program with NLTK, spaCy, and PyNLPI (which is a library for natural
language processing in Python), we will cover the following aspects:
We'll combine all three libraries in a single Python program for some common NLP tasks, such
as text preprocessing, named entity recognition, and tokenization.
Before you run the code, make sure to install the necessary libraries using pip:
bash
Copy code
pip install nltk spacy pynlpi
Also, for spaCy, download a pre-trained language model (e.g., en_core_web_sm for English):
bash
Copy code
python -m spacy download en_core_web_sm
python
Copy code
import nltk
import spacy
from pynlpi import Tokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from spacy import displacy
# Dependency Parsing
print("\nDependency Parsing using spaCy:")
for token in doc:
print(f"{token.text} -> {token.dep_} -> {token.head.text}")
# 3. PyNLPI: Tokenization
print("\nPyNLPI Processing:")
1. NLTK:
o Tokenization: We use word_tokenize to break the text into individual words.
o Stopword Removal: We remove common English stopwords (like "the", "is",
"and") from the tokenized list using NLTK's stopwords corpus.
2. spaCy:
o Named Entity Recognition (NER): We extract named entities (e.g., "Apple",
"U.K.") from the text using spaCy's built-in ents attribute.
o Dependency Parsing: We analyze the syntactic structure of the sentence, printing
each word's syntactic role (e.g., subject, object).
o Optional Visualization: You can visualize the dependency parsing tree using
displacy.serve, which opens a visualization in the browser.
3. PyNLPI:
o Tokenization: We use PyNLPI's Tokenizer to split the text into tokens.
Expected Output:
plaintext
Copy code
NLTK Processing:
Tokens using NLTK: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.',
'Steve', 'Jobs', 'co-founded', 'Apple', '.']
Filtered Tokens using NLTK: ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.',
'Steve', 'Jobs', 'co-founded', 'Apple', '.']
spaCy Processing:
Named Entities using spaCy:
Apple - ORG
U.K. - GPE
$1 billion - MONEY
Steve Jobs - PERSON
Apple - ORG
PyNLPI Processing:
Tokens using PyNLPI: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion',
'.', 'Steve', 'Jobs', 'co-founded', 'Apple', '.']
Explanation of Output:
NLTK: The tokens are extracted and stopwords are removed from the text.
spaCy:
o Named Entity Recognition (NER) identifies entities like Apple, U.K., Steve
Jobs, and $1 billion.
o Dependency parsing shows the grammatical relationships between words in the
sentence.
PyNLPI: The tokens extracted by PyNLPI are similar to the ones from NLTK.