0% found this document useful (0 votes)
16 views42 pages

DATASCIENCE LAB

data science lab'

Uploaded by

G Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views42 pages

DATASCIENCE LAB

data science lab'

Uploaded by

G Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

List of Experiments

1. Creating
2. a NumPy Array a. Basic ndarray
3. b. Array of zeros
4. c. Array of ones
5. d. Random numbers in ndarray
6. e. An array of your choice
7. f. Imatrix in NumPy
8. g. Evenly spaced ndarray

Here’s how you can create different types of NumPy arrays based on your requests:

1. Basic ndarray

To create a basic NumPy array from a list:

python
import numpy as np

# Basic ndarray
arr = np.array([1, 2, 3, 4, 5])
print(arr)

2. Array of Zeros

To create an array filled with zeros:

# Array of zeros
zeros_arr = np.zeros((3, 3)) # 3x3 matrix of zeros
print(zeros_arr)

3. Array of Ones

To create an array filled with ones:

python
Copy code
# Array of ones
ones_arr = np.ones((2, 4)) # 2x4 matrix of ones
print(ones_arr)

4. Random Numbers in ndarray

To create an array with random numbers (uniform distribution between 0 and 1):

python
Copy code
# Random numbers in ndarray
random_arr = np.random.rand(3, 3) # 3x3 matrix of random numbers
print(random_arr)

You can also generate random integers using np.random.randint:

python
Copy code
# Random integers in ndarray
random_int_arr = np.random.randint(0, 10, size=(2, 5)) # 2x5 matrix with integers between 0
and 10
print(random_int_arr)

5. An Array of Your Choice

You can create an array based on a specific sequence or data you want. Here's an example of a
custom array:

python
Copy code
# An array of your choice
custom_arr = np.array([10, 20, 30, 40])
print(custom_arr)

6. Identity Matrix in NumPy

To create an identity matrix (diagonal of ones, rest are zeros):

python
Copy code
# Identity matrix
identity_matrix = np.eye(4) # 4x4 identity matrix
print(identity_matrix)

7. Evenly Spaced ndarray

To create an array with evenly spaced values using np.linspace:

python
Copy code
# Evenly spaced ndarray
evenly_spaced_arr = np.linspace(0, 10, 5) # 5 values from 0 to 10 (inclusive)
print(evenly_spaced_arr)

You can also use np.arange to create an array with a specified step:

python
Copy code
# Evenly spaced ndarray using np.arange
even_arr = np.arange(0, 10, 2) # Values from 0 to 10 with step size of 2
print(even_arr)
2. The Shape and Reshaping of NumPy Array

a. Dimensions of NumPy array

b. Shape of NumPy array

c. Size of NumPy array

d. Reshaping a NumPy array

e. Flattening a NumPy array

Transpose of a NumPy array

Here’s how you can create different types of NumPy arrays based on your requests:

1. Basic ndarray

To create a basic NumPy array from a list:

python
Copy code
import numpy as np

# Basic ndarray
arr = np.array([1, 2, 3, 4, 5])
print(arr)

2. Array of Zeros

To create an array filled with zeros:

python
Copy code
# Array of zeros
zeros_arr = np.zeros((3, 3)) # 3x3 matrix of zeros
print(zeros_arr)

3. Array of Ones

To create an array filled with ones:


python
Copy code
# Array of ones
ones_arr = np.ones((2, 4)) # 2x4 matrix of ones
print(ones_arr)

4. Random Numbers in ndarray

To create an array with random numbers (uniform distribution between 0 and 1):

python
Copy code
# Random numbers in ndarray
random_arr = np.random.rand(3, 3) # 3x3 matrix of random numbers
print(random_arr)

You can also generate random integers using np.random.randint:

python
Copy code
# Random integers in ndarray
random_int_arr = np.random.randint(0, 10, size=(2, 5)) # 2x5 matrix with integers between 0
and 10
print(random_int_arr)

5. An Array of Your Choice

You can create an array based on a specific sequence or data you want. Here's an example of a
custom array:

python
Copy code
# An array of your choice
custom_arr = np.array([10, 20, 30, 40])
print(custom_arr)

6. Identity Matrix in NumPy

To create an identity matrix (diagonal of ones, rest are zeros):

python
Copy code
# Identity matrix
identity_matrix = np.eye(4) # 4x4 identity matrix
print(identity_matrix)

7. Evenly Spaced ndarray

To create an array with evenly spaced values using np.linspace:


python
Copy code
# Evenly spaced ndarray
evenly_spaced_arr = np.linspace(0, 10, 5) # 5 values from 0 to 10 (inclusive)
print(evenly_spaced_arr)

You can also use np.arange to create an array with a specified step:

python
Copy code
# Evenly spaced ndarray using np.arange
even_arr = np.arange(0, 10, 2)
# Values from 0 to 10 with step size of 2
print(even_arr)

Each of these snippets demonstrates how to create different types of arrays in NumPy. You can
adjust the dimensions and values based on your specific requirements.

You said:
2. The Shape and Reshaping of NumPy Array a. Dimensions of NumPy array b. Shape of
NumPy array c. Size of NumPy array d. Reshaping a NumPy array e. Flattening a NumPy array
Transpose of a NumPy array

2. The Shape and Reshaping of NumPy Array

a. Dimensions of NumPy Array

The number of dimensions (axes) of a NumPy array can be obtained using the .ndim attribute.

python
Copy code
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6]])


# Get the number of dimensions
print("Dimensions of the array:", arr.ndim)

b. Shape of NumPy Array

The shape of a NumPy array refers to the dimensions of the array (rows, columns, etc.). It can be
accessed using the .shape attribute.

python
Copy code
# Get the shape of the array
print("Shape of the array:", arr.shape)

For the above example, it will output (2, 3), indicating that the array has 2 rows and 3 columns.
c. Size of NumPy Array

The size of a NumPy array refers to the total number of elements in the array, which can be
obtained using the .size attribute.

python
Copy code
# Get the size of the array
print("Size of the array:", arr.size)

This will give you the total number of elements in the array. For example, in a 2x3 array, the size
will be 6.

d. Reshaping a NumPy Array

You can reshape an array using the .reshape() method. This changes the shape of the array
without changing its data.

python
Copy code
# Reshape the array
reshaped_arr = arr.reshape(3, 2) # Reshaping to a 3x2 array
print("Reshaped array:")
print(reshaped_arr)

Note: The total number of elements must stay the same when reshaping. For example, if the
original array has 6 elements, the reshaped array must also have 6 elements (e.g., 2x3, 3x2).

e. Flattening a NumPy Array

Flattening refers to converting a multidimensional array into a one-dimensional array. This can
be done using .flatten() or .ravel().

python
Copy code
# Flatten the array
flattened_arr = arr.flatten()
print("Flattened array:", flattened_arr)

Alternatively, .ravel() also flattens the array, but it returns a flattened view of the array whenever
possible.

python
Copy code
# Flatten the array using ravel
raveled_arr = arr.ravel()
print("Raveled array:", raveled_arr)
f. Transpose of a NumPy Array

The transpose of an array is obtained by swapping rows and columns. This can be done using .T.

python
Copy code
# Transpose of the array
transposed_arr = arr.T
print("Transposed array:")
print(transposed_arr)

For a 2x3 array, the transpose will result in a 3x2 array.

Example Walkthrough:

python
Copy code
import numpy as np

# Creating a 2x3 array


arr = np.array([[1, 2, 3], [4, 5, 6]])

# a. Dimensions of NumPy array


print("Dimensions of the array:", arr.ndim)

# b. Shape of NumPy array


print("Shape of the array:", arr.shape)

# c. Size of NumPy array


print("Size of the array:", arr.size)

# d. Reshaping the array to 3x2


reshaped_arr = arr.reshape(3, 2)
print("Reshaped array:\n", reshaped_arr)

# e. Flattening the array


flattened_arr = arr.flatten()
print("Flattened array:", flattened_arr)

# f. Transposing the array


transposed_arr = arr.T
print("Transposed array:\n", transposed_arr)

Output:

lua
Copy code
Dimensions of the array: 2
Shape of the array: (2, 3)
Size of the array: 6
Reshaped array:
[[1 2]
[3 4]
[5 6]]
Flattened array: [1 2 3 4 5 6]
Transposed array:
[[1 4]
[2 5]
[3 6]]

Key Points:

 arr.ndim: Gives the number of dimensions of the array.


 arr.shape: Returns the shape (dimensions) of the array (rows, columns, etc.).
 arr.size: Returns the total number of elements in the array.
 arr.reshape(): Changes the shape of the array.
 arr.flatten(): Converts the array into a one-dimensional array.
 arr.T: Returns the transpose of the array, switching rows and columns.

3. Expanding and Squeezing a NumPy Array a. Expanding a NumPy array b. Squeezing a


NumPy array c. Sorting in NumPy Arrays

a. Expanding a NumPy array

Expanding a NumPy array refers to increasing its dimensions by adding new axes. The function
np.expand_dims() is commonly used for this. It adds a new axis at a specified position.

Example:
python
Copy code
import numpy as np

# Creating a 1D array
arr = np.array([1, 2, 3])

# Expanding the array by adding a new axis at position 0


expanded_arr = np.expand_dims(arr, axis=0)
print("Expanded array along axis 0:")
print(expanded_arr)
print("Shape of expanded array:", expanded_arr.shape)

# Expanding the array by adding a new axis at position 1


expanded_arr = np.expand_dims(arr, axis=1)
print("Expanded array along axis 1:")
print(expanded_arr)
print("Shape of expanded array:", expanded_arr.shape)
Output:
lua
Copy code
Expanded array along axis 0:
[[1 2 3]]
Shape of expanded array: (1, 3)

Expanded array along axis 1:


[[1]
[2]
[3]]
Shape of expanded array: (3, 1)

b. Squeezing a NumPy array

Squeezing a NumPy array removes dimensions of size 1 from the shape of an array. This is done
using the np.squeeze() function.

Example:
python
Copy code
import numpy as np

# Creating a 3D array with a single dimension of size 1


arr = np.array([[[1], [2], [3]]])

print("Original array:")
print(arr)
print("Shape of original array:", arr.shape)

# Squeezing the array to remove the single-dimensional axes


squeezed_arr = np.squeeze(arr)
print("Squeezed array:")
print(squeezed_arr)
print("Shape of squeezed array:", squeezed_arr.shape)

Output:
lua
Copy code
Original array:
[[[1]
[2]
[3]]]
Shape of original array: (1, 3, 1)

Squeezed array:
[1 2 3]
Shape of squeezed array: (3,)

c. Sorting in NumPy Arrays


NumPy provides functions to sort arrays in different ways. The np.sort() function returns a sorted
copy of the array, while np.argsort() returns the indices that would sort the array. Additionally,
np.sort() can sort the array along a specified axis.

Example of sorting an array:


python
Copy code
import numpy as np

# Creating an unsorted array


arr = np.array([3, 1, 2, 5, 4])

# Sorting the array in ascending order


sorted_arr = np.sort(arr)
print("Sorted array:", sorted_arr)

# Sorting the array in descending order by reversing the sorted result


sorted_desc_arr = np.sort(arr)[::-1]
print("Sorted array in descending order:", sorted_desc_arr)

# Sorting along an axis (2D array example)


arr_2d = np.array([[3, 1, 2], [6, 5, 4]])
sorted_2d_arr = np.sort(arr_2d, axis=1) # Sort along each row
print("Sorted 2D array along axis 1:")
print(sorted_2d_arr)

# Using np.argsort() to get the indices that would sort the array
sorted_indices = np.argsort(arr)
print("Indices that would sort the array:", sorted_indices)

Output:
lua
Copy code
Sorted array: [1 2 3 4 5]
Sorted array in descending order: [5 4 3 2 1]

Sorted 2D array along axis 1:


[[1 2 3]
[4 5 6]]

Indices that would sort the array: [1 2 0 4 3]

4. Indexing and Slicing of NumPy Array a. Slicing 1-D NumPy arrays b. Slicing 2-D
NumPy arrays c. Slicing 3-D NumPy arrays d. Negative slicing of NumPy arrays

a. Slicing 1-D NumPy arrays

Slicing in a 1D array allows you to extract a portion of the array using a start, stop, and step
value.
Syntax for 1D array slicing:
python
Copy code
arr[start:stop:step]

 start: The starting index (inclusive).


 stop: The stopping index (exclusive).
 step: The step size (default is 1).

Example:
python
Copy code
import numpy as np

# Creating a 1D array
arr = np.array([10, 20, 30, 40, 50, 60, 70])

# Slicing the array


print("Array sliced from index 2 to 5:", arr[2:5]) # [30, 40, 50]
print("Array sliced with step 2:", arr[::2]) # [10, 30, 50, 70]
print("Array sliced with step -1 (reversed):", arr[::-1]) # [70, 60, 50, 40, 30, 20, 10]

Output:
javascript
Copy code
Array sliced from index 2 to 5: [30 40 50]
Array sliced with step 2: [10 30 50 70]
Array sliced with step -1 (reversed): [70 60 50 40 30 20 10]

b. Slicing 2-D NumPy arrays

Slicing a 2D array allows you to select subarrays along both axes (rows and columns).

Syntax for 2D array slicing:


python
Copy code
arr[start_row:end_row, start_col:end_col]

 start_row:end_row: Row slicing.


 start_col:end_col: Column slicing.

Example:
python
Copy code
import numpy as np

# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Slicing the 2D array
print("Sliced 2D array (rows 1 to 3 and columns 1 to 2):")
print(arr_2d[1:3, 1:3])

print("Sliced 2D array (all rows, columns 1 to 2):")


print(arr_2d[:, 1:3])

print("Sliced 2D array (row 2, all columns):")


print(arr_2d[2, :])

Output:
lua
Copy code
Sliced 2D array (rows 1 to 3 and columns 1 to 2):
[[5 6]
[8 9]]

Sliced 2D array (all rows, columns 1 to 2):


[[ 2 3]
[ 5 6]
[ 8 9]
[11 12]]

Sliced 2D array (row 2, all columns):


[7 8 9]

c. Slicing 3-D NumPy arrays

For 3D arrays, you can slice across three dimensions: depth (axis 0), rows (axis 1), and columns
(axis 2).

Syntax for 3D array slicing:


python
Copy code
arr[start_depth:end_depth, start_row:end_row, start_col:end_col]

Example:
python
Copy code
import numpy as np

# Creating a 3D array
arr_3d = np.array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]],
[[9, 10], [11, 12]]])

# Slicing the 3D array


print("Sliced 3D array (depth 0 to 2, row 0, all columns):")
print(arr_3d[0:2, 0, :])
print("Sliced 3D array (depth 1, all rows and columns):")
print(arr_3d[1, :, :])

print("Sliced 3D array (depth 0 to 2, all rows, columns 0 to 1):")


print(arr_3d[0:2, :, 0:1])

Output:
lua
Copy code
Sliced 3D array (depth 0 to 2, row 0, all columns):
[[1 2]
[5 6]]

Sliced 3D array (depth 1, all rows and columns):


[[ 5 6]
[ 7 8]]

Sliced 3D array (depth 0 to 2, all rows, columns 0 to 1):


[[[ 1]
[ 3]]

[[ 5]
[ 7]]]

d. Negative slicing of NumPy arrays

Negative slicing allows you to slice an array starting from the end rather than the beginning.
Negative indexing is useful when you want to select elements from the end without knowing the
exact size of the array.

Example:
import numpy as np

# Creating a 1D array
arr = np.array([10, 20, 30, 40, 50, 60, 70])

# Negative indexing and slicing


print("Last 3 elements using negative indexing:", arr[-3:])
print("All elements except the last 2:", arr[:-2])
print("Reverse the array using negative slicing:", arr[::-1])

Output:
Last 3 elements using negative indexing: [50 60 70]
All elements except the last 2: [10 20 30 40 50]
Reverse the array using negative slicing: [70 60 50 40 30 20 10]
5. Stacking and Concatenating Numpy Arrays a. Stacking ndarrays b. Concatenating
ndarrays c. Broadcasting in Numpy Arrays

a. Stacking NumPy arrays


Stacking refers to combining multiple arrays along a new axis. NumPy provides several
functions for stacking:

 np.stack(): Stacks arrays along a new axis.


 np.hstack(): Stacks arrays horizontally (along axis 1).
 np.vstack(): Stacks arrays vertically (along axis 0).
 np.dstack(): Stacks arrays depth-wise (along axis 2).

Example:
python
Copy code
import numpy as np

# Creating two 1D arrays


arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Stacking arrays along a new axis (default axis=0)


stacked = np.stack((arr1, arr2))
print("Stacked along axis 0:\n", stacked)

# Stacking arrays horizontally (along axis=1)


hstacked = np.hstack((arr1, arr2))
print("Horizontally stacked:\n", hstacked)

# Stacking arrays vertically (along axis=0)


vstacked = np.vstack((arr1, arr2))
print("Vertically stacked:\n", vstacked)

# Stacking arrays depth-wise (along axis=2)


arr1_2d = np.array([[1, 2], [3, 4]])
arr2_2d = np.array([[5, 6], [7, 8]])
dstacked = np.dstack((arr1_2d, arr2_2d))
print("Depth-wise stacked:\n", dstacked)

Output:
lua
Copy code
Stacked along axis 0:
[[1 2 3]
[4 5 6]]

Horizontally stacked:
[1 2 3 4 5 6]

Vertically stacked:
[[1 2 3]
[4 5 6]]

Depth-wise stacked:
[[[1 5]
[2 6]]

[[3 7]
[4 8]]]

b. Concatenating NumPy arrays

Concatenation refers to joining two or more arrays along a specified axis. NumPy provides
np.concatenate() for this operation. You can concatenate arrays along any axis, not just 0 or 1.

Example:
python
Copy code
import numpy as np

# Creating two 1D arrays


arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Concatenating arrays along axis 0 (default)


concatenated = np.concatenate((arr1, arr2))
print("Concatenated along axis 0:", concatenated)

# Concatenating 2D arrays along axis 1


arr1_2d = np.array([[1, 2], [3, 4]])
arr2_2d = np.array([[5, 6], [7, 8]])
concatenated_2d = np.concatenate((arr1_2d, arr2_2d), axis=1)
print("Concatenated 2D array along axis 1:\n", concatenated_2d)

# Concatenating along axis 0 (vertically)


concatenated_2d_v = np.concatenate((arr1_2d, arr2_2d), axis=0)
print("Concatenated 2D array along axis 0:\n", concatenated_2d_v)

Output:
lua
Copy code
Concatenated along axis 0: [1 2 3 4 5 6]

Concatenated 2D array along axis 1:


[[1 2 5 6]
[3 4 7 8]]

Concatenated 2D array along axis 0:


[[1 2]
[3 4]
[5 6]
[7 8]]

c. Broadcasting in NumPy Arrays


Broadcasting allows NumPy to perform element-wise operations on arrays of different shapes
by automatically expanding the smaller array to match the shape of the larger one. This
eliminates the need for explicit replication of arrays, making operations more efficient.

Broadcasting follows a set of rules to determine whether two arrays can be broadcast together.
The key rules are:

1. If the arrays have different numbers of dimensions, the shape of the smaller array is
padded with 1s on the left side until they have the same number of dimensions.
2. If the dimensions of the arrays do not match, broadcasting is possible only if one of the
arrays has a dimension of size 1 in that position.

Example of Broadcasting:
python
Copy code
import numpy as np

# Array A: 3x3 matrix


A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Array B: 1D array (broadcasted across columns of A)


B = np.array([1, 2, 3])

# Broadcasting B to the shape of A (adding B to each row of A)


result = A + B
print("Result of broadcasting and adding arrays A and B:\n", result)

# Array C: A scalar (broadcasted across all elements of A)


C = 10
result_scalar = A + C
print("Result of broadcasting a scalar to A:\n", result_scalar)

Output:
less
Copy code
Result of broadcasting and adding arrays A and B:
[[ 2 4 6]
[ 6 7 9]
[10 11 12]]

Result of broadcasting a scalar to A:


[[11 12 13]
[14 15 16]
[17 18 19]]

6. Perform following operations using pandas a. Creating dataframe b. concat() c. Setting


conditions d. Adding a new column
a. Creating a DataFrame in Pandas

A DataFrame is the primary data structure in Pandas, and you can create it from various data
sources like dictionaries, lists, or NumPy arrays.

Example:
python
Copy code
import pandas as pd

# Creating a DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)
print("DataFrame created from dictionary:")
print(df)

Output:
sql
Copy code
DataFrame created from dictionary:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston

b. Concatenating DataFrames (concat())

Pandas' concat() function is used to concatenate DataFrames either vertically (row-wise) or


horizontally (column-wise).

Example (Concatenating DataFrames vertically):


python
Copy code
import pandas as pd

# Creating two DataFrames


df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
})
df2 = pd.DataFrame({
'Name': ['Charlie', 'David'],
'Age': [35, 40]
})
# Concatenating DataFrames vertically (along rows)
df_concat = pd.concat([df1, df2], ignore_index=True)
print("Concatenated DataFrame (vertically):")
print(df_concat)

Example (Concatenating DataFrames horizontally):


python
Copy code
# Creating two DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
})
df2 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
})

# Concatenating DataFrames horizontally (along columns)


df_concat_h = pd.concat([df1, df2], axis=1)
print("Concatenated DataFrame (horizontally):")
print(df_concat_h)

Output (Vertical Concatenation):


java
Copy code
Concatenated DataFrame (vertically):
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
3 David 40

Output (Horizontal Concatenation):


java
Copy code
Concatenated DataFrame (horizontally):
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles

c. Setting Conditions in Pandas

You can apply conditions to filter data within a DataFrame. This allows you to select rows that
meet specific criteria.

Example (Filtering based on conditions):


python
Copy code
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

# Setting a condition to filter rows where Age is greater than 30


filtered_df = df[df['Age'] > 30]
print("Filtered DataFrame (Age > 30):")
print(filtered_df)

# Using multiple conditions (Age > 30 and City is 'Chicago')


filtered_multiple_conditions = df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
print("Filtered DataFrame (Age > 30 and City is 'Chicago'):")
print(filtered_multiple_conditions)

Output:
java
Copy code
Filtered DataFrame (Age > 30):
Name Age City
2 Charlie 35 Chicago
3 David 40 Houston

Filtered DataFrame (Age > 30 and City is 'Chicago'):


Name Age City
2 Charlie 35 Chicago

d. Adding a New Column to a DataFrame

You can add new columns to an existing DataFrame by simply assigning values to a new column
name.

Example (Adding a new column):


python
Copy code
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

# Adding a new column 'Salary'


df['Salary'] = [50000, 60000, 70000, 80000]
print("DataFrame with a new column 'Salary':")
print(df)

Output:
sql
Copy code
DataFrame with a new column 'Salary':
Name Age City Salary
0 Alice 25 New York 50000
1 Bob 30 Los Angeles 60000
2 Charlie 35 Chicago 70000
3 David 40 Houston 80000

7. Perform following operations using pandas a. Filling NaN with string b. Sorting based on
column values c. groupby()

a. Filling NaN with a String

In Pandas, you can fill NaN values with a specific value using the fillna() method. If you want to
replace NaN with a string (or any other value), you can do so easily.

Example:
python
Copy code
import pandas as pd
import numpy as np

# Creating a DataFrame with NaN values


df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 35, np.nan],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan]
})

# Filling NaN values with a specific string


df_filled = df.fillna('Unknown')
print("DataFrame after filling NaN with 'Unknown':")
print(df_filled)

Output:
sql
Copy code
DataFrame after filling NaN with 'Unknown':
Name Age City
0 Alice 25 New York
1 Bob Unknown Los Angeles
2 Charlie 35 Chicago
3 David Unknown Unknown

b. Sorting Based on Column Values


You can sort a DataFrame by one or more columns using the sort_values() method. You can sort
in ascending or descending order based on a column's values.

Example:
python
Copy code
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

# Sorting the DataFrame by 'Age' in ascending order


sorted_df = df.sort_values(by='Age', ascending=True)
print("DataFrame sorted by 'Age' in ascending order:")
print(sorted_df)

# Sorting the DataFrame by 'Age' in descending order


sorted_df_desc = df.sort_values(by='Age', ascending=False)
print("\nDataFrame sorted by 'Age' in descending order:")
print(sorted_df_desc)

Output:
csharp
Copy code
DataFrame sorted by 'Age' in ascending order:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston

DataFrame sorted by 'Age' in descending order:


Name Age City
3 David 40 Houston
2 Charlie 35 Chicago
1 Bob 30 Los Angeles
0 Alice 25 New York

c. groupby() in Pandas

The groupby() function in Pandas is used to group data based on one or more columns and then
apply an aggregate function to the grouped data. Common operations include summing,
averaging, or counting values in each group.

Example:
python
Copy code
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Los Angeles']
})

# Grouping the DataFrame by 'City' and calculating the average age for each city
grouped_df = df.groupby('City')['Age'].mean()
print("Average age for each city:")
print(grouped_df)

# Grouping by multiple columns and counting the number of occurrences


grouped_count = df.groupby(['City', 'Age']).size()
print("\nCount of occurrences of each combination of City and Age:")
print(grouped_count)

Output:
vbnet
Copy code
Average age for each city:
City
Chicago 35.0
Houston 40.0
Los Angeles 30.0
New York 25.0
Name: Age, dtype: float64

Count of occurrences of each combination of City and Age:


City Age
Chicago 35 1
Houston 40 1
Los Angeles 30 2
New York 25 1
dtype: int64

8. Read the following file formats using pandas


a. Text files b. CSV files c. Excel files d. JSON files

a. Reading Text Files

You can read a text file into a DataFrame using pd.read_csv(). Text files can have custom
delimiters (spaces, tabs, or others). If the text file is space-delimited, use the delim_whitespace
parameter.

Example:
python
Copy code
import pandas as pd

# Reading a text file (space-delimited or tab-delimited)


df_text = pd.read_csv('file.txt', delim_whitespace=True) # Use 'sep="\t"' for tab-delimited
print(df_text)

 Parameters:
o delim_whitespace=True: This allows Pandas to treat any whitespace as a
delimiter.
o You can also use sep=' ', sep='\t', or any other custom delimiter for more control.

b. Reading CSV Files

CSV (Comma Separated Values) is the most common tabular data format. Pandas provides
pd.read_csv() to read CSV files.

Example:
python
Copy code
import pandas as pd

# Reading a CSV file


df_csv = pd.read_csv('file.csv')
print(df_csv)

 Parameters:
o sep=',': Specifies the delimiter (default is comma).
o header=0: Row number to use as column names (default is 0).
o index_col: To specify which column to use as the index.

Example with custom delimiter (semicolon):


python
Copy code
df_csv = pd.read_csv('file.csv', sep=';') # For semicolon-separated values
print(df_csv)

c. Reading Excel Files

Pandas can read .xls and .xlsx files using the pd.read_excel() function. You’ll need the openpyxl
library for .xlsx files and xlrd for .xls.

Example:
python
Copy code
import pandas as pd
# Reading an Excel file (default is sheet_name=0 for the first sheet)
df_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
print(df_excel)

 Parameters:
o sheet_name: Specifies the sheet to read by name or index. If sheet_name=None, it
reads all sheets.
o header: Defines the row(s) to use as column names.
o usecols: To select specific columns.

Example (reading all sheets):


python
Copy code
df_excel_all = pd.read_excel('file.xlsx', sheet_name=None) # Read all sheets
print(df_excel_all)

d. Reading JSON Files

JSON (JavaScript Object Notation) is commonly used for hierarchical data. Pandas can easily
read JSON files into DataFrames using pd.read_json().

Example:
python
Copy code
import pandas as pd

# Reading a JSON file


df_json = pd.read_json('file.json')
print(df_json)

 Parameters:
o orient: You can specify how to read the JSON file. For instance, 'records' means
each line is a dictionary (the default).
o lines=True: If the file contains one JSON object per line, use this option.

Example with nested JSON structure:


python
Copy code
# Read a nested JSON file
df_json_nested = pd.read_json('file.json', orient='records', lines=True)
print(df_json_nested)

Summary of Code Examples:

1. Text Files (space or tab-delimited):


python
Copy code
df_text = pd.read_csv('file.txt', delim_whitespace=True)

2. CSV Files:

python
Copy code
df_csv = pd.read_csv('file.csv') # For comma-separated
df_csv = pd.read_csv('file.csv', sep=';') # For semicolon-separated

3. Excel Files:

python
Copy code
df_excel = pd.read_excel('file.xlsx', sheet_name='Sheet1')
df_excel_all = pd.read_excel('file.xlsx', sheet_name=None) # Read all sheets

4. JSON Files:

python
Copy code
df_json = pd.read_json('file.json')
df_json_nested = pd.read_json('file.json', orient='records', lines=True)

9. Read the following file formats a. Pickle files b. Image files using PIL c. Multiple files
using Glob d. Importing data from database

a. Reading Pickle Files

Pickle files are used to serialize and deserialize Python objects, making them convenient for
saving and loading complex data structures. You can read Pickle files using the pickle module or
pandas (for DataFrames).

Example using pickle:


python
Copy code
import pickle

# Load a Pickle file


with open('file.pkl', 'rb') as f:
data = pickle.load(f)

print(data)

 'rb': Opens the file in binary mode for reading.


 Use pickle.load() to deserialize the object stored in the file.
Example using pandas (for reading DataFrames stored as Pickle files):
python
Copy code
import pandas as pd

# Load a Pickle file (DataFrame)


df = pd.read_pickle('file.pkl')
print(df)

 pd.read_pickle() directly loads Pickle files that store Pandas DataFrames.

b. Reading Image Files using PIL (Pillow)

The Pillow library (a fork of PIL, the Python Imaging Library) allows you to open, manipulate,
and save various image formats like PNG, JPEG, GIF, etc.

Example:
python
Copy code
from PIL import Image

# Open an image file


image = Image.open('file.jpg')

# Show the image


image.show()

# Optionally, you can convert the image to grayscale or perform other manipulations
image_gray = image.convert('L')
image_gray.show()

 Image.open(): Opens the image file.


 image.show(): Displays the image.
 convert('L'): Converts the image to grayscale.

You can also save or manipulate images further using Pillow's various methods.

c. Reading Multiple Files using glob

The glob module allows you to find all pathnames matching a specified pattern. It is useful for
reading multiple files from a directory, such as all .txt files or .csv files.

Example:
python
Copy code
import glob

# Get a list of all text files in the directory


files = glob.glob('path/to/folder/*.txt')

# Loop through all files and process them


for file in files:
with open(file, 'r') as f:
content = f.read()
print(content)

 glob.glob('pattern'): Finds all files matching the pattern (e.g., *.txt for all text files).
 You can then read and process each file as needed.

Example for reading multiple CSV files:


python
Copy code
import pandas as pd
import glob

# Get a list of all CSV files in the folder


csv_files = glob.glob('path/to/folder/*.csv')

# Loop through all CSV files and read them into a DataFrame
for csv_file in csv_files:
df = pd.read_csv(csv_file)
print(df)

d. Importing Data from a Database

To import data from a database like SQLite, MySQL, PostgreSQL, etc., you can use the
pandas.read_sql() function. You'll need a database connection, and each type of database
requires a different connection method.

Example for SQLite:


python
Copy code
import sqlite3
import pandas as pd

# Create a connection to the SQLite database


conn = sqlite3.connect('database.db')

# Query data from a table


df = pd.read_sql('SELECT * FROM table_name', conn)

# Display the DataFrame


print(df)
# Close the connection
conn.close()

Example for MySQL using SQLAlchemy:


python
Copy code
from sqlalchemy import create_engine
import pandas as pd

# Create a connection to the MySQL database


engine = create_engine('mysql+pymysql://user:password@host/database')

# Query data from a table


df = pd.read_sql('SELECT * FROM table_name', engine)

# Display the DataFrame


print(df)

 SQLite: Use sqlite3.connect() to create a connection.


 MySQL/PostgreSQL: Use SQLAlchemy or a database-specific connector like pymysql
or psycopg2.

Summary of Code Examples:

1. Reading Pickle Files:


o Using pickle:

python
Copy code
import pickle
with open('file.pkl', 'rb') as f:
data = pickle.load(f)

o Using pandas (for DataFrames):

python
Copy code
import pandas as pd
df = pd.read_pickle('file.pkl')

2. Reading Image Files using PIL:

python
Copy code
from PIL import Image
image = Image.open('file.jpg')
image.show()
3. Reading Multiple Files using glob:

python
Copy code
import glob
files = glob.glob('path/to/folder/*.txt')
for file in files:
with open(file, 'r') as f:
content = f.read()
print(content)

4. Importing Data from Database:


o SQLite:

python
Copy code
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

o MySQL using SQLAlchemy:

python
Copy code
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://user:password@host/database')
df = pd.read_sql('SELECT * FROM table_name', engine)

10. Demonstrate web scraping using python


import requests
from bs4 import BeautifulSoup

# URL of the website to scrape


url = 'https://example.com'

# Send HTTP GET request


response = requests.get(url)

# Check if the request was successful (status code 200)


if response.status_code == 200:
print("Successfully fetched the webpage!")

# Parse the HTML content


soup = BeautifulSoup(response.content, 'html.parser')

# Find all <h2> tags


article_titles = soup.find_all('h2')

# Loop through each article title and print the title and the link
for title in article_titles:
link = title.find('a') # Get the link inside the <h2> tag
if link:
title_text = title.get_text()
article_link = link.get('href')
print(f"Title: {title_text}")
print(f"Link: {article_link}")
print("-" * 40)
else:
print("Failed to fetch the webpage. Status code:", response.status_code)
11. Perform following preprocessing techniques on loan prediction dataseta. Feature Scaling
b. Feature Standardization c. Label Encoding d. One Hot Encoding.

1. Feature Scaling

Feature scaling ensures that features have similar ranges, which is important for algorithms that
rely on the distance between points (e.g., k-nearest neighbors, support vector machines). The
most common techniques for feature scaling are Min-Max Scaling and Standardization.

Example: Min-Max Scaling

Min-Max scaling scales the data to a specific range, often [0, 1].

python
Copy code
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Example dataset
data = {'Age': [25, 30, 35, 40, 45],
'Income': [40000, 50000, 60000, 70000, 80000],
'LoanAmount': [100000, 200000, 150000, 120000, 180000]}
df = pd.DataFrame(data)

# Initialize the scaler


scaler = MinMaxScaler()

# Apply Min-Max scaling on 'Income' and 'LoanAmount'


df[['Income', 'LoanAmount']] = scaler.fit_transform(df[['Income', 'LoanAmount']])

print(df)

Output:
plaintext
Copy code
Age Income LoanAmount
0 25 0.0 0.000000
1 30 0.25 0.500000
2 35 0.5 0.250000
3 40 0.75 0.000000
4 45 1.0 0.750000

2. Feature Standardization

Standardization scales the data to have a mean of 0 and a standard deviation of 1. This is useful
for algorithms like Logistic Regression, SVM, or Linear Regression that assume the data is
normally distributed.

Example: Standardization (Z-score normalization)


python
Copy code
from sklearn.preprocessing import StandardScaler

# Initialize the standard scaler


scaler = StandardScaler()

# Apply standardization on 'Income' and 'LoanAmount'


df[['Income', 'LoanAmount']] = scaler.fit_transform(df[['Income', 'LoanAmount']])

print(df)

Output:

plaintext
Copy code
Age Income LoanAmount
0 25 -1.414214 -1.297771
1 30 -0.707107 0.129777
2 35 0.000000 -0.587023
3 40 0.707107 -1.138628
4 45 1.414214 1.893645

3. Label Encoding

Label encoding is used when the target variable is categorical and has a natural order (like
"Low", "Medium", "High"). It encodes categories as integers.

Example: Label Encoding for a target variable Loan_Status


python
Copy code
from sklearn.preprocessing import LabelEncoder

# Example target column (Loan_Status)


loan_status = ['Approved', 'Denied', 'Approved', 'Denied', 'Approved']

# Initialize LabelEncoder
encoder = LabelEncoder()

# Apply label encoding


encoded_status = encoder.fit_transform(loan_status)

print(encoded_status) # Output: [1 0 1 0 1]

In this case:

 Approved -> 1
 Denied -> 0

4. One-Hot Encoding

One-hot encoding is used when categorical features have no ordinal relationship (like Gender,
MaritalStatus). It converts categorical variables into binary columns (1 or 0), one column for
each category.

Example: One-Hot Encoding for a feature Marital_Status


python
Copy code
# Example feature column (Marital_Status)
data = {'Marital_Status': ['Single', 'Married', 'Single', 'Married', 'Divorced']}
df = pd.DataFrame(data)

# Apply one-hot encoding


df_encoded = pd.get_dummies(df, columns=['Marital_Status'])

print(df_encoded)

Output:

plaintext
Copy code
Marital_Status_Divorced Marital_Status_Married Marital_Status_Single
0 0 0 1
1 0 1 0
2 0 0 1
3 0 1 0
4 1 0 0
The feature Marital_Status is converted into three binary columns (Marital_Status_Single,
Marital_Status_Married, Marital_Status_Divorced).

12. Perform following visualizations using matplotlib a. Bar Graph b. Pie Chart c. Box Plot
d. Histogram e. Line Chart and Subplots f. Scatter Plot

1. Bar Graph

A bar graph is useful to represent categorical data with rectangular bars where the length of the
bar represents the value.

python
Copy code
import matplotlib.pyplot as plt

# Sample data for the bar graph


categories = ['A', 'B', 'C', 'D']
values = [3, 7, 2, 5]

# Creating a bar graph


plt.bar(categories, values, color='skyblue')

# Adding title and labels


plt.title('Bar Graph Example')
plt.xlabel('Categories')
plt.ylabel('Values')

# Displaying the plot


plt.show()

2. Pie Chart

A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical
proportions.

python
Copy code
# Data for the pie chart
labels = ['Apple', 'Banana', 'Cherry', 'Date']
sizes = [35, 25, 20, 20]
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']

# Creating the pie chart


plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)

# Adding a title
plt.title('Fruit Distribution')
# Displaying the plot
plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle
plt.show()

3. Box Plot

A box plot (or box-and-whisker plot) is used to represent the distribution of numerical data based
on a five-number summary: minimum, first quartile, median, third quartile, and maximum.

python
Copy code
import numpy as np

# Random data for the box plot


data = np.random.randn(100)

# Creating a box plot


plt.boxplot(data)

# Adding title and labels


plt.title('Box Plot Example')
plt.ylabel('Values')

# Displaying the plot


plt.show()

4. Histogram

A histogram is used to represent the distribution of numerical data. It groups the data into bins
and counts the number of data points in each bin.

python
Copy code
# Random data for the histogram
data = np.random.randn(1000)

# Creating a histogram
plt.hist(data, bins=30, color='orange', edgecolor='black')

# Adding title and labels


plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Displaying the plot


plt.show()
5. Line Chart and Subplots

A line chart is useful for showing data trends over a continuous range (e.g., time series). Subplots
allow multiple plots to be displayed in a single figure.

python
Copy code
# Data for line chart
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Creating subplots (2 rows, 1 column)


fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 6))

# Plotting the first line chart


ax1.plot(x, y1, label='sin(x)', color='blue')
ax1.set_title('Sine Wave')
ax1.set_xlabel('x')
ax1.set_ylabel('sin(x)')

# Plotting the second line chart


ax2.plot(x, y2, label='cos(x)', color='red')
ax2.set_title('Cosine Wave')
ax2.set_xlabel('x')
ax2.set_ylabel('cos(x)')

# Adding space between plots


plt.tight_layout()

# Displaying the plots


plt.show()

6. Scatter Plot

A scatter plot is used to represent the relationship between two continuous variables.

python
Copy code
# Data for scatter plot
x = np.random.rand(100)
y = np.random.rand(100)

# Creating a scatter plot


plt.scatter(x, y, color='purple')

# Adding title and labels


plt.title('Scatter Plot Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Displaying the plot


plt.show()
13. Getting started with NLTK, install NLTK using PIP

Steps to Install NLTK using PIP:

1. Install NLTK: Open a terminal (or command prompt) and run the following command to
install NLTK via pip:

bash
Copy code
pip install nltk

2. Verify Installation: After installation, you can verify that NLTK has been successfully
installed by importing it in a Python script or in an interactive Python session.

python
Copy code
import nltk
print(nltk.__version__) # Print the NLTK version

3. Download NLTK Data: After installing NLTK, it is necessary to download various


datasets and models that NLTK uses (e.g., corpora, stopwords, punkt tokenizer). To
download the necessary data, use the following command:

python
Copy code
import nltk
nltk.download()

This will open a GUI window where you can select which datasets to download.
Alternatively, you can download specific resources like so:

python
Copy code
nltk.download('punkt') # For tokenization
nltk.download('stopwords') # For stop words

Example Usage:

Once installed, you can begin using NLTK for tasks like tokenization, stemming, or part-of-
speech tagging. Here's an example to tokenize text into words:

python
Copy code
import nltk
from nltk.tokenize import word_tokenize
# Sample text
text = "NLTK is a leading platform for building Python programs to work with human language
data."

# Tokenizing the text into words


tokens = word_tokenize(text)

print(tokens)

Output:

plaintext
Copy code
['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'w
14. Python program to implement with Python Sci Kit-Learn & NLTK

Steps:

1. Install Required Libraries: First, make sure you have NLTK and Scikit-Learn
installed.

bash
Copy code
pip install nltk scikit-learn

2. Download Necessary NLTK Data: For this example, we'll need NLTK's stopwords and
punkt for tokenization.

python
Copy code
import nltk
nltk.download('stopwords')
nltk.download('punkt')

3. Text Classification Program: We'll use the 20 Newsgroups dataset from Scikit-learn
for classification. The task is to classify text documents into one of several predefined
categories.

Python Program: Text Classification with NLTK & Scikit-Learn

python
Copy code
import nltk
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download necessary NLTK data


nltk.download('stopwords')
nltk.download('punkt')

# Step 1: Fetch the 20 Newsgroups dataset


newsgroups = fetch_20newsgroups(subset='all')

# Step 2: Preprocess the text using NLTK (tokenization, stopword removal, punctuation
removal)
def preprocess_text(text):
# Tokenize the text
tokens = word_tokenize(text)

# Remove punctuation and convert tokens to lowercase


tokens = [word.lower() for word in tokens if word.isalpha()]

# Remove stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]

return " ".join(tokens)

# Preprocess all the documents


processed_docs = [preprocess_text(text) for text in newsgroups.data]

# Step 3: Convert the text data into numerical features using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed_docs)

# Step 4: Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, newsgroups.target, test_size=0.3,
random_state=42)

# Step 5: Train a Naive Bayes classifier


clf = MultinomialNB()
clf.fit(X_train, y_train)

# Step 6: Make predictions on the test set


y_pred = clf.predict(X_test)

# Step 7: Evaluate the model


print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Explanation of the Code:

1. Dataset:
o We use Scikit-learn's fetch_20newsgroups to load a dataset of newsgroup
documents categorized into 20 topics.
2. Text Preprocessing:
o We preprocess the text data using NLTK for:
 Tokenization: Using nltk.word_tokenize.
 Removing Punctuation: We filter out any tokens that are not alphabetic.
 Removing Stopwords: Using the stopwords corpus from NLTK.
3. Vectorization:
o Scikit-learn's CountVectorizer is used to convert the processed text documents
into a matrix of token counts. This transforms the text data into a format that can
be used for machine learning.
4. Model Training:
o We use Multinomial Naive Bayes (MultinomialNB), a classifier well-suited for
text classification tasks.
5. Evaluation:
o We evaluate the model's performance using classification report, which shows
precision, recall, and F1-score for each category.

Sample Output:

The output will be a classification report that provides performance metrics for each category:

plaintext
Copy code
Classification Report:
precision recall f1-score support

alt.atheism 0.85 0.79 0.82 319


comp.graphics 0.86 0.86 0.86 389
comp.os.ms-windows.misc 0.70 0.77 0.73 394
...

15. Python program to implement with Python NLTK/Spicy/Py NLPI.

o implement a Python program with NLTK, spaCy, and PyNLPI (which is a library for natural
language processing in Python), we will cover the following aspects:

1. NLTK: Used for text preprocessing, tokenization, and part-of-speech tagging.


2. spaCy: A more advanced NLP library, suitable for named entity recognition (NER),
dependency parsing, and other sophisticated NLP tasks.
3. PyNLPI: A package for natural language processing (though less commonly used
compared to NLTK and spaCy).

We'll combine all three libraries in a single Python program for some common NLP tasks, such
as text preprocessing, named entity recognition, and tokenization.

Steps to Install Libraries:

Before you run the code, make sure to install the necessary libraries using pip:
bash
Copy code
pip install nltk spacy pynlpi

Also, for spaCy, download a pre-trained language model (e.g., en_core_web_sm for English):

bash
Copy code
python -m spacy download en_core_web_sm

Python Program Using NLTK, spaCy, and PyNLPI:

python
Copy code
import nltk
import spacy
from pynlpi import Tokenizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from spacy import displacy

# Download NLTK data


nltk.download('punkt')
nltk.download('stopwords')

# Load spaCy model


nlp = spacy.load("en_core_web_sm")

# Sample text for processing


text = "Apple is looking at buying U.K. startup for $1 billion. Steve Jobs co-founded Apple."

# 1. NLTK: Tokenization and stopword removal


print("NLTK Processing:")

# Tokenize text using NLTK


nltk_tokens = word_tokenize(text)
print(f"Tokens using NLTK: {nltk_tokens}")

# Remove stopwords using NLTK


stop_words = set(stopwords.words('english'))
filtered_nltk_tokens = [word for word in nltk_tokens if word.lower() not in stop_words]
print(f"Filtered Tokens using NLTK: {filtered_nltk_tokens}")

# 2. spaCy: Named Entity Recognition (NER) and Dependency Parsing


print("\nspaCy Processing:")

# Process the text with spaCy


doc = nlp(text)

# Named Entity Recognition (NER)


print("Named Entities using spaCy:")
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")

# Dependency Parsing
print("\nDependency Parsing using spaCy:")
for token in doc:
print(f"{token.text} -> {token.dep_} -> {token.head.text}")

# Visualize Dependency Parsing


# displacy.serve(doc, style="dep") # Uncomment this line to visualize in a browser

# 3. PyNLPI: Tokenization
print("\nPyNLPI Processing:")

# Initialize the PyNLPI tokenizer


tokenizer = Tokenizer()

# Tokenize text using PyNLPI


pynlpi_tokens = tokenizer.tokenize(text)
print(f"Tokens using PyNLPI: {pynlpi_tokens}")

Breakdown of the Program:

1. NLTK:
o Tokenization: We use word_tokenize to break the text into individual words.
o Stopword Removal: We remove common English stopwords (like "the", "is",
"and") from the tokenized list using NLTK's stopwords corpus.
2. spaCy:
o Named Entity Recognition (NER): We extract named entities (e.g., "Apple",
"U.K.") from the text using spaCy's built-in ents attribute.
o Dependency Parsing: We analyze the syntactic structure of the sentence, printing
each word's syntactic role (e.g., subject, object).
o Optional Visualization: You can visualize the dependency parsing tree using
displacy.serve, which opens a visualization in the browser.
3. PyNLPI:
o Tokenization: We use PyNLPI's Tokenizer to split the text into tokens.

Expected Output:

plaintext
Copy code
NLTK Processing:
Tokens using NLTK: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.',
'Steve', 'Jobs', 'co-founded', 'Apple', '.']
Filtered Tokens using NLTK: ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion', '.',
'Steve', 'Jobs', 'co-founded', 'Apple', '.']

spaCy Processing:
Named Entities using spaCy:
Apple - ORG
U.K. - GPE
$1 billion - MONEY
Steve Jobs - PERSON
Apple - ORG

Dependency Parsing using spaCy:


Apple -> nsubj -> looking
is -> aux -> looking
looking -> ROOT -> looking
at -> prep -> looking
buying -> pcomp -> at
U.K. -> pobj -> at
startup -> attr -> buying
for -> prep -> buying
$ -> quantmod -> billion
1 -> compound -> billion
billion -> pobj -> for
. -> punct -> looking
Steve -> nsubj -> co-founded
Jobs -> appos -> Steve
co-founded -> ROOT -> co-founded
Apple -> dobj -> co-founded
. -> punct -> co-founded

PyNLPI Processing:
Tokens using PyNLPI: ['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion',
'.', 'Steve', 'Jobs', 'co-founded', 'Apple', '.']

Explanation of Output:

 NLTK: The tokens are extracted and stopwords are removed from the text.
 spaCy:
o Named Entity Recognition (NER) identifies entities like Apple, U.K., Steve
Jobs, and $1 billion.
o Dependency parsing shows the grammatical relationships between words in the
sentence.
 PyNLPI: The tokens extracted by PyNLPI are similar to the ones from NLTK.

You might also like