0% found this document useful (0 votes)
29 views34 pages

Pranav Data Science Lab

Data science notes

Uploaded by

Pranav Coder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views34 pages

Pranav Data Science Lab

Data science notes

Uploaded by

Pranav Coder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

SAGE University Indore

LAB MANUAL
Subject code :- ACTDEDSE001P
SUBJECT NAME :- DATA SCIENCE LAB
SEMESTER: VI SEM III YEAR
Session: - Dec - may 2023

Institute of Advance Computing

Submitted to :- Submitted by:-


Prof. Rishi yadav Name:- Pranav Patidar
Enrollment No :- 21ADV3CSE0117
INDEX
S.no Name of the Experiment Date of Page no. Remark
Experiment
1. Write the code to sort an array in NumPy by the nth
column? If yes, then write the code to plot the
histogram?
2. How can you copy objects in Python?

3. Write code to sort a DataFrame in Python in descending


order.
4. Which method in pandas is used to create scatter plot
matrix?
5. Is it possible to plot histogram in Pandas without calling
Matplotlib?

6. Study of various types of Charts for Data Visualization


using Python Programming
7. Write a program to calculate Mean using Python
Programming

8. Write a program to calculate Median using Python


Programming
9. Write a program to calculate Mode using Python
Programming
10. Write a program to calculate correlations using python
programming.

11. Write a program to calculate regression using python


programming.
12. Write a program to calculate Linear regression using
python programming.
13. Write a program to calculate Logistic regression using
python programming.
14. Implement a vector model of information retrieval using
python programming.
15. Implement a decision tree using python programming.
16. Write a program to calculate different types of
probability.
Experiment 1

Write the code to sort an array in NumPy by the nth column? If yes, then write the code to
plot the histogram?

Certainly! Let’s break this down into two steps:


1. Sorting an Array in NumPy by the nth Column: To sort a NumPy array by its
nth column, you can use the following code snippet. Let’s assume you have an
array a and you want to sort it by the second column (index 1):
2. Python
import numpy as np

# Example array
a = np.array([[9, 2, 3], [4, 5, 6], [7, 0, 5]])

# Sort by the second column


sorted_array = a[a[:, 1].argsort()]

print(sorted_array)
3.
4. AI-generated code. Review and use carefully. More info on FAQ.
5. The output will be:
[[7 0 5]
[9 2 3]
[4 5 6]]
6.
7. Plotting a Histogram using Matplotlib: To create a basic histogram using
Matplotlib, you can use the following code snippet. Here, we generate some
random data and visualize its distribution:
8. Python
import matplotlib.pyplot as plt
import numpy as np

# Generate random data (you can replace this with your own data)
data = np.random.randn(1000)

# Create the histogram


plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.title('Basic Histogram')
9. plt.show()
OUTPUT :
Experiment 2

How can you copy objects in Python?

Copy in Python (Deep Copy and Shallow


Copy)
In Python, Assignment statements do not copy objects, they create bindings
between a target and an object. When we use the = operator, It only creates a new
variable that shares the reference of the original object. In order to create “real
copies” or “clones” of these objects, we can use the copy module in Python.

Syntax of Python Deepcopy

Syntax: copy.deepcopy(x)

Syntax of Python Shallowcopy

Syntax: copy.copy(x)

Example:

In order to make these copies, we use the copy module. The copy() returns a
shallow copy of the list, and deepcopy() returns a deep copy of the list. As you can
see that both have the same value but have different IDs.

Example: This code showcases the usage of the copy module to create both
shallow and deep copies of a nested list li1. A shallow copy, li2, is created using
copy.copy(), preserving the top-level structure but sharing references to the inner
lists. A deep copy, li3, is created using copy.deepcopy(), resulting in a
completely independent copy of li1, including all nested elements. The code
prints the IDs and values of li2 and li3, highlighting the distinction between
shallow and deep copies in terms of reference and independence.

import copy

li1 = [1, 2, [3, 5], 4]

li2 = copy.copy(li1)

print("li2 ID: ", id(li2), "Value: ", li2)

li3 = copy.deepcopy(li1)

print("li3 ID: ", id(li3), "Value: ", li3)

Output:

li2 ID: 2521878674624 Value: [1, 2, [3, 5], 4]


li3 ID: 2521878676160 Value: [1, 2, [3, 5], 4]

What is Deep copy in Python?

A deep copy creates a new compound object before inserting copies of the items
found in the original into it in a recursive manner. It means first constructing a new
collection object and then recursively populating it with copies of the child objects
found in the original. In the case of deep copy, a copy of the object is copied into
another object. It means that any changes made to a copy of the object do not
reflect in the original object.
Example:

In the above example, the change made in the list did not affect other lists,
indicating the list is deeply copied.

This code illustrates deep copying of a list with nested elements using the copy
module. It initially prints the original elements of li1, then deep copies them to
create li2. A modification to an element in li2 does not affect li1, as
demonstrated by the separate printouts. This highlights how deep copying creates
an independent copy, preserving the original list’s contents even after changes to
the copy.

import copy

li1 = [1, 2, [3,5], 4]

li2 = copy.deepcopy(li1)

print ("The original elements before deep copying")

for i in range(0,len(li1)):
print (li1[i],end=" ")

print("\r")

li2[2][0] = 7

print ("The new list of elements after deep copying ")

for i in range(0,len( li1)):

print (li2[i],end=" ")

print("\r")

print ("The original elements after deep copying")

for i in range(0,len( li1)):

print (li1[i],end=" ")

Output:

The original elements before deep copying


1 2 [3, 5] 4
The new list of elements after deep copying
1 2 [7, 5] 4
The original elements after deep copying
1 2 [3, 5] 4
Experiment 3

Write code to sort a DataFrame in Python in descending order.

Certainly! To sort a DataFrame in Python in descending order, you can use the
sort_values() method from the pandas library. Let’s assume you have a
DataFrame called df with columns named 'one', 'two', and 'letter'. You want
to sort it based on the 'one' column in descending order. Here’s how you can do it:
Python
import pandas as pd

# Create a sample DataFrame


d = {'one': [2, 3, 1, 4, 5], 'two': [5, 4, 3, 2, 1], 'letter': ['a', 'a', 'b', 'b', 'c']}
df = pd.DataFrame(d)

# Sort the DataFrame by the 'one' column in descending order


sorted_df = df.sort_values(by=['one'], ascending=False)

# Display the sorted DataFrame


print(sorted_df)

OUTPUT:

one two letter


4 5 1 c
3 4 2 b
1 3 4 a
0 2 5 a
2 1 3 b
Experiment 4

Which method in pandas is used to create scatter plot matrix?

To create a scatter plot matrix in pandas, you can use the


pandas.plotting.scatter_matrix() function. This handy method allows you
to visualize the relationships between multiple variables in a dataset simultaneously.
Let’s dive into the details:

Scatter Plot Matrix with pandas.plotting.scatter_matrix()


The syntax for creating a scatter matrix is as follows:
Python
import pandas as pd

# Assume 'df' is your pandas DataFrame


pd.plotting.scatter_matrix(df, alpha=0.5, figsize=None, ax=None, grid=False,
diagonal='hist', marker='.', density_kwds=None, hist_kwds=None,
range_padding=0.05, **kwargs)

AI-generated code. Review and use carefully. More info on FAQ.

Here’s a brief explanation of the parameters:

● frame: The pandas DataFrame containing your data.


● alpha: Amount of transparency applied to the scatter plots (optional).
● figsize: A tuple specifying the width and height of the plot in inches
(optional).
● ax: An optional Matplotlib axis object.
● grid: Set this to True if you want to display a grid.
● diagonal: Choose between ‘hist’ (histogram) or ‘kde’ (Kernel Density
Estimation) for the diagonal plots.
● marker: Matplotlib marker type for the scatter points (default is ‘.’).
● density_kwds: Keyword arguments for the kernel density estimate plot.
● hist_kwds: Keyword arguments for the histogram plot.
● range_padding: Relative extension of axis range in x and y with respect to
the data range.
● **kwargs: Additional keyword arguments to be passed to the scatter
function.
The resulting output will be a matrix of scatter plots, where each cell represents the
relationship between two variables. For example, if you have columns ‘A’, ‘B’, ‘C’, and
‘D’ in your DataFrame, the scatter matrix will display scatter plots for (‘A’, ‘A’), (‘A’, ‘B’),
(‘A’, ‘C’), (‘A’, ‘D’), and so on.

📊🌟
Remember, this scatter matrix provides a visual overview of how your variables
interact, making it a powerful tool for exploratory data analysis!
Experiment 5

Is it possible to plot histogram in Pandas without calling Matplotlib?

Certainly! You can create a histogram directly from a Pandas DataFrame without
explicitly calling Matplotlib. Here are a couple of approaches:

1. Using DataFrame.hist(): The df.hist() function in Pandas allows you


to create histograms for specific columns within your DataFrame. You can
customize the number of bins, grid, and other parameters. Here’s the syntax:

2. Python
df.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None,
ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None,
layout=None, bins=10, backend=None, legend=False, **kwargs)

3. You can specify the column you want to create a histogram for, and Pandas
will handle the plotting internally1.
4. Using numpy.histogram(): If you need more control or want to retrieve the
histogram values without actually plotting, you can use
numpy.histogram(). This function computes the histogram and returns the
bin edges and counts. Here’s an example:
5. Python
import numpy as np

def function_hist(a, ini, final):


# Define bins
bins = np.linspace(ini, final, 13)
weights_a = np.ones_like(a) / float(len(a))

# Compute histogram
hist, bin_edges = np.histogram(np.array(a), bins, weights=weights_a)
return hist, bin_edges

# Usage
your_data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
hist_values, bin_edges = function_hist(your_data, ini=1, final=10)
Experiment 6

Study of various types of Charts for Data Visualization using Python Programming
Matplotlib
Matplotlib is an easy-to-use, low-level data visualization library that is
built on NumPy arrays. It consists of various plots like scatter plot, line
plot, histogram, etc. Matplotlib provides a lot of flexibility.
To install this type the below command in the terminal.

pip install matplotlib

Scatter Plot
Scatter plots are used to observe relationships between variables and uses
dots to represent the relationship between them. The scatter() method in the
matplotlib library is used to draw a scatter plot.
import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")
# Scatter plot with day against tip
plt.scatter(data['day'], data['tip'])

# Adding Title to the Plot


plt.title("Scatter Plot")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')

plt.show()

Output:

Line Chart
Line Chart is used to represent a relationship between two data X and Y on a
different axis. It is plotted using the plot() function. Let’s see the below
example.
import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")
# Scatter plot with day against tip
plt.plot(data['tip'])
plt.plot(data['size'])

# Adding Title to the Plot


plt.title("Scatter Plot")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')

plt.show()

Output:

Bar Chart
A bar plot or bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the
values which they represent. It can be created using the bar() method.
import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")

# Bar chart with day against tip


plt.bar(data['day'], data['tip'])

plt.title("Bar Chart")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')

# Adding the legends


plt.show()
Output:
Histogram

A histogram is basically used to represent data in the form of some


groups. It is a type of bar plot where the X-axis represents the bin ranges
while the Y-axis gives information about frequency. The hist() function is
used to compute and create a histogram. In histogram, if we pass
categorical data then it will automatically compute the frequency of that
data i.e. how often each value occurred.
Example:
import pandas as pd
import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")

# histogram of total_bills
plt.hist(data['total_bill'])
plt.title("Histogram")

# Adding the legends


plt.show()

Output:
Experiment 7

Write a program to calculate Mean using Python Programming

Calculating the mean (or average) of a list of numbers in Python is straightforward.


You can achieve this using built-in functions or external libraries. Let’s explore a few
methods:

Using Python’s Built-in statistics Module: The statistics module provides a


convenient mean() function. Here’s an example:
Python:
import statistics

data = [1, 3, 4, 5, 7, 9, 2]
mean_value = statistics.mean(data)
print(f"Mean is: {mean_value:.6f}")

Output:
Mean is: 4.428571

Using a Custom Function: If you prefer not to use external libraries, you can create
your own function to calculate the mean. Here’s a simple implementation:
Python
def calculate_mean(sample):
return sum(sample) / len(sample)

data_sample = [4, 8, 6, 5, 3, 2, 8, 9, 2, 5]
mean_result = calculate_mean(data_sample)
print(f"Mean is: {mean_result:.1f}")

Output:
Mean is: 5.2

Using NumPy (for larger datasets): If you’re working with large datasets, consider
using the NumPy library. It provides efficient array operations, including mean
calculation:
Python
import numpy as np
array_data = np.array([0, 1, 2, 3, 4, 5, 6, 7])
mean_value_np = np.mean(array_data)
print(f"Mean using NumPy: {mean_value_np:.2f}")

Output:
Mean using NumPy: 3.5
Experiment 8

Write a program to calculate Median using Python Programming

Calculating the median in Python is straightforward. Let’s explore a couple of ways to


achieve this:

Using the statistics Module:

Python’s built-in statistics module provides a convenient function to


calculate the median.

Here’s an example using this approach:


Python
import statistics

# Create a list of numbers


numbers = [1, 2, 3, 4, 5, 6, 7]

# Calculate the median


med = statistics.median(numbers)

# Print the result


print(f"The median value is: {med}")

The output will be:


The median value is: 4

The statistics.median() function sorts the data in ascending order before


calculating the median. If the dataset has an even number of values, it takes the
average of the two middle values12.
Using NumPy (for more complex scenarios):

If you’re dealing with larger datasets or need additional functionality, consider


using NumPy.

Install NumPy if you haven’t already (pip install numpy).

Here’s an example using NumPy:


Python
import numpy as np
# Create an array
array1 = np.array([0, 1, 2, 3, 4, 5, 6, 7])

# Calculate the median


median1 = np.median(array1)

# Print the result


print(f"The median value using NumPy is: {median1}")

Output will be same as previous…


Experiment 9

Write a program to calculate Mode using Python Programming

Let’s create a Python program to calculate the mode of a given dataset. The mode
represents the most frequently occurring value in the dataset.
Python
def calculate_mode(data):
"""
Calculates the mode of a list of numeric data.

Args:
data (list): A list of numeric values.

Returns:
float or str: The mode value (or values) in the dataset.
"""
from statistics import mode

try:
# Calculate the mode using Python's built-in mode() function
result = mode(data)
return result
except StatisticsError:
return "No unique mode found in the dataset."

# Example usage
sample_data = [4, 1, 2, 2, 3, 5]
mode_result = calculate_mode(sample_data)
print(f"The mode of the dataset is: {mode_result}")

OUTPUT:
The mode of the dataset is: 2

In this program:

● We define a function calculate_mode that takes a list of numeric values as


input.
● Inside the function, we use Python’s statistics.mode() function to find
the mode.
● If there is no unique mode (i.e., multiple values occur with the same highest
frequency), the function returns a message indicating that.
Experiment 10

Write a program to calculate correlations using python programming.

Calculating correlations between variables is essential in data analysis. Let’s explore


how to calculate different correlation coefficients using Python. We’ll use libraries
like NumPy, SciPy, and pandas.
1. Pearson Correlation Coefficient:
○ The Pearson correlation measures the linear relationship between two
continuous variables.
○ It ranges from -1 (perfect negative correlation) to 1 (perfect positive
correlation).
○ Here’s an example using NumPy:
○ Python
import numpy as np

# Generate random data


np.random.seed(100)
var1 = np.random.randint(0, 10, 50)
var2 = var1 + np.random.normal(0, 10, 50)

# Calculate Pearson correlation


correlation_matrix = np.corrcoef(var1, var2)
pearson_coefficient = correlation_matrix[0, 1]

print(f"Pearson Correlation Coefficient: {pearson_coefficient:.4f}")

OUTPUT:

Pearson Correlation Coefficient: 0.3350


2. Spearman and Kendall Correlation:
○ These coefficients are useful when dealing with non-linear
relationships or ranked data.
○ SciPy provides functions for both Spearman and Kendall correlations.
○ Example using SciPy:
○ Python
from scipy.stats import spearmanr, kendalltau

# Calculate Spearman correlation


spearman_corr, _ = spearmanr(var1, var2)
print(f"Spearman Correlation Coefficient: {spearman_corr:.4f}")

# Calculate Kendall correlation


kendall_corr, _ = kendalltau(var1, var2)
print(f"Kendall Correlation Coefficient: {kendall_corr:.4f}")

OUTPUT:

Spearman Correlation Coefficient: 0.3844

Kendall Correlation Coefficient: 0.2644

3. Correlation Matrix with pandas:


○ You can compute the correlation matrix for multiple variables using
pandas:
○ Python
import pandas as pd

# Create a DataFrame with your data


df = pd.DataFrame({"Variable1": var1, "Variable2": var2})

# Calculate the correlation matrix


correlation_matrix_df = df.corr()

print("Correlation Matrix:")
print(correlation_matrix_df)

OUTPUT:

Correlation Matrix:

Variable1 Variable2

Variable1 1.000000 0.335018

Variable2 0.335018 1.000000

4. Visualization:
○ To visualize correlations, you can create scatter plots, regression lines,
or heatmaps using Matplotlib.
Experiment 11

Write a program to calculate correlations using python programming.

Certainly! Below is a simple Python program using the `scikit-learn` library to perform linear
regression:

```python
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 3, 5, 7, 9])

# Create and fit the model


model = LinearRegression()
model.fit(X, y)

# Coefficients
slope = model.coef_[0]
intercept = model.intercept_

print("Slope:", slope)
print("Intercept:", intercept)

# Predictions
y_pred = model.predict(X)

print("Predictions:", y_pred)
```
OUTPUT:
Slope: 1.8
Intercept: -0.20000000000000018
Predictions: [1.6 3.4 5.2 7. 8.8]

In this program:
- We first import the necessary libraries: `numpy` for numerical operations and
`LinearRegression` from `sklearn.linear_model` for linear regression.
- We define our sample data `X` and `y`, where `X` is a 2D array representing the features
and `y` is a 1D array representing the target variable.
- We create an instance of the `LinearRegression` model and fit it to our data using the `fit()`
method.
- We print the coefficients (slope and intercept) of the regression line using the `coef_` and
`intercept_` attributes of the model.
- Finally, we use the model to make predictions on the input data `X` and print the predicted
values.
Experiment 12

Write a program to calculate correlations using python programming.

Certainly! Here's a Python program to calculate linear regression using the least squares
method without using any external libraries:

```python
# Sample data
X = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 9]

# Calculate mean of X and y


mean_X = sum(X) / len(X)
mean_y = sum(y) / len(y)

# Calculate slope (m) and y-intercept (b)


numerator = sum((x - mean_X) * (yi - mean_y) for x, yi in zip(X, y))
denominator = sum((x - mean_X) ** 2 for x in X)
slope = numerator / denominator
intercept = mean_y - slope * mean_X

# Print slope and y-intercept


print("Slope:", slope)
print("Intercept:", intercept)

# Predictions
y_pred = [slope * x + intercept for x in X]
print("Predictions:", y_pred)
```

OUTPUT:

Slope: 1.8
Intercept: -0.20000000000000018
Predictions: [1.5999999999999999, 3.4, 5.2, 7.0, 8.8]

In this program:
- We define our sample data `X` and `y`.
- We calculate the mean of `X` and `y`.
- We use the least squares method to calculate the slope (`m`) and y-intercept (`b`) of the
regression line.
- We print the slope and y-intercept.
- We make predictions for the target variable `y` based on the calculated slope and
y-intercept.

You can modify the `X` and `y` lists with your own data to perform linear regression on your
dataset. Keep in mind that this implementation is a basic example and may not be as efficient
or robust as using external libraries like `numpy` or `scikit-learn` for more complex tasks.
Experiment 13

Write a program to calculate correlations using python programming.


Here's a simple implementation of logistic regression using the `scikit-learn` library in
Python:

```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and fit the logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set


y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

```
OUTPUT:

Accuracy: 1.0

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

In this program:
- We first import necessary libraries from `scikit-learn`.
- We load the Iris dataset using `load_iris()` function from `sklearn.datasets`.
- We split the data into training and testing sets using `train_test_split()` function.
- We create an instance of the Logistic Regression model and fit it to the training data using
the `fit()` method.
- We make predictions on the test set using the `predict()` method.
- We calculate the accuracy of the model using `accuracy_score()` function from
`sklearn.metrics`.
- We print the classification report which includes precision, recall, f1-score, and support for
each class using `classification_report()` function from `sklearn.metrics`.

You can replace the `X` and `y` with your own dataset to perform logistic regression on your
data. This example uses the Iris dataset as a demonstration. Make sure your data is properly
preprocessed before applying logistic regression.
Experiment 14

Implement a vector model of information retrieval using python programming.

Implementing a vector space model of information retrieval involves several steps, including
preprocessing the text data, constructing a document-term matrix, calculating the TF-IDF
(Term Frequency-Inverse Document Frequency) weights, and performing similarity
calculations between documents and queries.

Below is a Python implementation of a simple vector space model of information retrieval


using the `scikit-learn` library:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

# Sample query
query = "This is the second document."

# Create a TF-IDF vectorizer


vectorizer = TfidfVectorizer()

# Fit the vectorizer on the documents and transform them into a document-term matrix
tfidf_matrix = vectorizer.fit_transform(documents)

# Transform the query into a document-term matrix


query_vector = vectorizer.transform([query])

# Calculate cosine similarity between the query vector and document vectors
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix)

# Get the document with the highest similarity score


most_similar_doc_index = cosine_similarities.argmax()
most_similar_doc = documents[most_similar_doc_index]

print("Most similar document:", most_similar_doc)


```

OUTPUT:

Most similar document: This document is the second document.


In this implementation:
- We have a list of sample documents and a sample query.
- We use the `TfidfVectorizer` class from `scikit-learn` to convert the text data into a TF-IDF
weighted document-term matrix.
- We transform the query into a document-term matrix using the same vectorizer.
- We calculate cosine similarity between the query vector and document vectors using
`cosine_similarity` function from `sklearn.metrics.pairwise`.
- We identify the document with the highest similarity score to the query based on cosine
similarity.

You can replace the `documents` list and `query` string with your own dataset and query to
perform information retrieval using the vector space model. Make sure your documents are
preprocessed and tokenized appropriately before applying the vectorizer.
Experiment 15

Implement decision tree using python programming

Sure! Here's an example of implementing a decision tree classifier using the popular library
`scikit-learn`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create a decision tree classifier


clf = DecisionTreeClassifier()

# Fit the classifier to the training data


clf.fit(X_train, y_train)

# Make predictions on the test data


y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))
```
OUTPUT:

Accuracy: 1.0

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

In this example:
- We first import necessary modules from `scikit-learn`.
- We load the Iris dataset using `load_iris()` function from `sklearn.datasets`.
- We split the data into training and testing sets using `train_test_split()` function from
`sklearn.model_selection`.
- We create an instance of the Decision Tree classifier using `DecisionTreeClassifier()` class.
- We fit the classifier to the training data using the `fit()` method.
- We make predictions on the test data using the `predict()` method.
- We calculate the accuracy of the model using `accuracy_score()` function from
`sklearn.metrics`.
- We print the classification report which includes precision, recall, f1-score, and support for
each class using `classification_report()` function from `sklearn.metrics`.

You can replace the Iris dataset with your own dataset and adjust the parameters of the
decision tree classifier as needed. Additionally, you can visualize the decision tree using tools
like `graphviz` to gain insights into the decision-making process of the model.
Experiment 16

Implement decision tree using python programming

Sure, here's a Python program to calculate different types of probabilities:

```python
# Importing required libraries
import numpy as np
from scipy.stats import binom, poisson, norm

# Probability of getting exactly k successes in n independent Bernoulli trials with


probability p of success
def binomial_probability(n, k, p):
return binom.pmf(k, n, p)

# Probability of getting k or fewer successes in n independent Bernoulli trials with


probability p of success
def binomial_cumulative_probability(n, k, p):
return binom.cdf(k, n, p)

# Probability of getting exactly k occurrences in a fixed interval of time or space if


events occur independently and at a constant rate
def poisson_probability(k, lambd):
return poisson.pmf(k, lambd)

# Probability of a random variable falling within a specified range


def normal_probability(x, mean, std):
return norm.pdf(x, mean, std)

# Cumulative probability of a random variable falling below a specified value


def normal_cumulative_probability(x, mean, std):
return norm.cdf(x, mean, std)

# Example usage
if __name__ == "__main__":
# Binomial probability
n = 10 # Number of trials
k = 5 # Number of successes
p = 0.5 # Probability of success
print("Binomial Probability:", binomial_probability(n, k, p))

# Binomial cumulative probability


print("Binomial Cumulative Probability:", binomial_cumulative_probability(n, k, p))

# Poisson probability
k = 2 # Number of occurrences
lambd = 3 # Average rate
print("Poisson Probability:", poisson_probability(k, lambd))
# Normal probability
x = 1 # Value of the random variable
mean = 0 # Mean of the distribution
std = 1 # Standard deviation of the distribution
print("Normal Probability:", normal_probability(x, mean, std))

# Normal cumulative probability


print("Normal Cumulative Probability:", normal_cumulative_probability(x, mean,
std))
```

OUTPUT:

Binomial Probability: 0.24609375000000003


Binomial Cumulative Probability: 0.623046875
Poisson Probability: 0.22404180765538775
Normal Probability: 0.24197072451914337
Normal Cumulative Probability: 0.8413447460685429

In this program:
- We use functions from the `scipy.stats` module to calculate various types of probabilities:
binomial probability, cumulative binomial probability, Poisson probability, normal
probability, and cumulative normal probability.
- Each function takes input parameters specific to the probability distribution being calculated
and returns the corresponding probability value.
- We demonstrate the usage of each function with example inputs.

You can adjust the input parameters and probability distributions as needed for your specific
calculations.

You might also like