Pranav Data Science Lab
Pranav Data Science Lab
LAB MANUAL
Subject code :- ACTDEDSE001P
SUBJECT NAME :- DATA SCIENCE LAB
SEMESTER: VI SEM III YEAR
Session: - Dec - may 2023
Write the code to sort an array in NumPy by the nth column? If yes, then write the code to
plot the histogram?
# Example array
a = np.array([[9, 2, 3], [4, 5, 6], [7, 0, 5]])
print(sorted_array)
3.
4. AI-generated code. Review and use carefully. More info on FAQ.
5. The output will be:
[[7 0 5]
[9 2 3]
[4 5 6]]
6.
7. Plotting a Histogram using Matplotlib: To create a basic histogram using
Matplotlib, you can use the following code snippet. Here, we generate some
random data and visualize its distribution:
8. Python
import matplotlib.pyplot as plt
import numpy as np
# Generate random data (you can replace this with your own data)
data = np.random.randn(1000)
Syntax: copy.deepcopy(x)
Syntax: copy.copy(x)
Example:
In order to make these copies, we use the copy module. The copy() returns a
shallow copy of the list, and deepcopy() returns a deep copy of the list. As you can
see that both have the same value but have different IDs.
Example: This code showcases the usage of the copy module to create both
shallow and deep copies of a nested list li1. A shallow copy, li2, is created using
copy.copy(), preserving the top-level structure but sharing references to the inner
lists. A deep copy, li3, is created using copy.deepcopy(), resulting in a
completely independent copy of li1, including all nested elements. The code
prints the IDs and values of li2 and li3, highlighting the distinction between
shallow and deep copies in terms of reference and independence.
import copy
li2 = copy.copy(li1)
li3 = copy.deepcopy(li1)
Output:
A deep copy creates a new compound object before inserting copies of the items
found in the original into it in a recursive manner. It means first constructing a new
collection object and then recursively populating it with copies of the child objects
found in the original. In the case of deep copy, a copy of the object is copied into
another object. It means that any changes made to a copy of the object do not
reflect in the original object.
Example:
In the above example, the change made in the list did not affect other lists,
indicating the list is deeply copied.
This code illustrates deep copying of a list with nested elements using the copy
module. It initially prints the original elements of li1, then deep copies them to
create li2. A modification to an element in li2 does not affect li1, as
demonstrated by the separate printouts. This highlights how deep copying creates
an independent copy, preserving the original list’s contents even after changes to
the copy.
import copy
li2 = copy.deepcopy(li1)
for i in range(0,len(li1)):
print (li1[i],end=" ")
print("\r")
li2[2][0] = 7
print("\r")
Output:
Certainly! To sort a DataFrame in Python in descending order, you can use the
sort_values() method from the pandas library. Let’s assume you have a
DataFrame called df with columns named 'one', 'two', and 'letter'. You want
to sort it based on the 'one' column in descending order. Here’s how you can do it:
Python
import pandas as pd
OUTPUT:
📊🌟
Remember, this scatter matrix provides a visual overview of how your variables
interact, making it a powerful tool for exploratory data analysis!
Experiment 5
Certainly! You can create a histogram directly from a Pandas DataFrame without
explicitly calling Matplotlib. Here are a couple of approaches:
2. Python
df.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None,
ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None,
layout=None, bins=10, backend=None, legend=False, **kwargs)
3. You can specify the column you want to create a histogram for, and Pandas
will handle the plotting internally1.
4. Using numpy.histogram(): If you need more control or want to retrieve the
histogram values without actually plotting, you can use
numpy.histogram(). This function computes the histogram and returns the
bin edges and counts. Here’s an example:
5. Python
import numpy as np
# Compute histogram
hist, bin_edges = np.histogram(np.array(a), bins, weights=weights_a)
return hist, bin_edges
# Usage
your_data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
hist_values, bin_edges = function_hist(your_data, ini=1, final=10)
Experiment 6
Study of various types of Charts for Data Visualization using Python Programming
Matplotlib
Matplotlib is an easy-to-use, low-level data visualization library that is
built on NumPy arrays. It consists of various plots like scatter plot, line
plot, histogram, etc. Matplotlib provides a lot of flexibility.
To install this type the below command in the terminal.
Scatter Plot
Scatter plots are used to observe relationships between variables and uses
dots to represent the relationship between them. The scatter() method in the
matplotlib library is used to draw a scatter plot.
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
Output:
Line Chart
Line Chart is used to represent a relationship between two data X and Y on a
different axis. It is plotted using the plot() function. Let’s see the below
example.
import pandas as pd
import matplotlib.pyplot as plt
plt.show()
Output:
Bar Chart
A bar plot or bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the
values which they represent. It can be created using the bar() method.
import pandas as pd
import matplotlib.pyplot as plt
plt.title("Bar Chart")
# histogram of total_bills
plt.hist(data['total_bill'])
plt.title("Histogram")
Output:
Experiment 7
data = [1, 3, 4, 5, 7, 9, 2]
mean_value = statistics.mean(data)
print(f"Mean is: {mean_value:.6f}")
Output:
Mean is: 4.428571
Using a Custom Function: If you prefer not to use external libraries, you can create
your own function to calculate the mean. Here’s a simple implementation:
Python
def calculate_mean(sample):
return sum(sample) / len(sample)
data_sample = [4, 8, 6, 5, 3, 2, 8, 9, 2, 5]
mean_result = calculate_mean(data_sample)
print(f"Mean is: {mean_result:.1f}")
Output:
Mean is: 5.2
Using NumPy (for larger datasets): If you’re working with large datasets, consider
using the NumPy library. It provides efficient array operations, including mean
calculation:
Python
import numpy as np
array_data = np.array([0, 1, 2, 3, 4, 5, 6, 7])
mean_value_np = np.mean(array_data)
print(f"Mean using NumPy: {mean_value_np:.2f}")
Output:
Mean using NumPy: 3.5
Experiment 8
Let’s create a Python program to calculate the mode of a given dataset. The mode
represents the most frequently occurring value in the dataset.
Python
def calculate_mode(data):
"""
Calculates the mode of a list of numeric data.
Args:
data (list): A list of numeric values.
Returns:
float or str: The mode value (or values) in the dataset.
"""
from statistics import mode
try:
# Calculate the mode using Python's built-in mode() function
result = mode(data)
return result
except StatisticsError:
return "No unique mode found in the dataset."
# Example usage
sample_data = [4, 1, 2, 2, 3, 5]
mode_result = calculate_mode(sample_data)
print(f"The mode of the dataset is: {mode_result}")
OUTPUT:
The mode of the dataset is: 2
In this program:
OUTPUT:
OUTPUT:
print("Correlation Matrix:")
print(correlation_matrix_df)
OUTPUT:
Correlation Matrix:
Variable1 Variable2
4. Visualization:
○ To visualize correlations, you can create scatter plots, regression lines,
or heatmaps using Matplotlib.
Experiment 11
Certainly! Below is a simple Python program using the `scikit-learn` library to perform linear
regression:
```python
import numpy as np
from sklearn.linear_model import LinearRegression
# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 3, 5, 7, 9])
# Coefficients
slope = model.coef_[0]
intercept = model.intercept_
print("Slope:", slope)
print("Intercept:", intercept)
# Predictions
y_pred = model.predict(X)
print("Predictions:", y_pred)
```
OUTPUT:
Slope: 1.8
Intercept: -0.20000000000000018
Predictions: [1.6 3.4 5.2 7. 8.8]
In this program:
- We first import the necessary libraries: `numpy` for numerical operations and
`LinearRegression` from `sklearn.linear_model` for linear regression.
- We define our sample data `X` and `y`, where `X` is a 2D array representing the features
and `y` is a 1D array representing the target variable.
- We create an instance of the `LinearRegression` model and fit it to our data using the `fit()`
method.
- We print the coefficients (slope and intercept) of the regression line using the `coef_` and
`intercept_` attributes of the model.
- Finally, we use the model to make predictions on the input data `X` and print the predicted
values.
Experiment 12
Certainly! Here's a Python program to calculate linear regression using the least squares
method without using any external libraries:
```python
# Sample data
X = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 9]
# Predictions
y_pred = [slope * x + intercept for x in X]
print("Predictions:", y_pred)
```
OUTPUT:
Slope: 1.8
Intercept: -0.20000000000000018
Predictions: [1.5999999999999999, 3.4, 5.2, 7.0, 8.8]
In this program:
- We define our sample data `X` and `y`.
- We calculate the mean of `X` and `y`.
- We use the least squares method to calculate the slope (`m`) and y-intercept (`b`) of the
regression line.
- We print the slope and y-intercept.
- We make predictions for the target variable `y` based on the calculated slope and
y-intercept.
You can modify the `X` and `y` lists with your own data to perform linear regression on your
dataset. Keep in mind that this implementation is a basic example and may not be as efficient
or robust as using external libraries like `numpy` or `scikit-learn` for more complex tasks.
Experiment 13
```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
```
OUTPUT:
Accuracy: 1.0
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
In this program:
- We first import necessary libraries from `scikit-learn`.
- We load the Iris dataset using `load_iris()` function from `sklearn.datasets`.
- We split the data into training and testing sets using `train_test_split()` function.
- We create an instance of the Logistic Regression model and fit it to the training data using
the `fit()` method.
- We make predictions on the test set using the `predict()` method.
- We calculate the accuracy of the model using `accuracy_score()` function from
`sklearn.metrics`.
- We print the classification report which includes precision, recall, f1-score, and support for
each class using `classification_report()` function from `sklearn.metrics`.
You can replace the `X` and `y` with your own dataset to perform logistic regression on your
data. This example uses the Iris dataset as a demonstration. Make sure your data is properly
preprocessed before applying logistic regression.
Experiment 14
Implementing a vector space model of information retrieval involves several steps, including
preprocessing the text data, constructing a document-term matrix, calculating the TF-IDF
(Term Frequency-Inverse Document Frequency) weights, and performing similarity
calculations between documents and queries.
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Sample documents
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# Sample query
query = "This is the second document."
# Fit the vectorizer on the documents and transform them into a document-term matrix
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate cosine similarity between the query vector and document vectors
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix)
OUTPUT:
You can replace the `documents` list and `query` string with your own dataset and query to
perform information retrieval using the vector space model. Make sure your documents are
preprocessed and tokenized appropriately before applying the vectorizer.
Experiment 15
Sure! Here's an example of implementing a decision tree classifier using the popular library
`scikit-learn`:
```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 1.0
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
In this example:
- We first import necessary modules from `scikit-learn`.
- We load the Iris dataset using `load_iris()` function from `sklearn.datasets`.
- We split the data into training and testing sets using `train_test_split()` function from
`sklearn.model_selection`.
- We create an instance of the Decision Tree classifier using `DecisionTreeClassifier()` class.
- We fit the classifier to the training data using the `fit()` method.
- We make predictions on the test data using the `predict()` method.
- We calculate the accuracy of the model using `accuracy_score()` function from
`sklearn.metrics`.
- We print the classification report which includes precision, recall, f1-score, and support for
each class using `classification_report()` function from `sklearn.metrics`.
You can replace the Iris dataset with your own dataset and adjust the parameters of the
decision tree classifier as needed. Additionally, you can visualize the decision tree using tools
like `graphviz` to gain insights into the decision-making process of the model.
Experiment 16
```python
# Importing required libraries
import numpy as np
from scipy.stats import binom, poisson, norm
# Example usage
if __name__ == "__main__":
# Binomial probability
n = 10 # Number of trials
k = 5 # Number of successes
p = 0.5 # Probability of success
print("Binomial Probability:", binomial_probability(n, k, p))
# Poisson probability
k = 2 # Number of occurrences
lambd = 3 # Average rate
print("Poisson Probability:", poisson_probability(k, lambd))
# Normal probability
x = 1 # Value of the random variable
mean = 0 # Mean of the distribution
std = 1 # Standard deviation of the distribution
print("Normal Probability:", normal_probability(x, mean, std))
OUTPUT:
In this program:
- We use functions from the `scipy.stats` module to calculate various types of probabilities:
binomial probability, cumulative binomial probability, Poisson probability, normal
probability, and cumulative normal probability.
- Each function takes input parameters specific to the probability distribution being calculated
and returns the corresponding probability value.
- We demonstrate the usage of each function with example inputs.
You can adjust the input parameters and probability distributions as needed for your specific
calculations.