data science practicals
data science practicals
. Step 3:Lets use the data bars, because it compares all to each other
.Step 1 : Select all data -> click on insert -> Pivot table -> select ok
Step 2 : New sheet get created -> we rename the sheet as pivot table and drag and drop
the fields you want to get pivoted
Step 3: Rename the pivot table -> explore the field settings option
Step 4 : Group options -> group selection -> insert slicer select -> Apply slicer on item
Step 5 : Filter field -> certain data above / below the range can be viewed
C. Use VLOOKUP function to retrieve information from a different worksheet or table
Step 1: Create a csv file -> student.csv-> write the code below
import pandas as pd
df = pd.read_csv(r"E:\Student.csv")
print("Our dataset:")
print(df)
Output:-
import pandas as pd
data = pd.read_json(r"E:\data.json")
print(data)
output:-
B. Perform basic data pre-processing tasks such as handling missing values and outliers.
import pandas as pd
Code:
import pandas as pd
Code:
import pandas as pd
# Load iris dataset
iris = pd.read_csv(r"E:\1\iris\iris.csv")
# Filtering data based on a condition
setosa = iris[iris['species'] == 'setosa']
print("Setosa samples:")
print(setosa.head())
# Sorting data
sorted_iris = iris.sort_values(by='sepal_length',
ascending=False)
print("\nSorted iris dataset:")
print(sorted_iris.head())
# Grouping data
grouped_species = iris.groupby('species').mean()
print("\nMean measurements for each species:")
print(grouped_species)
output:-
PRACTICAL NO : 3
Original Data: The dataset had raw values for Alcohol and Malic Acid in different ranges.
Code:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
We loaded the Iris dataset and changed the species names into numbers using Label
Encoding. This helps the computer understand the data better. Then, we added a new
column called code with these numbers and printed the updated dataset.
Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
output:-
PRACTICAL NO : 4
.Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test
Code:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Draw Conclusions
if p_value < alpha:
print("Conclusion: There is significant evidence to reject
the null hypothesis.")
if np.mean(sample1) > np.mean(sample2):
print("Interpretation: Sample 1 has a significantly
higher mean than Sample 2.")
else:
print("Interpretation: Sample 2 has a significantly
higher mean than Sample 1.")
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to
claim a significant difference between the means.")
Output:
.CHI square test
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Fixed matplotlib import
import seaborn as sb
import warnings
from scipy import stats
warnings.filterwarnings('ignore')
# Load dataset
df = sb.load_dataset('mpg')
print(df)
print(df['horsepower'].describe())
print(df['model_year'].describe())
output:
PRACTICAL NO : 5
Code:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
output:
PRACTICAL NO : 6
.Explore and interpret the regression model coefficients and goodness of fit measures
.Extent the analysis to multiple linear regression and assess the impact of additional
predictors
Code:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Make predictions
y_pred = model.predict(X_test)
# Display results
print("Simple Linear Regression Results:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R²): {r2:.4f}")
print(f"Intercept (β₀): {model.intercept_:.4f}")
print(f"Coefficient (β₁): {model.coef_[0]:.4f}")
print("\n---------------------------------------------------\
n")
# Make predictions
y_pred = model.predict(X_test)
# Display results
print("Multiple Linear Regression Results:")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R²): {r2:.4f}")
print(f"Intercept (β₀): {model.intercept_:.4f}")
print(f"Coefficients (β₁, β₂, ...): {model.coef_}")
Output:
PRACTICAL NO : 7
. Contruct a decision tree model and interpret the decision rules for classification
Code:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score, classification_report
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Load the Iris dataset and create a binary classification
problem
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'],
iris['target']], columns=iris['feature_names'] + ['target'])
output:
PRACTICAL NO : 8
.Apply the K-means algorithm to group similar data points into clusters
.Determine the optimal number of clusters using elbow method or silhouette analysis
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
# Load data
data = pd.read_csv("D:\DOWNLOADS\Wholesale customers
data.csv")
plt.figure(figsize=(8, 5))
plt.plot(K, sum_of_squared_distances, 'bo-', markersize=6)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Sum of Squared Distances')
plt.title('Elbow Method for Optimal k')
plt.show()
# Silhouette Analysis
silhouette_scores = []
for k in K:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
cluster_labels = km.fit_predict(data_scaled)
silhouette_scores.append(silhouette_score(data_scaled,
cluster_labels))
plt.figure(figsize=(8, 5))
plt.plot(K, silhouette_scores, 'ro-', markersize=6)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal k')
plt.show()
# Choose the best k (based on elbow + silhouette score)
optimal_k = 4 # Change this based on the elbow and silhouette
graphs
kmeans = KMeans(n_clusters=optimal_k, random_state=42,
n_init=10)
data['Cluster'] = kmeans.fit_predict(data_scaled)
plt.figure(figsize=(8, 5))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster',
palette='viridis', data=df_pca, s=50)
plt.title('Cluster Visualization using PCA')
plt.show()
output:
PRACTICAL NO : 9
.Evaluate the explained variance and select the appropriate number of principal
components
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Load the Iris dataset
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'],
iris['target']], columns=iris['feature_names'] + ['target'])
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
output:
PRACTICAL NO : 10
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
Output: