Experiment No: 1 WORKING WITH PANDAS DATA FRAMES
Algorithm for the 1st Program
1. Start
2. Import the pandas library (import pandas as pd).
3. Create a dictionary named data with two keys:
o "calories" containing a list [420, 380, 390]
o "duration" containing a list [50, 40, 45]
4. Convert the dictionary into a Pandas DataFrame (df =
pd.DataFrame(data)).
5. Access the row at index 0 using df.loc[0].
6. Print the retrieved row.
7. End
Basic plots using Matplotlib
Algorithm for the Given Matplotlib Program
1. Start
2. Import the matplotlib.pyplot module (import matplotlib.pyplot as plt).
3. Initialize data lists:
o a = [1, 2, 3, 4, 5]
o b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
4. Plot list a using plt.plot(a).
5. Plot list b using red circles (plt.plot(b, "or")).
6. Plot a list of numbers generated using range(0, 22, 3).
7. Label the x-axis as 'Day ->' using plt.xlabel().
8. Label the y-axis as 'Temp ->' using plt.ylabel().
9. Initialize another data list:
o c = [4, 2, 6, 8, 3, 20, 13, 15]
10.Plot list c with a label '4th Rep'.
11.Get the current axis (ax = plt.gca()).
12.Modify graph boundary settings:
o Hide the right and top spines.
o Set bounds for the left spine (ax.spines['left'].set_bounds(-3, 40)).
13.Set x-axis ticks using plt.xticks(list(range(-3, 10))).
14.Set y-axis ticks using plt.yticks(list(range(-3, 20, 3))).
15.Add a legend to describe the plotted lines using ax.legend().
16.Annotate the graph with text 'Temperature V / s Days'.
17.Set the title of the graph as 'All Features Discussed'.
18.Display the plot using plt.show().
19.End
Experiment No: 3 FREQUENCY DISTRIBUTIONS, AVERAGES,
VARIABILITY
# Python program to get average of a list
Algorithm for the Given Python Program
1. Start
2. Import the numpy module (import numpy as np).
3. Initialize a list of elements:
o list = [2, 40, 2, 502, 177, 7, 9]
4. Calculate the average of the list using np.average(list).
5. Print the calculated average.
6. End
# Python program to get variance of a list
Algorithm for Calculating Variance using NumPy
1. Start
2. Import the numpy module (import numpy as np).
3. Initialize a list of elements:
o list = [2, 4, 4, 4, 5, 5, 7, 9]
4. Calculate the variance of the list using np.var(list).
5. Print the calculated variance.
6. End
7. # Python program to get standard deviation of a list
Algorithm for Calculating Standard Deviation using NumPy
1. Start
2. Import the numpy module (import numpy as np).
3. Initialize a list of elements:
o list = [290, 124, 127, 899]
4. Calculate the standard deviation of the list using np.std(list).
5. Print the calculated standard deviation.
6. End
Experiment No: 4
NORMAL CURVES, CORRELATION AND SCATTER PLOTS,
CORRELATION COEFFICIENT
Algorithm for Plotting a Normal Curve using NumPy and Matplotlib
1. Start
2. Import the necessary libraries:
o matplotlib.pyplot as plt for plotting.
o numpy as np for numerical operations.
3. Initialize the mean (mu) and standard deviation (sigma):
o mu = 0.5
o sigma = 0.1
4. Generate 1000 random values from a normal distribution using
np.random.normal(mu, sigma, 1000), and store them in s.
5. Create a histogram using plt.hist(s, 20, normed=True):
o s: Data points.
o 20: Number of bins.
o normed=True: Normalize the histogram.
6. Store the histogram data in variables:
o count: Heights of the histogram bars.
o bins: Bin edges.
o ignored: Unused returned value.
7. End
#Correlation and scatter plots
Algorithm for Calculating Correlation using Pandas
1. Start
2. Import the necessary libraries:
o sklearn (though it's not used in this program).
o numpy (np) for numerical operations.
o matplotlib.pyplot (plt) for visualization (not used here).
o pandas (pd) for handling data.
3. Create a Pandas Series y with values [1, 2, 3, 4, 3, 5, 4].
4. Create a Pandas Series x with values [1, 2, 3, 4, 5, 6, 7].
5. Calculate the correlation between x and y using y.corr(x).
6. Store the correlation result in the variable correlation.
7. End
# Correlation coefficient
Algorithm for Calculating Correlation Coefficient Manually
Step 1: Start
Step 2: Import Required Library
Import the math module to perform mathematical operations.
Step 3: Define the Function correlationCoefficient(X, Y, n)
Initialize variables:
o sum_X = 0 → Stores the sum of elements in X.
o sum_Y = 0 → Stores the sum of elements in Y.
o sum_XY = 0 → Stores the sum of products of X[i] and Y[i].
o squareSum_X = 0 → Stores the sum of squares of elements in X.
o squareSum_Y = 0 → Stores the sum of squares of elements in Y.
o i = 0 → Iterator variable.
Step 4: Compute Required Sums using a Loop
While i < n (loop through all elements in X and Y):
o Add X[i] to sum_X.
o Add Y[i] to sum_Y.
o Compute X[i] * Y[i] and add to sum_XY.
o Compute X[i] * X[i] and add to squareSum_X.
o Compute Y[i] * Y[i] and add to squareSum_Y.
o Increment i by 1.
Step 5: Calculate Correlation Coefficient using the Formula
r=n∑XY−∑X∑Y(n∑X2−(∑X)2)×(n∑Y2−(∑Y)2)r = \frac{n \sum XY - \sum
X \sum Y}{\sqrt{(n \sum X^2 - (\sum X)^2) \times (n \sum Y^2 - (\sum
Y)^2)}}r=(n∑X2−(∑X)2)×(n∑Y2−(∑Y)2)n∑XY−∑X∑Y
Compute the numerator: n×sum_XY−sum_X×sum_Yn \times sum\_XY -
sum\_X \times sum\_Yn×sum_XY−sum_X×sum_Y
Compute the denominator:
(n×squareSum_X−sum_X2)×(n×squareSum_Y−sum_Y2)\sqrt{(n \times
squareSum\_X - sum\_X^2) \times (n \times squareSum\_Y - sum\
_Y^2)}(n×squareSum_X−sum_X2)×(n×squareSum_Y−sum_Y2)
Compute corr as the fraction of the numerator and denominator.
Return the correlation coefficient.
Step 6: Initialize Input Data
Create lists X = [15, 18, 21, 24, 27] and Y = [25, 25, 27, 31, 32].
Compute n = len(X) to find the number of elements.
Step 7: Call the Function and Print the Result
Call correlationCoefficient(X, Y, n) and print the result formatted to six
decimal places.
Step 8: End
Experiment No: 5 REGRESSION
Algorithm for Simple Linear Regression using NumPy & Matplotlib
Step 1: Start
Step 2: Import Required Libraries
Import numpy as np for numerical operations.
Import matplotlib.pyplot as plt for visualization.
Step 3: Define the Function estimate_coef(x, y)
Input: Arrays x and y, representing the independent and dependent variables.
Compute:
1. Find the number of observations n = np.size(x).
2. Compute the mean of x → m_x = np.mean(x).
3. Compute the mean of y → m_y = np.mean(y).
4. Compute cross-deviation: SSxy=∑(y×x)−n×my×mxSS_{xy} = \sum(y \times
x) - n \times m_y \times m_xSSxy=∑(y×x)−n×my×mx
5. Compute deviation of x: SSxx=∑(x×x)−n×mx2SS_{xx} = \sum(x \times x) - n
\times m_x^2SSxx=∑(x×x)−n×mx2
6. Compute regression coefficients:
Slope: b1=SSxy/SSxxb_1 = SS_{xy} / SS_{xx}b1=SSxy/SSxx
Intercept: b0=my−b1×mxb_0 = m_y - b_1 \times m_xb0=my−b1×mx
Return (b_0, b_1).
Step 4: Define the Function plot_regression_line(x, y, b)
Input: Arrays x, y, and the regression coefficients b.
Plot Data Points:
o Use plt.scatter(x, y) to plot actual data points.
Compute Predicted Values: ypred=b0+b1×xy_{pred} = b_0 + b_1 \times xypred=b0
+b1×x
Plot the Regression Line:
o Use plt.plot(x, y_pred, color='g') to plot the best-fit line.
Label Axes:
o plt.xlabel('x')
o plt.ylabel('y')
Show the Plot:
o plt.show().
Step 5: Define main() Function
Create Data Arrays:
python
CopyEdit
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
Call estimate_coef(x, y) to compute regression coefficients.
Print Estimated Coefficients.
Call plot_regression_line(x, y, b) to visualize the regression line.
Step 6: Run the Program
Check if the script is executed directly:
python
CopyEdit
if __name__ == "__main__":
main()
End.
Experiment No: 6 Z-TEST
Algorithm for Z-Test for Hypothesis Testing
Step 1: Start
Step 2: Import Required Libraries
Import math for mathematical operations.
Import numpy as np for numerical operations.
Import randn from numpy.random to generate random numbers.
Import ztest from statsmodels.stats.weightstats for hypothesis testing.
Step 3: Define Parameters for Data Generation
Set the mean IQ score: mean_iq = 110.
Compute standard deviation for sample mean: sdiq=1550sd_{iq} = \frac{15}{\
sqrt{50}}sdiq=5015
Set significance level (alpha): alpha = 0.05.
Define null hypothesis mean: null_mean = 100.
Step 4: Generate Random Data
Generate a random sample of 50 numbers from a normal distribution with the given
mean and standard deviation:
python
CopyEdit
data = sd_iq * randn(50) + mean_iq
Step 5: Print Sample Statistics
Calculate and print sample mean and sample standard deviation using:
python
CopyEdit
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
Step 6: Perform Z-Test
Call the ztest() function with parameters:
o data: data
o null hypothesis mean: value = null_mean
o alternative hypothesis: 'larger' (checks if the mean is significantly
greater).
python
CopyEdit
ztest_Score, p_value = ztest(data, value=null_mean,
alternative='larger')
Step 7: Compare p-value with Alpha
If p_value < alpha: Reject the Null Hypothesis, meaning the sample mean is
significantly greater.
Otherwise, Fail to Reject the Null Hypothesis, meaning there's not enough evidence
to support the claim.
python
CopyEdit
if p_value < alpha:
print("Reject Null Hypothesis")
else:
print("Fail to Reject Null Hypothesis")
Step 8: End
Experiment No: 7 T-TEST
Algorithm for Two-Sample T-Test
Step 1: Start
Step 2: Import Required Libraries
Import numpy as np for numerical operations.
Import stats from scipy for statistical calculations.
Step 3: Define Parameters for Data Generation
Set sample size: N = 10.
Generate two Gaussian-distributed samples:
o Sample x with mean = 2 and variance = 1.
o Sample y with mean = 0 and variance = 1.
python
CopyEdit
x = np.random.randn(N) + 2
y = np.random.randn(N)
Step 4: Calculate Standard Deviation
Compute sample variance for both x and y using the formula:
variance=∑(Xi−Xˉ)2N−1\text{variance} = \frac{\sum (X_i - \bar{X})^2}{N-
1}variance=N−1∑(Xi−Xˉ)2
python
CopyEdit
var_x = x.var(ddof=1)
var_y = y.var(ddof=1)
Compute pooled standard deviation: SD=varx+vary2SD = \sqrt{\frac{\text{var}_x
+ \text{var}_y}{2}}SD=2varx+vary
python
CopyEdit
SD = np.sqrt((var_x + var_y) / 2)
Print standard deviation.
Step 5: Calculate T-Statistic
Compute T-value using the formula: t=xˉ−yˉSD×2/Nt = \frac{\bar{x} - \bar{y}}
{SD \times \sqrt{2/N}}t=SD×2/Nxˉ−yˉ
python
CopyEdit
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
Print T-Statistic.
Step 6: Compute p-Value
Compute Degrees of Freedom (dof): dof=2N−2dof = 2N - 2dof=2N−2
Compute one-tailed p-value using the cumulative distribution function (CDF) of the
t-distribution:
python
CopyEdit
pval = 1 - stats.t.cdf(tval, df=dof)
Convert to two-tailed p-value:
python
CopyEdit
pval = 2 * pval
Print T-value and p-value.
Step 7: Cross-Check Using SciPy’s Built-in Function
Use stats.ttest_ind() to validate results:
python
CopyEdit
tval2, pval2 = stats.ttest_ind(x, y)
Print T-value and p-value from SciPy function.
Step 8: Compare p-value with Significance Level (α = 0.05)
If pval < 0.05: Reject Null Hypothesis → The means are significantly different.
Else: Fail to Reject Null Hypothesis → No significant difference between means.
python
CopyEdit
if pval < 0.05:
print("Reject Null Hypothesis: Significant Difference")
else:
print("Fail to Reject Null Hypothesis: No Significant
Difference")
Step 9: End
Experiment No: 8 ANOVA
Algorithm for ANOVA Test in R
Step 1: Install and Load Required Package
Install the dplyr package (if not already installed).
Load the dplyr package using the library() function.
r
CopyEdit
install.packages("dplyr")
library(dplyr)
Step 2: Load and Visualize Data
Use the mtcars dataset (built into R).
Create a boxplot to compare the disp (displacement) across different gear groups.
r
CopyEdit
boxplot(mtcars$disp ~ factor(mtcars$gear),
xlab = "Gear", ylab = "Displacement")
Step 3: Define Hypotheses
Null Hypothesis (H₀): The mean displacement is the same for all gear groups.
Alternative Hypothesis (H₁): At least one group has a different mean displacement.
Step 4: Perform ANOVA Test
Use the aov() function to perform the Analysis of Variance (ANOVA) test.
r
CopyEdit
mtcars_aov <- aov(mtcars$disp ~ factor(mtcars$gear))
summary(mtcars_aov)
Step 5: Interpret the Results
The ANOVA test provides an F-statistic and a p-value.
If p-value < 0.05, reject the null hypothesis → Significant difference exists.
If p-value ≥ 0.05, fail to reject the null hypothesis → No significant difference.
Step 6: End
Experiment No: 9 BUILDING AND VALIDATING LINEAR MODELS
Algorithm for Loading and Exploring the Boston Housing Dataset
Step 1: Import Required Libraries
Import pandas, numpy, matplotlib.pyplot, and seaborn for data handling and
visualization.
Import load_boston from sklearn.datasets to load the Boston housing dataset.
python
CopyEdit
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
Step 2: Set Visualization Styles
Configure Seaborn for better plotting.
Customize Matplotlib figure size and resolution.
python
CopyEdit
sns.set(style="ticks", color_codes=True)
plt.rcParams['figure.figsize'] = (8,5)
plt.rcParams['figure.dpi'] = 150
Step 3: Load the Boston Housing Dataset
Load the dataset using load_boston().
Store the dataset in a variable.
python
CopyEdit
boston = load_boston()
Step 4: Display Dataset Keys
Print the available keys in the dataset.
python
CopyEdit
print(boston.keys())
Step 5: Print Dataset Description
Print the dataset’s description (DESCR) to understand its contents.
python
CopyEdit
print(boston.DESCR)
Step 6: End
Experiment No: 10 BUILDING AND VALIDATING LOGISTICS MODELS
Algorithm for Logistic Regression Model using StatsModels
Step 1: Import Required Libraries
Import statsmodels.api for logistic regression.
Import pandas for data handling.
python
CopyEdit
import statsmodels.api as sm
import pandas as pd
Step 2: Load the Training Dataset
Read the dataset from a CSV file using pandas.
Set the first column as the index.
python
CopyEdit
df = pd.read_csv('logit_train1.csv', index_col=0)
Step 3: Define Independent and Dependent Variables
Select independent variables: gmat, gpa, work_experience.
Select dependent variable: admitted.
python
CopyEdit
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]
Step 4: Build and Train Logistic Regression Model
Create a logistic regression model using sm.Logit().
Fit the model to the training data.
python
CopyEdit
log_reg = sm.Logit(ytrain, Xtrain).fit()
Step 5: End
Algorithm for Testing Logistic Regression Model
Step 1: Load the Testing Dataset
Read the dataset from a CSV file using pandas.
Set the first column as the index.
python
CopyEdit
df = pd.read_csv('logit_test1.csv', index_col=0)
Step 2: Define Independent and Dependent Variables
Select independent variables: gmat, gpa, work_experience.
Select dependent variable: admitted.
python
CopyEdit
Xtest = df[['gmat', 'gpa', 'work_experience']]
ytest = df['admitted']
Step 3: Perform Predictions on the Test Dataset
Use the trained logistic regression model (log_reg) to make predictions on Xtest.
Apply the round function to convert probabilities into class labels (0 or 1).
python
CopyEdit
yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
Step 4: Compare Actual and Predicted Values
Print the actual values of admitted.
Print the predicted values.
python
CopyEdit
print('Actual values:', list(ytest.values))
print('Predictions :', prediction)
Step 5: End
Algorithm for Evaluating Logistic Regression Model
Step 1: Import Required Libraries
Import confusion_matrix and accuracy_score from sklearn.metrics.
python
CopyEdit
from sklearn.metrics import confusion_matrix, accuracy_score
Step 2: Compute the Confusion Matrix
Use the confusion_matrix() function with actual (ytest) and predicted
(prediction) values.
python
CopyEdit
cm = confusion_matrix(ytest, prediction)
Print the confusion matrix.
python
CopyEdit
print("Confusion Matrix : \n", cm)
Step 3: Calculate Accuracy Score
Use accuracy_score() to compute the model's accuracy.
python
CopyEdit
accuracy = accuracy_score(ytest, prediction)
Print the accuracy score.
python
CopyEdit
print('Test accuracy = ', accuracy)
Step 4: End
Experiment No: 11 TIME SERIES ANALYSIS