Python Codes Test 2
Python Codes Test 2
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as sps
import statsmodels as sm
import statsmodels.formula.api as smf
"""
Example:
Student's Roll Number = 012345
my_df = df.sample (n = 2345, random_state = 12345, ignore_index = True)
"""
A1. Identify the Nominal, Categorical, Ordinal & Continuous Variables in the Data.
import pandas as pd
# A2.
# Select non-categorical variables
non_categorical_vars = ['Age', 'Yearly_Average_Salary',
'Yearly_Average_Balance', 'Credit_Score', 'Tenure_Years','Customer_Id']
2.Here, my_df[non_categorical_vars] selects only the columns corresponding to non-categorical variables from the DataFrame my_df.
The describe() function is then applied to this subset of the DataFrame.
The describe() function provides summary statistics of the selected variables, including count, mean, standard deviation, minimum, 25th
percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum.
# A3.
# Select two categorical variables
cat_var1 = 'Country'
cat_var2 = 'Gender'
These lines print out the descriptive statistics for the first categorical variable, 'Country'. The value_counts() function is used to count the
occurrences of each unique value in the 'Country' column, providing a frequency distribution.
B1. Calculate Maximum Value of the above Continuous Variable lying at the Bottom 10%.
# B1.
# Selecting the continuous variable 'Age'
continuous_variable = 'Age'
''' B1'''
The inference from the calculated maximum value at the bottom 10% of 'Age' (which is 28.0) is that within the lowest
10% of ages in your dataset, the oldest individual is 28 years old. This provides a specific data point that describes the
upper boundary of the age distribution for the bottom decile (10%) of your dataset.
B2. Calculate Minimum Value of the above Continuous Variable lying at the Top 10%.
# B2.
# Calculate Minimum Value of the Continuous Variable lying at the Top
10%
top_10_percent_value = my_df[continuous_variable].quantile(0.9)
print(f"Minimum value at the top 10% of {continuous_variable}:
{top_10_percent_value}")
The quantile(0.9) function is used to calculate the value below which 90% of the data falls. In other words, it calculates the 90th percentile
of the 'Age' variable. This value represents the boundary above which the top 10% of the data points lie.
B2
The output indicates that the minimum value at the top 10% of the continuous variable 'Age' in your DataFrame (my_df) is 53.0. This
means that among the highest 10% of ages in your dataset, the youngest individual is 53 years old.
C1. Taking a Categorical Variariable having 02 Categories and a Continuous Variable, Conduct a T-Test.
# Performing a t-test
t_statistic, p_value = stats.ttest_ind(category_1, category_2)
print(f"T-Test results - t-statistic: {t_statistic}, p-value:
{p_value}")
These lines extract the 'Age' values for two categories of the 'Gender' variable. category_1 contains ages for males, and category_2
contains ages for females.
The ttest_ind function from the scipy.stats module is used to perform an independent two-sample t-test. It calculates the t-statistic and
the p-value associated with the test.
C1
The t-test results indicate that there is a statistically significant difference between the two groups (Male and Female) based on the
continuous variable 'Age'.
1. T-Statistic: The t-statistic is approximately -2.62. This value represents the difference between the means of the two groups in terms
of the number of standard deviations. A negative t-statistic suggests that, on average, the 'Male' group has a lower age than the
'Female' group.
2. P-Value: The p-value is approximately 0.0088. This p-value is below the commonly used significance level of 0.05. The p-value
represents the probability of observing such extreme results (or more extreme) under the assumption that the null hypothesis is true. In
this context, a p-value below 0.05 suggests that the observed difference in age between the 'Male' and 'Female' groups is unlikely to
have occurred by random chance alone. Therefore, you may reject the null hypothesis and conclude that there is a statistically
significant difference in age between the two gender groups.
C2. Taking a Categorical Variariable having 03 or more Categories and a Continuous Variable, Conduct an ANOVA.
# C2.
import statsmodels.api as sm
from statsmodels.formula.api import ols
This line specifies a linear model using Ordinary Least Squares (OLS) regression to predict the 'Age' variable based on the 'Country' variable.
The .fit() method fits the model to the data.
The sm.stats.anova_lm function is used to perform the ANOVA. The typ=2 argument specifies that it is a two-way ANOVA.
Finally, the code prints the ANOVA results, which include various statistics such as the sum of squares, degrees of freedom, mean squares,
F-statistic, and p-value.
C2
The ANOVA results suggest that there is a statistically significant difference in the mean age across the categories of the 'Country'
variable. Here's a breakdown of the key components of the ANOVA table:
1. sum_sq (Sum of Squares): This column represents the sum of squared deviations from the mean. For 'Country', the sum of squares is
2095.841380.
2. df (Degrees of Freedom): The degrees of freedom associated with the 'Country' variable is 2. This is the number of categories in
'Country' minus 1.
3. F (F-statistic): The F-statistic is a ratio of the variance between group means to the variance within groups. In this case, the F-statistic
is 9.296834.
4. PR(>F) (p-value): The p-value associated with the F-statistic is 0.000093, which is much smaller than the commonly used significance
level of 0.05. This suggests that the difference in mean age across the categories of 'Country' is unlikely to be due to random chance
alone. Therefore, you may reject the null hypothesis and conclude that there is a statistically significant difference in the mean age
across different countries.
The .corr() method is used to calculate the correlation coefficient between 'Yearly_Average_Salary' and 'Yearly_Average_Balance'. The
result is stored in the variable correlation.
This line prints the computed correlation coefficient between the two variables.
A value of 1 indicates a perfect positive correlation (both variables increase or decrease together).
A value of -1 indicates a perfect negative correlation (one variable increases as the other decreases).
A value of 0 indicates no linear correlation.
The calculated correlation between 'Yearly_Average_Salary' and 'Yearly_Average_Balance' is approximately -0.0016. This correlation is
very close to zero, indicating a very weak or negligible linear relationship between the two continuous variables.
# C4.
# Assuming 'Credit_Score' is a continuous variable
statistic, p_value = stats.normaltest(my_df['Credit_Score'])
print(f"Normality Test results - statistic: {statistic}, p-value:
{p_value}")
The normaltest function tests the null hypothesis that a sample comes from a normal distribution. It returns a test statistic and a p-value.
The statistic value is the test statistic, and the p_value is the probability that the data is normally distributed.
This line prints the results of the normality test, including the test statistic and the p-value.
Interpretation:
If the p-value is less than a chosen significance level (e.g., 0.05), you may reject the null hypothesis and conclude that the data is not
normally distributed.
If the p-value is greater than the significance level, you may fail to reject the null hypothesis, suggesting that there is no strong evidence
against the assumption of normality.
The normality test results indicate that the 'Credit_Score' variable does not follow a normal distribution. Here's an interpretation of the
results:
Statistic: The normality test statistic is 70.12. This statistic is used to assess how well the data follows a normal distribution. In this case,
the higher the statistic, the more evidence there is against normality.
P-Value: The p-value associated with the normality test is very close to zero (5.95e-16). A low p-value indicates strong evidence against
the null hypothesis that the data follows a normal distribution.
D1. Develop a Multiple Linear Regression Model (with at least 03 Continuous Variables as Input). Describe the Results. Predict the
Dependent Variable with artifitial Inputs.
mlr_prediction = mlr_model.predict(new_data)
print("\nMultiple Linear Regression Prediction:")
print(mlr_prediction)
This part creates a DataFrame 'df' with 100 rows, where 'Dependent_Variable' is a random variable, and 'Continuous_Var1',
'Continuous_Var2', and 'Continuous_Var3' are continuous variables with random values.
This line fits the MLR model using the ordinary least squares (OLS) method. The formula 'Dependent_Variable ~ Continuous_Var1 +
Continuous_Var2 + Continuous_Var3' specifies the relationship between the dependent variable and the three continuous predictor
variables.
This line prints a summary of the MLR model, including coefficients, standard errors, t-statistics, p-values, and other relevant statistics.
This part creates a new DataFrame 'new_data' with artificial values for the continuous predictor variables. The predict method is then used
to predict the dependent variable based on these artificial inputs.
D1
Model Fit: The R-squared value is low (0.025), indicating that the model does not explain a substantial proportion of the variability in
the dependent variable. It suggests that the current set of independent variables may not be strong predictors of the dependent
variable.
Variable Significance: The p-values for the individual coefficients suggest that none of the continuous variables ('Continuous_Var1,'
'Continuous_Var2,' 'Continuous_Var3') are statistically significant predictors of the dependent variable in this model.
Prediction: The predicted value for the new set of input values is provided, but the overall model's predictive power seems limited
based on the low R-squared value.
import statsmodels.api as sm
import pandas as pd
import numpy as np
# Generate an example dataset with continuous and categorical variables
np.random.seed(42)
df = pd.DataFrame({
'Dependent_Variable': np.random.rand(100),
'Continuous_Var1': np.random.randn(100),
'Continuous_Var2': np.random.randn(100),
})
# Model with at least two continuous variables and one categorical variable as input
print(mlr_model.summary())
new_data = pd.DataFrame({
'Continuous_Var1': [1.5],
'Continuous_Var2': [0.8],
'Categorical_Var': ['B']
})
# Create dummy variables for the categorical variable in the new data
mlr_prediction = mlr_model.predict(new_data)
print(mlr_prediction)
The code uses NumPy to generate random data for a dependent variable ('Dependent_Variable'), two continuous independent variables
('Continuous_Var1' and 'Continuous_Var2'), and a categorical variable ('Categorical_Var') with three categories (A, B, C).
The ols function from the statsmodels.formula.api module is used to specify and fit the MLR model.
The formula 'Dependent_Variable ~ Continuous_Var1 + Continuous_Var2 + C(Categorical_Var)' expresses the relationship
between the dependent variable and the independent variables. The C() notation indicates that 'Categorical_Var' is a
categorical variable.
The .fit() method fits the model to the provided dataset.
The summary() method is called on the fitted model to display detailed statistics and information about the regression results.
The summary includes coefficients, standard errors, t-statistics, p-values, R-squared, and other relevant statistics.
A new DataFrame (new_data) is created with artificial values for the continuous and categorical predictor variables.
Dummy variables are created for the categorical variable ('Categorical_Var') using pd.get_dummies(). The prefix='Categorical_Var' adds a
prefix to the dummy variable names, and drop_first=True drops one of the dummy variables to avoid multicollinearity.
The predict method is used to make predictions for the dependent variable based on the artificial input values.
The predicted values for the dependent variable are printed based on the artificial input values.
Interpretation
The output from the Multiple Linear Regression (MLR) model summary provides various statistics and information that can be interpreted
to understand the relationships between the variables in the model. Here's an interpretation of some key parts of the output:a
1. **Coefficients:**
- **Intercept (0.5010):** The intercept represents the estimated value of the dependent variable when all independent variables are
zero.
- **Categorical_Var[T.B] (-0.0653):** This is the change in the dependent variable associated with being in category B compared to the
reference category (assumed to be A for dummy variable coding). In this case, it suggests a decrease of 0.0653 in the dependent variable
when 'Categorical_Var' is B.
- **Categorical_Var[T.C] (-0.0500):** Similar interpretation for category C compared to the reference category.
2. **Continuous Variables:**
- **Continuous_Var1 (-0.0510):** A one-unit increase in 'Continuous_Var1' is associated with a decrease of 0.0510 in the dependent
variable.
- **Continuous_Var2 (0.0041):** A one-unit increase in 'Continuous_Var2' is associated with an increase of 0.0041 in the dependent
variable.
3. **P-values:**
- **P>|t|:** These p-values assess the statistical significance of each coefficient. A p-value less than the chosen significance level (e.g.,
0.05) indicates that the variable is statistically significant.
4. **R-squared (0.031):**
- R-squared measures the proportion of the variance in the dependent variable explained by the model. In this case, only 3.1% of the
variance in the dependent variable is explained by the model.
5. **F-statistic (0.7722):**
- The F-statistic tests the overall significance of the model. In this case, the value is relatively low, suggesting that the model may not be
statistically significant.
- This p-value associated with the F-statistic tests the null hypothesis that all coefficients in the model are equal to zero. A low p-value
indicates that at least one coefficient is significantly different from zero.
- These tests assess the normality assumptions of the residuals. Low p-values may indicate departures from normality.
8. **Durbin-Watson (1.982):**
- The Durbin-Watson statistic tests for autocorrelation in the residuals. A value around 2 suggests no significant autocorrelation.
10. **Notes:**
It's important to carefully interpret each coefficient, considering its statistical significance, and keep in mind that statistical significance
does not imply practical significance. The low R-squared and F-statistic suggest that the model may not be a good fit for the data. Further
model refinement or additional variables may be necessary for better predictions.
D3. Develop a Logistic Regression Model (with at least 02 Continuous Variables as Input & 01 Categorical Variable as Output). Describe
the Results. Predict the Dependent Variable with artifitial Inputs.
To develop a Logistic Regression model with at least two continuous variables as input and one categorical variable as output, you can use
the `statsmodels` library in Python. Below is an example code that demonstrates how to create a Logistic Regression model, describe the
results, and predict the dependent variable with artificial inputs:
```python
import statsmodels.api as sm
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'Continuous_Var1': np.random.randn(100),
'Continuous_Var2': np.random.randn(100),
})
print(logit_model.summary())
new_data = pd.DataFrame({
'Continuous_Var1': [1.5],
'Continuous_Var2': [0.8],
})
logit_prediction = logit_model.predict(new_data)
print(logit_prediction)
```
- The `Logit` class from `statsmodels.api` is used to specify and fit the Logistic Regression model.
- `summary()` is called on the fitted model to display detailed statistics and information about the logistic regression results.
- A new DataFrame (`new_data`) is created with artificial values for the continuous variables.
- The predicted probabilities for the dependent variable are printed based on the artificial input values.
Note: Ensure that you have the required libraries installed (`statsmodels`, `pandas`, `numpy`) before running the code. Also, in logistic
regression, the dependent variable should be binary (0 or 1). Adjust the dataset and variable names as needed for your specific case.
Interpretation
The output of the Logistic Regression model summary provides information about the coefficients, odds ratios, and statistical significance.
Here's an interpretation of some key parts of the output:
1. **Coefficients:**
- **Intercept:** Represents the log-odds of the dependent variable being 1 when all independent variables are zero.
- **Continuous_Var1 and Continuous_Var2:** Represent the change in the log-odds of the dependent variable associated with a one-
unit increase in each respective continuous variable.
2. **Odds Ratios:**
- **Odds Ratio for Continuous_Var1:** It represents the multiplicative change in odds for a one-unit increase in Continuous_Var1.
- **Odds Ratio for Continuous_Var2:** It represents the multiplicative change in odds for a one-unit increase in Continuous_Var2.
3. **P-values:**
- Test the null hypothesis that each coefficient is equal to zero (no effect).
- Low p-values suggest that the corresponding independent variable is statistically significant.
4. **Log-Likelihood, AIC, and BIC:**
- Log-Likelihood measures how well the model explains the observed data.
- AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are measures of model fit, considering complexity.
6. **Pseudo R-squared:**
- Provides a measure of model fit. In logistic regression, pseudo R-squared values are used as they are different from traditional R-
squared used in linear regression.
```plaintext
==============================================================================
==================================================================================
----------------------------------------------------------------------------------
==================================================================================
```
Interpretation:
- The intercept (-0.0216) represents the log-odds of the dependent variable being 1 when both continuous variables are zero.
- The coefficient for Continuous_Var1 (-0.4842) indicates that a one-unit increase in Continuous_Var1 is associated with a decrease in the
log-odds of the dependent variable being 1. The odds ratio can be obtained by exponentiating this coefficient.
- The coefficient for Continuous_Var2 (0.2923) indicates that a one-unit increase in Continuous_Var2 is associated with an increase in the
log-odds of the dependent variable being 1. The odds ratio can be obtained by exponentiating this coefficient.
- Pseudo R-squared is approximately 0.055, suggesting a limited ability of the model to explain the variation in the dependent variable.
- The Wald test (z-values) tests the significance of each coefficient. For example, the p-value for Continuous_Var1 is 0.080, suggesting it
may not be statistically significant at a conventional significance level (e.g., 0.05).
Remember that the interpretation of coefficients and odds ratios depends on the context of your specific problem and the nature of the
variables involved.