Lab 1 - Python - Excel
Lab 1 - Python - Excel
Lớp:CNTTK20P
MãSV: DTC21H4802010537
Access to Safe drinking water is eassently a global issue. The World Health Organization
(WHO) estimates that half of all people in the world are affected by the lack of safe drinking
water. With this assesment, we will explore the data and look for patterns in the data to analyze if
the given data is a good indicator of safe drinking water.
In [20]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
#sns.set_context('notebook')
Data Set
The dataset is downloaded from kaggle.com and is available for download at:
https://www.kaggle.com/adityakadiwal/water-potability
EDA - Exploratory Data Analysis
In this section we will explore the data and look for patterns in the data to analyze if the given
data is a good indicator of safe drinking water.
In [4]:
# Print the first 5 rows of the dataframe
display(df.head())
3.71
129.4 18630. 6.6352 592.88 15.1800 56.32907 4.50
1 608 NaN 0
22921 057858 46 5359 13 6 0656
0
8.09
224.2 19909. 9.2758 418.60 16.8686 66.42009 3.05
2 912 NaN 0
36259 541732 84 6213 37 3 5934
4
8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6
9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3
Hardness - Hardness is a measure of the physical properties of the water. It is a measure of the
ability of the water to support the roots and the leaves. The lower the hardness, the more support
the roots and leaves can have.
Solids (Total dissolved solids - TDS) - TDS is a measure of the solids in the water. The water
with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500
mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
Chloramines - Chlorine and chloramine are the major disinfectants used in public water systems.
Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking
water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are
considered safe in drinking water.
Sulfate - Sulfate is a common disinfectant used in public water systems. Sulfate levels up to 2
milligrams per liter (mg/L or 2 parts per million (ppm)) are considered safe in drinking water.
Conductivity - Pure water is not a good conductor of electric current rather’s a good insulator.
Increase in ions concentration enhances the electrical conductivity of water. Generally, the
amount of dissolved solids in water determines the electrical conductivity. Electrical
conductivity (EC) actually measures the ionic process of a solution that enables it to transmit
current. According to WHO standards, EC value should not exceeded 400 μS/cm.
Organic_carbon - Total Organic Carbon (TOC) in source waters comes from decaying natural
organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of
carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated /
drinking water, and < 4 mg/Lit in source water which is use for treatment.
Trihalomethanes - THMs are chemicals which may be found in water treated with chlorine. The
concentration of THMs in drinking water varies according to the level of organic material in the
water, the amount of chlorine required to treat the water, and the temperature of the water that is
being treated. THM levels up to 80 ppm is considered safe in drinking water.
Turbidity - Turbidity is a measure of the water’s ability to absorb particulate matter. The lower
the turbidity, the more it can absorb particulate matter.
Potability (Target variable) - Indicates if water is safe for human consumption where 1 means
Potable and 0 means Not potable.
In [5]:
# datatypes of the columns
print(df.head())
ph float64
Hardness float64
Solids float64
Chloramines float64
Sulfate float64
Conductivity float64
Organic_carbon float64
Trihalomethanes float64
Turbidity float64
Potability int64
dtype: object
In [6]:
# Describe the data
print(df.describe())
Out[6]:
8768.
st 1.594 32.87 1.583 41.41 80.82 3.30816 16.1750 0.780 0.487
57082
d 320 9761 085 6840 4064 2 08 382 849
8
129.0
mi 0.000 47.43 320.9 0.352 181.4 2.20000 0.73800 1.450 0.000
0000
n 000 2000 42611 000 83754 0 0 000 000
0
In [7]:
# Check if there are any null columns
print(df.isnull().sum())
Out[7]:
ph 491
Hardness 0
Solids 0
Chloramines 0
Sulfate 781
Conductivity 0
Organic_carbon 0
Trihalomethanes 162
Turbidity 0
Potability 0
dtype: int64
In [8]:
# Lets try to plot misisng values
plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
plt.title('Number of Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()
Analyze ph column
In [9]:
# for ph column
# set the histogram, mean and median
#9 # Tính toán trung bình và trung vị của cột 'ph'
mean_ph = df['ph'].mean()
median_ph = df['ph'].median()
In [10]:
#10# Mô tả thống kê cơ bản của cột 'Sulphate'
sulfate_values = df['Sulfate'].dropna() # Drop NaN values for plotting
mean_sulfate = sulfate_values.mean()
median_sulfate = sulfate_values.median()
plt.figure(figsize=(10, 6))
sns.histplot(sulfate_values, bins=30, kde=False, color='skyblue', edgecolor='black')
In [11]:
#11# Mô tả thống kê cơ bản của cột 'Trihalomethanes'
print("Thống kê cơ bản của cột 'Trihalomethanes':")
print(df['Trihalomethanes'].describe())
In [13]:
# check outliers
# 13# Đọc dữ liệu vào filr, sau đó tính toán
# Tính toán các phân vị Q1 và Q3
Q1 = df['ph'].quantile(0.25)
Q3 = df['ph'].quantile(0.75)
In [14]:
# # Tính toán ma trận hệ số tương quan
correlation_matrix = df.corr()
Out[15]:
#15# Tính toán độ méo lệch của các biến số liên tục
skewness = df.skew()
Skew
There are no skew in our data :)
Lets see the distribution of Potability
In [31]:
#16(31)# Hiển thị phân phối của khả năng uống được
plt.figure(figsize=(10, 6))
sns.countplot(x='Potability', data=df, palette='viridis')
plt.title('Phân phối của khả năng uống được')
plt.xlabel('Khả năng uống được')
plt.ylabel('Số lượng')
plt.show()
Out[31]:
0 1998
1 1278
Name: Potability, dtype: int64
Feature Transformation
In [16]:
# print the dataframe head
# #16
print(df.head(5).to_string(index=False))
Out[16]:
7.08
204.8 20791. 7.3002 368.5 564.30 10.3797 86.99097 2.96
0 079 0
90455 318981 12 16441 8654 83 0 3135
5
3.71
129.4 18630. 6.6352 333.7 592.88 15.1800 56.32907 4.50
1 608 0
22921 057858 46 75777 5359 13 6 0656
0
8.09
224.2 19909. 9.2758 333.7 418.60 16.8686 66.42009 3.05
2 912 0
36259 541732 84 75777 6213 37 3 5934
4
8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6
9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3
In [17]:
# Feature transformation
# scale the numeric columns
# 17
import pandas as pd
from sklearn.preprocessing import StandardScaler
print(df.head())
In [18]:
# After transformation print the dataframe head
# #18
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
Out[18]:
- -
0.00 0.198 0.0854 1.04 1.2271 -
0 0.01 1.028759 0.935 0
0000 981 92 3542 78 0.854560
1702 210
- - - -
0.00 1.4734 0.514
1 2.11 1.696 0.19 0.2490 0.214093 -0.502884 0
0000 06 449
3014 382 6962 88
- - -
0.63 0.684 1.0795 0.00
2 0.08 0.0282 0.590024 0.001189 0.847 0
9503 850 58 0000
7287 51 715
-
0.77 0.437 0.09 0.4674 0.69 0.635
3 0.5050 0.939076 1.695662 0
6180 145 3483 46 4190 242
79
- - - - -
1.26 - 0.113
4 0.398 0.25 0.2936 0.71 0.2022 -1.718287 0
3161 0.592197 188
477 2771 90 0100 62
Hypothesis Testing
We define a hypothesis to test in our data set
Hypothesis 1:
Out[29]:
Ttest_indResult(statistic=-4.476932191647608, pvalue=7.705940306619221e-06)
the p value is less than 0.05 , so we are rejecting the null hypothesis at 5% significance level.
Next Step in analyzing the data
Write here
Quality of data
Write here
Key findings
Write here