0% found this document useful (0 votes)
10 views

Lab 1 - Python - Excel

Uploaded by

tfs23112003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lab 1 - Python - Excel

Uploaded by

tfs23112003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

TRẦN THỊ LAN ANH

Lớp:CNTTK20P

MãSV: DTC21H4802010537

Access to Safe drinking water is eassently a global issue. The World Health Organization
(WHO) estimates that half of all people in the world are affected by the lack of safe drinking
water. With this assesment, we will explore the data and look for patterns in the data to analyze if
the given data is a good indicator of safe drinking water.

In [20]:
# Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
#sns.set_context('notebook')
Data Set

The dataset is downloaded from kaggle.com and is available for download at:

https://www.kaggle.com/adityakadiwal/water-potability
EDA - Exploratory Data Analysis

In this section we will explore the data and look for patterns in the data to analyze if the given
data is a good indicator of safe drinking water.

1) Describe the data

2) Visualize the data

3) Identify the missing values and fill them

4) Identify the outliers and remove them

5) Identify the categorical variables and encode them (if any)

6) Identify the numerical variables and perform basic statistical analysis


In [2]:
#2 Đường dẫn tới file CSV
df = pd.read_csv(r'G:/python excel/water_potability.csv')
In [3]:
# Read the csv file from the url

In [4]:
# Print the first 5 rows of the dataframe
display(df.head())

Chlor Trihalo Tur Pota


Hard Sulfa Condu Organic
ph Solids amine methane bidit bilit
ness te ctivity _carbon
s s y y

204.8 20791. 7.3002 368.5 564.30 10.3797 86.99097 2.96


0 NaN 0
90455 318981 12 16441 8654 83 0 3135

3.71
129.4 18630. 6.6352 592.88 15.1800 56.32907 4.50
1 608 NaN 0
22921 057858 46 5359 13 6 0656
0

8.09
224.2 19909. 9.2758 418.60 16.8686 66.42009 3.05
2 912 NaN 0
36259 541732 84 6213 37 3 5934
4

8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6

9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3

More information about the data

ph - PH is an important parameter in evaluating the acid–base balance of water. It is also the


indicator of acidic or alkaline condition of water status. WHO has recommended maximum
permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which
are in the range of WHO standards.

Hardness - Hardness is a measure of the physical properties of the water. It is a measure of the
ability of the water to support the roots and the leaves. The lower the hardness, the more support
the roots and leaves can have.
Solids (Total dissolved solids - TDS) - TDS is a measure of the solids in the water. The water
with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500
mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

Chloramines - Chlorine and chloramine are the major disinfectants used in public water systems.
Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking
water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are
considered safe in drinking water.

Sulfate - Sulfate is a common disinfectant used in public water systems. Sulfate levels up to 2
milligrams per liter (mg/L or 2 parts per million (ppm)) are considered safe in drinking water.

Conductivity - Pure water is not a good conductor of electric current rather’s a good insulator.
Increase in ions concentration enhances the electrical conductivity of water. Generally, the
amount of dissolved solids in water determines the electrical conductivity. Electrical
conductivity (EC) actually measures the ionic process of a solution that enables it to transmit
current. According to WHO standards, EC value should not exceeded 400 μS/cm.

Organic_carbon - Total Organic Carbon (TOC) in source waters comes from decaying natural
organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of
carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated /
drinking water, and < 4 mg/Lit in source water which is use for treatment.

Trihalomethanes - THMs are chemicals which may be found in water treated with chlorine. The
concentration of THMs in drinking water varies according to the level of organic material in the
water, the amount of chlorine required to treat the water, and the temperature of the water that is
being treated. THM levels up to 80 ppm is considered safe in drinking water.

Turbidity - Turbidity is a measure of the water’s ability to absorb particulate matter. The lower
the turbidity, the more it can absorb particulate matter.

Potability (Target variable) - Indicates if water is safe for human consumption where 1 means
Potable and 0 means Not potable.

In [5]:
# datatypes of the columns
print(df.head())
ph float64
Hardness float64
Solids float64
Chloramines float64
Sulfate float64
Conductivity float64
Organic_carbon float64
Trihalomethanes float64
Turbidity float64
Potability int64
dtype: object

In [6]:
# Describe the data
print(df.describe())
Out[6]:

Chlor Cond Organi Trihalo


Hard Sulfa Turb Pota
ph Solids amin uctivi c_carb methan
ness te idity bility
es ty on es

co 2785. 3276. 3276. 3276. 2495. 3276. 3276. 3276.


3276.00 3114.00
un 0000 0000 00000 00000 0000 00000 0000 0000
0000 0000
t 00 00 0 0 00 0 00 00

m 196.3 22014 333.7


7.080 7.122 426.2 14.2849 66.3962 3.966 0.390
ea 6949 .0925 7577
795 277 05111 70 93 786 110
n 6 26 7

8768.
st 1.594 32.87 1.583 41.41 80.82 3.30816 16.1750 0.780 0.487
57082
d 320 9761 085 6840 4064 2 08 382 849
8

129.0
mi 0.000 47.43 320.9 0.352 181.4 2.20000 0.73800 1.450 0.000
0000
n 000 2000 42611 000 83754 0 0 000 000
0

176.8 15666 307.6


25 6.093 6.127 365.7 12.0658 55.8445 3.439 0.000
5053 .6902 9949
% 092 421 34414 01 36 711 000
8 97 8

196.9 20927 333.0


50 7.036 7.130 421.8 14.2183 66.6224 3.955 0.000
6762 .8336 7354
% 752 299 84968 38 85 028 000
7 07 6

216.6 27332 359.9


75 8.062 8.114 481.7 16.5576 77.3374 4.500 1.000
6745 .7621 5017
% 066 887 92304 52 73 320 000
6 27 0
Chlor Cond Organi Trihalo
Hard Sulfa Turb Pota
ph Solids amin uctivi c_carb methan
ness te idity bility
es ty on es

323.1 61227 481.0


m 14.00 13.12 753.3 28.3000 124.000 6.739 1.000
2400 .1960 3064
ax 0000 7000 42620 00 000 000 000
0 08 2

In [7]:
# Check if there are any null columns
print(df.isnull().sum())
Out[7]:
ph 491
Hardness 0
Solids 0
Chloramines 0
Sulfate 781
Conductivity 0
Organic_carbon 0
Trihalomethanes 162
Turbidity 0
Potability 0
dtype: int64

In [8]:
# Lets try to plot misisng values
plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar', color='skyblue')
plt.title('Number of Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()

Analyze ph column

In [9]:
# for ph column
# set the histogram, mean and median
#9 # Tính toán trung bình và trung vị của cột 'ph'
mean_ph = df['ph'].mean()
median_ph = df['ph'].median()

# Hiển thị kết quả


print(f"Trung bình của cột 'ph': {mean_ph}")
print(f"Trung vị của cột 'ph': {median_ph}")

# Vẽ biểu đồ cột cho cột 'ph'


plt.figure(figsize=(10, 6))
sns.histplot(df['ph'], kde=False, bins=20, color='skyblue', edgecolor='black')
plt.axvline(mean_ph, color='r', linestyle='dashed', linewidth=1, label=f'Mean: {mean_ph:.2f}')
plt.axvline(median_ph, color='g', linestyle='dashed', linewidth=1, label=f'Median:
{median_ph:.2f}')
plt.title('Distribution of pH')
plt.xlabel('pH')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Based on the above data, we can impute ph with either mean or median. There is no skweness in
the data.
Analyze Sulfate column

In [10]:
#10# Mô tả thống kê cơ bản của cột 'Sulphate'
sulfate_values = df['Sulfate'].dropna() # Drop NaN values for plotting

mean_sulfate = sulfate_values.mean()
median_sulfate = sulfate_values.median()

plt.figure(figsize=(10, 6))
sns.histplot(sulfate_values, bins=30, kde=False, color='skyblue', edgecolor='black')

# Add mean and median lines


plt.axvline(mean_sulfate, color='red', linestyle='dashed', linewidth=2, label=f'Mean:
{mean_sulfate:.2f}')
plt.axvline(median_sulfate, color='green', linestyle='dashed', linewidth=2, label=f'Median:
{median_sulfate:.2f}')

plt.title('Histogram of Sulfate values')


plt.xlabel('Sulfate')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Based on the above data, we can impute Sulphate with either mean or median.

Analyze Trihalomethanes column

In [11]:
#11# Mô tả thống kê cơ bản của cột 'Trihalomethanes'
print("Thống kê cơ bản của cột 'Trihalomethanes':")
print(df['Trihalomethanes'].describe())

# Kiểm tra giá trị thiếu trong cột 'Trihalomethanes'


print("\nSố lượng giá trị thiếu trong cột 'Trihalomethanes':")
print(df['Trihalomethanes'].isnull().sum())

# Tính toán trung bình, trung vị, độ lệch chuẩn


mean_thm = df['Trihalomethanes'].mean()
median_thm = df['Trihalomethanes'].median()
std_thm = df['Trihalomethanes'].std()

# Hiển thị kết quả


print(f"\nTrung bình của cột 'Trihalomethanes': {mean_thm}")
print(f"Trung vị của cột 'Trihalomethanes': {median_thm}")
print(f"Độ lệch chuẩn của cột 'Trihalomethanes': {std_thm}")
Based on the above data, we can impute Trihalomethanes with either mean or median
# Vẽ biểu đồ phân phối của cột 'Trihalomethanes'
plt.figure(figsize=(10, 6))
sns.histplot(df['Trihalomethanes'], kde=True, bins=20, color='skyblue', edgecolor='black')
plt.axvline(mean_thm, color='r', linestyle='dashed', linewidth=1, label=f'Mean: {mean_thm:.2f}')
plt.axvline(median_thm, color='g', linestyle='dashed', linewidth=1, label=f'Median:
{median_thm:.2f}')
plt.title('Distribution of Trihalomethanes')
plt.xlabel('Trihalomethanes')
plt.ylabel('Frequency')
plt.legend()
plt.show().
Missing Value imputation

Missing values in ph column


In [12]:
# impute missing values with mean
# 12# Kiểm tra giá trị thiếu trong cột 'ph'
missing_ph_values = df['ph'].isnull().sum()

# Hiển thị số lượng giá trị thiếu trong cột 'ph'


print(f"Số lượng giá trị thiếu trong cột 'ph': {missing_ph_values}")
Identify outliers in the data

In [13]:
# check outliers
# 13# Đọc dữ liệu vào filr, sau đó tính toán
# Tính toán các phân vị Q1 và Q3
Q1 = df['ph'].quantile(0.25)
Q3 = df['ph'].quantile(0.75)

# Tính toán IQR


IQR = Q3 - Q1

# Xác định các điểm ngoại lai


outliers = df[(df['ph'] < (Q1 - 1.5 * IQR)) | (df['ph'] > (Q3 + 1.5 * IQR))]

# Hiển thị số lượng và các điểm ngoại lai


print(f"Số lượng điểm ngoại lai trong cột 'ph': {len(outliers)}")
print("Các điểm ngoại lai trong cột 'ph':")
print(outliers)

Identify corrleation between variables

In [14]:
# # Tính toán ma trận hệ số tương quan
correlation_matrix = df.corr()

# Hiển thị ma trận hệ số tương quan


print("Ma trận hệ số tương quan:")
print(correlation_matrix)

# Vẽ ma trận hệ số tương quan dưới dạng heatmap


plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

There are no categorical variables in the dataset.


Identify skweness in the data
In [15]:
# identify skewness
# Enter your code here
# Showing the skewed columns
# Enter your code here
Number of skewed columns : 0

Out[15]:
#15# Tính toán độ méo lệch của các biến số liên tục
skewness = df.skew()

# Xác định các cột có độ méo lệch dương (lệch phải)


right_skewed_columns = skewness[skewness > 0].index.tolist()

# Hiển thị số lượng và các cột bị lệch phải


print(f"Số lượng cột bị lệch phải: {len(right_skewed_columns)}")
print("Các cột bị lệch phải:")
print(right_skewed_columns)

Skew
There are no skew in our data :)
Lets see the distribution of Potability

In [31]:
#16(31)# Hiển thị phân phối của khả năng uống được
plt.figure(figsize=(10, 6))
sns.countplot(x='Potability', data=df, palette='viridis')
plt.title('Phân phối của khả năng uống được')
plt.xlabel('Khả năng uống được')
plt.ylabel('Số lượng')
plt.show()

Out[31]:
0 1998
1 1278
Name: Potability, dtype: int64
Feature Transformation

In [16]:
# print the dataframe head
# #16
print(df.head(5).to_string(index=False))

Out[16]:

Chlor Trihalo Tur Pota


Hard Sulfa Condu Organic
ph Solids amine methane bidit bilit
ness te ctivity _carbon
s s y y

7.08
204.8 20791. 7.3002 368.5 564.30 10.3797 86.99097 2.96
0 079 0
90455 318981 12 16441 8654 83 0 3135
5

3.71
129.4 18630. 6.6352 333.7 592.88 15.1800 56.32907 4.50
1 608 0
22921 057858 46 75777 5359 13 6 0656
0

8.09
224.2 19909. 9.2758 333.7 418.60 16.8686 66.42009 3.05
2 912 0
36259 541732 84 75777 6213 37 3 5934
4

8.31
214.3 22018. 8.0593 356.8 363.26 18.4365 100.3416 4.62
3 676 0
73394 417441 32 86136 6516 24 74 8771
6

9.09
181.1 17978. 6.5466 310.1 398.41 11.5582 31.99799 4.07
4 222 0
01509 986339 00 35738 0813 79 3 5075
3
In [17]:
# Feature transformation
# scale the numeric columns
# 17
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Đường dẫn tới file CSV


df = pd.read_csv(r'G:/python excel/water_potability.csv')

# Chọn các cột số liên tục


continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']

# Khởi tạo bộ chuyển đổi chuẩn hóa


scaler = StandardScaler()

# Áp dụng chuẩn hóa


df[continuous_columns] = scaler.fit_transform(df[continuous_columns])

print(df.head())
In [18]:
# After transformation print the dataframe head
# #18
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Đọc dữ liệu từ file CSV


df = pd.read_csv('G:/python excel/water_potability.csv')

# Chọn các cột số liên tục


continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']

# Khởi tạo bộ chuyển đổi chuẩn hóa


scaler = StandardScaler()

# Áp dụng chuẩn hóa


df[continuous_columns] = scaler.fit_transform(df[continuous_columns])
# Hiển thị phần đầu của DataFrame sau khi chuẩn hóa
print("DataFrame sau khi chuẩn hóa:")
print(df.head())

Out[18]:

Hard Soli Chlora Sulf Condu Organic Trihalom Turb Pota


ph
ness ds mines ate ctivity _carbon ethanes idity bility

- -
0.00 0.198 0.0854 1.04 1.2271 -
0 0.01 1.028759 0.935 0
0000 981 92 3542 78 0.854560
1702 210

- - - -
0.00 1.4734 0.514
1 2.11 1.696 0.19 0.2490 0.214093 -0.502884 0
0000 06 449
3014 382 6962 88

- - -
0.63 0.684 1.0795 0.00
2 0.08 0.0282 0.590024 0.001189 0.847 0
9503 850 58 0000
7287 51 715

-
0.77 0.437 0.09 0.4674 0.69 0.635
3 0.5050 0.939076 1.695662 0
6180 145 3483 46 4190 242
79

- - - - -
1.26 - 0.113
4 0.398 0.25 0.2936 0.71 0.2022 -1.718287 0
3161 0.592197 188
477 2771 90 0100 62

Save the cleaned data


In [30]:
# #19(30)
# Chọn các cột số liên tục
continuous_columns = ['ph', 'Sulfate', 'Trihalomethanes']

Hypothesis Testing
We define a hypothesis to test in our data set

Hypothesis 1:

Null: Increase in pH is associated with increase in Solids

Alternate : No relataion between ph and Solids


In [29]:
# import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

# Đọc dữ liệu từ file CSV


df = pd.read_csv('G:/python excel/water_potability.csv')

# Giả sử cột 'Solid' là tên cột tương ứng


# Tính toán hệ số tương quan giữa pH và Solid
stat, p_value = stats.ttest_ind(df['ph'], df['Solids'])

print("T-test result: statistic = {}, p-value = {}".format(stat, p_value))

if p_value < 0.05:


print("Giá trị p nhỏ hơn 0,05, vì vậy bác bỏ giả thuyết không ở mức ý nghĩa 5%.")
else:
print("Giá trị p không nhỏ hơn 0,05, vì vậy không thể bác bỏ giả thuyết khống ở mức ý nghĩa
5%.")

Out[29]:
Ttest_indResult(statistic=-4.476932191647608, pvalue=7.705940306619221e-06)
the p value is less than 0.05 , so we are rejecting the null hypothesis at 5% significance level.
Next Step in analyzing the data

Write here

Quality of data

Write here
Key findings
Write here

You might also like