Skip to content

Commit 41a6cef

Browse files
authored
Merge pull request animator#448 from AnushkaJMokashi/GroupBy_Pandas
GroupBy_Pandas
2 parents 44330af + 4fdaf92 commit 41a6cef

File tree

2 files changed

+392
-0
lines changed

2 files changed

+392
-0
lines changed
Lines changed: 391 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,391 @@
1+
## Group By Functions
2+
3+
GroupBy is a powerful function in pandas that allows you to split data into distinct groups based on one or more columns and perform operations on each group independently. It's a fundamental technique for data analysis and summarization.
4+
5+
Here's a step-by-step breakdown of how groupby functions work in pandas:
6+
7+
* __Splitting the Data:__ You can group your data based on one or more columns using the .groupby() method. This method takes a column name or a list of column names as input and splits the DataFrame into groups according to the values in those columns.
8+
9+
* __Applying a Function:__ Once the data is grouped, you can apply various functions to each group. Pandas offers a variety of built-in aggregation functions like sum(), mean(), count(), etc., that can be used to summarize the data within each group. You can also use custom functions or lambda functions for more specific operations.
10+
11+
* __Combining the Results:__ After applying the function to each group, the results are combined into a new DataFrame or Series, depending on the input data and the function used. This new data structure summarizes the data by group.
12+
13+
14+
```python
15+
import pandas as pd
16+
import seaborn as sns
17+
import numpy as np
18+
```
19+
20+
21+
```python
22+
iris_data = sns.load_dataset('iris')
23+
```
24+
25+
This code loads the built-in Iris dataset from seaborn and stores it in a pandas DataFrame named iris_data. The Iris dataset contains measurements of flower sepal and petal dimensions for three Iris species (Setosa, Versicolor, Virginica).
26+
27+
28+
```python
29+
iris_data
30+
```
31+
32+
| | sepal_length | sepal_width | petal_length | petal_width | species |
33+
|----|--------------|-------------|--------------|-------------|-----------|
34+
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
35+
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
36+
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
37+
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
38+
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
39+
| ...| ... | ... | ... | ... | ... |
40+
| 145| 6.7 | 3.0 | 5.2 | 2.3 | virginica |
41+
| 146| 6.3 | 2.5 | 5.0 | 1.9 | virginica |
42+
| 147| 6.5 | 3.0 | 5.2 | 2.0 | virginica |
43+
| 148| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
44+
| 149| 5.9 | 3.0 | 5.1 | 1.8 | virginica |
45+
46+
47+
48+
49+
50+
```python
51+
iris_data.groupby(['species']).count()
52+
```
53+
54+
55+
56+
57+
| species | sepal_length | sepal_width | petal_length | petal_width |
58+
|------------|--------------|-------------|--------------|-------------|
59+
| setosa | 50 | 50 | 50 | 50 |
60+
| versicolor | 50 | 50 | 50 | 50 |
61+
| virginica | 50 | 50 | 50 | 50 |
62+
63+
64+
65+
66+
* We group the data by the 'species' column.
67+
count() is applied to each group, which counts the number of occurrences (rows) in each species category.
68+
* The output (species_counts) is a DataFrame showing the count of each species in the dataset.
69+
70+
71+
```python
72+
iris_data.groupby(["species"])["sepal_length"].mean()
73+
```
74+
75+
76+
77+
78+
species
79+
setosa 5.006\
80+
versicolor 5.936\
81+
virginica 6.588\
82+
Name: sepal_length, dtype: float64
83+
84+
85+
86+
* This groups the data by 'species' and selects the 'sepal_length' column.
87+
mean() calculates the average sepal length for each species group.
88+
* The output (species_means) is a Series containing the mean sepal length for each species.
89+
90+
91+
```python
92+
iris_data.groupby(["species"])["sepal_length"].std()
93+
```
94+
95+
96+
97+
98+
species
99+
setosa 0.352490\
100+
versicolor 0.516171\
101+
virginica 0.635880\
102+
Name: sepal_length, dtype: float64
103+
104+
105+
106+
* Similar to the previous, this groups by 'species' and selects the 'sepal_length' column.
107+
However, it calculates the standard deviation (spread) of sepal length for each species group using std().
108+
* The output (species_std) is a Series containing the standard deviation of sepal length for each species
109+
110+
111+
```python
112+
iris_data.groupby(["species"])["sepal_length"].describe()
113+
```
114+
115+
116+
117+
| species | count | mean | std | min | 25% | 50% | 75% | max |
118+
|------------|-------|-------|----------|------|--------|------|------|------|
119+
| setosa | 50.0 | 5.006 | 0.352490 | 4.3 | 4.800 | 5.0 | 5.2 | 5.8 |
120+
| versicolor | 50.0 | 5.936 | 0.516171 | 4.9 | 5.600 | 5.9 | 6.3 | 7.0 |
121+
| virginica | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.5 | 6.9 | 7.9 |
122+
123+
124+
125+
126+
* We have used describe() to generate a more comprehensive summary of sepal length for each species group.
127+
* It provides statistics like count, mean, standard deviation, minimum, maximum, percentiles, etc.
128+
The output (species_descriptions) is a DataFrame containing these descriptive statistics for each species.
129+
130+
131+
```python
132+
iris_data.groupby(["species"])["sepal_length"].quantile(q=0.25)
133+
```
134+
135+
136+
137+
138+
species\
139+
setosa 4.800\
140+
versicolor 5.600\
141+
virginica 6.225\
142+
Name: sepal_length, dtype: float64
143+
144+
145+
146+
147+
```python
148+
iris_data.groupby(["species"])["sepal_length"].quantile(q=0.75)
149+
```
150+
151+
152+
153+
154+
species\
155+
setosa 5.2\
156+
versicolor 6.3\
157+
virginica 6.9\
158+
Name: sepal_length, dtype: float64
159+
160+
161+
162+
* To calculate the quartiles (25th percentile and 75th percentile) of sepal length for each species group.
163+
* quantile(q=0.25) gives the 25th percentile, which represents the value below which 25% of the data points lie.
164+
* quantile(q=0.75) gives the 75th percentile, which represents the value below which 75% of the data points lie.
165+
* The outputs (species_q1 and species_q3) are Series containing the respective quartile values for each species.
166+
167+
## Custom Function For Group By
168+
169+
170+
```python
171+
nc = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species']
172+
```
173+
174+
175+
```python
176+
nc
177+
```
178+
179+
180+
181+
182+
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
183+
184+
185+
186+
187+
```python
188+
nc = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
189+
def species_stats(species_data,species_name):
190+
print("Species Name: {}".format(species_name))
191+
print()
192+
print("Mean:\n",species_data[nc].mean())
193+
print()
194+
print("Median:\n",species_data[nc].median())
195+
print()
196+
print("std:\n",species_data[nc].std())
197+
print()
198+
print("25% percentile:\n",species_data[nc].quantile(0.25))
199+
print()
200+
print("75% percentile:\n",species_data[nc].quantile(0.75))
201+
print()
202+
print("Min:\n",species_data[nc].min())
203+
print()
204+
print("Max:\n",species_data[nc].max())
205+
print()
206+
```
207+
208+
209+
```python
210+
setosa_data = iris_data[iris_data['species'] == 'setosa']
211+
```
212+
213+
214+
```python
215+
versicolor_data = iris_data[iris_data['species'] == 'versicolor']
216+
```
217+
218+
219+
```python
220+
virginica_data = iris_data[iris_data['species'] == 'virginica']
221+
```
222+
223+
224+
```python
225+
species_data_names = ['setosa_data','viginica_data','versicolor_data']
226+
for data in species_data_names:
227+
print("************** Species name {} *****************".format(data))
228+
species_stats(setosa_data,data)
229+
print("------------------------------------")
230+
```
231+
232+
************** Species name setosa_data *****************\
233+
Species Name: setosa_data
234+
235+
Mean:\
236+
sepal_length 5.006\
237+
sepal_width 3.428\
238+
petal_length 1.462\
239+
petal_width 0.246\
240+
dtype: float64
241+
242+
Median:\
243+
sepal_length 5.0\
244+
sepal_width 3.4\
245+
petal_length 1.5\
246+
petal_width 0.2\
247+
dtype: float64
248+
249+
std:\
250+
sepal_length 0.352490\
251+
sepal_width 0.379064\
252+
petal_length 0.173664\
253+
petal_width 0.105386\
254+
dtype: float64
255+
256+
25% percentile:\
257+
sepal_length 4.8\
258+
sepal_width 3.2\
259+
petal_length 1.4\
260+
petal_width 0.2\
261+
Name: 0.25, dtype: float64
262+
263+
75% percentile:\
264+
sepal_length 5.200\
265+
sepal_width 3.675\
266+
petal_length 1.575\
267+
petal_width 0.300\
268+
Name: 0.75, dtype: float64
269+
270+
Min:\
271+
sepal_length 4.3\
272+
sepal_width 2.3\
273+
petal_length 1.0\
274+
petal_width 0.1\
275+
dtype: float64
276+
277+
Max:
278+
sepal_length 5.8\
279+
sepal_width 4.4\
280+
petal_length 1.9\
281+
petal_width 0.6\
282+
dtype: float64
283+
284+
------------------------------------\
285+
************** Species name viginica_data *****************\
286+
Species Name: viginica_data
287+
288+
Mean:\
289+
sepal_length 5.006\
290+
sepal_width 3.428\
291+
petal_length 1.462\
292+
petal_width 0.246\
293+
dtype: float64
294+
295+
Median:\
296+
sepal_length 5.0\
297+
sepal_width 3.4\
298+
petal_length 1.5\
299+
petal_width 0.2\
300+
dtype: float64
301+
302+
std:\
303+
sepal_length 0.352490\
304+
sepal_width 0.379064\
305+
petal_length 0.173664\
306+
petal_width 0.105386\
307+
dtype: float64
308+
309+
25% percentile:\
310+
sepal_length 4.8\
311+
sepal_width 3.2\
312+
petal_length 1.4\
313+
petal_width 0.2\
314+
Name: 0.25, dtype: float64
315+
316+
75% percentile:\
317+
sepal_length 5.200\
318+
sepal_width 3.675\
319+
petal_length 1.575\
320+
petal_width 0.300\
321+
Name: 0.75, dtype: float64
322+
323+
Min:\
324+
sepal_length 4.3\
325+
sepal_width 2.3\
326+
petal_length 1.0\
327+
petal_width 0.1\
328+
dtype: float64
329+
330+
Max:
331+
sepal_length 5.8
332+
sepal_width 4.4
333+
petal_length 1.9
334+
petal_width 0.6
335+
dtype: float64
336+
337+
------------------------------------\
338+
************** Species name versicolor_data *****************\
339+
Species Name: versicolor_data
340+
341+
Mean:\
342+
sepal_length 5.006\
343+
sepal_width 3.428\
344+
petal_length 1.462\
345+
petal_width 0.246\
346+
dtype: float64
347+
348+
Median:\
349+
sepal_length 5.0\
350+
sepal_width 3.4\
351+
petal_length 1.5\
352+
petal_width 0.2\
353+
dtype: float64
354+
355+
std:\
356+
sepal_length 0.352490\
357+
sepal_width 0.379064\
358+
petal_length 0.173664\
359+
petal_width 0.105386\
360+
dtype: float64
361+
362+
25% percentile:\
363+
sepal_length 4.8\
364+
sepal_width 3.2\
365+
petal_length 1.4\
366+
petal_width 0.2\
367+
Name: 0.25, dtype: float64
368+
369+
75% percentile:\
370+
sepal_length 5.200\
371+
sepal_width 3.675\
372+
petal_length 1.575\
373+
petal_width 0.300\
374+
Name: 0.75, dtype: float64
375+
376+
Min:
377+
sepal_length 4.3\
378+
sepal_width 2.3\
379+
petal_length 1.0\
380+
petal_width 0.1\
381+
dtype: float64
382+
383+
Max:\
384+
sepal_length 5.8\
385+
sepal_width 4.4\
386+
petal_length 1.9\
387+
petal_width 0.6\
388+
dtype: float64
389+
390+
------------------------------------
391+

contrib/pandas/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22

33
- [Pandas Series Vs NumPy ndarray](pandas_series_vs_numpy_ndarray.md)
44
- [Pandas Descriptive Statistics](Descriptive_Statistics.md)
5+
- [Group By Functions with Pandas](GroupBy_Functions_Pandas.md)
56
- [Excel using Pandas DataFrame](excel_with_pandas.md)

0 commit comments

Comments
 (0)