0 ratings0% found this document useful (0 votes) 160 views31 pagesData Modelling and Visualization
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Data Aggregation and Analysis
4
Prd
Data Aggregation and Group operations : Group by Mechanics, Dats aggregation, General split:
Pivot tables and cross tabulation 67 Time Series
Data Analysis: Date and Time Data Types and Tools, Time series Basics, date Ranges, Frequencies and shifting, Time
Moving Window Functions
Zone Handling, Periods and Periods Arithmetic, Resampling and Frequency conversion,
5.1 _ Introduction : Data Aggregation
+ Data aggregation is any process in which information is gathered and expressed in a summary form. for
purposes such as statistical analysis,
+ A common aggregation purpose is to get more information about particular groups based on specific
variables such as age, profession or income.
+The information about such groups can then be used for Web site personalization to choose contentt and
advertising likely to appeal to an individual belonging to one or more groups for which data has been
collected. For example, a site that sells music CDs might advertise certain CDs based on the age of the user
and the data aggregate for their age group.
+ Online Analytic Processing (OLAP) is a simple type of data aggregation in which the marketer uses an online
reporting mechanism to process the information.
‘+ A-common aggregation purpose is to get more information about particular groups based on specific
variables such as age, profession, or income, The information about such groups can then be used for Web
site personalization to choose content and advertising likely to appeal to an individual belonging to one or
more groups for which data has been collected. For example, a site that sells music CDs might advertise
certain CDs based on the age of the user and the data aggregate for their age group. Online Analytic
Processing (OLAP) is a simple type of data aggregation in which the marketer uses an online reporting
mechanism to process the information.
5.2__Data Aggregation and Group Operations
+ Data aggregation and group operations are essential techniques in data analysis and are particulary useful
when dealing with large datasets. These methods allow analysts to summarize and gain insights from the
data efficiently. ". ation oF transform
Ctegorizing a dataser and applying a function to each group, whether an oes preparing a datas
Often a critical Component of'a data analysis workflow. After loadiia, tity aint Purposes. p
may need to compute ‘STOUP statistics or possibly pivot tables for reporting or
te ize datasets in a natural
*ible group by interface, enabling you to slice, dice, and summarize
One reason fo,
is the ease wit
p ery langu
* the popularity of relational databases and SQl. (which stands ei eee a on
ith which data can be joined, filtered, transformed, and aggrega = formed. As yall
like SQL are Somewhat constrained in the kinds of group cperstions that oe pat operations a
With the expressiveness of Python and Pandas, we can perform quite complex nie how to:
any functi Pandas object or NumPy array. In this chapter, you wi
i of functions, arrays, or Data
© Split a pandas object into pieces using one or more keys (in the form 7
column names),
ion that accepts a
©. Calculate group summary statistics, like count, mean, or standard see 2 set oe
© Apply within-group transformations or other manipulations, like normalization, linea
subset selection.
© Compute pivot tables and cross-tabulations.
°
Perform quantile analysis and other statistical group analyses,
1 Group by Mechanics
£idcRReO
ics in data modeling and visualization 1s a ee = as “
Group by Mechani: oR een ply 9 crcl re - sro ral
summarize data based on aaa decisions, The “Group by" operation involves dividing a dataset
he ae ee one of more key variables and then performing clculations or aggrega
groups based on Cee in various data manipulation and ees tools, including spr
each group. vice Excel and data analysis libraries like pandas in Pythor
software like Mi
gain insia!-
pata Modeling and Visualization ‘Aggregation and Analysis
L Data Organization
Before applying the “Group by* operation, the data needs to be structured n'a tabular format sto
represented a5 a data frame or a table with rows and columns, Each row in the dataset represents ® ange
observation, and each column represents a specific attribute or variable associated with that observation.
2, Wentifying Key Variables
To use the "Group by" operation, one or more key variables need to be selected to define the groups These
variables should have categorical or discrete values, such as product categories, geographical regions, oF
customer IDs. The "Group by” operation will group the data based on the unique values of these key
variables.
3. Grouping Process
Once the key variables are chosen, the “Group by" operation will group the data according to the nvaue
values of these variables. All rows with the same value(s) for the key variable(s) will be combined into a
separate group. As a result, the dataset is partitioned into multiple subsets, each representing a distinct
group.
4, Aggregation and Calculation
inctions ‘and calculations on each
median, count, standard
within each
+ After the data is grouped, analysts can perform various aggregation fu
group independently. Common aggregation functions include sum, mean,
deviation, and more. These calculations provide insights into the characteristics and patterns
group.
«For example, consider a sales dataset with columns for product. categories, sales dates, and, sales
amounts. By grouping the data based on the product categories, you can obtain separate groups for
teach category. You can then apply the sum function to calculate the total sales for each product
category, providing a concise summary of the sales data for different products.
5, Multi-Level Grouping
Group by Mechanics also supports multi-level grouping, where you can use multiple key variables to
create a hierarchical grouping structure. This allows for deeper analysis by drilling down into subgroups
within larger groups.
For instance, you can group sales data by both product categories and sales regions, providing insights
into how different products perform in various regions.
6. Visualization
Once the data is grouped and aggregated, visualizing the results can be highly effective in communicating
insights, Bar charts, line charts, pie charts, and other visualizations can help compare and contrast the data,
across different groups. Visualizations make it easier to identify trends, patterns, and outliers within the
dataset. re
7. Interpretation and Decision-Making > origin a
The insights gained from Group by Mechanics enable data-driven decision-making. An enti :
top-performing categories, regions with the highest sales, or trends in customer behavior, which. can guide
marketing strategies, inventory management, and resource allocation.Data Modeling and Visualization
5.2.2 Data Aggregation
54 Data Aggregation and.
Data Aggregation is any process whereby data is gathered and expressed in a summary form. When d
aggregated, atomic data rows - typically gathered from multiple sources - are replaced with tot
Simmary statistics. Groups of observed aggregates are replaced with summary statistics based on
observations. Aggregate data is typically found in a data warehouse, as it can provide answers to a
Questions and also dramatically reduce the time to query large sets of data.
Data aggregation can enable analysts to access and examine large amounts of data in a reasonable
frame, A row of aggregate data can represent hundreds, thousands or even more atomic data records,
the data is aggregated, it can be queried quickly instead of requiring all of the processing cycles to
ach undlerlying atomic data row and aggregate it in real time when itis queried or accessed.
As the amount of data stored by organizations continues to expand, the most important and
accessed data can benefit from aggregation, making it feasible to access efficiently.
What does data aggregation do?
Data aggregators summarize data from multiple sources. They provide capabilities for multiple
measurements, such as sum, average and counting.
Examples of aggregate data :
* _ Voter turnout by state or county. Individual voter records are not presented, just the vote totals by cz
for the specific region.
* Average age of customer by product. Each individual customer is not identified, but for each pr
average age of the customer is saved.
* Number of customers by country. Instead of examining each customer, a count of the customers i
country is presented.
An example of this is creating a summary that shows the aggregate average salary for
department, rather than browsing through individual employee records with salary data.
* Aggregate data does not need to be numeric. You can, for example, count the number of any no
data element.
* Before aggregating, it is crucial that the atomic data is analyzed for accuracy and that there is
for the aggregation to be useful. For example, counting votes when only 5% of results are
likely to produce a relevant aggregate for prediction.
How do data aggregators work?
* Data aggregators work by combining atomic data from multiple sources, processing the data
insights and presenting the aggregate data in a summary view. Furthermore, data
provide the ability to track data lineage and can trace back to the underlying atomic data
aggregated.Z ae
ta Modeling and Visualization
3, col lection
s Data Aggregation and AmalySs
Ist data aggregation tools may extract data from multiple sources, storing it in large databases
as atomic data. The data may be extracted from Internet of Things (oT) sources, such as the following °
‘+ social media communications;
+ news headlines;
«personal data and browsing history from IoT devices; and
+ all centers, podcasts, etc (through speech recognition),
2. Processing : Once the data is extracted, it is processed. The data aggregator will identify the atomic data that
is to be aggregated. The data aggregator may apply predictive analytics, artificial intelligence (AD or machine
teaming algorithms to the collected data for new insights. The aggregator then applies the specified
statistical functions to aggregate the data
3. Presentation : Users can present the aggregated data in a summarized format that itself provides new data.
The statistical results are comprehensive and high quality
«Data aggregation may be performed manually or through the use of data aggregators. However, data
aggregation Is often performed on a large-scale’ basis, which makes manual aggregation less feasible.
Furthermore, manual aggregation risks accidental omission of crucial data sources and patterns.
5.2.2(A) Uses for Data Aggregation
‘+ Data aggregation can be helpful for many disciplines, such as finance and business strategy decisions,
product planning, product and service pricing, operations optimization and marketing strategy creation.
Users may be data analysts, data scientists, data warehouse administrators and subject matter experts.
= Aggregated data is commonly used for statistical analysis to obtain information about particular groups
based on specific demographic or behavioral variables, such as age, profession, education level or income.
+ For business analysis purposes, data can be aggregated into summaries that help leaders make well-
informed decisions. User data can be aggregated from multiple sources, such as social media
communications, browsing history from IoT devices and other. personal data, to give companies critical
insights into consumers.
cayoi HolleeDps
TOUTE MOT RQRNEDS
“1 18> NOUSQOTEYS Gaake-SWeh NE
h opetevis sMt-ataluolea asa WOee etData Modeling and Visualization
56 Data Aggregation and,
Data aggregation is a crucial technique in data modeling and visualization, used to summarize and
large datasets into. more manageable and insightful representations. This process involves. g
mathematical or statistical operations to combine data within each group, leading to concise and my
results. Data aggregation simplifies complex data, allowing analysts to identify patterns, trends, and
characteristics, which are Vital for data-driven decision-making. Here's @ comprehensive explanation
aggregation :
1. Aggregation Functions
Aggregation functions are mathematical or statistical operations that consolidate data within each
Common aggregation functions include sum, mean (average), median, count, standard deviation,
and maximum. These functions help create summary statistics, enabling users to understand the i
and characteristics of the data effectively.
2. Data Organization
Before performing data aggregation, the data should be structured in a tabular format, typically repr
5 a data frame or a database table. Each row represents an individual observation, while each
contains attributes or variables associated with those observations.
3. Group by Mechanics
Data aggregation often goes hand in hand with the "Group by" operation. By grouping the data
‘one or more key variables, the dataset is divided into distinct groups. Aggregation functions are then a
to each group independently, resulting in summary statistics for each group.
For instance, consider a sales dataset with columns for product categories, sales dates, and sales at
‘After grouping the data by product categories, you can apply the sum function to calculate the total
‘each category, providing a concise overview of sales performance across different products.
4, Multi-Level Aggregation
Similar to multi-level grouping, data aggregation also supports multi-level aggregation. In this case,
apply aggregation functions to multiple key variables, creating a hierarchical summary. This enables
insights by examining subgroups within larger groups.
Continuing with the previous example, you could perform multi-level aggregation by both
categories and sales regions. This would provide a comprehensive view of total sales for each
category across different regions.
5. Time-based Aggregation
In time series data, data aggregation is crucial for summarizing and analyzing trends over time. Tit
aggregation involves grouping data into specific time intervals, such as days, weeks, or months, and
aggregation functions to calculate relevant statistics for each interval.
Time-based aggregation can help identify seasonality, trends, and patterns in time series data. For i
you can calculate the average daily sales for each month or the total monthly revenue for a specificData Modeling and Visualization
57
Visualization
Analysis
After performing data aggregation,
summarized information, Bar charts, fi
representation of aggregated data, mal
7, Decision-Making and Analysis
visualizing the results can be highly effective in presenting the
ine charts, area charts, and other visualizations can provide a clear
king it easier to compare and analyze trends across different groups.
Data aggregation plays a pivotal role in data-driven deci
, n-making. By summarizing large datasets into
meaningful insights, analysts can, make. informed choices, identify growth opportunities, optimize business
processes, and address potential issues,
5.2.3. Split Apply Combine
anBython we do this by ling GroupBy and it invelves Gpe ox rors other hea steps of the Split-Apply-
Combine strategy. Let us start by defining each of the three steps :
Fig. 5.23 shows the Split-Apply-Combine using an aggregation function.
Fig. 5.2.3: The Split-Apply-Combine using an aggregation function.
1. Split : Split the data into groups based on some criteria thereby creati
column or a combination of columns to split the data into groups) _
a
ad
2. Apply : Apply a function to each group independently. (Aggregate, Transform, or Filter the data in this step)
3. Combine : Combine the results into a data structure (Pandas Series, Pandas Data Frame) ew
* Toaccess the code used inthis article please vs this ink.
ing a Group by object. (We can use the
¥Data Modeling and Visualization
Amport pandas as pd
Amport numpy as np
| import metplottin.;
Import feaborn as sn
lot as plt
| data_sales~pd.DetaFrame(sales_dict)
Inport Libraries and create a small Dataset to work on.
5-8
Create an Example Data-set in the form of dictionary having key value pairs.
ieeeeeat
Colour} Sales | Transactions | Product
0 | Yellow | 100000 100 Type A
1 | Black | 150000 150 TypeA
2 | Blue | 80000 820 TypeA
3 | Red | 90000 920 Type A
Yellow | 200000 230 Type A
5 | Black | 145000 120 Type A
6 | Blue | 120000 70 Type A
7 | Red | 300000 250 Type A
8 | Yellow | 250000 250 Type A
9 | Black | 200000 110 Type B
10 | Blue | 160000 130 Type B
11_| Red | 90000 | 360 Type B
—— —-
12 | Yellow | 90100 | 980 TypeB
13 | Black | 150000 300 Type B
14 | Blue | 142000 150 Type B
1s | Red | 130000 170 Type B
=
| 16 | Blue | 400000 | 230 TypeB
17 | Red | 350000 280 Type Bto summarize the whole data, seabom brary have been used to create Ms
data raphically. ceo asieet
‘Transactions
Visual Data Summary
. Product
Type A mn 2
Type8
Fig. 5.24 °
+ After creating and summarizing the data, asa first step let's move on to the first part of Split-Apply-Combine.
SPLIT : Create an Object
+ In this step we will create the the groups from the dataframe ‘data_sales’ by grouping on the basis of the
column ‘colour’.
1 Spit: Growpty ehe column ‘cotour
ata by = data_sales.groupby("colour’)
prin ype(aata_gby) sea
‘elses “pandas. core.groupby.groupby DataFranecroupny">
* Once we apply the groupby( function on the dataframe, it creates the GroupBy object as a result. We can
think of this object as a separate dataframe for each group. Each group has been created based on
Categories in a grouped column (4 Groups will be created ‘Black, ‘Blue’, ‘Red', ‘Yellow’ from the column
‘colour of the dataframe in our case ). wand
* A GroupBy object stores the data of the individual groups in the form of key value pairs as in dictionary. To
know the group names , we can either use attribute ‘keys’ or use the attribute ‘groups’ of the GroupBy object.Data Modeling and Visualization 5-10 Data Aggregation and.
[# Lets check the nones of the groups
Intearnden({1, 5, 9, 23], dtypen’ inte’),
‘lug’: Inteatnden({2,"6,'30, 14,"16], atypen'sntéa"),
Red": 14,15, 1 ined
Yellow": Intostndex((e, 4,°8, 12], dtypes'intes'))
For further clarity on Groups and its content, we can run a loop and print the key value pairs.
My" (= the nome af the group and ‘value’ (= the tegeented rows fram the origtnet Oatafrawe.
for man in data_gby:
print Groupie: 7A |
printtvalue) me 1
rine
Back 1 ets cee
sales transactions product
150000, 150 "type A
145000 120 type A
200000 sae type.
150000 300
‘transactions product
820 "type A
72 ype ak
10 Spe
150 type 8
ype
ype
ype
pe
sales transactions product,
2 Yellow 102000 100 "type A
4 Yellow 200000 230 type A
3 Yellow 250000 250. type A
12 Yellow 96100 ‘988 type 8
With the above example, I hope we have developed some clarity on the GroupBy object along with
its attributes and methods. With this, now let's move forward to the next stage, which is APPLY.
APPLY : Apply some function on the Object.
Apply step can performed in three ways : Aggregation, Transformation, and Filtering. We all
‘amount of experience in using Aggregation with GroupBy objects, but most of us might not have
experience with the Transformation and Filtering. Here, we will discuss all the three with special
Transformation? = = » s oes oe
wor A a hetua qu eit
am assuming that we are already comfortable with applying the aggregation functions with
therefore I will start of with some interesting features of this function. db str soci
agaleo eet ee . Ses oe> a
ta deling and Visualization
aggres?
«By choosing multiple columns to create the group, we increase the granulaty of the aggregation: FOE
instance, while spiting we created 4 groups based on the column ‘colours, which has 4.categories of
colours, 50 we had 4 groups, Now, if include ‘product’ column, having 2 categories (type A’ and ‘type B.
along with the ‘colour’ column, then we will be having total 8 categories (ex. type A-Blue’, type A-Black’.) in
total (4 x 2). This would be more clear from the below mentioned code.
Groupby two columns and aggregation
pata Aggregation and Analysis
14
the Groups created by multiple columns
ta prod olor inde
sales wroupy( prduct,“aloun'1, at indent) ,sum() Th on inaometree
aa, prod_colour_index
tee tamactons
ot_cobut
Wek Bick 2000
+ The above code used the aggregation function as sum), thus we get the sum of sales and the transactions to
the level of granularity defined by the combination of the ‘product’ and ‘colour’ columns.
+ Itis to be noted that we have used the parameter ‘as index=True, therefore we can see the ‘product and the
‘colour’ column as the index. On the contrary, if.we take the same parameter as False then in our output we
will not get the ‘product’ and ‘colour’ columns as the index but as the columns.
(roupby without Index as grouped coluan
Andex-False).som()
‘ate prod. colour Noindex ~ data_sales.groupby([ product”, ‘colour’ ],
ata prod_colour_Noladex Si eave east
oduct colour sale transactions
7 WA eux 255000 ES
$ tpeA sn 200000 00
2 ym _ Ras. 300000 1170 .
2 typeA Yetow 550000, 00
4 amb Bec. 260000 a eS!
5 tymB Bim 702000 so a
$ yaB ed stam 30 make
Tyme Yetow 0100 00 ma
Custom Aggregation grouped by Multiple Columns "
In previous example we used only single type of aggregation function for all the coltimns; however, if we
want to aggregate different columns with different aggregation functions then we can use the custom
aggregation functionality of the aggregation function. For doing this we’ can pass on the dictionary ie
~ aggregation function stating the columin name as ‘key’ and function name as ‘value’. Interestingly, “aah also
ass the multiple aggregation functions to a column. Let us see an example code below for more ecarity, :Data Modeling and Visualization
1% Custom Aggregation with Groupty using Olctionary as 4 paranaten Anside aggregation function “ORs()é
data_san,growpby{loraduct calor], ax nism) sape( soles} aps) "ernasetlonscLopceedians ome 09
sates wanton
sum madi count
protect colour
208
1%
200
5.2.4 Pivot Tables
* Pandas : Pandas is an open-source library that is built on top of the NumPy library. It is a Python package
ita and time series. It is
that offers various data structures and operations for manipulating numerical dat
mainly popular for importing and analyzing data much easier. Pandas is fast and it has high-performance &
productivity for users.
+ Pivot Tables : A pivot table is a table of statistics that summarizes the data of a more extensive table (such as
from a database, spreadsheet, or business intelligence program). This summary might include sums,
averages, or other statistics, which the pivot table groups together in a meaningful way: é
© Steps Needed
© Import Library (Pandas)
‘0 Import / Load / Create data.
© Use Pandas.pivot_table() method with different variants.
Pivot tables in Excel
+ Pivot tables are one of Excel's most powerful features. A pivot table allows you to extract the significance
from a large, detailed data set. }
© Ourdata set consists of 213 records and 6 fields. Order ID, Product, Category, Amount, Date and Country.
(Had nies D Bee {238
‘Country
“ZiBroccoli Vegetables $3,239 1/7/2016 United Ki
3 Banana Fruit $617 1/8/2016 United States
4 Banana Fruit $8,384 1/10/2016 Canada
5Beans Vegetables $2,626) 1/10/2016 Germany
6 Orange Fruit | $3,610] 1/11/2016 United States
7 Broccoli Vegetables $9,062. 1/11/2016 Australia vs
8 Banana Fruit | $6,906 1/16/2016 New Zealand
Fruit
9 Apple $2,417 1/16/2016 France-_
ta Modeling and Visualization
spoon Pivot Table
toinsert a pivot table, execute the following stéps,
1. Glick any single cell inside the data set,
2, Onthe Insert tab, in the Tables group, click PivotTable.
Recommended Table
PivotTables
Tables
The following dialog box appears. Excel automatically selects the data for you. The default location for a new
pivot table is New Worksheet
3. Glick OK.
Create PivotTable
Choose the data that you want to analyze
© Setect a table or rang
Table Mange: {
O Use an external data source
Connection name:
Use this workbook’s Data Model
Choose where you want the PwatTable report to be placed
@ New Worksheet
O Baisting Worksheet
bain (ose eee
Choose whether you want to analyze multiple tables —
(Aas this data to the Data Model
yl
Stancat )
Drag fields
The PivotTable Fields pane appears. To get the total amount exported of each product, drag the following
fields to the different areas.
1. Product field to the Rows area.
2 Amount field to the Values area.
3. Country field to the Filters area,Data Modeling and Visualization
PivotTable Fields
Choose fields to add to report:
(Search
(1 Order iD
C1 Deferiayout Update F Update |
Below you can find the pivot table. Bananas are our main export product. That's how easy pivot tables can b
Sort
To get Banana at the top of the list, sort the pivot table.
1. Click any cell inside the Sum of Amount column. “a
2. Right click and click on Sort, Sort Largest to Smallest.OB Cony
ED Eormat cote.
Number Format.
2X Remeye “Sum of Amount”
Summarize Values By
‘Show Values As r
BB voveried seringse
Prue Table Cptions.
BE ce Fea ie
Filter enue
Because we added the Country field to the Filters area, we can filter this pivot table by Country. For example,
which products do we export the most to France? ‘
1. Glick the filter drop-down and select France.
Result. Apples are our main export product to France.Data Modeling and Visualization
Change Summary Calculation
By default, Excel-summarizes your data by either su!
calculation that you want to use, execute the following steps.
1. Click any cell inside the Sum of Amount column,
2. Right click and click on Value Field Settings.
mming or counting the items. To change the
Lopy
Eormat Cells...
Number Format,.
Refresh
Sort
Remoye “Sum of Amount”
Summarize Values By
Show Values As
jvotTable Option: 8
By Hiderieta ist
3. Choose the type of calculation you want to use. For example, click Count.
Value Field Settings
Source Name: Amount
"Summarize Values By Show Values AS |
| Summarize value field by
; | Ghoose the type of calculation that you want to use to summarize
| data from the selected field
4, Click OK.
Result. 16 out of the 28 orders to France were ‘Apple’ orders.a
I aang enterin
ae et 5-17 ta Aggregation and Analysis
A
‘Two-dimensional Pivot Table
Ifyou drag @ field to the Rows area and Columns area, you can create a two-dimensional pivot table. First,
insert a pivot table. Next, to get the total amount exported to each country, of each product, drag the
following fields to the different areas.
1. Country field to the Rows area,
2, Product field to the Columns area,
3. Amount field to the Values area.
4, Category field to the Filters area,
Pivotlable Fields > x |
Choasefieldsto addto report: | $ >|
Drag fields between areas below:
Y Fiters BI Columns
| Category ProductData Modeling and Visualization 5.10
Below you can find the two-dimensional pivot table.
MIO Ae anon eee tw
3 |Sum of Amount Column|~|
4 |Rowlabels _[=|Apple Banana Beans Broccoli Carrots Mango Orange Grand Te
5 |Australia 20634 52721 14433 17953 8106 9186 8680
6 |Canada 24867 33775 12407 phe trl!
7 |France 80193 36094 680 5341 9104 7388 2256
8 |Germany 9082 39686 29905 37197 21636 8775 8887
9 |New Zealand 10332 40050 4390 seh
10 |United Kingdom 17534 42908 5100 38436 41815 5600 21744
A1|Unitedstates _—-28615 950617163 26715 56284 22363 30932
12 |Grand Total 191257 340295 57281 142439, 136845 57079 104438
B
5.2.5 Cross Tabulation
Cross tabulation, also known as a contingency table, is another technique used to analyze the
between two categorical variables. It presents the frequency distribution of the data for each co
the variables, providing a clear overview of their associations. =
Cross tabulation is often visualized as a table, with one variable’s categories farming the rows and the
variable's categories forming the columns. The cells of the table display the counts or percent
‘observations falling into each category combination.
Cross-tabulation analysis has its unique language, using terms such as “banners”, “stubs”,
Statistic" and “Expected Values.” A typical cross-tabulation table comparing the two hypothetical
“City of Residence" with “Favorite Baseball Team’ is shown below. Are city of residence and being a
that team independent? The cells of the table report the frequency counts and percentages for the
of respondents in each cell
You typically use cross tabulation when you have categorical variables or data - e.g. information tt
divided into mutually exclusive groups.
For example, a categorical variable could be customer reviews by region. You divide this i
reviews per geographical area: North, South, East, West, or state, and then analyze the rel
‘that data.
‘Another example of when to use cross-tabulation is with product surveys - you could ask a g
people “Do you like our products?” and use cross-tabulation to get a more insightful answer. R
recording the 50 responses, you can add another independent variable, such as gender, and
tabulation to understand how the male and female respondents view your product. 5
With this information, you might see that your female customers prefer your products more tt
‘customers. You can then use these insights to improve your products for your male customers.
eegg TS “=C
W Multiply the Grd period by 4 to get the 12th period (one year)
twelfth nonth_period = nonths_periodl2] * +
print(twalfthmonth_peried) # Output: Period('2023', "A-DEC')
In the context of mathematics and numerical operations, “periods” typically refer to different concepts
depending on the specific field of mathematics or context in which they are used. Here are some common |
interpretations of "periods" and "periods arithmetic” : . Ve
1. Periods in Trigonometry and Geometry
* In trigonometry, a “period” refers to the interval over which a trigonometric function, such ss sine or
cosine, repeats itself. For example, the sine function has a period of 2m radians, which jeans it repeats
its values every 2n radians. ami.
* In geometry, "periodic tessellation” refers to a repeating patter of tiles or shapes Oto
wsithout any gaps or overlaps. These tessellations often have periodic characteristics. =
2. Periods in Time and Frequency ‘ Tea as
tn the context of time and frequency analysis, a “period” usually refers to the time it takes |
event or wave to complete one full cycle. For example, the period of a simple harrr oa
haha
time it takes for a pendulum to swing back and forth once, : eee
rData Modeling and Visualization 5-28 Data Aggregation and,
3. Periods in Financial Mathematics:
* _ In finance, "periods" often refer to discrete time intervals, such as months, quarters, or years,
calculations often involve compound interest, where the number of periods plays a crucial
determining the final amount
* Now, let's briefly discuss "periods arithmetic," which might refer to. mathematical oper
calculations involving periods:
4. Periodic Functions Arithmetic
When working with periodic functions like sine and cosine, you can perform arithmetic operations i
these functions. For example, you can add, subtract, multiply, or divide periodic functions while con
their periods,
5. Time-Series Arithmetic:
In time-series analysis, arithmetic operations can be performed on data points collected at regular
intervals (periods). These operations might include calculating averages, growth rates, or changes in
‘ver specific periods of time.
6. Financial Periods Arithmetic:
* In finance, calculations often involve periods, such as compounding interest over a specific numt
Periods, calculating the net present value of cash flows occurring at different periods, or determining
future value of investments over multiple periods.
* To perform arithmetic operations with periods, it’s important to understand the specific context
units involved, as different fields and situations may have their own conventions and formulas
working with periods.
5.3.8 Resampling and Frequency Conversion
Resampling and frequency conversion are important techniques used in time series data analysis to
the time intervals at which data is recorded or observed, These operations are useful for adjusting the granu
of time series data, aggregating data at different time intervals, and preparing data for specific ti
analyses. Let's explore resampling and frequency conversion in more detail :
1. Resampling : Resampling involves changing the frequency of a time series by aggregating or int
data to different time intervals. Resampling is often necessary when the original data is collected
different frequency than the desired analysis frequency. There are two primary types of resampling =
a) Downsampling : Downsampling involves reducing the frequency of the data, converting it to a
time resolution. This is typically done to summarize data over longer time periods. For
converting daily data to weekly or monthly data involves downsampling.
In downsampling, data is aggregated or combined within each new time interval. Common
functions include sum, mean, median, or other statistical measures.
b) Upsampling : Upsampling involves increasing the frequency of the data, converting it to a higher
resolution. This is often done to add more data points within a given time period, For
converting monthly data to daily data involves upsampling.Data Aggregation and Analysis
Interpolation
ig on
29
In upsampling, new data points are interpolated or filled in between existing data pol snd
methods may include linear interpolation, polynomial interpolation, or'other techniques depe'
the nature of the data,
Frequency Conversion :
2 frequency, which
Frequency conversion refers to changi ime series to a new
's to changing the time intervals of a time ination of both
can be either higher or lower than the original frequency. Frequency conversion is @ corm!
downsampling and upsampling operations.
+ For example, converting daily data to hourly data involves frequency conversion by upsampling, where
new data points are inserted between the existing daily data points :
+. Frequency conversion is useful when analyzing data at a different time granularity, or when aligning
multiple time series with different frequencies for comparison.
3, Resampling and Frequency Conversion Example :
Let's illustrate resampling and frequency conversion using Python with the pandas library:
sayian
import pandas as pd
# (Create a daily time series with random data
dates = pd.date_range(start='2023-01-01', periods=10, freq="D')
dally data = pd.Series(range(10), index=dates)
‘#Downsample to weekly data by taking the sum of each week
‘weekly data = daily dataresample('W').sumQ)
print{weekdy_data)
#Upsample to hourly data using linear interpolation
hourly dates = pd.date range(start='2023-01-01', end='2023-01-10', freq="H)
+hourly data = daily data.resample('H').interpolate(method="inear’)
In this example, we first create a daily time series with random data. We then downsample the data to weekly
frequency by taking the sum of data points within each week. Next, We upsample thé data to ‘hourly frequency
Using linear interpolation to fill in new data points between the existing daily data points.
5.3.9 Moving Window Functions
Moving window functions, also known as rolling or sliding window functions, are essential tools in time series”
ata analysis. They involve applying a specific function or operation to a fixed-size window of data that moves
along the time axis, These functions are used to calculate rolling statistics, smooth data, identify trends, and
Perform other time-based computations, Moving window functions are Particularly valuable when dealing with
ea 's explore moving window functions in more detailData Modeling and Visualization 5-30 Data Aggregation and
Basic Concept : The basic concept of moving window functions involves defining a fixed-size windo
Spans a specific number of consecutive data points in the time series, The window moves one step
along the time axis, continuously updating the window's data points and recalculating the function's o
Applications : Moving window functions have various applications in time series data analysis, includ
4) Rolling Statistics : Calculating rolling statistics such as rolling mean, rolling median, rolling
deviation, or other aggregated measures within the moving window. These statistics help
and identify underlying patterns and trends,
b) Moving Averages : Computin,
19 moving averages by taking the mean of the data within the
M
loving averages are commonly used to reveal underlying trends or eliminate noise in the data,
‘J Exponential Moving Averages (EMA) : similar to moving averages, EMA assigns
decreasing weights to data points within the window, giving more weight to recent observations. El
widely used in financial analysis and trend detection,
Rolling Sum : Calculating the sum of data points within the window to analyze cumulative
trends.
Window Size : The size of the moving window is an essential parameter in moving window fu
window size determines the number of data points included in the computation at any given time.
window size results in a smoother output but may lead to slower responsiveness to changes in the
the other hand, a smaller window size provides a more responsive output but may be more
noise,
Handling Boundary Effects : When applying moving window functions, boundary effects
considered. At the beginning and end of the time series, there may not be enough data points to fe
complete window. Different strategies can be used to handle this, such as padding missing values or
weighted windows that give more weight to available data points.
In addition to the above two boundary effects of electrostatic and hydrodynamic nature
Presence of the boundary also has a geometric confining effect upon the ion distribution around the
if the double layer is thick enough to reach the boundary. This affects the electric driving force c
external electric field is applied upon the system. This perhaps is the most important issue in the
boundary effect in electrokinetic motion, as the double layer polarization/deformation
ultimate particle motion in general. And obviously the thicker the double layer is, the more si
factor is, as the deformation of the double layer can be much more profound.
5.3.9(A) _ Moving Window Functions Example
* Time series data is a series of data points recorded with a time component (temporal) present.
the time these data points are recorded at a fixed time interval. *
* Many real-world datasets like stock market data, weather data, geography datasets, earthquake d
are time series datasets. ‘ :
* While working with time series datasets, we need to perform various operations on them to
from different perspectives, The two most common operations are resampling and movingat
a2
as
a4
as
a6
po
jodeling and Visualization 31 Data Aggregation and Analysi
Time series Resampling is the process of changing frequency at which data points(observations) are
recorded. Resampling is generally performed to analyze how time series data behaves under different
frequencies.
Moving window functions are aggregate functions applied to time series datasets by moving window of fixed
variable size through them. Moving window functions can be used to smooth time series to handle noise.
Let's illustrate moving window functions using Python with the pandas library :
Smport pandi
as pd
4 Create o tine series with random data
dates = pd.date_range(start='2023-01-01', periods=i0, freq='D')
sta = pd.Series([10, 15, 20, 58], index=dates)
# Calculate the rolling mean with a window size of 3
zolling_mean = data. rolling(window=).mean()
print(rolling_mean)
In this example, we create a time series with random data and calculate the rolling mean using a window size
of 3. The rolling mean smooths the data by taking the average of every three consecutive data points.
What is mean by group by mechanics?
Explain data aggregation,
Discuss various uses of data aggregation.
Explain three steps of the split-apply-combine strategy.
Explain various date and time tools.
What is time series of data?
Discuss about data ranges,