Data Analytics and Reporting - Notes Unit 1 and 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Data Analytics and

Reporting: An Introduction
Welcome to the World of Data!
Understanding the Importance of Data

In today's digital age, data is the new oil. It's the raw material that fuels
innovation, decision-making, and business growth. Data Analytics is the process
of examining, cleaning, transforming, and modeling data to discover useful
information, draw conclusions, and support decision-making.

Why Python?

Python has emerged as the language of choice for data scientists and analysts
due to its simplicity, readability, and powerful libraries. It's versatile, making it
suitable for both beginners and experienced programmers.

Python: A Brief Overview


● What is Python?
a. A high-level, interpreted programming language
b. Known for its readability and simplicity
c. Widely used in data science, machine learning, web development, and
more.

● History of Python:
a. Created by Guido van Rossum in the late 1980s
b. Named after the British comedy group Monty Python
c. Initially designed for scripting and automation
d. Grew in popularity due to its focus on code readability and efficiency
● Purpose of Python in Data Analytics:
a. Data manipulation and cleaning
b. Exploratory data analysis (EDA)
c. Data visualization
d. Machine learning and model building
e. Statistical analysis

Data Types in Python


Data types define the kind of data a variable can hold. Python supports various
data types:

● Numeric:
a. int: Integer values (e.g., 42, -10)
b. float: Floating-point numbers (e.g., 3.14, 2.5)
c. complex: Complex numbers (e.g., 2+3j)

● Text:
a. str: Strings (e.g., "Hello", 'World')

● Boolean:
a. bool: Boolean values (True or False)

● Sequence:
a. list: Ordered collection of items (mutable)
b. tuple: Ordered collection of items (immutable)

● Mapping:
a. dict: Unordered collection of key-value pairs

Pandas: Your Data Analysis Toolkit


Pandas is a powerful Python library built on top of NumPy. It provides
high-performance, easy-to-use data structures and data analysis tools.

Installation:
1. Open your terminal or command prompt.
2. Type the following command and press Enter:
Bash
pip install pandas

Importing Pandas:

To use Pandas in your Python code, import it as follows:

Python

import pandas as pd

DataFrame: The Core Data Structure

A DataFrame is a two-dimensional labeled data structure with columns of


potentially different types. It is similar to a spreadsheet or SQL table.

Creating a DataFrame:

Python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 28],
'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)
print(df)

In the next session, we will delve deeper into Pandas, exploring various
data manipulation techniques and visualization capabilities.
Remember: Practice is key to mastering Python and Pandas. Experiment with
different datasets and explore the vast functionalities offered by these libraries.

Unit-01: Introduction to
Data Analytics and
Reporting
Lecture 1: What is Data Analytics?
Data Analytics is the process of examining large data sets to discover trends
and patterns. It involves collecting, cleaning, transforming, and analyzing data to
extract meaningful insights. These insights can be used to make informed
decisions, identify opportunities, and solve problems.

Real-world example: A retailer might use data analytics to analyze customer


purchasing behavior to determine which products to promote or to identify trends
in customer preferences.

Lecture 2: Data Analysis and Data Processing


Data Analysis is the process of inspecting, cleansing, transforming, and
modeling data with the goal of discovering useful information, informing
conclusions, and supporting decision-making.

Data Processing is the conversion of raw data into a more organized format
suitable for analysis. This involves tasks like data cleaning, transformation, and
integration.

Real-world example: A telecom company might process customer call records


to analyze call patterns, identify network congestion areas, and improve service
quality.

Lecture 3: Types of Analysis


● Descriptive Analytics: Summarizes historical data to understand what
happened.
○ Examples: Sales reports, customer demographics

● Diagnostic Analytics: Explores the reasons behind past occurrences.


○ Examples: Root cause analysis of product failures

● Predictive Analytics: Uses historical data to predict future outcomes.


○ Examples: Customer churn prediction, demand forecasting

● Prescriptive Analytics: Recommends actions based on predictive models.


○ Examples: Product recommendations, optimized pricing strategies

Lecture 4: Difference Between Data Science and Analysis


Data Science is a broader field that encompasses data analysis, machine
learning, and data visualization. It focuses on extracting insights from data to
solve complex problems.

Data Analysis is a subset of data science that focuses on exploring and


understanding data to uncover patterns and trends.

Lecture 5: Different Data Preprocessing Techniques


Data Preprocessing is the process of transforming raw data into a clean and
structured format suitable for analysis. Techniques include:

● Data Cleaning: Handling missing values, outliers, and inconsistencies.


● Data Integration: Combining data from multiple sources.
● Data Transformation: Normalization, standardization, and aggregation.
Lecture 6: Understanding Reporting and Use of Different
Tools
Reporting is the process of communicating insights derived from data analysis
to stakeholders. Effective reporting involves clear visualization and concise
communication. Tools:

● Business Intelligence (BI) tools: Power BI, Tableau, IBM Cognos.


● Data Visualization tools: Excel, Python libraries (Matplotlib, Seaborn)
● Statisticall: Pandas

Real-world example: A marketing team might use a BI tool to create a


dashboard showing sales trends, customer demographics, and campaign
performance.

Unit-02: Data Analysis


Using Pandas
Pandas: A Powerful Tool for Data Manipulation
Pandas is a Python library specifically designed for data manipulation and
analysis. It provides high-performance, easy-to-use data structures and data
analysis tools. Think of it as a spreadsheet on steroids, offering much more
flexibility and capabilities.

Key Features of Pandas:


● Data Structures: Pandas introduces two primary data structures:
○ Series: One-dimensional labeled array holding any data type.
○ DataFrame: Two-dimensional labeled data structure with columns of
potentially different types.
● Data Import/Export: Easily handles various file formats like CSV, Excel,
JSON, SQL databases, and more.

● Data Cleaning and Preparation: Offers functions to handle missing


values, duplicates, outliers, and data normalization.

● Data Analysis: Provides tools for statistical calculations, data aggregation,


and exploratory data analysis.

● Time Series: Excellent support for working with time-series data.

Why Pandas is Popular:


● Efficiency: It's optimized for performance on large datasets.
● Flexibility: Handles diverse data types and structures.
● Ease of Use: Intuitive syntax and clear documentation.
● Integration: Works seamlessly with other Python libraries like NumPy,
Matplotlib, and Scikit-learn.

Lecture 7: Types of Data and Different Sources of Data


● Structured Data: Organized in a predefined format (e.g., databases, CSV
files)
● Unstructured Data: No predefined format (e.g., text, images, audio)
● Semi-Structured Data: Hybrid of structured and unstructured (e.g., JSON,
XML)

Data Sources:

● Databases (SQL, NoSQL)


● Files (CSV, Excel, JSON)
● APIs (REST, GraphQL)
● Web scraping
Lecture 8: Overview of Pandas Library
Pandas is a Python library for data manipulation and analysis. It provides
high-performance data structures and data analysis tools.

Lecture 9: Data Structures in Pandas: Series and DataFrame


● Series: One-dimensional labeled array
● DataFrame: Two-dimensional labeled data structure

Python

import pandas as pd

# Create a Series
data = [10, 20, 30, 40]
series = pd.Series(data)
print(series)

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30,
35]}
df = pd.DataFrame(data)
print(df)

Lecture 10: Importing and Exporting Data Using Pandas

Python
import pandas as pd

# Import CSV data


df = pd.read_csv('data.csv')

# Export to CSV
df.to_csv('output.csv', index=False)

Lecture 11: Data Cleaning and Preparation with Pandas


● Handling missing values: dropna(), fillna()
● Removing duplicates
● Outlier detection and treatment

Python

import pandas as pd
import numpy as np

# Handle missing values


df.fillna(0, inplace=True) # Fill missing values with 0

# Remove duplicates
df.drop_duplicates(inplace=True)

Lecture 12: Handling Missing Data: dropna(), fillna(), and


Interpolation
● dropna(): Removes rows or columns with missing values
● fillna(): Fills missing values with specified values or methods
● Interpolation: Estimates missing values based on surrounding data
Python

import pandas as pd

# Drop rows with missing values


df.dropna(inplace=True)

# Fill missing values with mean


df.fillna(df.mean(), inplace=True)

Lecture 13: Data Transformation and Manipulation: Sorting,


Filtering, and Grouping
● Sorting: sort_values()
● Filtering: Boolean indexing
● Grouping: groupby()

Python

import pandas as pd

# Sort by age
df.sort_values('Age', ascending=False, inplace=True)

# Filter for age greater than 30


filtered_df = df[df['Age'] > 30]

# Group by gender and calculate mean age


grouped_df = df.groupby('Gender')['Age'].mean()

Lecture 14: Descriptive Statistics with Pandas


● Count, mean, median, mode, standard deviation, min, max, quartiles
● Correlation and covariance
Python

import pandas as pd

# Calculate summary statistics


print(df.describe())

# Calculate correlation between columns


correlation = df['Column1'].corr(df['Column2'])
print(correlation).

You might also like