The document provides an overview of data analytics, covering sources and types of data, classification, characteristics, and the importance of analytics in decision-making. It discusses various analytical techniques including regression modeling, multivariate analysis, and machine learning, as well as the data analytics lifecycle and modern tools. Additionally, it highlights the significance of stream data processing and methods for handling high-velocity data streams.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
10 views26 pages
Data Analytics Notes
The document provides an overview of data analytics, covering sources and types of data, classification, characteristics, and the importance of analytics in decision-making. It discusses various analytical techniques including regression modeling, multivariate analysis, and machine learning, as well as the data analytics lifecycle and modern tools. Additionally, it highlights the significance of stream data processing and methods for handling high-velocity data streams.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 26
UNIT-1
Introduction to Data Analytics
1. Sources and Nature of Data
Data originates from various sources and can have different characteristics based on its structure and usage.
Sources of Data:
Internal Sources: Data generated within an organization, such as transaction records, employee information,
and customer databases,
Example: A company's CRM system captures customer interactions, purchases, and pré
External Sources: Data from outside the organization, including market research, so
media, public records,
and third-party databases.
Example: Social media platforms like Twitter and Facebook provide data on customer sentiment and brand.
perception.
Sensor Data: Data generated by devices and sensors, often used in IoT (Internet of Things) applications.
Example: Smart thermostats collect temperature and usage data to optimize energy consumption.
Nature of Data:
Qualitative Data: Non-numeric, descriptive data often in the form of text, images, or videos.
Example: Customer feedback, interviews, and social media posts.
Quantitative Data: Numeric data that can be measured and analyzed statistically.
Example: Sales figures, click-through rates, and number of products sold.
2. Classification of Data
Data can be categorized based on its structure and form:
Structured Dat:
Data that is highly organized and easily searchable within databases.
Example: Excel spreadsheets or SQL databases containing customer information (e.g., name, age, purchase
history).
Semi
structured Data: Data that does not have a rigid structure but still contains some organizational
properties, making it easier to analyze.Example: JSON, XML files, and email metadata (e.g., subject, sender, timestamp).
Unstructured Data: Data that lacks a predefined structure, making it more complex to analyze.
Example: Text documents, images, videos, social media posts.
3. Characteristics of Data
Understanding the characteristics of data is essential for effective data analysis.
Volume: The amount of data available, often large in big data contexts.
Example: Social media platforms generate terabytes of data daily.
Velocity: The speed at which new data is generated and needs to be processed.
Example: Financial trading systems ger
srate and process millions of transactions per second.
Var
ty: The diversity of data types and formats.
Example: A company might deal with text data from emails, numerical data from sales, and multimedia data
from marketing videos.
Veracity: The trustworthiness and accuracy of the data.
Example: Data froma well-maintained database is more reliable than user-generated content on social media.
4, Introduction to Big Data Platform
Big Data platforms are designed to handle the vast amounts of data generated in today’s digital world.
Definition:
ig Data platforms provide toole and technologies for storing, processing, and analyzing large
datasets that traditional databases cannot handle,
Components:
Storage: Distributed storage systems like Hadoop’s HDFS.
Processing: Tools like Apache Spark and Hadoop MapReduce.
Visualization: Dashboards and tools like Tableau and Power BI for visualizing data.
Example: Netflix uses a Big Data platform to analyze viewing habits and recommend shows to its users.
5. Need for Data Analytics
Data analytics is crucial for extracting actionable insights from raw data.Decision-Making: Helps businesses make informed decisions based on data-driven insights.
Example: Retailers use data analytics to optimize inventory management based on sales trends.
Competitive Advantage: Companies that leverage data analytics can gain a competitive edge by understanding
market trends and customer behavior.
Example: Amazon uses data analytics to recormmend products and predict customer needs, driving sales.
6. Evolution of Analytic Scalability
As data grows, eo does the need for scalable analytic eolutione.
Early Stages: Initially, analytics were limited to small datasets and simple statistical tools.
Example: Early business intelligence tools focused on historical reporting from small datasets.
Modern Scalability: Today, analytics platforms can scale to handle massive datasets in real-time, using
distributed computing and cloud technologie:
Example: Google's BigQuery allows for the processing of petabytes of data with SQL queries.
7. Analytic Process and Tools
‘The analytic process involves several steps, supported by various tools.
Steps:
Data Collection: Gathering relevant data from various sources,
Data Cleaning: Removing or correcting inaccuracies in the data.
Data Analysis: Using statistical and machine learning models to extract insights.
Data Vieualization: Prese
ing data in an understandable and actionable format.
Tools:
Data Collection: Apache Kafka, Google Dataflow.
Data Cleaning: OpenRefine, Python libraries like Pandas
Data Analysis: R, Python (sciki
Learn), SAS.
Data Visualization: Tableau, Power BI, Matplotlib in Python.
Example: A marketing team might use tools like Google Analytics to collect web traffic data, clean it with
Python, analyze it using R, and visualize it im Tableau.
8. Analysis vs Reporting
Analysis: A deep dive into the data to extract actionable insights, often involving statistical methods andpredictive modeling.
Example: Predicting customer churn based on usage patterns.
Reporting: Presenting data in a structured format, often for monitoring and tracking purposes.
Example: Monthly sales reports showing total revenue, profit margins, and other key metrics.
9. Modern Data Analytic Tools
Modern tools have transformed how data ie analyzed, making it more accessible and powerful.
Self-Service Tools: Allow non-technical users to perform data analysis without needing deep technical skills
Example: Table:
and Power BI enable users to create interactive dashboards with drag-and-drop interfaces.
Machine Learning Platforms: Tools that allow for the implementation of complex predictive models.
Example: Google's TensorFlow and Microsoft’s Azure Machine Learning Studio,
Cloud-Based Tools: Provide scalable and flexible data analytics solutions.
Exampl
er,
mazon Web Services (AWS) offers a suite of tools for big data analyties, including Redshift, S3, and
10. Applications of Data Analytics
Data analytics is used across various industries to improve processes, enhance customer experience, and drive
growth.
Healtheare: Analyzing patient data to predict disease outbreaks and personalize treatment.
Example: IBM Wateon Health uses data analytics to help doctors make more informed decisions.
Finance: Detecting fraudulent transactions and assessing credit risk.
Example: Banks use data analytics to monitor transactions in real-time for signs of fraud.
Retail: Optimizing inventory, personalizing marketing campaigns, and predicting trends.
Example: Walmart uses data analytics to forecast demand and manage inventory efficiently.
Data Analytics Lifecycle
1. Need for Data Analytics Lifecycle
The lifecycle ensures that data analytics projects are carried out systematically, leading to reliable and
actionable results.Purpose: Provides
structured approach to solving data-related problems, ensuring consistency and
effectiveness.
Example: A company launching a new product would use the data analytics lifecycle to analyze market trends,
customer feedback, and sales data.
2. Key Roles for Successful Analytic Projects
Data Analyst: Interprets data and provides insights.
Data Engineer: Prepares the data infrastructure, ensuring that data is accessible and clean,
Data Scientist: Develops models and algorithme to analyze data.
Project Manager: Coordinates the efforts of the team to ensure project success.
Example: Ina data-driven marketing campaign, the data engineer sets up data pipelines, the analyst interprets
customer data, and the scientist models customer behavior,
3. Phases of Data Analytics Lifecycle
Discovery: Understanding the business problem and determining the data required.
Activities: Identify data sources, define project objectives,
Example: A retail company wanting to predict holiday sales would start by identifying relevant historical sales
data.
Data Preparation: Cleaning and transforming the data into a usable format.
Activities: Data cleaning, normalization, feature selection.
Example: Removing duplicates and handling missing values in a customer database.
Model Planning: Selecting appropriate modeling techniques and tools.
Activities: Determine the modeling approach (e.g., regression, classification), eplit data into training and testing
sete,
Example: Choosing a linear regression model to predict sales based on past data,
Model Building: Developing the model using the prepared di
Activities: Training the model, fine-tuning parameters, validating with test data.
Example: Building a predictive model that estimates future sales based on previous trends.
Communicating Results: Presenting findings to stakeholders in a clear and actionable manner.
Activities: Creating visualizations, writing reports, giving presentations.
Example: A dashboard that shows projected sales and inventory needs for the next quarter.Operationalization: Deploying the model into production and monitoring its performance.
Activities: Implementing the model in a live environment, setting up monitoring systems, updating the model as
needed.
Example: Integrating the sales prediction model into the company’s ERP system to automate inventory orders.
ThankYou!
I Know you doing well in Exam,
Like | Share | Subscribe | commentUNIT-2
Data Analysis
1. Regression Modeling
Definition: Regression modeling is a statistical technique used to identify the relationship between a dependent
variable and one or more independent variables. It helps in predicting the value of the dependent variable based
on the known values of the independent variables.
Types:
Linear Regression: Models a linear relationship between the dependent and independent variables.
Multiple Regression: Involves two or more independent variables
Logistic Regression: Used when the dependent variable is categorical (e.g. binary outcomes)
Example: Predicting house prices based on features like size, location, and number of rooms.
2. Multivariate Analysis
Definition: Multivariate analysis involves examining more than two variables simultaneously to understand the
relationships and patterns among them. It is used when data has multiple dimensions.
Techniques:
Principal Component Analysis (PCA): Reduces the dimensionality of data while preserving most of the
variability.
Factor Analysis: Identifies underlying factors that explain the patterns of correlations within a set of observed
variables.
Cluster Analysis: Groups similar data points together based on their characteristics.
Example: Market segmentation based on customer demographics, purchasing behavior, and preferences.
3. Bayesian Modeling
Definition: Bayesian modeling is a statistical approach that applies Bayes’ theorem to update the probability
estimate for a hypothesis as new evidence or information becomes available.
Key Concepts:
Prior Probability: Initial belief about the probability of an event.
Likelihood: Probability of the observed data given the hypothesis.
Posterior Probabil
Updated probability after considering the new evidence.
Example: Medical diagnosis where the probability of a disease is updated as new test results are obtained.4, Inference and Bayesian Networks:
Definition: Bayesian networks are graphical models that represent the probabilistic relationships among a set of
variables. They are used for reasoning under uncertainty and for inference.
Components:
Nodes: Represent variables.
Edges: Represent dependencies between variables.
Conditional Probability Tables (CPTs): Define the probability of a variable given ite parents.
Example: Predicting weather conditions where variables like temperature, humidity, and wind are
interdependent.
5, Support Vector and Kernel Methods.
Definition: Support Vector Machines (SVM) are supervised learning models used for classification and regression
tasks. Kernel methods are techniques that enable SVMs to handle non-
high
jonships by mapping data to a
imensional space.
Key Concepts:
‘Support Vectors: Data points that define the decision boundary.
Kernel Functions: Transform the data into a higher-dimensional space to make it linearly separable.
Example: Image recognition, where SVMs classify images into different categories (e.g., cats vs. dogs).
6. Analysis of Time Series
Definition: Time series analysis involves studying datasets where observations are collected over time at regular
intervals. Itis used to identify trends, seasonality, and cyclic patterns.
Types:
Linear Systems Analysis: Models the relationship between time series data and other variables using linear
equations.
Nonlinear Dynamics: Captures complex, non-linear relationships in time series data.
Example: Stock market prediction where past prices are used to forecast future movements.
7.Rule Induction
Definition: Rule induction is a data mining technique used to extract useful if-then rules from data. Itis often
jon tasks.Decision Trees: Splits data into branches based on feature values, leading to a decision rule,
Association Rule Mining: Identifies relationships between variables in large datasets.
Example: Identifying purchasing patterns in retail, such as if a customer buys bread, they are likely to buy
butter.
8, Neural Networks: Leaming and Generalization
Definition: Neural networks are computational models inspired by the human brain that are used to recognize
patterns and colve complex problems. They learn from data and generalize to new data.
Components:
Neurons: Basic units that receiv:
iput, process it, and produce output.
Layers: Arrangements of neurons, including input, hidden, and output layers,
Activation Functions: Determine the output of a neuron given an input.
Example: Handwriting recognition where neural networks learn to identify letters from handwritten samples.
9. Competitive Learning
Definition: Competitive learning is a type of unsupervised learning where neurons in a network compete with
each other to be activated. Only the winning neuron (the one closest to the input) gets to learn.
Applications:
Kohonen’s Self-Organizing Maps (SOM): Map high-dimensional input data to a lower-dimensional space.
Example: Clustering similar images in a large dataset.
10. Principal Component Analysis (PCA) and Neural Networks
Definition: PCA is a dimensionality reduction technique that transforms data into a set of orthogonal
components that explain the most variance. Its often used in conjunction with neural networks to reduce input
features.
Example: Reducing the number of input features in a facial recognition system while retaining important
information.
11. Fuzzy Logie: Extracting Fuzzy Models from Data
Definition: Fuzzy logic is an approach that allows for reasoning with uncertainty and imprecision. Fuzzy models
are extracted from data to handle vague or imprecise information.
Components:
Fuzzy Sets: Define membership functions that describe how much an element belongs toa set.Fuzzy Rules: If-then rules that govern the behavior of the system.
Example: Controlling the temperature of an air conditioning system where the input (e.g., temperature) is not
precise.
12. Fuzzy Decision Trees
Definition: Fuzzy decision trees combine fuzzy logic with traditional decision trees, allowing for decisions in the
presence of uncertainty.
Application: Decision-making in environments where inputs are imprecige or uncertain,
Example: Rick acseeement in financial portfolios where future returne are uncertain,
13. Stochastic Search Methods
Definition: Stochastic search methods are optimization techniques that use randomness to find solutions to
complex problems. These methods are useful when the search space is large and complex.
Techniques:
Simulated Annealing: A probabilistic technique that explores the search space and gradually refines the
solution.
Genetic Algorithms: Mimic natural selection by evolving a population of solutions over time,
Example: Optimizing the layout of components on a microchip to
inimize interference and power consumption.
ThankYou!
I Know you doing well in Exarn..
Like | Share | Subscribe | commentUNIT-3
Mining Data Streams
1. Introduction to Streams Concepts
Definition: Data streams are continuous, rapid, and time-varying sequences of data elements generated by
various sources. Unlike static datasets, streams are unbounded and require real-time processing.
Character
Continuous Flow: Data arrives incessantly,
Unbounded Size: Potentially infinite data volums
Time-sensitive: Data relevance may decrease over time.
Example: Sensor data from IoT devices, clickstreams from web users, or social media feeds.
2, Stream Data Model and Architecture
Stream Data Model: Represents data as sequences of tuples that arrive over time. Queries over streams need to
handle continuous data arrival.
Architecture Components:
Data Sources: Origin of the streams (e.g, sensors, logs)-
Stream Ingestion: Systems that capture and pre-process streams (e.g. Apache Kafka).
Stream Processing Engine: Processes data in real-
ime (e.g., Apache Flink, Spark Streaming).
Storage: Temporary or permanent storage for processed data.
Output: Dashboards, alerts, or other applications consuming processed data.
Example: A real-time monitoring system for manufacturing processes where sensors feed data into a processing
engine that detects anomalies.
3. Stream Computing
Definition: Stream computing involves processing data streams in real-time to extract insights, detect patterns,
or trigger actions.
Key Challenges:
Latency: Eneuring minimal delay in processing.
‘Throughput: Handling high data arrival rates.
Scalability: Adapting to varying data volumes.
Technologies: Apache Storm, Apache Samza, Google Cloud Dataflow.
Example: Fraud detection in credit card transactions where each transaction is analyzed in real-time to identifypotential fraud.
4, Sampling Data in a Stream
Purpose: Since it's impractical to store or process all data in high-velocity streams, sampling techniques are used
to select representative subsets for analysis.
Techniques:
Reservoir Sampling: Maintains a fixed-size sample of the etream where each incoming element has an equal
probability of being included.
Bernoulli Sampling: Each element is included in the sample with a fixed probability.
Example: Estimating the average temperature from a stream of sensor readings by sampling a subset of the
data.
5. Filtering Streams
Definition:
iltering involves selecting data elements from a stream that meet certain criteria, effectively
reducing the volume of data to be processed.
Methods:
Content-based Filtering: Based on the content of the data (e.g., keywords in a tweet).
Time-based Filtering: Based on timestamps (e.g., events within the last hour),
Example: Monitoring social media for mentions of a brand by filtering tweets containing the brand's name.
6. Counting Distinct Elements in a Stream
Challenge: Determining the number of unique elements (e.g., unique visitors) in a data stream using limited
memory.
Algorithms:
HyperLogLog: Probabilistic algorithm that provides an approximate count of distinct elements with fixed
memory usage
Bloom Filters: Space-efficient data structures that test whether an element is a member of a set.
Example: Estimating the number of unique IP addresses visiting a website in real-time.
7, Estimating Moments
Definition: Moments are statistical measures (like mean, variance) that provide insights into the distribution of
data ina stream.Algorithms:
Alon-Matias-Szegedy (AMS) Algorithm: Estimates higher-order moments (like second or third moments) in data
streams.
Example: Computing the variance of packet sizes in network traffic to detect anomalies.
8. Counting Oneness ina Window
Definition: Counting the number of occurrences of a particular event or element within a specified time window
in the stream,
Sliding Windows: Time-based or count-based windows that move forward ae new data arrives.
Techniques:
Damped Window Model:
fecent data is given more weight than older data.
Example: Counting the number of times a user clicks a specific button in the last 10 minutes.
9. Decaying Window
Definition: A model where the importance of data decreases over time, ensuring that recent data has more
influence on the analysis.
Implementation:
Exponential Decay Functions: Assign weights to data points that decrease exponentially over time.
Example: Ina recommendation system, giving more importance to a user's recent interactions than older ones.
10. Real-time Analytics Platform (RTAP) Applications
Definition: RTAPs process and analyze data as it arrives, enabling immediate insights and actions,
Applications:
Real
ime Monitoring: Tracking system health or user activity.
Dynamic Pricing: Adjusting prices based on current demand and supply.
Real
ime Personalization: Updating user experiences based on recent behavior.
Example: Uber's surge pricing model adjusts fares in real-time based on demand.
11. Case Study ~ Real-time Sentiment Analysis
Objective: Analyzing public sentiment about a topic, brand, or event as it unfolds.
Process:
Data Ingestion: Collecting data from sources like Twitter, Facebook.Pre-processing: Cleaning text data, removing noise.
Sentiment Analysis: Using Natural Language Processing (NLP) techniques to classify sentiments as positive,
negative, or neutral.
Visualization: Displaying sentiment trends on dashboards.
Example: Monitoring public reaction to a live event or product launch to gauge success.
12. Case Study ~ Stock Market Predictions
Objective: Predicting stock price movernente in real-time to inform trading decisions.
Process:
Data Collection: Streaming data from stock exchanges, financial news, and social media.
Feature Extraction: Identifying relevant indicators like trading volume, market sentiment.
Modeling: Using machine learning models (e.g., LSTM networks) to predict future prices.
Execution: Automated trading based on predictions.
Example: High-frequency trading firms leveraging microsecond-level data to mal
trading di
ThankYou!
T Know you doing well in Examen!
Like | Share | Subscribe | commentUNIT
Frequent Itemsets and Clustering
1. Mining Frequent Itemsets
Definition: Frequent itemsets are groups of items that often appear together in a dataset. Mining frequent
itemsets is a fundamental task in association rule mining, where the goal is to discover associations between
different items.
Applications:
Market Basket Analysis: Identifying products frequently bought together.
Fraud Detection: Detecting patterns in fraudulent transactions.
Example: Ina supermarket, discovering that customers who buy bread also frequently buy butter.
2. Market Basket Modelling
Definition: Market basket modeling analyzes customer purchase behavior by identifying sets of products that
are frequently bought together. This helps in understanding consumer behavior and optimizing product
placement.
Process:
‘Transaction Data Collection: Gathering data on customer transactions.
Frequent Iterset Mining: Identifying common item combinations.
Rule Generation: Creating association rules from frequent itemsets.
Example: Offering discounts on butter when customers buy bread and jam together to increase overall sales.
3. Apriori Algorithm
Definition: The Apriori algorithm is a classic algorithm for mining frequent itemsets and generating association
rules. It operates on the principle that all non-empty subsets of a frequent itemset must also be frequent.
Steps:
Generate Candidate Itemsets: Identify all possible item combinations.
Prune Infrequent Itemsets: Remove itemsets that do not meet the minimum support threshold.
Iterate: Repeat the process by increasing the size of itemsets until no more frequent itemsets are found.
Example: Using Apriori to discover that in a retail store, customers who buy diapers and baby wipes are also
likely to buy baby formula.
4, Handling Large Datasets in Main MemoryChallenge: Processing large datasets that cannot fit into the main memory requires efficient algorithms and
data structures.
Techniques:
Partitioning: Dividing the dataset into smaller chunks that can be processed individually.
Data Compression: Reducing the size of the dataset using techniques like sampling or aggregation.
In-Memory Databases: Using databases that are optimized for memory storage to improve performance.
Example: Processing transaction data from a large e-commerce site to find frequent itemsete without exceeding
memory limite.
5. Limited Pass Algorithm,
Definition: Limited pass algorithms are designed to minimize the number of passes over the dataset, making
them suitable for large datasets or data streams.
Example:
PCY Algorithm: An extension of the Apriori algorithm that uses a hash table to reduce memory usage and.
requires only two passes over the data.
Multistage Algorithm: Reduces the number of candidate itemsets in each pass, thereby minimizing the overall
number of passes required,
Example: Using the PCY algorithm to find frequent itemsets in large-scale transaction data by reducing the
number of candidate pairs using hashing.
6. Counting Frequent Itemsets in a Stream
Challenge: In data streams, data arrives continuously, 20 counting frequent itemsets requires algorithms that
canhandle real-time data and limited memory.
Algorithms:
Lossy Counting: Maintains a small summary of the stream and provides approximate counts with a guaranteed
error bound.
Frequent Pattern Mining: Extends frequent itemset mining to streaming data by continuously updating the
counts of itemsets.
Example: Identifying trending topics on social media by counting frequent word combinations in real-time.
7. Clustering Techniques
Definition: Clustering is an unsupervised learning technique that groups data points into clusters based on their
similarity. Different clustering techniques are suited for different types of data.Hierarchical Clustering
Definition: Hierarchical clustering creates a tree-like structure (dendrogram) of clusters by iteratively merging
or splitting clusters.
Types:
Agglomerative (Bottom-Up): Starts with individual data points and merges them into larger clusters.
Di
ive (Top-Down): Starts
the entire dataset and splits it into emaller clusters.
Example: Grouping customers into hierarchical segments based on their purchasing behavior.
K-Means Clustering
Definition: K-means is a popular clustering algorithm that partitions data into k clusters by mi
within-cluster variance.
elect k initial cluster centroids.
Assign: Assign each data point to the nearest centroid.
Update: Recalculate the centroids based on the assigned data points.
Iterate: Repeat the process until the centroids no longer change
Example: Segmenting customers into different groups based on thei
purchasing patterns for targeted
marketing.
Clustering High Dimensional Data
Challenge: High-dimensional data (e.g., text data, gene expression data) can be sparse and challenging to cluster
using traditional methods.
Techniques:
Dimensionality Reduction: Techniques like PCA or t-SNE are used to reduce the number of dimensions before
clustering,
‘Subspace Clustering: Identifies clusters in different subspa
sof the high-dimensional data.
Example: Clustering gene expression data to identify groups of genes with similar expression patterns.
8, CLIQUE and ProCLUS
CLIQUE (Clustering in Quest)
Definition: CLIQUE is a subspace clustering algorithm that identifies dense regions in different subspaces of
high-dimensional data.Process:
Partitioning: The data space is partitioned into non-overlapping rectangular units.
Density Calculation: Units with a high density of points are identified as clusters.
Subspace Identification: Clusters are identified in different subspaces.
Example: Clustering high-dimensional data in market research to identify segments of customers based on
specific combinations of features.
ProCLUS (Projected Clustering)
Definition: ProCLUS ie a projected clustering algorithm that identifies clustere in eubepaces by selecting a
subset of dimensions for each cluster.
Process:
Initi
zation: Select initial medeids (central points).
Dimension Selection: Identify relevant dimensions for each cluste
Cluster Formation: Assign data points to the nearest medoid in the selected dimensions.
Example: Clustering customer data in e-commerce by identifying clusters in subsets of attributes like purchase
history and browsing patterns.
9. Frequent Pattern-Based Clustering Methods
Definition: Frequent pattern-based clustering identifies clusters based on frequent patterns or itemsets within
the data. This method is particularly useful for categorical data.
Techniques:
FP-Tree: A data structure that compresses the data and allows for efficient mining of frequent patterns.
FPC (Frequent Pattern Clustering): Uses frequent patterns to form clusters of similar data points.
Example: Clustering transaction data in retail to identify groups of customers with similar purchasing habits
based on frequent itemeets.
10. Clustering im Non-Euclidean Spa
Definition: Non-Euclides
spaces involve distance measures other than the traditional Euclidean distance,
making clustering more suitable for complex data types like graphs, sequences, or categorical data.
Techniques:
Graph-Based Clustering: Uses graph structures to represent data and clusters based on connectivity or other
graph properties.
Edit Distance-Based Clustering: Measures the similarity between sequences (e.g., DNA, text) by calculating theminimum number of edits required to transform one sequence into another.
Example: Clustering biological sequences (like DNA or protein sequences) using edit distance to identify similar
genetic patterns.
11. Clustering for Streams and Parallelism
Challenge: Clustering data streams requires algorithms that can process data in real-time with limited memory
and computational resources.
Techniques:
Micro-clusters: Summarize the data stream into small clusters that are updated continuously
StreamKMs: An extension of the K-means algorithm for streaming data.
Parallelism: Distributes the clustering task across multiple processors or machines to handle large-scale data
stre
Example: Real-time clustering of online customer behavior on an e-commerce site to provide personalized
recommendations.
ThankYou!
T Know you doing well in Examen!
Like | Share | Subscribe | commentUNIT-5
Frame Works and Visualization
1. MapReduce
Definition: MapReduce is a programming model used for processing large data sets in a parallel and distributed
manner. It splits tasks into two phases: the Map phase (processes input data and produces key-value pairs) and
the Reduce phase (aggregates the results).
Applications:
Word Count: Counting occurrences of each word in a large document.
Log Analysis: Analyzing server logs to extract useful information.
Example: Using MapReduce to count the frequency of each search term in search engine logs.
Hadoop
Definition: Hadoop is an open-source framework that allows for the distributed storage
ind processing of large
data sets using the MapReduce programming model. It is designed to scale up from a single server to thousands
of machines.
‘Components:
HDFS (Hadoop Distributed File System): A distributed file system that stores data across multiple machines.
YARN (Yet Another Resource Negotiator): Manages and schedules resources for running applications,
Example: Analyzing large datasets of social media posts to detect trends and patterns.
Pig
Definition: Pig ie a high-level platform for creating MapReduce programe used with Hadoop. It provides a
scripting language called Pig Latin, which simplifies the process of writing complex data transformations.
Features:
Ease of Use: Simplifies data processing tasks.
Extensibility: Supports user-defined functions
Example: Processing large datasets to extract useful information, such as summarizing clickstream data for
web analytics.
Hive
Definition: Hive is a data warehouse infrastructure built on top of Hadoop. It allows users to query and manage
large datasets using a SQL-like language called HiveQL.Features:
SQL Compatibility: Supports most SQL queries.
Scalability: Handles large datasets efficiently.
Example: Running SQL-like queries on a massive dataset stored in HDFS to perform data analysis and generate
reports.
HBase
Definition: HBase is a distributed, scalable, NoSQL database built on top of Hadoop and modeled after Google's
Bigtable. It ie designed for storing and managing large amounts of eparse data.
Features:
Real-Time Read/Write: Supports real-time data access.
Scalability:
jandles large amounts of data across many servers.
Example: Storing and retrieving time-series data from IoT sensors in a distributed environment.
MapR
Definition: MapR is a data platform that supports Hadoop and other big data technologies. It provides an
enterprise-grade distribution with added features like real-time analytics, high availability, and rwulti-tenancy.
Features:
Unified Data Platform: Integrates various big data technologies.
Real-Time Capabilities: Supports real-time data processing and analytics.
Example: Using MapR to handle real-time streaming data for financial trading applications.
Sharding
Definition: Sharding is a database architecture pattern in which a large dataset is divided into smaller, more
manageable pieces, called shards, which are distributed acrose multiple servers.
Advantages:
Scalability: Allows the database to handle larger datasets and more transactions by distributing the load.
Performance: Improves query performance by reducing the amount of data each server needs to manage,
Example: A social media platform shards its user data based on geographic regions to improve performance and.
scalability.
NoSQL Databases
Definition: NoSQL databases are designed for distributed data stores that need to handle large volumes of.
unstructured, semi-structured, or structured data, They are schema-less and provide horizontal scalability.‘Types:
Document Stores: (e.g., MongoDB) Store data as JSON-like documents.
Column Stores: (e.g., Cassandra) Store data in columns rather than rows.
Key-Value Stores: (e.g., Redis) Store data as key-value pairs.
Graph Databases: (e.g., Neo4) Store data as graphs.
Example: Using MongoDB to store and retrieve JSON-like documents for a content management system.
3 (Simple Storage Service)
Definition: $3 ie an object storage cervice provided by Amazon Web Services (AWS) that offers scalability, data
availability, security, and performance. It is used to store and retrieve any amount of data at any time from
anywhere on the web,
Features:
Scalability: Stores large amounts of data.
Durability: Ensures data is replicated and available.
Accessibility: Accessible via HTTP-based RESTFul API.
Example: Storing and serving media files (like images and videos) for a large-scale web application.
Hadoop Distributed File Systems (HDFS)
Definition: HDFS is the primary storage system used by Hadoop applications. It stores data across multiple
machines in a distributed manner, ensuring fault tolerance and high throughput.
Features:
Replication: Data is replicated across multiple nodes to ensure reliability.
Scalability: Capable of storing petabytes of data,
Example: Storing large datasets such as loge, clickstre
sms, or sensor data that are processed by Hadoop,
MapReduce jobs.
Visualization: Visual Data Analysis Techniques
Definition: Visual data analysis involves using graphical representations of data to identify patterns, trends, and
insights.
Techniques:
Scatter Plots: Shows the relationship between two variables.
Heat Maps: Visualizes data through color variations.
Bar Charts and Histograms: Represent frequency distributions.Example: Using a scatter plot to visualize the correlation between
les revenue and marketing spend.
Visualization: Interaction Techniques
Definition: Interaction techniques in data visualization allow users to engage with the visual representation of
data, enabling them to explore, filter, and manipulate the data.
Examples:
Brushing: Highlighting specific data points by selecting them.
Zooming and Panning: Navigating through different levele of data granularity.
Linked Views: Connecting multiple visualizations so that interactions in one view affect others.
Example: Interactive dashboards that allow users to filter data by date range, geography, or other criteria to
explore trends and patterns.
Visualization: Systems and Applications
Definition:
isualization systems are tools and platforms that provide capabilities for creating, managing.
sharing visualizations.
Popular Tools:
Tableau: A powerful data visualization tool used for creating interactive dashboards and visual analytics.
D3.js: A JavaScript library for creating dynamic and interactive data visualizations on the web.
Power BI: A Microsoft tool for creating interactive reports and dashboards.
Example: Using Tableau to create a real-time dashboard that tracks key performance indicators (KPIs) for a
business.
2. Introduction toR
R Graphical User Interfaces (GUIs)
Definition: R provides several graphi it easier for users to interact with R
\Luser interfaces (GUIs) that mal
without needing to write code.
Examples:
RStudio: A popular IDE for R that provides an intuitive interface for writing code, visualizing data, and managing
projects.
R Commander: A basic GUI that provides point-and-click access to R's statistical functions.
Example: Using RStudio to load date, perform statistical analysis, and visualize results in an interactive
environment.Data Import and Export in R
Definition: R supports importing data from various formats and exporting results to different formats for
sharing and reporting.
Import Methods:
read.csv(): Import CSV files.
readxlsiread_excel(): Import Excel files,
RSQLitedbConnect(): Connect to and import data from SQL databases.
Export Methode:
write.cev(): Export data to a CSV file,
Save plots to image fil
: Importing sales data from an Excel file, performing analysis, and exporting the results to a CSV file for
further use.
Attribute and Data Types in
Definition: R has several data types and attributes that define the structure and behavior of data.
Data Types:
Numeric: Real numbers (e.g, 3.14, 42).
Integer: Whole numbers (e.g. 1, 42).
Character: Text strings (e.g., "Hello, World!”
Logical: Boolean values (TRUE, FALSE).
Factor: Categorical data with a fixed number of levels.
Attributes:
Names: Names of elements in a vector or columns in a data frame,
Dimensions: Define the shape of arrays or matrices.
Example: Creating a data frame in R where each column has a specific data type, such as integers for age and
factors for gender.
Descriptive Statistics in R
Definition: Descriptive statistics summarize and describe the main features of a dataset
Functions:
‘mean(): Calculate the average of a numeric vector.
median(): Find the median value.
summary(): Provides a summary of statistics (mean, median, min, max, quartiles) for each column ina data
frame.
sd(): Compute the standard deviation.Example: Using R to calculate the mean, median, and standard deviation of a dataset containing sales figures.
Exploratory Data Analysis (EDA) in R
Definition: EDA involves using statistical graphics and other techniques to explore and understand the data
before formal modeling.
Techniques:
Histograms: Visualize the distribution of a variable.
Boxplote: Identify outliers and understand the spread of data.
Scatter Plots: Explore relationships between two numeric variables.
Correlation Matrix: Assess the correlation between multiple variables.
Example: Performing EDA on a dataset of housing pri
1s to understand the distribution of prices and identify
any correlations with features like square footage or location.
Visualization Before Analysis in R
Definition: Visualizing data before formal analysis helps in identifying patterns, trends,
and potential issues like
outliers or missing data.
Common Visualizations:
Bar Charts: Compare categorical data.
Line Charts: Show trends over time.
Heat Maps: Visualize the intensity of values across a matrix,
Example: Creating a heat map in R to visualize the correlation between different financial metrics before
building a predictive model.
Analytics for Unstructured Data in R
Definition: Unstructured data, such as text, images, and videos, lacks a predefined format or structure, making.
it challenging to analyze using traditional methods.
Techniques:
Text Mining: Analyzing and extracting meaningful information from text data using packages like tm and
toxt2vec.
Sentiment Analysis: Assessing the sentiment of textual content, such as customer reviews, using libraries like
syuahet.
Tmage Analysis: Processing and analyzing images using the magick and imager packages.
Example: Using R to perform sentiment analysis on customer reviews of a product to understand overallcustomer satisfaction and identify areas for improvement.
ThankYou!
I Know you doing well in Exar!
Like | Share | Subscribe | comment