5/11/2025
Lecture 12: Data Analysis in IoT
Seyed-Hosein Attarzadeh-Niaki
Fundamentals of IoT 1
Recap & Lecture Objectives
• Recap
– Previous Lecture 9: General survey of AI/ML methods applicable to IoT.
– Previous Lecture 8: Covered data management - databases (SQL/NoSQL),
cloud layers (ingestion, processing, storage).
• Today's Focus: Data Analytics in IoT
– Concepts, types, and importance of analytics for IoT data.
– Key technologies and frameworks enabling IoT analytics.
– The process of building and deploying analytics solutions.
• Learning Objectives
– Define Big Data characteristics in the IoT context.
– Describe the different classes of analytics (Descriptive, Diagnostic, Predictive,
Prescriptive).
– Understand the steps involved in building an analytics model (e.g., CRISP-DM,
OSA).
– Recognize key tools and platforms used for IoT data analytics (focus on
concepts).
Fundamentals of IoT 2
1
5/11/2025
Why Data Analytics in IoT?
The Data Deluge & Value Proposition
The IoT Data Challenge
• Massive data generation: Estimated 50 billion devices by 2020 (now likely
higher), generating Zettabytes (270 bytes) of data.
• Data from diverse sources: Sensors, machines, wearables, smart buildings, etc..
• Complex, heterogeneous data: Structured, unstructured, semi-structured (text,
images, video, logs, sensor streams).
• Traditional data management falters with this scale and complexity.
The Value Proposition
• Data is a key resource for competitive advantage.
• Unlock insights: Reveal trends, find patterns, discover correlations.
• Enhance decision-making and automation.
• Improve efficiency, optimize operations, enable new services.
• Key Use Cases: Asset reliability, maintenance optimization, performance
management, compliance, safety, smart environments.
Fundamentals of IoT 3
Defining Big Data in the IoT Context
• What is Big Data?
– Data sets so vast and complex they require new computational
resources.
– Data held in amounts difficult to process with traditional methods.
– Not just about size, but also speed, type, and trustworthiness.
• Why is IoT Data inherently "Big Data"?
– Scale: Billions of devices generating continuous streams.
– Speed: Real-time or near-real-time data arrival (milliseconds/seconds).
– Variety: Sensor readings, logs, images, video, GPS, etc..
– Complexity: Requires advanced techniques beyond traditional BI.
• Goal: Convert the vast amount of IoT data into actionable
knowledge and insights.
Fundamentals of IoT 4
2
5/11/2025
The Defining Characteristics of Big
Data (Volume, Velocity)
Volume Velocity
• Refers to the sheer amount of data • Refers to the speed at which data is
collected. generated and needs to be
• IoT Example: A single ICU patient's processed.
vitals monitored every second • Streaming data requires near real-
generates massive data, time analysis and response.
traditionally summarized due to • IoT Example: Real-time anomaly
storage/processing limits. detection in sensor streams for
• Industrial Example: Thousands of immediate alerts. Autonomous
sensors on a manufacturing line or vehicle sensor processing.
jet engine generating continuous • Industrial Example: High-frequency
data streams. vibration data analysis for
• Scale: Terabytes, Petabytes, predictive maintenance.
Exabytes, Zettabytes. • Challenge: Handling data arriving in
milliseconds, requiring processing
within seconds.
Fundamentals of IoT 5
The Defining Characteristics of Big
Data (Variety, Veracity)
Variety Veracity
• Refers to the different types and • Refers to the accuracy,
structures of data. trustworthiness, and quality of
• Structured: Relational databases data.
(e.g., asset IDs, configuration). • Sources of uncertainty: Sensor
• Semi-structured: JSON, XML, Log drift/failure, noise, missing data,
files. latency, inconsistencies,
• Unstructured: Images, video incompleteness.
feeds, audio, text reports, social • IoT Challenge: Ensuring data
media posts. quality for reliable analytics.
• IoT Challenge: Integrating and Dealing with 'bad data', spikes,
analyzing data from all these frozen signals, out-of-order data.
different formats. • Need for data cleaning,
validation, and potentially
imputation.
Fundamentals of IoT 6
3
5/11/2025
The Big Data / IoT Analytics Lifecycle
• Analytics is a process, not a single step. Common
frameworks describe this iterative process.
• Generic Big Data Lifecycle
– Capture: Acquire data from sources.
– Organize: Structure and store data (e.g., Data Lakes).
– Integrate: Combine, clean, and prepare data.
– Analyze: Apply statistical methods, ML algorithms.
– Act: Make decisions, automate actions based on
insights.
• CRISP-DM
(Cross Industry Standard Process for Data Mining) CRISP-DM
– Business Understanding: Define objectives.
– Data Understanding: Collect and explore data.
– Data Preparation: Clean, transform, select features
(often most time-consuming).
– Modeling: Select and apply modeling techniques.
– Evaluation: Assess model performance against
objectives.
– Deployment: Integrate model into operational
systems.
• OSA (Open System Architecture) for CBM: More
specific to maintenance analytics. (Details in a later
slide).
• Key takeaway: It's an iterative process involving 7
understanding the problem, handling data,
modeling, and evaluating/deploying results.
Fundamentals of IoT
Analytics Maturity Levels:
From Descriptive to Prescriptive
• Organizations progress through different levels of analytical
sophistication. Prescriptive
• Common tiers Analytics
– Descriptive Analytics: What happened? Predictive
(Basic reporting, visualization, KPIs). Analytics
– Diagnostic Analytics: Why did it happen?
Diagnostic Analytics
(Root cause analysis, anomaly discovery).
– Predictive Analytics: What could happen?
(Forecasting, RUL estimation). Descriptive Analytics
– Prescriptive Analytics: What should we do?
(Optimization, recommendations, automated actions).
• Moving up the maturity scale provides increasing value but often
requires more complex techniques and better data quality.
• IoT enables moving beyond simple description towards prediction
and prescription.
Fundamentals of IoT 8
4
5/11/2025
Classes of IoT Analytics (1):
Descriptive Analytics – “What Happened?”
• Definition: The most basic class, analyzing past data to provide a
broad view or summary.
• Goal: Understand past performance and current status. Answers
"What happened?".
• Techniques
– Data aggregation and summarization.
– Basic statistical measures
(mean, median, mode, standard deviation, range).
– Reporting and Dashboards.
– Data Visualization (charts, graphs).
• IoT Examples
– Dashboard showing current temperature readings across sensors.
– Report summarizing device uptime/downtime over the last month.
– Chart showing average energy consumption per building floor.
Fundamentals of IoT 9
Descriptive Analytics Techniques:
KPI & Health Monitoring
KPI Monitoring Condition Monitoring
• Key Performance Indicators (KPIs) • Using rules (often simple
track critical metrics related to thresholds) to extract potential
operational goals. issues based on current or past
• Simplest way to monitor fleet data.
health by aggregating or calculating • Often uses thresholds or simple
indicators from raw data. fuzzy rules.
• Examples: Overall Equipment • Can use simple moving averages or
Effectiveness (OEE), Mean Time basic statistical features (variance,
Between Failures (MTBF), Energy std dev).
Efficiency Ratios. • Example: Alert if temperature >
threshold, Alert if vibration
standard deviation exceeds
baseline.
Fundamentals of IoT 10
5
5/11/2025
Classes of IoT Analytics (2):
Diagnostic Analytics – “Why Did It Happen?”
• Definition: Aims to understand the root causes of past events or issues.
• Goal: Answers "Why did it happen?" by analyzing failure modes and Event
causes. Occurs
• Typical Steps
– Detect Anomalies (Identify deviations from normal). Detect
– Discover/Isolate Anomalies (Pinpoint the specific nature of problem). Anomaly
– Determine Root Cause (Find the underlying reason).
• Techniques
Discover
– Advanced modeling, feature extraction. Nature
– Anomaly detection algorithms (statistical, ML-based).
– Root cause analysis (RCA) methodologies (e.g., Fishbone diagrams,
5 Whys - often involves human investigation aided by data). Determine
– Correlation analysis. Root Cause
• IoT Example: If a machine fails (descriptive), diagnostic analytics might
identify that an unusual spike in vibration (anomaly) preceded the failure,
caused by a bearing defect (root cause).
Fundamentals of IoT 11
Diagnostic Analytics Techniques:
Anomaly Detection & Root Cause Analysis
Anomaly Detection Root Cause Analysis (RCA)
• Identifies unusual patterns or • Goes beyond what happened to why.
deviations from expected behavior. • Often requires domain expertise
• Crucial first step in diagnostics. combined with data analysis.
• Can be rule-based (thresholds) or • May involve analyzing sequences of
model-based (statistical, ML). events, correlating sensor data, using
• Data-driven models often built on fault trees, or applying ML classifiers
"normal" operational data. trained on failure data.
• Output: Flags potential problems or • Can use deterministic, fuzzy,
impending failures. Bayesian, or ML-based rules
codifying cause-and-effect.
• Goal: Prevent recurrence by
addressing the fundamental cause.
Fundamentals of IoT 12
6
5/11/2025
Understanding Anomaly Detection in
IoT
• Definition: Identifying data points, events, or observations
that deviate significantly from the expected pattern.
• Types of Anomalies
– Point Anomaly: Single data point is anomalous (e.g., sudden
temperature spike).
– Contextual Anomaly: Anomaly depends on the context (e.g., high
energy use at night is anomalous, but normal during the day).
– Collective Anomaly: A group of data points together indicate an
anomaly, though individual points may seem normal.
• Approaches
– Statistical Methods: Based on distributions (e.g., Z-score, IQR),
moving averages, ARIMA residuals.
– Machine Learning (Unsupervised): Clustering (e.g., OCSVM,
DBSCAN), Isolation Forest. Learns "normal" patterns from
unlabeled data.
– Machine Learning (Supervised): Classification algorithms trained
on labeled data (normal vs. specific anomaly types). Requires
historical failure data.
– Deep Learning: Autoencoders, VAEs, LSTMs for complex time-
series data.
• Challenges: Defining “normal,” handling noise, high
dimensionality, concept drift, dealing with false
positives/negatives.
Fundamentals of IoT 13
Classes of IoT Analytics (3):
Predictive Analytics – “What Could Happen?”
• Definition: Uses historical and current data to make predictions about future events or outcomes.
• Goal: Answers “What could happen in the future?” Anticipate potential issues or trends.
• Techniques
– Regression models (Linear, Logistic, etc.).
– Time series forecasting (ARIMA, Prophet, LSTMs).
– Machine learning classification models (predicting future states/failures).
– Survival analysis.
– Prognostics and Health Management (PHM) specific techniques.
• IoT Examples
– Predicting remaining useful life (RUL) of a component.
– Forecasting energy demand based on weather and historical usage.
– Predicting the likelihood of equipment failure within the next week.
Degradation prediction in a prognostic
model, including the RUL and the UQ
Fundamentals of IoT 14
7
5/11/2025
Predictive Analytics Techniques:
Forecasting & Prognostics (RUL)
Forecasting Prognostics (RUL Estimation)
• Predicting future values of a • Special focus in IIoT/Maintenance.
variable based on historical • Estimates Time-to-Failure (TTF) or
patterns. Remaining Useful Life (RUL) for
• Common in predicting demand, components/assets.
load, resource consumption, market • Predicts future degradation based
trends. on current condition and historical
• Methods: Statistical (ARIMA, data.
Exponential Smoothing), ML • Crucial for planning maintenance
(Regression, NNs), Hybrid. proactively.
• Example Tool: Facebook Prophet • Methods: Physics-based models
library. (crack growth), Data-driven (ML
models trained on degradation
data), Hybrid.
• Requires handling uncertainty (UQ -
Uncertainty Quantification) in
predictions.
Fundamentals of IoT 15
Classes of IoT Analytics (4):
Prescriptive Analytics – “What Should We Do?”
• Definition: Represents the most advanced level, Prescriptive
suggesting actions to take based on predictive insights. Analytics
• Goal: Answers "How should we respond?" or "What Predictive
should we do?" by recommending optimal actions. Analytics
• Anticipates what will happen, when, and why, then
suggests decision options. Diagnostic Analytics
• Aims to optimize outcomes, mitigate future risks, or
capitalize on opportunities. Descriptive Analytics
• Relies heavily on outputs from predictive analytics.
• Often involves optimization algorithms, simulation,
rule-based systems (expert systems), or AI/ML models.
• IoT Example: Based on predicted RUL (Predictive),
suggest the optimal time to schedule maintenance
(Prescriptive) to minimize downtime and cost. If
demand is predicted to spike (Predictive), recommend
increasing production rate (Prescriptive).
Fundamentals of IoT 16
8
5/11/2025
Prescriptive Analytics Techniques:
Optimization & CBM
• Optimization
– Aims to maximize efficiency, increase production, reduce costs, etc., based on data and
models.
– Can work proactively (user-driven what-if scenarios) or automatically provide insights.
– Often used on-premises/controller level in traditional industry for direct production
optimization.
– Examples: Optimizing energy consumption in a smart building, adjusting process parameters
in manufacturing for yield maximization.
• Condition-Based Maintenance (CBM)
– A key application of prescriptive analytics in IIoT.
– Maintenance strategy based on the actual health/condition of equipment, rather than fixed
schedules.
– Uses monitoring (Descriptive), diagnostics (Diagnostic), and prognostics (Predictive) to
determine when maintenance is truly needed.
– Answers: "Should we perform maintenance now, or can we safely continue operating?".
– Goal: Avoid unnecessary maintenance, prevent unexpected failures, extend asset life.
Fundamentals of IoT 17
The Open System Architecture (OSA) for
Condition-Based Maintenance (CBM)
• A popular framework illustrating the layers involved in CBM.
• Shows the progression from raw data to actionable maintenance decisions.
• OSA-CBM Layers
1. Data Acquisition: Collect raw sensor data (vibration, temperature, pressure, etc.).
2. Signal Processing: Clean data, extract relevant features (e.g., FFT, statistical features). Often
done via stream analytics.
3. Condition Monitoring: Detect deviations from normal operation using rules or models
(Descriptive/Basic Diagnostic).
4. Health Assessment: Diagnose fault types and assess severity (Diagnostic). Requires
advanced analytics (ML/Physics).
5. Prognostics: Predict future degradation and estimate RUL (Predictive). Requires advanced
analytics.
6. Decision Support: Recommend optimal maintenance actions (Prescriptive). May involve
querying or on-demand analytics.
• These layers can be implemented across cloud, edge, or on-premises systems.
Fundamentals of IoT 18
9
5/11/2025
Building IoT Analytics:
The Process Overview (CRISP-DM / EDA)
• Developing effective analytics requires a
structured approach.
• Recall CRISP-DM
– Business Understanding -> Data Understanding ->
Data Preparation -> Modeling -> Evaluation ->
Deployment.
Exploratory Data
• Similar Workflow Analysis (EDA)
– Problem Statement: Define scope, constraints,
success metrics. Align business & technical goals.
– Dataset Acquisition: Collect, integrate (wrangle),
clean data.
– Exploratory Data Analysis (EDA): Investigate data,
calculate statistics, visualize patterns, identify
correlations. (Based on Tukey's EDA ).
– Model Building: Feature engineering,
train/validation split, select technique
(physics/data-driven), train model.
– Packaging & Deploying (MLOps): Testing (unit,
regression, performance), wrap model, deploy.
– Monitoring: Continuously review performance.
• Emphasizes iteration and deep understanding of
both the business problem and the data.
Fundamentals of IoT 19
Building IoT Analytics:
Data Acquisition & Understanding
• Data Acquisition/Collection
– Gathering raw data from diverse IoT sources (sensors, logs, devices, external APIs).
– Includes associated metadata, events, alerts if available.
– Consider data formats (CSV, JSON, Parquet, Avro, etc.). (Covered more later).
• Data Integration (Wrangling/Munging)
– Converting data formats.
– Combining data from multiple sources.
– Integrating with metadata or additional information (e.g., asset details, maintenance logs).
• Data Understanding (EDA Part 1)
– Initial exploration to understand data characteristics.
– What data is available? What does it represent?.
– Assess data quantity, quality, types, distributions. Data
– Identify potential issues (missing values, outliers) early on. Sensors, logs, acquisition
video
– Use basic descriptive statistics and visualizations. and
integration
Diverse IoT Initial data
data sources exploration
Fundamentals of IoT 20
10
5/11/2025
Building IoT Analytics:
Data Preparation & Feature Engineering
• Data Cleaning
– Handling missing data (imputation, removal).
– Dealing with outliers/spikes (smoothing, removal).
– Correcting inconsistencies or errors.
– Addressing data quality issues (noise, frozen signals). Often the most time-consuming phase.
• Data Transformation
– Normalization / Standardization: Scaling data to a common range (important for many ML
algorithms).
– Encoding Categorical Variables: Converting text labels to numerical representations (e.g., One-
Hot Encoding).
• Feature Engineering
– Creating new input variables (features) from raw data that are more informative for the model.
– Feature Extraction: Building calculated values (e.g., statistical features like mean/std dev over a
window, frequency domain features via FFT).
• Feature Selection/Reduction: Reducing dimensionality by selecting the most
relevant features or transforming them (e.g., PCA).
– Aims to improve model performance and reduce complexity.
Cleaning (Handling Transformation Feature Engineering Prepared Data for
Raw Data
Missing/Outliers) (Scaling/Encoding) (Extraction/Selection) Modeling
Fundamentals of IoT 21
Data Visualization Techniques for High-
Dimensional IoT Data
• Visualization is crucial for understanding complex IoT data and model results.
• Challenges: High dimensionality (many sensors/features) makes standard charts
difficult.
• Dimensionality Reduction for Visualization
– Project high-dimensional data into 2D or 3D for plotting.
– Principal Component Analysis (PCA): Projects data onto directions of highest variance. Good
for global structure.
– t-Distributed Stochastic Neighbor Embedding (t-SNE): Preserves local neighborhood structure,
good for visualizing clusters.
– Graph-Based Methods: Model data points as nodes, connections as edges, visualize the graph
layout.
• Other Tools/Techniques
– Time-Series Plots: Essential for sensor data (potentially with anomaly overlays).
– Scatter Plots / Scatter Matrices: Show relationships between pairs of variables.
– Heatmaps: Visualize correlation matrices or other matrix data.
– Dashboards (e.g., Grafana): Combine multiple visualizations for operational monitoring.
Fundamentals of IoT 22
11
5/11/2025
Analytics Technologies:
Rule-Based vs. Model-Based Approaches
Rule-Based Analytics Model-Based Analytics
• Uses pre-defined rules, often based on • Uses mathematical or statistical models
expert knowledge or empirical learned from data or derived from
observations. physical principles.
• If-Then logic, thresholds, decision trees, • Describes relationships between
fuzzy logic, Bayesian networks. variables or system states.
• Pros: Interpretable, relatively simple to • Can be Physics-Based or Data-Driven.
implement for known conditions. • Pros: Can capture complex
• Cons: Can be brittle, hard to maintain relationships, adapt to changing
for complex systems, may not capture conditions (data-driven), handle larger
unknown patterns. state spaces.
• Example: IF Temperature > 100C AND • Cons: Can be less interpretable ("black
Pressure > 2bar THEN Trigger Alarm. box"), may require significant data
(data-driven) or deep domain
knowledge (physics-based).
• Example: Using a neural network to
predict RUL based on sensor inputs.
Fundamentals of IoT 23
Model-Based Analytics:
Physics-Based Models - Pros & Cons
• Definition: Based on understanding the underlying
physical laws and principles governing the system's
behavior.
• Uses mathematical equations (e.g., differential
equations) derived from physics, chemistry, engineering
principles.
• Combines this domain knowledge with measured data.
• Advantages
– Requires relatively less historical failure data compared to
purely data-driven methods.
– Often better for long-term prediction (extrapolation).
– Generally more reliable when the physics are well
understood.
– Highly interpretable (model parameters have physical
meaning).
• Disadvantages
– Requires deep, detailed domain knowledge of the system.
– Can be difficult or complex to develop accurate physical
models.
– May not capture unforeseen behaviors not included in the
model.
• When to Use? High reliability needed, limited failure data
available, system physics well-understood.
Fundamentals of IoT 24
12
5/11/2025
Model-Based Analytics:
Data-Driven Models (ML) - Pros & Cons
• Definition: Learns patterns and relationships directly from historical data without
explicit programming of physical laws.
• Uses statistics, Machine Learning (ML), Deep Learning (DL) techniques.
• Recall ML Types
– Supervised: Learns from labeled data (input-output pairs) for classification/regression. Needs
historical examples of outcomes (e.g., failures).
– Unsupervised: Finds patterns in unlabeled data (clustering, anomaly detection). Useful when
outcomes are unknown/rare.
– Reinforcement: Learns through trial-and-error in an environment (less common for direct IIoT
analytics, more for control/robotics).
• Advantages
– Can discover complex, non-obvious patterns.
– Does not require deep understanding of underlying physics.
– Can adapt as new data becomes available.
• Disadvantages
– Often requires large amounts of relevant historical data (especially failure data for supervised
learning).
– Can be less interpretable ("black box" models). Explainable AI (XAI) aims to address this.
– Performance depends heavily on data quality and quantity. May perform poorly with
insufficient or biased data.
➢ When to Use? Large amount of data available, system physics complex or
unknown, need to find hidden patterns.
Fundamentals of IoT 25
Big Data Platforms for IoT Analytics:
Overview (Hadoop & Spark)
• Handling IoT Big Data requires scalable and
distributed platforms.
• Apache Hadoop
– Open-source framework for distributed storage and
processing of large datasets.
– Key Components (Original)
• HDFS (Hadoop Distributed File System): Stores large
files across clusters of commodity hardware, providing
reliability through replication.
• MapReduce: Programming model for parallel batch
processing (moves computation to data).
– YARN (Yet Another Resource Negotiator): Resource
management, allows running applications beyond
MapReduce (like Spark) on Hadoop clusters.
– Ecosystem: Includes tools like Hive (SQL-like
queries), Pig (scripting), HBase (NoSQL DB), Sqoop
(data transfer).
• Apache Spark
– Fast, general-purpose cluster computing framework.
– Often runs on Hadoop (using HDFS for storage, YARN
for resource management) but can run standalone.
– Key Advantage: In-memory computing, making it
much faster (up to 100x) than disk-based
MapReduce for iterative algorithms (like ML) and
interactive queries.
• These platforms provide the foundation for storing
massive IoT datasets and running complex
analytical workloads. Fundamentals of IoT 26
13
5/11/2025
Deploying IoT Analytics
• Deployment Considerations
• Infrastructure: Analytics implementation depends on infrastructure
support.
– Data Ingestion: Bulk vs. Micro-batch vs. Streaming.
• Affects analytic triggering.
– Data Quality: Handling late data, out-of-order data, gaps, frozen
signals is critical for real-world deployment.
– Triggering: Stream-triggered, micro-batch scheduled, or condition-
based.
– Location: Cloud vs. Edge vs. Controller
• Cloud: Scalable, good for large datasets/fleet-wide analysis, but
latency/bandwidth limitations for high-frequency data.
• Edge: For low latency needs, high-frequency data (e.g., vibration), offline
operation, data privacy/policy restrictions.
• Controller: Often for real-time control loops, security/robustness concerns
often limit external analytics from direct control.
Fundamentals of IoT 27
Conclusion
• IoT generates vast amounts of data requiring specialized
Big Data analytics techniques and platforms.
• Analytics progresses from Descriptive to Diagnostic,
Predictive, and Prescriptive, offering increasing value.
• Platforms like Hadoop and Spark provide scalable storage
and processing.
– Spark's in-memory capabilities are particularly suited for ML and
iterative tasks.
• Building and deploying analytics involves a structured
process (e.g., CRISP-DM) and careful consideration of data
quality and deployment architecture (Cloud/Edge).
• AI/ML is a key enabler for advanced IoT analytics (AIoT).
Fundamentals of IoT 28
14
5/11/2025
Next Lecture
• IoT in Smart Cities
Fundamentals of IoT 29
AIoT: The Convergence of AI and IoT
for Enhanced Analytics
• AIoT = Artificial Intelligence of Things.
• The combination of AI technologies (specifically ML/DL) with IoT infrastructure.
• Goal: Create more efficient, intelligent IoT systems.
• How AI enhances IoT
– Smart Data Analysis: Moving beyond basic analytics to complex pattern recognition,
prediction, and prescription using ML on IoT data.
– Improved Decision Making: Enabling automated or semi-automated decisions based on real-
time data analysis.
– Enhanced Human-Machine Interaction: More natural interfaces (voice, gesture), personalized
responses.
– Operational Efficiency: Optimizing processes, predictive maintenance, resource management.
• How IoT enhances AI
– Provides the massive, real-world data streams needed to train and operate powerful AI
models.
– Enables AI to interact with and influence the physical world through connected actuators.
• AIoT is the engine driving advanced analytics use cases like complex anomaly
detection, prognostics, and prescriptive actions in modern IoT systems.
Fundamentals of IoT 30
15
5/11/2025
Apache Spark for IoT Analytics:
Core Concepts (RDDs, DataFrames)
• Spark Core Engine: Underlying execution engine, handles
scheduling, memory management, fault recovery.
• Resilient Distributed Datasets (RDDs)
– Spark's fundamental data structure.
– Immutable, distributed collection of objects partitioned
across cluster nodes.
– Fault-tolerant: Can be recomputed from lineage
(dependency graph) if a partition is lost.
– Supports two types of operations
• Transformations: Create new RDDs from existing ones (e.g., map,
filter). Lazy evaluation. Spark’s approach to fast data sharing for iterative operation
• Actions: Trigger computation and return a result (e.g., count,
collect, save).
– Can be persisted (cached) in memory for fast reuse in
iterative/interactive tasks.
• DataFrames & Datasets
– Higher-level abstraction built on RDDs.
– DataFrame: Distributed collection of data organized into
named columns (like a database table or pandas DataFrame).
– Dataset: Strongly-typed version of DataFrame (Java/Scala).
– Provide benefits of RDDs + optimizations from Spark SQL
engine (Catalyst optimizer).
– Can be created from various sources (JSON, Parquet, Hive,
RDDs).
– Evaluated lazily like RDDs. Preferred API for most structured Spark’s approach to fast data sharing for queries
data analysis in Spark.
Fundamentals of IoT 31
Spark Ecosystem for Analytics:
Spark SQL, MLlib, Streaming
• Spark provides libraries built on Spark Core for common analytics tasks.
• Spark SQL
– Module for structured data processing.
– Allows querying data via SQL or HiveQL.
– Works with DataFrames and Datasets.
– Can read/write various structured formats (JSON, Parquet, JDBC, Hive tables).
– Includes Catalyst optimizer for query optimization.
• MLlib (Machine Learning Library)
– Spark's ML library, designed for scalability.
– Provides common algorithms: Classification, Regression, Clustering, Collaborative Filtering, etc..
– Includes tools for feature extraction, transformation, pipeline construction, model evaluation,
hyperparameter tuning.
– Often significantly faster than older Hadoop ML libraries (like Mahout) due to in-memory processing.
• Spark Streaming
– Enables scalable, high-throughput, fault-tolerant processing of live data streams.
– Processes data in small micro-batches (DStreams - Discretized Streams).
– DStream is a sequence of RDDs.
– Can ingest data from sources like Kafka, Flume, Kinesis.
– Allows applying Spark transformations/actions and MLlib models to streaming data.
• GraphX: Library for graph processing (less common for typical sensor analytics, but relevant for
network/relationship analysis).
Fundamentals of IoT 32
16
5/11/2025
Apache Kafka for Data Streaming
Apache Kafka is a distributed streaming platform
for real-time IoT data feeds.
• Features
– High throughput.
– Fault-tolerant storage.
– Scalable across clusters.
• Use Case: Streaming sensor data to analytics
platforms.
Fundamentals of IoT 33
Apache Storm and Flink
• Apache Storm
– Real-time computation for unbounded streams.
• Apache Flink
– Handles both stream and batch processing with high
throughput and low latency.
• Both support complex event processing in IoT.
Feature Apache Storm Apache Flink
Processing Model Stream-only Stream and batch
Latency Very low Low
State Management Basic Advanced
Windowing Time, count-based Flexible
Fundamentals of IoT 34
17