0% found this document useful (0 votes)

9 views17 pages

Unit 2

big data analytices

Uploaded by

Lokesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views17 pages

Unit 2

big data analytices

Uploaded by

Lokesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

UNIT-2

Big Data Technologies

Big data technologies refer to the tools, frameworks, and platforms used to
manage, process, analyze, and derive insights from large volumes of data. These
technologies are essential for organizations dealing with massive datasets that
traditional data processing and analysis methods cannot handle efficiently. Some
key components and technologies within the big data ecosystem include:

1. Distributed File Systems: Distributed file systems like Hadoop Distributed

File System (HDFS) and Apache HBase allow storage of large datasets
across clusters of commodity hardware.
2. Data Processing Frameworks: Frameworks such as Apache Hadoop,
Apache Spark, and Apache Flink enable distributed processing of large
datasets across clusters of computers.
3. Data Warehousing: Technologies like Apache Hive, Apache HBase, and
Amazon Redshift provide platforms for storing and querying structured data
in a distributed environment.
4. NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and
Couchbase are designed to handle unstructured or semi-structured data at
scale and offer flexible schema designs.
5. In-Memory Data Grids: Technologies such as Apache Ignite and Hazelcast
provide distributed, in-memory data storage and processing capabilities,
allowing for high-speed data access and computation.
6. Stream Processing Systems: Frameworks like Apache Kafka and Apache
Storm enable real-time processing and analysis of streaming data from
various sources.
7. Data Lakes: Data lakes, built using platforms like Amazon S3 and Azure
Data Lake Storage, serve as centralized repositories for storing structured,
semi-structured, and unstructured data at scale.
8. Machine Learning and AI: Integration of machine learning and artificial
intelligence technologies into big data platforms enables predictive analytics,
anomaly detection, and other advanced data analysis tasks.
9. Containerization and Orchestration: Technologies like Docker and
Kubernetes facilitate the deployment, scaling, and management of big data
applications and services in containerized environments.
10.Data Governance and Security: Solutions for data governance,
compliance, and security, such as Apache Ranger and Apache Atlas, help
organizations manage and protect sensitive data in big data environments.
11.Data Visualization and BI Tools: Tools like Tableau, Power BI, and
Apache Superset allow users to visualize and explore data stored in big data
platforms, enabling better decision-making and insights discovery.

Hadoop’s Parallel World:

"Hadoop's Parallel World" refers to the distributed computing paradigm embodied

by the Apache Hadoop ecosystem. Hadoop revolutionized the way large-scale data
processing and analysis are conducted, particularly for unstructured and semi-
structured data.

Here's a breakdown of what constitutes Hadoop's parallel world:

1. Distributed Storage: Hadoop Distributed File System (HDFS) forms the

backbone of Hadoop's storage layer. It breaks down large files into smaller
blocks and distributes them across a cluster of commodity hardware. This
distributed storage architecture enables high availability and fault tolerance.
2. MapReduce Paradigm: MapReduce is a programming model and
processing framework for distributed computing, popularized by Hadoop. It
allows developers to write parallelizable algorithms for processing vast
amounts of data across a Hadoop cluster. MapReduce operates in two
phases: the Map phase for data processing and the Reduce phase for
aggregation.
3. Scalability and Fault Tolerance: Hadoop's parallel world is designed to
scale horizontally, meaning additional commodity hardware can be added to
the cluster to handle increasing data volumes and processing demands.
Furthermore, Hadoop's fault-tolerant architecture ensures that data and
processing tasks are resilient to node failures within the cluster.
4. Ecosystem Components: Hadoop's parallel world extends beyond HDFS
and MapReduce to include a rich ecosystem of complementary tools and
frameworks. Apache projects such as Hive, Pig, HBase, Spark, and Kafka
offer diverse functionalities for data storage, processing, querying, real-time
analytics, and streaming data processing.
5. Data Lakes and Batch Processing: Hadoop's parallel world enables the
creation of data lakes, centralized repositories for storing vast amounts of
structured, semi-structured, and unstructured data. Batch processing,
facilitated by Hadoop's MapReduce paradigm, allows organizations to
analyze historical data efficiently and derive valuable insights.
6. Enterprise Adoption: Hadoop's parallel world has seen widespread
adoption across various industries, including technology, finance, healthcare,
and retail. Organizations leverage Hadoop to address diverse use cases such
as log processing, ETL (Extract, Transform, Load) pipelines, predictive
analytics, recommendation engines, and more.
7. Challenges and Evolutions: While Hadoop's parallel world offers
compelling advantages, it also poses challenges related to complexity,
performance optimization, and integration with existing IT infrastructure.
Moreover, the emergence of cloud-based alternatives, like managed Hadoop
services and serverless computing platforms, has influenced the evolution of
big data technologies beyond traditional Hadoop deployments.

Data discovery:

Data discovery refers to the process of identifying, exploring, and understanding

the data assets within an organization. It involves discovering data sources,
assessing data quality, understanding data relationships, and uncovering valuable
insights that can drive decision-making and innovation.

Here are key aspects and steps involved in the data discovery process:

1. Identifying Data Sources: Data discovery begins with identifying all

potential sources of data within an organization. These sources may include
databases, data warehouses, data lakes, spreadsheets, cloud storage, APIs,
and third-party data providers.
2. Cataloging Metadata: Metadata, which provides information about the
structure, content, and context of data, plays a crucial role in data discovery.
Metadata cataloging involves capturing and organizing metadata attributes
such as data types, schema definitions, data lineage, ownership, and access
permissions.
3. Exploratory Data Analysis (EDA): EDA involves analyzing and
visualizing data to gain insights into its characteristics, patterns,
distributions, and relationships. Techniques such as statistical analysis, data
profiling, data visualization, and clustering can help uncover hidden patterns
and anomalies in the data.
4. Data Profiling and Quality Assessment: Data profiling involves examining
the quality, completeness, accuracy, and consistency of data across different
sources. It helps identify data anomalies, missing values, duplicate records,
outliers, and other data quality issues that may affect the reliability of
analysis and decision-making.
5. Understanding Data Relationships: Understanding the relationships
between different datasets, attributes, and entities is essential for data
discovery. Techniques such as entity-relationship modeling, schema
mapping, and graph analysis help uncover the underlying structure and
dependencies within the data.
6. Data Governance and Compliance: Data discovery also involves ensuring
compliance with regulatory requirements, data governance policies, and
privacy regulations. It requires documenting data usage policies, defining
data classification standards, and implementing access controls to protect
sensitive information.
7. Collaboration and Knowledge Sharing: Effective data discovery requires
collaboration and knowledge sharing among data stakeholders, including
data analysts, data scientists, domain experts, and business users.
Collaborative tools and platforms facilitate sharing insights, documenting
data lineage, and capturing domain knowledge.
8. Automated Data Discovery: With the increasing volume and complexity of
data, organizations are turning to automated data discovery tools and
platforms to streamline the process. These tools leverage machine learning,
natural language processing, and data mining techniques to automate data
profiling, data lineage analysis, and pattern recognition.

Open source technology for Big Data Analytics:

Open-source technologies play a vital role in enabling big data analytics, providing
cost-effective and flexible solutions for processing, storing, analyzing, and
visualizing large volumes of data. Here are some key open-source technologies
commonly used in big data analytics:

1. Apache Hadoop: Hadoop is one of the foundational technologies in the big

data ecosystem. It includes the Hadoop Distributed File System (HDFS) for
distributed storage and MapReduce for distributed processing. Hadoop also
supports various higher-level frameworks like Apache Hive, Apache Pig,
and Apache Spark for data processing.
2. Apache Spark: Spark is a fast and general-purpose distributed computing
system that provides in-memory data processing capabilities. It supports
batch processing, interactive querying, machine learning, and streaming
analytics. Spark's flexible API and rich ecosystem make it suitable for a
wide range of big data analytics tasks.
3. Apache Kafka: Kafka is a distributed streaming platform used for building
real-time data pipelines and streaming applications. It enables high-
throughput, fault-tolerant messaging between systems and applications,
making it well-suited for event-driven architectures and real-time analytics.
4. Apache HBase: HBase is a distributed, scalable, and consistent NoSQL
database built on top of Hadoop HDFS. It provides random read and write
access to large volumes of structured data, making it suitable for
applications requiring low-latency access to big data.
5. Apache Flink: Flink is a stream processing framework that provides event-
driven, fault-tolerant processing of real-time data streams. It offers support
for event time processing, exactly-once semantics, and stateful
computations, making it suitable for complex event processing and real-time
analytics.
6. Elasticsearch: Elasticsearch is a distributed search and analytics engine
built on top of the Apache Lucene library. It provides real-time indexing,
search, and analysis capabilities for structured and unstructured data.
Elasticsearch is commonly used for log analytics, full-text search, and
monitoring applications.
7. Apache Druid: Druid is a distributed, column-oriented database designed
for real-time analytics on large datasets. It provides sub-second query
response times and supports high concurrency and scalable data ingestion.
Druid is often used for interactive analytics, OLAP (Online Analytical
Processing), and time-series data analysis.
8. Apache Airflow: Airflow is a platform for orchestrating complex data
workflows and data pipelines. It allows users to define, schedule, and
monitor workflows as directed acyclic graphs (DAGs). Airflow's extensible
architecture and rich ecosystem of operators make it suitable for managing
big data pipelines and ETL (Extract, Transform, Load) processes.

cloud and Big Data:

Cloud computing and big data technologies often go hand in hand, as the cloud
provides scalable infrastructure and resources for storing, processing, and
analyzing large volumes of data. Here's how cloud computing and big data
intersect and complement each other:

1. Scalability: Cloud platforms such as Amazon Web Services (AWS),

Microsoft Azure, and Google Cloud Platform (GCP) offer elastic and
scalable infrastructure resources, allowing organizations to scale their
computing and storage capacity up or down based on demand. This
scalability is crucial for handling the massive volumes of data generated in
big data analytics.
2. Storage: Cloud storage services like Amazon S3, Azure Blob Storage, and
Google Cloud Storage provide highly durable and scalable storage solutions
for big data. Organizations can store petabytes of data in the cloud without
worrying about managing physical hardware or infrastructure.
3. Compute Power: Cloud computing platforms offer a variety of compute
services, including virtual machines (VMs), containers, and serverless
computing, which can be leveraged for processing big data workloads.
Services like AWS EC2, Azure Virtual Machines, and Google Compute
Engine enable organizations to deploy and scale compute resources
dynamically to meet the needs of big data analytics applications.
4. Managed Big Data Services: Cloud providers offer managed big data
services and platforms that abstract the complexities of deploying and
managing big data infrastructure. For example, AWS offers services like
Amazon EMR (Elastic MapReduce) for running Apache Hadoop and Spark
clusters, while Azure provides Azure HDInsight and Google Cloud offers
Dataproc for managed big data processing.
5. Data Warehousing: Cloud data warehouses such as Amazon Redshift,
Azure Synapse Analytics (formerly SQL Data Warehouse), and Google
BigQuery enable organizations to analyze large datasets using SQL queries.
These platforms offer high-performance analytics, scalability, and
integration with other cloud services and big data tools.
6. Serverless Computing: Serverless computing platforms like AWS Lambda,
Azure Functions, and Google Cloud Functions allow organizations to run
code in response to events without provisioning or managing servers.
Serverless architectures can be used to build real-time data processing
pipelines and event-driven applications, making them well-suited for big
data analytics.
7. AI and Machine Learning: Cloud providers offer AI and machine learning
services that enable organizations to analyze big data and derive valuable
insights. Services like AWS SageMaker, Azure Machine Learning, and
Google AI Platform provide tools and frameworks for training and
deploying machine learning models at scale, leveraging big data analytics.
8. Data Lakes and Analytics: Cloud data lake solutions such as AWS Lake
Formation, Azure Data Lake Storage, and Google Cloud Storage provide
centralized repositories for storing structured, semi-structured, and
unstructured data at scale. These platforms integrate with big data analytics
tools and services, enabling organizations to perform advanced analytics,
machine learning, and data visualization on large datasets.
Predictive Analytics:

Predictive analytics is a branch of advanced analytics that utilizes historical data,

statistical algorithms, and machine learning techniques to forecast future outcomes,
trends, and behaviors. It involves extracting insights from past data patterns and
using them to make predictions about future events or behaviors.

Here's how predictive analytics works and its key components:

1. Data Collection and Preparation: Predictive analytics begins with

collecting relevant data from various sources, including databases, data
warehouses, transactional systems, sensors, and external sources. This data
may include historical records, customer interactions, financial transactions,
sensor readings, and more. Data cleaning, preprocessing, and transformation
are essential steps to ensure the quality and consistency of the data before
analysis.
2. Feature Selection and Engineering: In predictive analytics, features are the
variables or attributes used to make predictions. Feature selection involves
identifying the most relevant features that contribute to the predictive
model's accuracy and removing irrelevant or redundant ones. Feature
engineering may also involve creating new features or transforming existing
ones to improve model performance.
3. Model Selection and Training: Predictive analytics employs various
statistical and machine learning models to analyze data and make
predictions. Common predictive modeling techniques include linear
regression, logistic regression, decision trees, random forests, support vector
machines, neural networks, and ensemble methods. Model selection involves
choosing the appropriate algorithm based on the problem domain, data
characteristics, and performance metrics. Once selected, the model is trained
using historical data to learn patterns and relationships between features and
outcomes.
4. Evaluation and Validation: After training the predictive model, it is
evaluated using validation techniques to assess its performance and
generalization ability on unseen data. Common evaluation metrics include
accuracy, precision, recall, F1 score, area under the ROC curve (AUC), and
mean squared error (MSE). Cross-validation, holdout validation, and
resampling techniques are used to estimate the model's performance and
detect overfitting or underfitting issues.
5. Deployment and Integration: Once validated, the predictive model is
deployed into production environments where it can generate predictions in
real-time or batch mode. Integration with existing systems, applications, or
workflows is essential to operationalize the predictive insights and automate
decision-making processes. APIs, microservices, and cloud-based platforms
facilitate seamless integration and scalability of predictive analytics
solutions.
6. Continuous Monitoring and Optimization: Predictive models require
continuous monitoring and optimization to adapt to changing data patterns
and maintain their predictive accuracy over time. Monitoring involves
tracking model performance, detecting drifts or deviations in data
distributions, and retraining the model periodically with new data.
Techniques such as model retraining, feature updating, and ensemble
learning can help improve the model's performance and reliability in
dynamic environments.
Mobile Business Intelligence and Big Data:

Mobile Business Intelligence (BI) and Big Data are two interrelated concepts that
converge to empower organizations with data-driven decision-making capabilities,
especially in the increasingly mobile-centric business landscape. Here's how
Mobile BI and Big Data intersect and contribute to organizational success:

1. Accessibility and Real-Time Insights: Mobile BI leverages the capabilities

of mobile devices such as smartphones and tablets to provide stakeholders
with anytime, anywhere access to critical business insights. By integrating
with Big Data platforms and analytics tools, Mobile BI solutions enable
users to access real-time data dashboards, reports, and analytics on their
mobile devices, facilitating faster decision-making and responsiveness to
changing business conditions.
2. Data Visualization and Interactivity: Mobile BI applications leverage
interactive and intuitive data visualization techniques to present complex Big
Data analytics in a user-friendly format optimized for mobile screens.
Features like interactive charts, graphs, maps, and drill-down capabilities
allow users to explore and analyze large datasets on the go, uncovering
actionable insights and trends without being tethered to desktop computers.
3. Personalization and Customization: Mobile BI solutions enable
personalized and customized experiences tailored to the specific needs and
preferences of individual users. Advanced analytics capabilities powered by
Big Data platforms enable dynamic content generation, user segmentation,
and predictive analytics, allowing Mobile BI applications to deliver relevant
and context-aware insights to users based on their roles, interests, and
historical interactions.
4. Offline Access and Synchronization: Mobile BI applications support
offline access and data synchronization features, allowing users to access
and interact with BI content even when they are not connected to the
internet. Big Data technologies facilitate the replication and caching of
relevant data subsets on mobile devices, ensuring seamless access to critical
insights and analytics regardless of network connectivity or bandwidth
limitations.
5. Security and Compliance: Security is a paramount concern in Mobile BI
deployments, especially when dealing with sensitive business data and Big
Data analytics. Mobile BI solutions integrate robust security features such as
data encryption, authentication, authorization, and remote wipe capabilities
to protect against unauthorized access, data breaches, and compliance
violations. Big Data platforms also offer enterprise-grade security controls
and auditing capabilities to safeguard sensitive information and ensure
regulatory compliance.
6. Integration with Enterprise Systems: Mobile BI solutions seamlessly
integrate with existing enterprise systems, data warehouses, and Big Data
repositories to leverage the organization's data assets and analytics
infrastructure. Integration with enterprise resource planning (ERP) systems,
customer relationship management (CRM) platforms, and other business
applications enables Mobile BI users to access holistic views of business
performance and operational metrics, driving alignment, collaboration, and
informed decision-making across departments and functions.
7. Continuous Innovation and Evolution: Mobile BI and Big Data
technologies are continuously evolving to keep pace with the rapidly
changing business and technology landscape. Advances in mobile
computing, cloud infrastructure, artificial intelligence, and machine learning
are driving innovation in Mobile BI and Big Data analytics, enabling
organizations to unlock new insights, discover hidden opportunities, and stay
ahead of competition in today's data-driven economy.
Differentiate the business intelligence and big data analytics:

Business intelligence (BI) and big data analytics are related but distinct concepts in
the realm of data analysis and decision-making within organizations.

1. Scope and Purpose:

o Business Intelligence (BI): BI focuses on analyzing structured data
from various sources within an organization to provide historical,
current, and predictive views of business operations. It aims to support
decision-making by offering insights into key performance indicators
(KPIs), trends, and patterns to help optimize business processes,
improve efficiency, and drive strategic planning.
o Big Data Analytics: Big data analytics deals with large volumes of
structured, semi-structured, and unstructured data. It involves
advanced analytical techniques to uncover hidden patterns,
correlations, and other insights within vast datasets that traditional BI
tools may struggle to handle. Big data analytics aims to extract value
from diverse data sources, often in real-time, to inform strategic
decisions and gain competitive advantages.
2. Data Types and Sources:
o BI typically deals with structured data stored in relational databases or
data warehouses, such as sales figures, financial data, and customer
information.
o Big data analytics handles a broader spectrum of data types, including
structured, semi-structured, and unstructured data from sources such
as social media, sensor data, log files, and multimedia content.
3. Technologies and Tools:
o BI tools often include reporting and querying software, online
analytical processing (OLAP), dashboards, and data visualization
tools designed to analyze structured data efficiently.
o Big data analytics relies on specialized technologies and tools capable
of processing and analyzing large and complex datasets, including
distributed computing frameworks like Hadoop and Spark, NoSQL
databases, machine learning algorithms, and data mining techniques.
4. Timeliness and Agility:
o BI solutions typically focus on providing timely insights into
historical and current data trends to support operational and strategic
decision-making.
o Big data analytics emphasizes real-time or near-real-time analysis of
large datasets to enable agile decision-making and respond quickly to
changing market conditions or emerging trends.
5. Business Impact:
o BI enables organizations to gain better visibility into their operations,
identify areas for improvement, and make data-driven decisions to
optimize performance and achieve business objectives.
o Big data analytics can uncover deeper insights, facilitate predictive
modeling, and support innovation by revealing new opportunities,
enhancing customer experiences, and driving competitive advantage
through data-driven strategies.

Role of Data Analyst:

The role of a data analyst is crucial in today's data-driven world, where

organizations rely heavily on data to make informed decisions and gain
competitive advantages. Here are some key aspects of the role:

1. Data Collection and Cleaning:

o Data analysts are responsible for collecting data from various sources,
including databases, spreadsheets, APIs, and more.
o They also clean and preprocess data to ensure accuracy, consistency,
and reliability, which involves tasks such as handling missing values,
removing duplicates, and standardizing formats.
2. Data Analysis and Interpretation:
o Data analysts use statistical and analytical techniques to analyze
datasets and extract meaningful insights.
o They identify trends, patterns, correlations, and outliers within the
data to provide actionable recommendations to stakeholders.
o This may involve using tools like SQL for querying databases,
statistical software like R or Python for analysis, and data
visualization tools like Tableau or Power BI to communicate findings
effectively.
3. Reporting and Visualization:
o Data analysts create reports, dashboards, and visualizations to present
their findings in a clear and understandable manner.
o They use charts, graphs, and other visual aids to convey complex data
insights to non-technical stakeholders, such as executives or business
managers.
o Effective visualization helps stakeholders grasp key insights quickly
and facilitates data-driven decision-making.
4. Predictive Modeling and Forecasting:
o Some data analysts are involved in predictive modeling and
forecasting, where they build statistical models to predict future trends
or outcomes based on historical data.
o This may include techniques like regression analysis, time series
analysis, machine learning, and data mining to forecast sales,
customer behavior, or other business metrics.
5. Continuous Improvement and Optimization:
o Data analysts play a role in identifying opportunities for process
improvement and optimization based on data insights.
o They monitor key performance indicators (KPIs) and metrics to track
performance over time, identify areas of underperformance or
inefficiency, and propose solutions for improvement.
6. Collaboration and Communication:
o Data analysts often collaborate with cross-functional teams, including
business stakeholders, data engineers, data scientists, and IT
professionals.
o Effective communication skills are essential for translating technical
findings into actionable insights and facilitating collaboration across
departments.
o They also work closely with data scientists to refine models and
algorithms or with data engineers to ensure data quality and integrity.

Classification of Analytics:

Analytics can be broadly classified into several categories based on the type of data
being analyzed, the objectives of the analysis, and the techniques used. Here's a
detailed elucidation of these classifications:

1. Descriptive Analytics:
o Descriptive analytics focuses on summarizing historical data to
understand what has happened in the past. It involves analyzing data
to uncover patterns, trends, and relationships.
o This type of analytics provides insights into key performance
indicators (KPIs) and metrics, such as sales figures, website traffic,
customer demographics, and product performance.
o Common techniques used in descriptive analytics include data
aggregation, data visualization, and basic statistical analysis.
2. Diagnostic Analytics:
o Diagnostic analytics aims to answer the question "Why did it
happen?" by drilling down into the factors that influenced past events
or outcomes.
o It involves deeper analysis of data to identify root causes, correlations,
and relationships between different variables.
o Techniques used in diagnostic analytics include regression analysis,
correlation analysis, and hypothesis testing to uncover causal
relationships and understand the factors driving specific outcomes.
3. Predictive Analytics:
o Predictive analytics focuses on forecasting future trends and outcomes
based on historical data and statistical models.
o It involves using advanced statistical and machine learning algorithms
to analyze historical data and identify patterns that can be used to
predict future behavior.
o Predictive analytics is used in various applications such as demand
forecasting, risk management, churn prediction, and fraud detection.
o Common techniques used in predictive analytics include regression
analysis, time series forecasting, decision trees, and machine learning
algorithms like logistic regression, random forests, and neural
networks.
4. Prescriptive Analytics:
o Prescriptive analytics goes beyond predicting future outcomes to
recommend actions that can optimize future results.
o It involves using optimization and simulation techniques to evaluate
various possible actions and their potential impact on business
objectives.
o Prescriptive analytics helps decision-makers make informed choices
by providing recommendations based on predictive models and
business constraints.
o Techniques used in prescriptive analytics include linear programming,
simulation modeling, and decision analysis.
5. Diagnostic vs Predictive vs Prescriptive:
o Diagnostic analytics looks at past data to understand why something
happened.
o Predictive analytics forecasts what is likely to happen in the future
based on historical data.
o Prescriptive analytics recommends actions to take advantage of future
opportunities or mitigate future risks based on predictive models.
6. Other Classifications:
o Apart from the above categories, analytics can also be classified based
on the type of data being analyzed, such as text analytics for
unstructured data like customer reviews or social media posts, or
spatial analytics for geographical data.
o Additionally, analytics can be categorized based on industry focus,
such as healthcare analytics, financial analytics, marketing analytics,
etc.
Types of databases in NoSQL:

NoSQL (Not Only SQL) databases are a family of database management systems
that diverge from traditional relational databases (SQL databases) in favor of more
flexible data models, better scalability, and higher performance for certain types of
applications. There are several types of NoSQL databases, each designed to handle
specific data storage and retrieval requirements. Here are the main types:

1. Key-Value Stores:
o Key-value stores are the simplest form of NoSQL databases, where
each data item is stored as a key-value pair.
o Data retrieval is fast, as it involves a simple lookup based on the key.
o Examples include Redis, Amazon DynamoDB, and Riak.
2. Document Stores:
o Document stores store data in semi-structured formats like JSON or
XML documents.
o Each document can have a different structure, allowing for flexibility
in data representation.
o Queries are typically performed on the document structure or specific
fields within documents.
o Examples include MongoDB, Couchbase, and CouchDB.
3. Column-Family Stores (Wide-Column Stores):
o Column-family stores organize data into columns rather than rows,
making them suitable for storing and querying large datasets with
dynamic schemas.
o Data is grouped into column families, which can have different
columns.
o Queries can be performed on individual columns or across column
families.
o Examples include Apache Cassandra, HBase, and Google Bigtable.
4. Graph Databases:
o Graph databases are designed to represent and store data as graphs,
consisting of nodes (entities) and edges (relationships).
o They excel at handling complex relationships and interconnected data.
o Queries are expressed in graph-based query languages like Cypher or
SPARQL.
o Examples include Neo4j, Amazon Neptune, and JanusGraph.

Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Notes - Big Data Analytics Unit I, Ii, Iii
No ratings yet
Notes - Big Data Analytics Unit I, Ii, Iii
39 pages
Big Data Notes
No ratings yet
Big Data Notes
291 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
Module 2 - BigData Fundamentals
No ratings yet
Module 2 - BigData Fundamentals
37 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Big Data ANAlysis Short
No ratings yet
Big Data ANAlysis Short
114 pages
Notes Big Data
No ratings yet
Notes Big Data
106 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
BDS DS307 Unit-1
No ratings yet
BDS DS307 Unit-1
46 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Unit 5
No ratings yet
Unit 5
68 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Module-1 MCQ of Data Analytics and Visualization
No ratings yet
Module-1 MCQ of Data Analytics and Visualization
6 pages
Big Data - Cloud - AI
No ratings yet
Big Data - Cloud - AI
45 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Bda U2
No ratings yet
Bda U2
32 pages
Unit 1 Topic 2 Big Data Platform
No ratings yet
Unit 1 Topic 2 Big Data Platform
31 pages
Cheng T.P., Li L.F. Gauge Theory of Elementary Particle Physics (Oxford, 2000) (100dpi) (T) (315s) (KA) - PQGF
71% (7)
Cheng T.P., Li L.F. Gauge Theory of Elementary Particle Physics (Oxford, 2000) (100dpi) (T) (315s) (KA) - PQGF
315 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Module 1
No ratings yet
Module 1
29 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Bigdata
No ratings yet
Bigdata
18 pages
Two Marks BDA
No ratings yet
Two Marks BDA
15 pages
UNIT II - Emerging Technology
No ratings yet
UNIT II - Emerging Technology
22 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
UNIT-1 BigData
No ratings yet
UNIT-1 BigData
10 pages
Damen Shipbuilding Quality Standards - VOLUME I (Hull Quality Standards)
No ratings yet
Damen Shipbuilding Quality Standards - VOLUME I (Hull Quality Standards)
76 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
BDA Unit 2
No ratings yet
BDA Unit 2
8 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Big Data Integration and Processing 15 Marks
No ratings yet
Big Data Integration and Processing 15 Marks
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Last Min Preparation - Big Data
No ratings yet
Last Min Preparation - Big Data
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
File 1
No ratings yet
File 1
3 pages
Module 1 Week 7 8 Physics 1 PDF
No ratings yet
Module 1 Week 7 8 Physics 1 PDF
32 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Eagle Burgmann API Seal Plans
100% (2)
Eagle Burgmann API Seal Plans
55 pages
Nanmac Catalog 2011
No ratings yet
Nanmac Catalog 2011
124 pages
ICAS 2017 MA PaperA ReBranded
No ratings yet
ICAS 2017 MA PaperA ReBranded
13 pages
Goal of Isolation
No ratings yet
Goal of Isolation
20 pages
Planning 1
No ratings yet
Planning 1
45 pages
Unit 3
No ratings yet
Unit 3
17 pages
Canon: Scientific Calculator
No ratings yet
Canon: Scientific Calculator
37 pages
Applied Element Method (AEM) in Dynamic and Seismic Analysis
No ratings yet
Applied Element Method (AEM) in Dynamic and Seismic Analysis
75 pages
Ob Mid 1
No ratings yet
Ob Mid 1
7 pages
Spectral Energy Based Voice Activity Detection For Real-Time Voice Interface
No ratings yet
Spectral Energy Based Voice Activity Detection For Real-Time Voice Interface
17 pages
Concrete Floors
No ratings yet
Concrete Floors
44 pages
ASTM A307 Standard Specification For Carbon Steel Bolts, Studs, and Threaded Rod 60 000 PSI Tensile Strength
No ratings yet
ASTM A307 Standard Specification For Carbon Steel Bolts, Studs, and Threaded Rod 60 000 PSI Tensile Strength
6 pages
3000/3200 Series: Ratings
No ratings yet
3000/3200 Series: Ratings
2 pages
Che1003: Process Engineering Thermodynamics
No ratings yet
Che1003: Process Engineering Thermodynamics
50 pages
User Guide For Permits 124234234
No ratings yet
User Guide For Permits 124234234
46 pages
Keysight U1452A
No ratings yet
Keysight U1452A
19 pages
Practice 8
No ratings yet
Practice 8
11 pages
Unit 1
No ratings yet
Unit 1
11 pages
River Reach Data Fields-Landscape
No ratings yet
River Reach Data Fields-Landscape
11 pages
EDRG101
No ratings yet
EDRG101
8 pages
The Preparation of Methylamine Hydrochloride From Acetamide by Means of Calcium Hypochlorite
100% (2)
The Preparation of Methylamine Hydrochloride From Acetamide by Means of Calcium Hypochlorite
3 pages
Partial Replacement of Concrete With Sis
No ratings yet
Partial Replacement of Concrete With Sis
4 pages
REAL TIME KERNELL-Introduction, Principles and Design Issues
No ratings yet
REAL TIME KERNELL-Introduction, Principles and Design Issues
11 pages
CIVE #302 Lab #2: Wheatstone Bridge For Strain Measurement Date of Activity: Thursday, September 12, 2019 Due Date: Tuesday, September 26, 2019
No ratings yet
CIVE #302 Lab #2: Wheatstone Bridge For Strain Measurement Date of Activity: Thursday, September 12, 2019 Due Date: Tuesday, September 26, 2019
4 pages
Measure Projection of Cylinder Liner
No ratings yet
Measure Projection of Cylinder Liner
2 pages
Assign
No ratings yet
Assign
1 page
Note 1023437 - ABAP Syst: Downwardly Incompatible Passwords (Since NW2004s)
No ratings yet
Note 1023437 - ABAP Syst: Downwardly Incompatible Passwords (Since NW2004s)
3 pages
Problem Set 6 3.20 MIT Dr. Anton Van Der Ven Fall
No ratings yet
Problem Set 6 3.20 MIT Dr. Anton Van Der Ven Fall
2 pages
Contents 290
No ratings yet
Contents 290
2 pages
Wireline Operations DACC IRP 13
No ratings yet
Wireline Operations DACC IRP 13
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT-2

Big Data Technologies

1. Distributed File Systems: Distributed file systems like Hadoop Distributed

Hadoop’s Parallel World:

"Hadoop's Parallel World" refers to the distributed computing paradigm embodied

Here's a breakdown of what constitutes Hadoop's parallel world:

1. Distributed Storage: Hadoop Distributed File System (HDFS) forms the

Data discovery refers to the process of identifying, exploring, and understanding

1. Identifying Data Sources: Data discovery begins with identifying all

Open source technology for Big Data Analytics:

1. Apache Hadoop: Hadoop is one of the foundational technologies in the big

cloud and Big Data:

1. Scalability: Cloud platforms such as Amazon Web Services (AWS),

Predictive analytics is a branch of advanced analytics that utilizes historical data,

Here's how predictive analytics works and its key components:

1. Data Collection and Preparation: Predictive analytics begins with

1. Accessibility and Real-Time Insights: Mobile BI leverages the capabilities

1. Scope and Purpose:

Role of Data Analyst:

The role of a data analyst is crucial in today's data-driven world, where

1. Data Collection and Cleaning:

You might also like