Big Data Analytics-Digital Notes
Big Data Analytics-Digital Notes
2024-25
BIG DATA ANALYTICS
(R20A0520 )
LECTURE NOTES
LECTURE NOTES
Department of Computational Intelligence
CSE (Artificial Intelligence and Machine Learning)
Vision
Mission
QUALITY POLICY
COURSE OBJECTIVES:
The objectives of this course are,
1. To learn the need of Big Data and the various challenges involved and to acquire
Knowledge about different analytical architectures.
2. To understand Hadoop Architecture and its ecosystems.
3. To Understand Hadoop Ecosystem and acquire knowledge about the NoSQL database.
4. To acquire knowledge about the NewSQL, MongoDB and Cassandra databases.
5. To imbibe the processing of Big Data with advanced architectures like Spark.
UNIT – I
Introduction to big data: Data, Characteristics of data and Types of digital data: Unstructured, Semi-
structured and Structured - Sources of data. Big Data Evolution -Definition of big data- Characteristics and
Need of big data-Challenges of big data. Big data analytics, Overview of business intelligence.
UNIT – II
Big data technologies and Databases: Hadoop – Requirement of Hadoop Framework - Design principle
of Hadoop –Comparison with other system SQL and RDBMS- Hadoop Components – Architecture -
Hadoop 1 vs Hadoop 2.
UNIT – III
MapReduce and YARN framework: Introduction to MapReduce , Processing data with Hadoop using
MapReduce, Introduction to YARN, Architecture, Managing Resources and Applications with Hadoop
YARN.
Big data technologies and Databases: NoSQL: Introduction to NoSQL - Features and Types-
Advantages & Disadvantages -Application of NoSQL.
UNIT - IV
New SQL: Overview of New SQL - Comparing SQL, NoSQL and NewSQL.
Mongo DB: Introduction – Features – Data types – Mongo DB Query language – CRUD
operations – Arrays – Functions: Count – Sort – Limit – Skip – Aggregate – Map Reduce. Cursors –
Indexes – Mongo Import – Mongo Export.
Cassandra: Introduction – Features – Data types – CQLSH – Key spaces – CRUD operations –
Collections – Counter – TTL – Alter commands – Import and Export – Querying System tables.
UNIT - V
(Big Data Frame Works for Analytics) Hadoop Frame Work: Map Reduce Programming: I/O formats,
Map side join-Reduce Side Join-Secondary Sorting-Pipelining MapReduce jobs
Spark Frame Work: Introduction to Apache spark-How spark works, Programming with RDDs: Create
RDDspark Operations-Data Frame.
TEXT BOOKS:
1. Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt. Ltd.,
2016.
2. Mike Frampton, “Mastering Apache Spark”, Packt Publishing, 2015.
REFERENCE BOOKS:
1. Tom White, “Hadoop: The Definitive Guide”, O‟Reilly, 4th Edition, 2015.
2. Mohammed Guller, “Big Data Analytics with Spark”, Apress, 2015
3. Donald Miner, Adam Shook, “Map Reduce Design Pattern”, O‟Reilly, 2012
COURSE OUTCOMES:
On successful completion of the course, students will be able to,
1. Demonstrate knowledge of Big Data, Data Analytics, challenges and their solutions in Big Data.
2. Analyze Hadoop Framework and eco systems.
3. Compare and work on NoSQL environment and MongoDB and cassandra.
4. Apply the Big Data using Map-reduce programming in Both Hadoop and Spark framework.
5. Analyze the data Analytics algorithms in Spark
BIG DATA ANAYTICS
Dept. of CSE 1| P a g e
BIG DATA ANAYTICS
UNIT ‐ I
Introduction to big data: Data, Characteristics of data and Types of digital data:
Unstructured, Semi- structured and Structured - Sources of data. Big Data Evolution
-Definition of big data-Characteristics and Need of big data-Challenges of big data. Big data
analytics, Overview of business intelligence.
What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations or
just descriptions of things.
For example, data might include individual prices, weights, addresses, ages, names,
temperatures, dates, or distances.
Characteristics of Data
The following are six key characteristics of data which discussed below:
1. Accuracy
2. Validity
3. Reliability
4. Timeliness
5. Relevance
6. Completeness
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only once,
although it may have multiple uses. Data should be captured at the point of activity.
Dept. of CSE 2| P a g e
BIG DATA ANAYTICS
2. Validity
Data should be recorded and used in compliance with relevant requirements, including the
correct application of any rules or definitions. This will ensure consistency between periods
and with similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection points and
over time. Progress toward performance targets should reflect real changes rather than
variations in data collection approaches or methods. Source data is clearly identified and
readily available from manual, automated, or other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period. Data must be available quickly
and frequently enough to support information needs and to influence service or management
decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will require a
periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
Dept. of CSE 3| P a g e
BIG DATA ANAYTICS
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried and analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Dept. of CSE 4| P a g e
BIG DATA ANAYTICS
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
Dept. of CSE 5| P a g e
BIG DATA ANAYTICS
E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf
files, PowerPoint presentations, emails, blog entries, wikis and word processing
documents.
Dept. of CSE 6| P a g e
BIG DATA ANAYTICS
Dept. of CSE 7| P a g e
BIG DATA ANAYTICS
Faster data processing: Semi-structured data can be processed more quickly than
traditional structured data, as it can be indexed and queried in a more flexible way.
This makes it easier to retrieve specific subsets of data for analysis and reporting.
Improved data integration: Semi-structured data can be more easily integrated with
other types of data, such as unstructured data, making it easier to combine and analyze
data from multiple sources.
Richer data analysis: Semi-structured data often contains more contextual information
than traditional structured data, such as metadata or tags. This can provide additional
insights and context that can improve the accuracy and relevance of data analysis.
Dept. of CSE 8| P a g e
BIG DATA ANAYTICS
Data security: Semi-structured data can be more difficult to secure than structured
data, as it may contain sensitive information in unstructured or less- visible parts of
the data. This can make it more challenging to identify and protect sensitive
information from unauthorized access.
Overall, while semi-structured data offers many advantages in terms of flexibility and
scalability, it also presents some challenges and limitations that need to be carefully
considered when designing and implementing data processing and analysis pipelines.
Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but with huge size.
New York Stock Exchange : The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day.
Dept. of CSE 9| P a g e
BIG DATA ANAYTICS
Social Media: The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in terms of
photo and video uploads, message exchanges, putting comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to many Petabytes.
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and
sheets in the past, but these days the data will comes in array forms, that are PDFs, Emails,
audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Companies are using Big Data to know what their customers want, who are their best
customers, why people choose different products. The more a company knows about its
customers, the more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on predictions
about customers. Leveraging big data makes companies customer-centric.
Companies can use Historical and real-time data to assess evolving consumers’ preferences.
This consequently enables businesses to improve and update their marketing strategies which
make companies more responsive to customer needs.
Big Data importance doesn’t revolve around the amount of data a company has. Its
importance lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when
they have to store large amounts of data. These tools help organizations in identifying more
effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools
like Hadoop help them to analyze data immediately thus helping in making quick decisions
based on the learnings.
Big Data analysis helps businesses to get a better understanding of market situations. For
example, analysis of customer purchasing behavior helps companies to identify the products
sold most and thus produces those products accordingly. This helps companies to get ahead
of their competitors.
It's in the name—big data is big. Most companies are increasing the amount of data
they collect daily. Eventually, the storage capacity a traditional data center can provide
will be inadequate, which worries many business leaders. Forty-three percent of IT decision-
makers in the technology sector worry about this data influx overwhelming their
infrastructure [2] .
To handle this challenge, companies are migrating their IT infrastructure to the cloud.
Cloud storage solutions can scale dynamically as more storage is needed. Big data
software is designed to store large volumes of data that can be accessed and queried
quickly.
The data itself presents another challenge to businesses. There is a lot, but it is also
diverse because it can come from a variety of different sources. A business could have
analytics data from multiple websites, sharing data from social media, user information from
CRM software, email data, and more. None of this data is structured the same but may have
to be integrated and reconciled to gather necessary insights and create reports.
To deal with this challenge, businesses use data integration software, ETL software, and
business intelligence software to map disparate data sources into a common structure
and combine them so they can generate accurate reports.
Fortunately, there are solutions for this. Data governance applications will help organize,
manage, and secure the data you use in your big data projects while also validating data
sources against what you expect them to be and cleaning up corrupted and incomplete data
sets. Data quality software can also be used specifically for the task of validating and
cleaning your data before it is processed.
a challenge. Big data software comes in many varieties, and their capabilities often
overlap. How do you make sure you are choosing the right big data tools? Often, the
best option is to hire a consultant who can determine which tools will fit best with what
your business wants to do with big data. A big data professional can look at your current
and future needs and choose an enterprise data streaming or ETL solution that will
collect data from all your data sources and aggregate it. They can configure your cloud
services and scale dynamically based on workloads. Once your system is set up with big
data tools that fit your needs, the system will run seamlessly with very little maintenance.
Thinking about hiring a data analytics company to help your business implement a big
data strategy? Browse our list of top data analytics companies, and learn more about their
services in our hiring guide.
When your business begins a data project, start with goals in mind and strategies for how
you will use the data you have available to reach those goals. The team involved in
implementing a solution needs to plan the type of data they need and the schemas they
will use before they start building the system so the project doesn't go in the wrong
direction. They also need to create policies for purging old data from the system once it is
no longer useful.
There are a few ways to solve this problem. One is to hire a big data specialist and
have that specialist manage and train your data team until they are up to speed. The
specialist can either be hired on as a full-time employee or as a consultant who trains your
team and moves on, depending on your budget.
Another option, if you have time to prepare ahead, is to offer training to your current team
members so they will have the skills once your big data project is in motion.
This can be a hard challenge to tackle, but it can be done. You can start with a smaller project
and a small team and let the results of that project prove the value of big data to other
leaders and gradually become a data-driven business. Another option is placing big data
experts in leadership roles so they can guide your business towards transformation.
BI has a direct impact on organization’s strategic, tactical and operational business decisions.
BI supports fact-based decision making using historical data rather than assumptions and gut
feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and
charts to provide users with detailed intelligence about the nature of the business.
Why is BI important?
Step 2) The data is cleaned and transformed into the data warehouse. The table can be linked,
and data cubes are formed.
Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct any other
analysis.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to identify any
areas which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who should
own accountability and ownership for the organization’s performance against its set goals.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-sized enterprises. The
use of such type of system may be expensive for routine business transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can be so
complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the buying
competence of rich firms. Therefore, BI system is yet not affordable for many small and
medium size companies.
UNIT ‐ II
Big data technologies and Databases: Hadoop – Requirement of Hadoop Framework
- Design principle of Hadoop –Comparison with other system SQL and RDBMS- Hadoop
Components – Architecture -Hadoop 1 vs Hadoop 2.
-
Requirement of Hadoop Framework
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models.
Fault
Hadoop is highly fault-tolerant. SQL has good fault tolerance.
Tolerance
As Hadoop uses the notion of
SQL supporting databases are
distributed computing and the principle
usually available on-prises or on
of map-reduce therefore it handles data
Availability the cloud, therefore it can’t utilize
availability on multiple systems across
the benefits of distributed
multiple geo-
computing.
locations.
Integrity Hadoop has low integrity. SQL has high integrity.
Scaling in SQL required
Scaling in Hadoop based system requires
purchasing additional SQL servers
connecting computers over the network.
Scaling and configuration which is
Horizontal Scaling with Hadoop is cheap
expensive and time-
and flexible.
consuming.
SQL supports real-time data
Hadoop supports large-scale batch data processing known as Online
Data
processing known as Online Analytical Transaction Processing (OLTP)
Processing
Processing (OLAP). thereby making it interactive
and batch-oriented.
Statements in Hadoop are executed very
Execution SQL syntax can be slow when
quickly even when millions of
Time executed in millions of rows.
queries are executed at once.
Hadoop uses appropriate Java Database
Connectivity (JDBC) to interact with SQL systems can read and write
Interaction
SQL systems to transfer data to Hadoop systems.
and receive data between them.
Hadoop supports advanced machine SQL’s support for ML and AI is
Support for
learning and artificial intelligence limited compared to
ML and AI
techniques. Hadoop.
Hadoop requires an advanced skill level
The SQL skill level required to
Skill Level for you to be proficient in using it
use it is intermediate as it can
and trying to learn Hadoop as a
Data Storage Average size data Use for large data sets (Tbs and
(GBS) Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write Required on reading (dynamic
(static schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Components of Hadoop
There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.
SlaveNodes. The SlaveNodes do the processing and send it back to the MasterNode.
Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU
3. YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework. The data which is stored can
be processed with help of YARN using data processing engines like interactive processing. It
can be used to fetch any sort of data analysis.
Features:
It is a filing system that acts as an Operating System for the data stored on HDFS
It helps to schedule the tasks to avoid overloading any system
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process
the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce
job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task
Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.
Hadoop 1 vs Hadoop 2
Hadoop 1 Hadoop 2
HDFS HDFS
4. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
Secondary Namenode Secondary Namenode
Job Tracker Resource Manager
Task Tracker Node Manager
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce which
works as Resource Management as well as Data Processing. Due to this workload on
Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the top of
HDFS, there is YARN which works as Resource Management. It basically allocates the
resources and keeps all the things going on.
4. Limitations:
Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e active
namenodes and standby namenodes) and multiple slaves. If here master node got crashed then
standby master node will take over it. You can make multiple combinations of active-standby
nodes. Thus Hadoop 2 will eliminate the problem of a single point of failure.
5. Ecosystem
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and export
the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
UNIT ‐ III
MapReduce and YARN framework: Introduction to MapReduce , Processing data with
Hadoop using MapReduce, Introduction to YARN, Architecture, Managing Resources and
Applications with Hadoop YARN.
Big data technologies and Databases: NoSQL: Introduction to NoSQL - Features and
Types- Advantages & Disadvantages -Application of NoSQL.
MapReduce is a software framework and programming model used for processing huge
amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map
tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java,
Ruby, Python, and C++. The programs of Map Reduce in cloud computing are parallel in
nature, thus are very useful for performing large-scale data analysis using multiple machines
in the cluster.
The input to each phase is key-value pairs. In addition, every programmer needs to specify
two functions: map function and reduce function.
Let us understand more about MapReduce and its components. MapReduce majorly has the
following three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.
Input Split
It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key- Value
Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it
and generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up
a MapReduce Job to run-in Hadoop. We specify the names of Mapper and
Reducer Classes long with data types and their respective job names.
Introduction to YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes
it interactive to let another application Hbase, Spark etc. to work on it. Different Yarn
applications can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the
same time bringing great benefits for manageability and cluster utilization.
Components Of YARN
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were
responsible for handling resources and checking progress management. However, Hadoop 2.0
has Resource manager and NodeManager to overcome the shortfall of Jobtracker &
Tasktracker.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications. follow Resource Manager guide to learn Yarn Resource manager in great detail.
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The
scheduler is pure scheduler it means that it performs no monitoring no tracking for the
application and even doesn’t guarantees about restarting failed tasks either due to application
failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of
failures.
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource
usage and reporting the same to the ResourceManager. Manage the user process on that
machine. Yarn NodeManager also tracks the health of the node on which it is running. The
design also allows plugging long-running auxiliary services to the NM; these are application-
specific services, specified as part of the configurations and loaded by the NM during startup.
A shuffle is a typical auxiliary service by the NMs for MapReduce applications on YARN
One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the corresponding
NMs to start the application’s individual tasks.
NoSQL
Introduction to NoSQL
We know that MongoDB is a NoSQL Database, so it is very necessary to know about NoSQL
Database to understand MongoDB throughly.
Features of NoSQLNon-
relational
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Doesn’t require object-relational mapping and data normalization
No complex features like query languages, query planners,referential integrity joins,
ACID
Schema-free
NoSQL databases are either schema-free or have relaxed schemas
Do not require any sort of definition of the schema of the data
Offers heterogeneous structures of data in the same domain
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing services
Distributed
Multiple NoSQL databases can be executed in a distributed fashion
Offers auto-scaling and fail-over capabilities
Often ACID concept can be sacrificed for scalability and throughput
Mostly no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
Only providing eventual consistency
Shared Nothing Architecture. This enables less coordination and higher
distribution.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored
contiguously.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc.
as the data is readily available in a column.
Document-Oriented
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.
In this diagram on your left you can see we have rows and columns, and in the right, we have
a document database which has a similar structure to JSON. Now for the relational database,
you have to know what columns you have and so on. However, for a document database, you
have data store like JSON object. You do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics
& e-commerce applications. It should not use for complex transactions which require multiple
operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
No Need for Separate Caching Layer
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with equal effect
Object-oriented programming which is easy to use and flexible
NoSQL databases don’t need a dedicated high-performance server
Support Key Developer Languages and Platforms
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys
become difficult
Doesn’t work as well with relational data
The learning curve is stiff for new developers
Open source options so not so popular for enterprises.
Perhaps when a user wishes to mine a particular dataset from large amounts of data,
one can make use of NoSQL databases, to begin with. Data is the building block of
technology that has led mankind to such great heights.
Therefore, one of the most essential fields where NoSQL databases can be put to use is
data mining and data storage.
Social media sites like Facebook and Instagram often approach open-source NoSQL
databases to extract data that helps them keep track of their users and the activities
going on around their platforms.
3. Software Development
The third application that we will be looking at is software development. Software
development requires extensive research on users and the needs of the masses that are met
through software development.
Perhaps NoSQL databases are always useful in helping software developers keep a tab
on their users, their details, and other user-related data that is important to be noted.
That said, NoSQL databases are surely helpful in software development.
UNIT ‐ IV
New SQL: Overview of New SQL - Comparing SQL, NoSQL and NewSQL.
Mongo DB: Introduction – Features – Data types – Mongo DB Query language –
CRUD operations – Arrays – Functions: Count – Sort – Limit – Skip – Aggregate – Map
Reduce. Cursors – Indexes – Mongo Import – Mongo Export.
Cassandra: Introduction – Features – Data types – CQLSH – Key spaces – CRUD
operations – Collections – Counter – TTL – Alter commands – Import and Export –
Querying System tables.
NewSQL
Introduction to NewSQL
NewSQL is a modern relational database system that bridges the gap between SQL and
NoSQL. NewSQL databases aim to scale and stay consistent.
NoSQL databases scale while standard SQL databases are consistent. NewSQL attempts to
produce both features and find a middle ground. As a result, the database type solves the
problems in big data fields.
What is NewSQL?
NewSQL is a unique database system that combines ACID compliance with horizontal scaling.
The database system strives to keep the best of both worlds. OLTP- based transactions
and the high performance of NoSQL combine in a single solution.
Enterprises expect high-quality of data integrity on large data volumes. When either becomes
a problem, an enterprise chooses to:
Improve hardware, or
Create custom software for distributed databases
Both solutions are expensive on both a software and hardware level. NewSQL strives to
improve these faults by creating consistent databases that scale.
MongoDB
Introduction to MongoDB
MongoDB is an open-source document database that provides high performance, high
availability, and automatic scaling.
In simple words, you can say that - Mongo DB is a document-oriented database. It is
an open source product, developed and supported by a company named 10gen.
MongoDB is available under General Public license for free, and it is also
available under Commercial license from the manufacturer.
The manufacturing company 10gen has defined MongoDB as:
"MongoDB is a scalable, open source, high performance, document-oriented
database." - 10gen
MongoDB was designed to work with commodity servers. Now it is used by the
company of all sizes, across all industry.
Features of MongoDB
These are some important features of MongoDB:
1. Support ad hoc queries
In MongoDB, you can search by field, range query and it also supports regular expression
searches.
2. Indexing
You can index any field in a document.
3. Replication
MongoDB supports Master Slave replication.
A master can perform Reads and Writes and a Slave copies data from the master and can only
be used for reads or back up (not writes)
4. Duplication of data
MongoDB can run over multiple servers. The data is duplicated to keep the system up and also
keep its running condition in case of hardware failure.
5. Load balancing
It has an automatic load balancing configuration because of data placed in shards.
6. Supports map reduce and aggregation tools.
7. Uses JavaScript instead of Procedures.
8. It is a schema-less database written in C++.
9. Provides high performance.
10. Stores files of any size easily without complicating your stack.
11. Easy to administer in the case of failures.
12. It also supports:
o JSON data model with dynamic schemas
o Auto-sharding for horizontal scalability
MongoDB Datatypes
Following is a list of usable data types in MongoDB.
Data Description
Types
String String is the most commonly used datatype. It is used to store data. A string
must be UTF 8 valid in mongodb.
Integer Integer is used to store the numeric value. It can be 32 bit or 64 bit
depending on the server you are using.
Boolean This datatype is used to store boolean values. It just shows YES/NO values.
Min/Max This datatype compare a value against the lowest and highest bson elements.
Keys
Arrays This datatype is used to store a list or multiple values into a single key.
Date This datatype stores the current date or time in unix time format. It makes you
possible to specify your own date time by creating object of date and pass the
value of date, month, year into it.
Create Operations
For MongoDB CRUD, if the specified collection doesn’t exist, the create operation will create
the collection when it’s executed. Create operations in MongoDB target a single collection, not
multiple collections. Insert operations in MongoDB are atomic on a single document level.
MongoDB provides two different create operations that you can use to insert documents into
a collection:
db.collection.insertOne()
db.collection.insertMany()
Read Operations
The read operations allow you to supply special query filters and criteria that let you specify
which documents you want. The MongoDB documentation contains more information on the
available query filters. Query modifiers may also be used to change how many results are
returned.
MongoDB has two methods of reading documents from a collection:
db.collection.find()
db.collection.findOne()
Update Operations
Like create operations, update operations operate on a single collection, and they are atomic at
a single document level. An update operation takes filters and criteria to select the documents
you want to update.
You should be careful when updating documents, as updates are permanent and can’t be rolled
back. This applies to delete operations as well.
For MongoDB CRUD, there are three different methods of updating documents:
db.collection.updateOne()
db.collection.updateMany()
db.collection.replaceOne()
Delete Operations
Delete operations operate on a single collection, like update and create operations. Delete
operations are also atomic for a single document. You can provide delete operations with
filters and criteria in order to specify which documents you would like to delete from a
collection. The filter options rely on the same syntax that read operations utilize.
MongoDB has two different methods of deleting records from a collection:
db.collection.deleteOne()
db.collection.deleteMany()
can take many forms in MongoDB. We can define an array of string, integer, embedded
documents, Jason and BSON data types, array in it can be defined as any form of data types.
If we have made student collection and make the field as grade and grade are divided into
three parts, such as MongoDB array, Array is essential and useful in MongoDB.
Syntax:
{< array field >: {< operator1> : <value1>, < operator2> : <value2>, < operator3> :
<value3>, ….. }}
2. Operator: The operator is defined as which values we have to create the array. The operator
name is specific to the value name in the array.
3. Value: Value in the array is defined as the actual value of array on which we have defining
array in MongoDB. Value is significant while defining an array.
Code:
db.emp_count.find()
Output:
One a shared cluster, if you use this method without a query predicate, then it will
return an inaccurate count if orphaned documents exist or if a chunk migration is in
progress. So, to avoid such type of situation use db.collection.aggregate() method.
Syntax:
db.collection.count(query)
Parameters:
Name Description Required / Type
Optional
query The query selection Required document
criteria.
>db.COLLECTION_NAME.find().sort({KEY:1})
Syntax:
db.COLLECTION_NAME.find().limit(NUMBER)
Syntax :
cursor.skip(<offset>)
Syntax:
Basic syntax of aggregate() method is as follows −
>db.COLLECTION_NAME.aggregate(AGGREGATE_OPERATION)
Now we shall learn to use mapReduce() function for performing aggregation operations on a
MongoDB Collection, with the help of examples.
Syntax:
> db.collection.mapReduce(
1 cursor.addOption(flag)
2. Cursor.batchSize(size)
3. cursor.close()
4. cursor.collation(<collation document>)
5. cursor.forEach(function)
7. cursor.limit()
8. cursor.map(function)
9. cursor.max()
10. cursor.min()
11. cursor.tailable()
12. cursor.toArray()
Indexing in MongoDB
MongoDB uses indexing in order to make the query processing more efficient. If there is no
indexing, then the MongoDB must scan every document in the collection and retrieve only
those documents that match the query. Indexes are special data structures that stores some
information related to the documents such that it becomes easy for MongoDB to find the right
data file. The indexes are order by the value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an index.
Syntax:
db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index and 1 (or -1)
determines the order in which these indexes will be arranged(ascending or descending).
What is mongoimport
The mongoimport command is used to import your content from an extended JSON, CSV, or
TSV export created by mongoexport. It also supports restoring or importing data from other
third-party export tools.
This command is of immense help when it comes to managing your MongoDB database. It’s
super-fast and multi-threaded, more so than any custom script you might write to do your
import operation. The mongoimport command can be combined with other MongoDB
command-line tools, such as jq for JSON manipulation, csvkit for CSV manipulation,
or even curl for dynamically downloading data files from servers on the internet.
Syntax:
The mongoimport command has the following syntax:
What is Mongoexport
mongoexport command to Export Data from a Collection mongoexport command is used to
export MongoDB Collection data to a CSV or JSON file. By default, the mongoexport
command connects to mongod instance running on the localhost port number 27017.
Syntax:
Field_name(s) - Name of the Field (or multiple fields separated by comma (,) ) to be
exported. It is optional in case of CSV file. If not specified, all the fields of the collection will
be exported to JSON file.
But it is suggested to properly specify the name and path of the exported file.
When exporting to CSV format, must specify the fields in the documents to be exported
When exporting to JSON format, _id field will also be exported by default. While the same
will not be exported if exporting as CSV file until not specified in field list.
Cassandra
What is Cassandra
Apache Cassandra is highly scalable, high performance, distributed NoSQL database.
Cassandra is designed to handle huge amount of data across many commodity servers,
providing high availability without a single point of failure.
Cassandra has a distributed architecture which is capable to handle a huge amount of data.
Data is placed on different machines with more than one replication factor to attain a high
availability without a single point of failure.
Features of Cassandra
There are a lot of outstanding technical features which makes Cassandra very popular.
Following is a list of some popular features of Cassandra:
High Scalability
Cassandra is highly scalable which facilitates you to add more hardware to attach more
customers and more data as per requirement.
Rigid Architecture
Cassandra has not a single point of failure and it is continuously available for business- critical
applications that cannot afford a failure.
Fast Linear-scale Performance
Cassandra is linearly scalable. It increases your throughput because it facilitates you to
increase the number of nodes in the cluster. Therefore it maintains a quick response time.
Fault tolerant
Cassandra is fault tolerant. Suppose, there are 4 nodes in a cluster, here each node has a copy
of same data. If one node is no longer serving then other three nodes can served as per
request.
Flexible Data Storage
Cassandra supports all possible data formats like structured, semi-structured, and
unstructured. It facilitates you to make changes to your data structures according to your
need.
Easy Data Distribution
Data distribution in Cassandra is very easy because it provides the flexibility to distribute
data where you need by replicating data across multiple data centers.
Transaction Support
Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability (ACID).
Fast writes
Cassandra was designed to run on cheap commodity hardware. It performs blazingly fast
writes and can store hundreds of terabytes of data, without sacrificing the read efficiency.
Cassandra CQLsh
Cassandra CQLsh stands for Cassandra CQL shell. CQLsh specifies how to use Cassandra
commands. After installation, Cassandra provides a prompt Cassandra query language shell
(cqlsh). It facilitates users to communicate with it.
Start CQLsh:
CQLsh provides a lot of options which you can see in the following table:
Options Usage
help This command is used to show help topics about the options of CQLsh
commands.
version it is used to see the version of the CQLsh you are using.
execute It is used to direct the shell to accept and execute a CQL command.
file= By using this option, cassandra executes the command in the given file and
"fil exits.
e name"
u "username" Using this option, you can authenticate a user. The default user name
is: cassandra.
p "password" Using this option, you can authenticate a user with a password. The default
password is: cassandra.
What is Keyspace?
A keyspace is an object that is used to hold column families, user defined types. A keyspace
is like RDBMS database which contains column families, indexes, user defined types, data
center awareness, strategy used in keyspace, replication factor, etc.
Syntax:
Or
Cassandra Collections
Cassandra collections are used to handle tasks. You can store multiple elements in collection.
There are three types of collection supported by Cassandra:
o Set
o List
o Map
Cassandra Counter
The counter is a special column used to store a number that this changed increments. For
example, you might use a counter column to count the number of times a page is viewed. So,
we can define a counter in a dedicated table only and use that counter datatype.
Restriction on the counter column:
Now, we are going to create table with a Counter column. let’s have a look.
Create table View_Counts (
count_view counter, name
varchar, blog_name text,
primary key(name, blog_name)
);
Let’s see the table schema.
describe table View_Counts;
Output:
1. In Cassandra Both the INSERT and UPDATE commands support setting a time for
data in a column to expire.
2. It is used to set the time limit for a specific period of time. By USING TTL clause we
can set the TTL value at the time of insertion.
3. We can use TTL function to get the time remaining for a specific selected query.
4. At the point of insertion, we can set expire limit of inserted data by using TTL clause.
Let us consider if we want to set the expire limit to two days then we need to define
its TTL value.
5. By using TTL we can set the expiration period to two days and the value of TTL will
be 172800 seconds. Let’s understand with an example.
Table : student_Registration
To create the table used the following CQL query.
Output:
Id Name Event
Now, to determine the remaining time to expire for a specific column used the following CQL
query.
Output:
ttl(Name)
172700
It will decrease as you will check again for its TTL value just because of TTL time limit. Now,
used the following CQL query to check again.
Output:
ttl(Name)
172500
Syntax:
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Add a column
Drop a column
Adding a Column
Using ALTER command, you can add a column to a table. While adding columns, you have to
take care that the column name is not conflicting with the existing column names and that the
table is not defined with compact storage option.
Dropping a Column
Using ALTER command, you can delete a column from a table. Before dropping a column
from a table, check that the table is not defined with compact storage option. Given below is
the syntax to delete a column from a table using ALTER command.
To import data:
To Export data:
The system keyspace includes a number of tables that contain details about your
Cassandra database objects and cluster configuration.
schema_version, thrift_version,
tokens set, truncated at map
peers peer, data_center, rack, Each node records what
release_version, ring_id, other nodes tell it about
rpc_address, schema_version, themselves over
tokens the gossip.
schema_columns keyspace_name, Used internally with
columnfamily_name, column_name, compound primary keys.
component_index, index_name,
index_options, index_type,
validator
UNIT – V
(Big Data Frame Works for Analytics) Hadoop Frame Work: Map Reduce Programming:
I/O formats, Map side join-Reduce Side Join-Secondary Sorting- Pipelining MapReduce jobs
Spark Frame Work: Introduction to Apache spark-How spark works, Programming with
RDDs: Create RDDspark Operations-Data Frame.
Let’s get an idea of how data flows between the client interacting with HDFS, the name node,
and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls
(RPCs), to determine the locations of the first few blocks in the file. For each block, the name
node returns the addresses of the data nodes that have a copy of that block. The DFS returns
an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and
name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info
node addresses for the primary few blocks within the file, then connects to the primary
(closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly
on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the data node, then finds the best data node for the next block. This happens transparently to
the client, which from its point of view is simply reading an endless stream. Blocks are read
as, with the DFSInputStream opening new connections to data nodes because the client reads
through the stream. It will also call the name node to retrieve the data node locations for the
next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get a better
understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the
files which are already stored in HDFS, but we can append data by reopening the files.
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s
namespace, with no blocks associated with it. The name node performs various checks to
make sure the file doesn’t already exist and that the client has the right permissions to create
the file. If these checks pass, the name node prepares a record of the new file; otherwise, the
file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS
returns an FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into packets, which it
writes to an indoor queue called the info queue. The data queue is consumed by the
DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline,
and here we’ll assume the replication level is three, so there are three nodes in the pipeline.
The DataStreamer streams the packets to the primary data node within the pipeline, which
stores each packet and forwards it to the second data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and
last) data node in the pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be
acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for
acknowledgments before connecting to the name node to signal whether the file is complete
or not.
Types of Join:
Depending upon the place where the actual join is performed, joins in Hadoop are classified
into-
1. Map-side join – When the join is performed by the mapper, it is called as map-side join. In
this type, the join is performed before data is actually consumed by the map function. It is
mandatory that the input to each map is in the form of a partition and is in sorted order. Also,
there must be an equal number of partitions and it must be sorted by the join key.
2. Reduce-side join – When the join is performed by the reducer, it is called as reduce-side
join. There is no necessity in this join to have a dataset in a structured form (or partitioned).
Here, map side processing emits join key and corresponding tuples of both the tables. As an
effect of this processing, all the tuples with same join key fall into the same reducer which
then joins the records with same join key.
Secondary Sorting:
Secondary sort is a technique that allows the MapReduce programmer to control the order
that the values show up within a reduce function call.
Lets also assume that our secondary sorting is on a composite key made out of Last Name
and First Name.
The partitioner and the group comparator use only natural key, the partitioner uses it to
channel all records with the same natural key to a single reducer. This partitioning happens in
the Map Phase, data from various Map tasks are received by reducers
where they are grouped and then sent to the reduce method. This grouping is where the group
comparator comes into picture, if we would not have specified a custom group comparator
then Hadoop would have used the default implementation which would have considered the
entire composite key, which would have lead to incorrect results.
Finally just reviewing the steps involved in a MR Job and relating it to secondary sorting should
help us clear out the lingering doubts.
What is Spark?
Spark was built on the top of the Hadoop MapReduce. It was optimized to run in memory
whereas alternative approaches like Hadoop's MapReduce writes data to and from computer
hard drives. So, Spark process the data much quicker than other alternatives.
o Fast - It provides high performance for both batch and streaming data, using a state-
of-the-art DAG scheduler, a query optimizer, and a physical execution engine.
o Easy to Use - It facilitates to write the application in Java, Scala, Python, R, and SQL.
It also provides more than 80 high-level operators.
o Generality - It provides a collection of libraries including SQL and DataFrames, MLlib
for machine learning, GraphX, and Spark Streaming.
o Lightweight - It is a light unified analytics engine which is used for large scale data
processing.
o Runs Everywhere - It can easily run on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud.
Spark has a small code base and the system is divided in various layers. Each layer has some
responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some
modifications. As you enter your code in spark console (creating RDD’s and applying
operators), Spark creates a operator graph. When the user runs an action (like collect), the
Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into
(map and reduce) stages. A stage is comprised of tasks based on partitions of the input data.
The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map
operators can be scheduled in a single stage. This optimization is key to Sparks performance.
The final result of a DAG scheduler is a set of stages. The stages are passed on to the Task
Scheduler. The task scheduler launches tasks via cluster manager. (Spark
Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages.
What is RDD?
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of
elements, partitioned across the nodes of the cluster so that we can execute various parallel
operations on it.
RDD Operations
o Transformation
o Action
Transformation
In Spark, the role of transformation is to create a new dataset from an existing one. The
transformations are considered lazy as they only computed when an action requires a result to
be returned to the driver program.
Transformation Description
flatMap(func) Here, each input item can be mapped to zero or more output items,
so func should return a sequence rather than a single item.
union(otherDataset) It returns a new dataset that contains the union of the elements in the
source dataset and the argument.
intersection(otherDataset It returns a new RDD that contains the intersection of elements in the
) source dataset and the argument.
distinct([numPartitions])) It returns a new dataset that contains the distinct elements of the
source dataset.
join(otherDataset, When called on datasets of type (K, V) and (K, W), returns a
[numPartitions]) dataset of (K, (V, W)) pairs with all pairs of elements for each
key. Outer joins are supported through leftOuterJoin,
rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, When called on datasets of type (K, V) and (K, W), returns a
[numPartitions]) dataset of (K, (Iterable, Iterable)) tuples. This operation is also
called groupWith.
pipe(command, Pipe each partition of the RDD through a shell command, e.g.
[envVars]) a Perl or bash script.
repartition(numPartitions) It reshuffles the data in the RDD randomly to create either more
or fewer partitions and balance it across them.
Action
In Spark, the role of action is to return a value to the driver program after running a computation on
the dataset.
Action Description
collect() It returns all the elements of the dataset as an array at the driver
program. This is usually useful after a filter or other operation
that returns a sufficiently small subset of the data.
takeOrdered(n, [ordering]) It returns the first n elements of the RDD using either their
natural order or a custom comparator.
saveAsTextFile(path) It is used to write the elements of the dataset as a text file (or
set of text files) in a given directory in the local filesystem,
HDFS or any other Hadoop-supported file system. Spark calls
toString on each element to convert it to a line of text in the
file.
saveAsObjectFile(path) (Java It is used to write the elements of the dataset in a simple format
and Scala) using Java serialization, which can then be loaded
usingSparkContext.objectFile().
foreach(func) It runs a function func on each element of the dataset for side
effects such as updating an Accumulator or interacting with
external storage systems.
*****