Big Data and Hadoop Self Notes
Big Data and Hadoop Self Notes
Big Data and Hadoop Self Notes
UNIT I
Introduction about big data , Describe details Big data: definition and taxonomy ,
explain Big data value for the enterprise , Setting up the demo environment
,Describe Hadoop Architecture , Hadoop Distributed File System, MapReduce&
HDFS , First steps with the Hadoop , Deep to understand the fundamental of
MapReduce .
1. Volume: Big data involves vast amounts of data, often reaching terabytes, petabytes, or exabytes.
2. Velocity: Data is generated and updated rapidly, requiring real-time or near-real-time processing.
3. Variety: It encompasses structured, semi-structured, and unstructured data in various formats, such as text,
images, videos, and more.
4. Value: The goal is to extract valuable insights and benefits from the data to inform decision-making and
innovation.
5. Veracity: Concerns data quality and accuracy, as big data often contains noisy or unreliable information.
Complexity: Managing and analyzing big data requires specialized tools, technologies, and skills. Big data projects are
inherently complex due to the combination of volume, variety, velocity, and veracity. Managing and processing these
datasets require specialized tools, technologies, and skills. Data scientists and engineers often work with distributed
computing frameworks like Hadoop and Spark, as well as machine learning and artificial intelligence algorithms to make
sense of big data.
Applications: Big data is used across industries for purposes like analytics, predictive maintenance, and personalized
marketing. Big data is applied across various industries and domains, including healthcare, finance, retail, manufacturing,
energy, entertainment, and more. It can be used for purposes such as fraud detection, predictive maintenance,
customer sentiment analysis, personalized marketing, and scientific research.
Challenges : While big data offers immense opportunities, it also presents significant challenges. These include data
privacy concerns, security risks, ethical considerations, and the need for talent with expertise in data analysis, machine
learning, and data engineering.
Technologies: Big data technologies and tools continue to evolve rapidly. Cloud computing platforms, NoSQL databases,
data lakes, and data warehousing solutions are just a few examples of the infrastructure and tools used to store,
process, and analyze big data.
Describe details Big data: definition and taxonomy , explain Big data value for
the enterprise , Setting up the demo environment
Big Data, as the name itself suggests, refers to the huge amounts of data that are difficult to capture, manage or process,
even with the help of various software tools. Big Data requires the use of various techniques and technologies such as
predictive user behaviour or other advanced data analytics to obtain useful insights from them, which can be leveraged
further. According to Wikipedia, Big Data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate. It needs to be acquired, organised and analysed computationally to identify
certain patterns or trends that further facilitate the processing, updating or management of such huge amounts of data.
Data source: These are the datasets on which different Big Data techniques are implemented. They can exist in an
unstructured, semi-structured or structured format. There are unstructured datasets which are extracted from several
social media applications in the form of images, audio/video clips or text. The semi-structured datasets are generated by
different machines and require less effort to convert them to the structured form. Some data sets are already in the
structured form, as in the case of transaction information from several online applications or other master data.
Acquire: After various types of data sets are taken from several sources and inserted, they can either be written straight
away to real-time memory processes or can be written as messages to disk, database transactions or files. Once they are
received, there are various options for the persistence of these data. The data can either be written to several file
systems, to RDBMS or even various distributed-clustered systems like NoSQL and Hadoop Distributed File System.
Organise: This is the process of organising various acquired data sets so that they are in the appropriate form to be
analysed further. The quality and format of data is changed at this stage by using various techniques to quickly evaluate
unstructured data, like running the map-reduce process (Hadoop) in batch or map-reduce process (Spark) in memory.
There are other evaluation options available for real-time streaming data as well. These are basically extensive processes
which enable an open ingest, data warehouse, data reservoir and analytical model. They extend across all types of data
and domains by managing the bi-directional gap between the new and traditional data processing environments. One of
their most important features is that they meet the criteria of the four Vs — a large volume and velocity, a variety of
data sets, and they also help in finding value wherever our analytics operate. In addition to that, they also provide all
sorts of data quality services, which help in maintaining metadata and keeping a track of transformation lineage as well.
Analyse: After the data sets are converted to an organised form, they are further analysed. So the processing output of
Big Data, after having been converted from low density data to high density data, is loaded into a foundation data layer.
Apart from the foundation data layer, it can also be loaded to various data warehouses, data discovery labs (sets of data
stores, processing engines and their analysis tools), data marts or back into the reservoir. As the discovery lab requires
fast connections to the event processing, data reservoir and data warehouse, a high speed network like InfiniBand is
required for data transport. This is where the reduction-results are basically loaded from processing the output of Big
Data into the data warehouse for further analysis.
We can see that both the reservoir and the data warehouse offer
in-situ analytics, which indicates that analytical processing can take place at the source system without the extra step
needed to move the data to some other analytical environment. SQL analytics allows for all sorts of simple and complex
analytical queries at each data store, independently. Hence, it is the point where the performance of the system plays a
big role as the faster the data is processed or analysed, the quicker is the decision-making process. There are many
options like columnar databases, in-memory databases or flash memory, using which performance can be improved by
several orders of magnitude.
Decide: This is where the various decision-making processes take place by using several advanced techniques in order to
come to a final outcome. This layer consists of several real-time, interactive and data modelling tools. They are able to
query, report and model data while leaving the large amount of data in place. These tools include different advanced
analytics, in-reservoir and in-database statistical analysis, advanced visualisation, as well as the traditional components
such as reports, alerts, dashboards and queries.
1. The analysis and distillation of Big Data in combination with various traditional enterprise data, leads to the
development of a more thorough and insightful understanding of the business, for enterprises. It can lead to greater
productivity, greater innovation and a stronger competitive position.
2. Big Data plays a much more important role in healthcare services. It helps in the management of chronic or other
long-term conditions of patients by using in-home monitoring devices, which measure vital signs and check the progress
of patients to improve their health and reduce both hospital admissions and visits to doctors’ clinics.
3. Manufacturing companies also deploy sensors in their products to gather data remotely, as in the case of General
Motor’s OnStar or Renault’s R-Link. These help in delivering communications, navigation and security services. They also
reveal usage patterns, rates of failure and other such opportunities for product improvement that can further reduce
assembly and development costs.
4. The phenomenal increase in the use of smartphones and other GPS devices provides advertisers an opportunity to
target their consumers when they are in close proximity to a store, restaurant or a coffee shop. Retailers know the avid
buyers of their products better. The use of various social media and Web log files from their e-commerce sites helps
them get information about those who didn’t buy their products and also the reason for why they chose not to. This can
lead to more effective micro, customer-targeted marketing campaigns as well as improved supply chain efficiencies, as a
result of more accurate demand planning.
5. Finally, different social media websites like Facebook, Instagram, Twitter and LinkedIn wouldn’t have existed without
Big Data. The personalised experience provided by them to their different users can only be delivered by storing and
using all the available data about that user or member.
To set up a demo environment for big data, you'll need the following steps:
Prerequisites:
Hardware or Cloud Platform: You can use physical servers or a cloud platform like AWS, Azure, or Google Cloud.
Cloud platforms offer scalability and ease of use for demos.
Operating System: Install a compatible operating system (e.g., Linux) on your hardware or virtual machines
(VMs).
Access to Big Data Tools: Ensure you have access to the necessary big data tools like Hadoop, Spark, and data storage
systems.
Sample Data: Acquire or generate sample datasets that demonstrate the characteristics of big data.
Step-by-Step Guide:
Set Up Your Environment:Install and configure your chosen operating system on your hardware or VMs.
Install Big Data Tools:Install and configure the required big data tools. You can use package managers or
download the software directly from their official websites.
Common big data tools to consider:
Hadoop ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes
Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS,
MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major
elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of
data etc.
All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore, you need to install any Linux
flavored OS. The following simple steps are executed for Hive installation:
Certainly, here are the installation steps for integrating Hadoop with Hive without specific code:
Download the latest stable release of Hadoop from the Apache Hadoop website.
Configure Hadoop by editing the necessary XML configuration files (e.g., core-site.xml, hdfs-site.xml, yarn-site.xml) to set
up properties such as cluster name, data directories, and other parameters.
bashCopy code
bashCopy code
Verify that Hadoop services are running by accessing the ResourceManager and NameNode web UIs.
Download the latest stable release of Hive from the Apache Hive website.
Configure Hive by editing the hive-site.xml configuration file located in the conf/ directory. Configure properties such as
the Hive metastore, database connection settings, and Hadoop configuration.
bashCopy code
You can interact with Hive using the command-line interface (CLI) by running:
bashCopy code
bin/hive
Alternatively, you can use Beeline, a JDBC client for Hive, by running:
bashCopy code
bin/beeline -u jdbc:hive2://localhost:10000
Step 6: Create and Query Tables
With Hive set up, you can create tables, import data, and run SQL-like queries on your data.
These steps provide a high-level overview of integrating Hadoop with Hive. For a production environment, make sure to
refer to official documentation and consider additional configurations and optimizations based on your specific
requirements.
Prerequisites:
Hadoop Installation: Ensure you have Hadoop installed and configured as described in the previous answer.
Visit the Apache Pig website (https://pig.apache.org/) and download the latest stable release of Pig.
Set the PIG_HOME environment variable to point to the directory where Pig is installed. You can do this by adding the
following line to your .bashrc or .bash_profile file (or equivalent for your shell and OS):
bashCopy code
bashCopy code
pig
Pig Latin is the scripting language used with Pig. You can use Pig Latin to load, transform, and analyze data in Hadoop.
Here's a simple example:
pigCopy code
-- Example Pig Latin script (save as script.pig) data = LOAD 'hdfs://localhost:9000/input_data' USING PigStorage(',') AS
(field1:chararray, field2:int); filtered_data = FILTER data BY field2 > 100; grouped_data = GROUP filtered_data BY field1;
result = FOREACH grouped_data GENERATE group, COUNT(filtered_data); STORE result INTO
'hdfs://localhost:9000/output_data';
To run the Pig script:
bashCopy code
pig -f script.pig
This script loads data from HDFS, filters it, groups it by a field, and counts the records in each group. The result is stored
back in HDFS.
You can monitor the progress of your Pig jobs through the Hadoop ResourceManager web UI (usually at
http://localhost:8088/). Pig will submit MapReduce jobs to Hadoop for execution.
Download the latest stable release of ZooKeeper from the Apache ZooKeeper website.
Rename the zoo_sample.cfg configuration file to zoo.cfg in the ZooKeeper configuration directory (usually conf/).
Configure zoo.cfg with your ZooKeeper settings, including data directory and client port.
Start ZooKeeper on each node where you want to run it using the following command:
bashCopy code
bin/zkServer.sh start
bashCopy code
bin/zkServer.sh status
Edit Hadoop configuration files such as core-site.xml and hdfs-site.xml to specify ZooKeeper settings. Adjust the
properties according to your cluster configuration.
Monitor Hadoop services through the ResourceManager and NameNode web UIs.
These steps will integrate ZooKeeper with your Hadoop ecosystem for distributed coordination and high availability (if
configured). Make sure to refer to official documentation for Hadoop and ZooKeeper for more detailed configuration
options and troubleshooting information.
Installing Hadoop Eco System and Integrate Hbase Installation
Certainly, here are the installation steps for integrating Hadoop with HBase without specific code:
Download the latest stable release of Hadoop from the Apache Hadoop website.
Configure Hadoop by editing the necessary XML configuration files (e.g., core-site.xml, hdfs-site.xml, yarn-site.xml) to set
up properties such as cluster name, data directories, and other parameters.
bashCopy code
bashCopy code
Verify that Hadoop services are running by accessing the ResourceManager and NameNode web UIs.
Download the latest stable release of HBase from the Apache HBase website.
Configure HBase by editing the hbase-site.xml configuration file located in the conf/ directory. Configure properties such
as ZooKeeper quorum, data directories, and Hadoop configuration.
bashCopy code
bin/start-hbase.sh
You can verify that HBase is running by accessing the HBase Master web UI (usually at http://localhost:16010).
You can also use the HBase shell to interact with HBase by running:
bashCopy code
bin/hbase shell
Step 6: Create and Manage HBase Tables
With HBase set up, you can create tables, insert data, and perform various operations using the HBase shell or your
preferred programming language.
These steps provide a high-level overview of integrating Hadoop with HBase. For a production environment, make sure
to refer to official documentation and consider additional configurations and optimizations based on your specific
requirements.
Download the latest stable release of Hadoop from the Apache Hadoop website.
Configure Hadoop by editing the necessary XML configuration files (e.g., core-site.xml, hdfs-site.xml, yarn-site.xml) to set
up properties such as cluster name, data directories, and other parameters.
bashCopy code
bashCopy code
Verify that Hadoop services are running by accessing the ResourceManager and NameNode web UIs.
Download the latest stable release of Sqoop from the Apache Sqoop website.
Configure Sqoop by editing the sqoop-env.sh configuration file located in the conf/ directory. Set environment variables
such as Java home, Hadoop home, and any other necessary settings.
You can also configure connection properties for your database sources in Sqoop using the sqoop-site.xml configuration
file.
Step 5: Verify Sqoop
bashCopy code
bin/sqoop version
With Sqoop set up, you can use it to import data from external sources (e.g., relational databases) into your Hadoop
cluster or export data from Hadoop to external sources. The specific commands for importing or exporting data depend
on your use case and data source/target.
Before you can use Mahout with Hadoop, you need to have Hadoop installed and configured on your cluster. Follow the
Hadoop installation steps mentioned earlier in this conversation.
Download the latest stable release of Apache Mahout from the Apache Mahout website.
Configure Mahout by editing the Mahout configuration files as needed. These files are typically located in the conf/
directory of your Mahout installation.
Prepare your data in a format suitable for machine learning tasks. Mahout supports various data formats, including text,
sequence files, and vector representations. Ensure your data is available and accessible within your Hadoop HDFS
cluster.
You can run Mahout jobs on your Hadoop cluster using the Hadoop command-line interface. Here's a general command
template for running Mahout jobs:
bashCopy code
Replace <version> with your Mahout version, <algorithm> with the specific algorithm you want to run,
<AlgorithmName> with the name of the algorithm, input with the input path on HDFS, output with the output path on
HDFS, and any other relevant options for the specific algorithm.
Step 5: Explore Mahout Algorithms
Mahout provides various machine learning algorithms for tasks such as clustering, classification, recommendation, and
more. Explore the documentation and examples provided with Mahout to understand how to use these algorithms for
your specific use case.
Monitor the progress of your Mahout jobs using the Hadoop ResourceManager and other monitoring tools. Depending
on your data and machine learning task, you may need to tune parameters and configurations for optimal performance.
Remember that the exact steps and commands may vary depending on the Mahout version and the specific machine
learning task you want to perform. Always refer to the official Mahout documentation and examples for detailed
information and best practices.
By making the use of distributed and parallel algorithms, MapReduce makes it possible to carry over the processing’s
logic and helps to write applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to
SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After
the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of large data sets. However, its
query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the SQL datatypes are supported
by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC Drivers and HIVE Command
Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and connection whereas HIVE
Command line helps in the processing of queries.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling anything of Hadoop Database. It
provides capabilities of Google’s BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge database, the request must
be processed within a short quick span of time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Load data into Hadoop to leverage its distributed storage and processing capabilities, enabling scalable storage, efficient
data analysis, and advanced analytics on large datasets. Hadoop provides a cost-effective and flexible platform for
handling diverse data types, making it suitable for big data storage, processing, and analytics.
Choose Method: Select the method (e.g., HDFS commands, Sqoop, Hive, custom jobs) based on data source and use
case.
Execute: Run the chosen method's command, script, or job, specifying source and destination.
HDFS provides several commands for managing files and directories within the file system. To load data into Hadoop
using HDFS commands, follow these steps:
Create a directory in HDFS where you want to store the data. You can use the hadoop fs -mkdir command to create a
directory. For example, if you want to create a directory called data in the root directory of HDFS, you would use the
following command: hadoop fs -mkdir /data.
Copy the data from your local file system to the directory you just created in HDFS. You can use the hadoop fs -
put command to copy a file or directory from your local file system to HDFS. For example, if you have a file
called data.csv in your local file system and you want to copy it to the data directory in HDFS, you would use the
following command: hadoop fs -put data.csv /data.
Verify that the data was loaded into Hadoop successfully. You can use the hadoop fs -ls command to list the contents of
a directory in HDFS. For example, if you want to verify that the data.csv file was copied to the data directory in HDFS, you
would use the following command: hadoop fs -ls /data.
Identify the data you want to retrieve from Hadoop. Know the HDFS path or location of the data you intend to extract.
Select the appropriate method for retrieving data from Hadoop based on your use case and destination. Common
methods include:
HDFS Commands: Use hadoop fs or hdfs dfs commands to copy data from HDFS to a local file system or another HDFS
location.
Hadoop DistCp: For efficient large-scale data copying, use the Hadoop Distributed Copy (hadoop distcp) tool.
Hive Data Export: If the data is stored in Hive tables, you can export it using SQL-like queries (INSERT OVERWRITE LOCAL
DIRECTORY or INSERT INTO).
Sqoop Data Export: For relational databases, use Sqoop to export data from Hadoop to a database.
Custom MapReduce or Spark Jobs: Write custom MapReduce or Spark jobs to extract and transform data for export.
ETL Tools: Employ ETL (Extract, Transform, Load) tools, like Apache Nifi or Apache NiFi Registry, to facilitate data retrieval
and transformation.
Execute the chosen data retrieval method by running the corresponding command, script, or job. Specify the source and
destination paths or locations accurately.
Monitor the data retrieval process to ensure it completes successfully. Use Hadoop cluster monitoring tools, job
tracking, or log files to check progress and status.
After retrieving data from Hadoop, validate its integrity and correctness to ensure it matches your expectations and
adheres to any required data quality standards.
Step 6: Use or Store Data
Depending on your use case, you can use the retrieved data for analysis, reporting, or further processing, or store it in
your destination system.
These steps provide a general guideline for getting data from Hadoop. The specific commands, tools, and configurations
may vary based on your data source, Hadoop distribution, and destination. Always refer to the documentation and best
practices for the specific methods and tools you are using.