Da ANSWERS

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

3(A) HDFS Design Goal

1. Reliability

a. We should not lose data in any scenario.


b. We use many hardware devices and inevitably something will fail
(Hard Disk, Network Cards, Server Rack, and so on) at some point or
another.
 
2. Scalability

a. We should be able to add more hardware (workers) to get the job


done.

3. Cost-Effective

a. The system needs to be cheap, after all, we are building a poor man's
supercomputer and we do not have a budget for fancy hardware.

Data Assumptions

1. Process large files both horizontally and Vertically (GigaByte +).


2. Data is append-only.
3. Access to data is large and sequential.
4. Write Once and Read many times.

Throughput vs. Random Seek


 
Since we are working with large datasets with sequential read, Hadoop and Map
Reduce is optimized for throughput and not random seek.
 
In other words, if one must move a big house from one city to another than a slow
big truck will do a better job, than a fancy small fast car.
 
Data Locality
 
It is cheaper to move the compute logic than data. In Hadoop, we move the
computation code around where the data is present, instead of moving the data
back and forth to the compute server; that typically happens on a client/server
model.
 
Data Storage in HDFS
  

1. HDFS will split the file into 64 MB blocks.

a. The size of the blocks can be configured.


b. An entire block of data will be used in the computation.
c. Think of it as a sector on a hard disk.

2. Each block will be sent to 3 machines (data nodes) for storage.

a. This provides reliability and efficient data processing.


b. Replication factor of 3 is configurable.
c. RAID configuration to store the data is not required.
d. Since data is replicated 3 times the overall storage space is reduced a
third.

3. The accounting of each block is stored in a central server, called a Name


Node.

 
a. A Name Node is a master node that keeps track of each file and its
corresponding blocks and the data node locations.
b. Map Reduce will talk with the Name Node and send the computation
to the corresponding data nodes.

c. The Name Node is the key to all the data and hence the Secondary
Name node is used to improve the reliability of the cluster.

3(B)

The characteristics of Big Data are commonly referred to as the four Vs:

Volume of Big Data


The volume of data refers to the size of the data sets that need to be analyzed and
processed, which are now frequently larger than terabytes and petabytes. The sheer
volume of the data requires distinct and different processing technologies than
traditional storage and processing capabilities. In other words, this means that the data
sets in Big Data are too large to process with a regular laptop or desktop processor. An
example of a high-volume data set would be all credit card transactions on a day within
Europe.

Velocity of Big Data


Velocity refers to the speed with which data is generated. High velocity data is
generated with such a pace that it requires distinct (distributed) processing techniques.
An example of a data that is generated with high velocity would be Twitter messages or
Facebook posts.

Variety of Big Data


Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data.
The variety in data types frequently requires distinct processing capabilities and
specialist algorithms. An example of high variety data sets would be the CCTV audio
and video files that are generated at various locations in a city.

Veracity of Big Data


Veracity refers to the quality of the data that is being analyzed. High veracity data has
many records that are valuable to analyze and that contribute in a meaningful way to
the overall results. Low veracity data, on the other hand, contains a high percentage of
meaningless data. The non-valuable in these data sets is referred to as noise. An
example of a high veracity data set would be data from a medical experiment or trial.

Data that is high volume, high velocity and high variety must be processed with
advanced tools (analytics and algorithms) to reveal meaningful information. Because of
these characteristics of the data, the knowledge domain that deals with the storage,
processing, and analysis of these data sets has been labeled Big Data.

4(A)

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It


resides on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the name Apache Hive. It is used
by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each
unit:

Unit Name Operation

User Interface Hive is a data warehouse infrastructure software that


can create interaction between user and HDFS. The
user interfaces that Hive supports are Hive Web UI,
Hive command line, and Hive HD Insight (In Windows
server).

Meta Store Hive chooses respective database servers to store the


schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.

HiveQL Process HiveQL is similar to SQL for querying on schema info


Engine on the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead
of writing MapReduce program in Java, we can write a
query for MapReduce job and process it.

Execution Engine The conjunction part of HiveQL process Engine and


MapReduce is Hive Execution Engine. Execution
engine processes the query and generates results as
same as MapReduce results. It uses the flavor of
MapReduce.

HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.

4(B)

The Five Key Differences of Apache


Spark vs Hadoop MapReduce:
1. Apache Spark is potentially 100 times faster than Hadoop
MapReduce.
2. Apache Spark utilizes RAM and isn’t tied to Hadoop’s two-
stage paradigm.
3. Apache Spark works well for smaller data sets that can all fit
into a server's RAM.
4. Hadoop is more cost effective processing massive data sets.
5. Apache Spark is now more popular that Hadoop MapReduce.
For years, Hadoop was the undisputed champion of big data—until
Spark came along.
Since its initial release in 2014, Apache Spark has been setting the
world of big data on fire. With Spark's convenient APIs and
promised speeds up to 100 times faster than Hadoop MapReduce,
some analysts believe that Spark has signaled the arrival of a new
era in big data.

How can Spark, an open-source data processing framework,


crunch all this information so fast? The secret is that Spark runs in-
memory on the cluster, and it isn’t tied to Hadoop’s MapReduce
two-stage paradigm. This makes repeated access to the same data
much faster.

Spark can run as a standalone application or on top of Hadoop


YARN, where it can read data directly from HDFS. Dozens of major
tech companies such as Yahoo, Intel, Baidu, Yelp, and Zillow are
already using Spark as part of their technology stacks.

While Spark seems like it's bound to replace Hadoop MapReduce,


you shouldn't count out MapReduce just yet. In this post we’ll
compare the two platforms and see if Spark truly comes out on
top.

5(A)

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to


analyze larger sets of data representing them as data flows. Pig is generally used
with Hadoop; we can perform all the data manipulation operations in Hadoop using
Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig
Latin. This language provides various operators using which programmers can
develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin
language. All these scripts are internally converted to Map and Reduce tasks. Apache
Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input
and converts those scripts into MapReduce jobs.
Programmers who are not so good at Java normally used to struggle working with
Hadoop, especially while performing any MapReduce tasks. Apache Pig is a boon for
all such programmers.
 Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
 Apache Pig uses multi-query approach, thereby reducing the length of codes.
For example, an operation that would require you to type 200 lines of code
(LoC) in Java can be easily done by typing as less as just 10 LoC in Apache
Pig. Ultimately Apache Pig reduces the development time by almost 16 times.
 Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are
familiar with SQL.
 Apache Pig provides many built-in operators to support data operations like
joins, filters, ordering, etc. In addition, it also provides nested data types like
tuples, bags, and maps that are missing from MapReduce
5(B)

Weather data analytics has lot of importance in human life. Accurate prediction of weather is very helpful to
agriculture sector, tourism, and also planning for any natural calamities like flood, drought etc. Weather Prediction
has lot of commercial value in news agency, government sector and industrial farming. Weather has enormous effect
on psyche of a human being. Human mood can change as positive, negative or tiredness based on some changes in
weather [1]. Prediction of climatic condition is very important challenge for every living being to sustain. To study
the climate there is need to study meteorology. Meteorology is the interdisciplinary scientific study of atmosphere
i.e. temperature, pressure, humidity, wind, etc. Usually, temperature, pressure, wind and humidity are the variables
that are measured by a thermometer, barometer, anemometer, and hygrometer respectively [2]. Observations of
these parameters are collected from various sensors deployed at different geographical location. This data is
accumulated at meteorological department of various countries. This data is known as weather data. At each location
the values of various weather parameters is collected at a frequency of 3-4 times per hour. This data is stored in the
unstructured format along with location, date and time. The structure of these formats is flat file which is separated
by comma or tab or may be semicolons.
It is difficult to process this unstructured data directly. The collective data is becoming very huge considering
various parameters, their frequency of recording and number of locations. Day by day this data is growing and
accumulated at enormous speed. Hence, to process this
data using conventional methods and tools is becoming a challenge. Hadoop platform with MapReduce
programming paradigm has proven to be very useful in processing huge unstructured data. Spark with in memory
computing also gives very good performance for analysis of unstructured data. As various other Big Data
Technologies like Storm, NoSQL are also claiming their usefulness in storage and processing of huge data it is
important to study their relative performance and usefulness in various domains. In the current project, the study
of Big Data technology MapReduce and Spark is being studied and compared for Weather Data Analytics.

7(A)

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:

Step Operation
No.

1 Execute Query

The Hive interface such as Command Line or Web UI sends query to


Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.

3 Get Metadata
The compiler sends metadata request to Metastore (any database).

4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.

6 Execute Plan
The driver sends the execute plan to the execution engine.

7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.

7.1 Metadata Ops


Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.

8 Fetch Result
The execution engine receives the results from Data nodes.

9 Send Results
The execution engine sends those resultant values to the driver.

10 Send Results
The driver sends the results to Hive Interfaces.

7(B)

Indexing :-

Indexes support the efficient resolution of queries. Without indexes, MongoDB must
scan every document of a collection to select those documents that match the query
statement. This scan is highly inefficient and require MongoDB to process a large
volume of data.
Indexes are special data structures, that store a small portion of the data set in an
easy-to-traverse form. The index stores the value of a specific field or set of fields,
ordered by the value of the field as specified in the index.
Aggergation :-

Aggregations operations process data records and return computed results. Aggregation
operations group values from multiple documents together, and can perform a variety of
operations on the grouped data to return a single result. In SQL count(*) and with group by is an
equivalent of MongoDB aggregation.

8 (2)

5 P's that take significant part in the data science activities.

 Purpose: The purpose refers to the challenge or set of challenges


defined by your big data strategy. The purpose can be related to a
scientific analysis with a hypothesis or a business metric that needs to
be analyzed based often on Big Data.
 People: The data scientists are often seen as people who possess skills
on a variety of topics including: science or business domain
knowledge; analysis using statistics, machine learning and
mathematical knowledge; data management, programming and
computing. In practice, this is generally a group of researchers
comprised of people with complementary skills.

 Process: Since there is a predefined team with a purpose, a great place


for this team to start with is a process they could iterate on. We can
simply say, People with Purpose will define a Process to collaborate
and communicate around! The process of data science includes
techniques for statistics, machine learning, programming, computing
and data management. A process is conceptual in the beginning and
defines the course set of steps and how everyone can contribute to it.
Note that similar reusable processes can be applicable to many
applications with different purposes when employed within different
workflows. Data science workflows combine such steps in executable
graphs. We believe that process-oriented thinking is a transformative
way of conducting data science to connect people and techniques to
applications. Execution of such a data science process requires access
to many datasets, Big and small, bringing new opportunities and
challenges to Data Science. There are many Data Science steps or
tasks, such as Data Collection, Data Cleaning, Data
Processing/Analysis, Result Visualization, resulting in a Data Science
Workflow. Data Science Processes may need user interaction and other
manual operations, or be fully automated.Challenges for the data
science process include 1) how to easily integrate all needed tasks to
build such a process; 2) how to find the best computing resources and
efficiently schedule process executions to the resources based on
process definition, parameter settings, and user preferences.

 Platforms: Based on the needs of an application-driven purpose and


the amount of data and computing required to perform this application,
different computing and data platforms can be used as a part of the data
science process. This scalability should be made part of any data
science solution architecture.

 Programmability: Capturing a scalable data science process requires


aid from programming languages, e.g., R, and patterns, e.g.,
MapReduce. Tools that provide access to such programming techniques
are key to making the data science process programmable on a variety
of platforms.
8 (4)

ANOVA for Regression


Analysis of Variance (ANOVA) consists of calculations that provide information about
levels of variability within a regression model and form a basis for tests of
significance. The basic regression line concept, DATA = FIT + RESIDUAL, is rewritten
as follows:
(yi -  ) = ( i -  ) + (yi -  i).
The first term is the total variation in the response y, the second term is the variation
in mean response, and the third term is the residual value. Squaring each of these
terms and adding over all of the n observations gives the equation
(yi -  )² =  ( i -  )² +  (yi -  i)².
This equation may also be written as SST = SSM + SSE, where SS is notation for sum
of squares and T, M, and E are notation for total, model, and error, respectively.
8 (5)
Term frequency (TF) means how often a term occurs in a document. In the
context of natural language, terms correspond to words or phrases. But terms
could also represent any token in text. It’s all about how you define it. Term
frequency is commonly used in Text Mining, Machine Learning, and Information
Retrieval tasks.

As documents can have different lengths, it’s possible that a term would appear
more frequently in longer documents versus shorter ones. Because of this, it will
seem like a term is more important to a longer document than to a shorter one.
To reduce this effect, term frequency is often divided by the total number of terms
in the document as a way of normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).

You might also like