Da ANSWERS
Da ANSWERS
Da ANSWERS
1. Reliability
3. Cost-Effective
a. The system needs to be cheap, after all, we are building a poor man's
supercomputer and we do not have a budget for fancy hardware.
Data Assumptions
a. A Name Node is a master node that keeps track of each file and its
corresponding blocks and the data node locations.
b. Map Reduce will talk with the Name Node and send the computation
to the corresponding data nodes.
c. The Name Node is the key to all the data and hence the Secondary
Name node is used to improve the reliability of the cluster.
3(B)
The characteristics of Big Data are commonly referred to as the four Vs:
Data that is high volume, high velocity and high variety must be processed with
advanced tools (analytics and algorithms) to reveal meaningful information. Because of
these characteristics of the data, the knowledge domain that deals with the storage,
processing, and analysis of these data sets has been labeled Big Data.
4(A)
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each
unit:
HDFS or HBASE Hadoop distributed file system or HBASE are the data
storage techniques to store data into file system.
4(B)
5(A)
Weather data analytics has lot of importance in human life. Accurate prediction of weather is very helpful to
agriculture sector, tourism, and also planning for any natural calamities like flood, drought etc. Weather Prediction
has lot of commercial value in news agency, government sector and industrial farming. Weather has enormous effect
on psyche of a human being. Human mood can change as positive, negative or tiredness based on some changes in
weather [1]. Prediction of climatic condition is very important challenge for every living being to sustain. To study
the climate there is need to study meteorology. Meteorology is the interdisciplinary scientific study of atmosphere
i.e. temperature, pressure, humidity, wind, etc. Usually, temperature, pressure, wind and humidity are the variables
that are measured by a thermometer, barometer, anemometer, and hygrometer respectively [2]. Observations of
these parameters are collected from various sensors deployed at different geographical location. This data is
accumulated at meteorological department of various countries. This data is known as weather data. At each location
the values of various weather parameters is collected at a frequency of 3-4 times per hour. This data is stored in the
unstructured format along with location, date and time. The structure of these formats is flat file which is separated
by comma or tab or may be semicolons.
It is difficult to process this unstructured data directly. The collective data is becoming very huge considering
various parameters, their frequency of recording and number of locations. Day by day this data is growing and
accumulated at enormous speed. Hence, to process this
data using conventional methods and tools is becoming a challenge. Hadoop platform with MapReduce
programming paradigm has proven to be very useful in processing huge unstructured data. Spark with in memory
computing also gives very good performance for analysis of unstructured data. As various other Big Data
Technologies like Storm, NoSQL are also claiming their usefulness in storage and processing of huge data it is
important to study their relative performance and usefulness in various domains. In the current project, the study
of Big Data technology MapReduce and Spark is being studied and compared for Weather Data Analytics.
7(A)
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
2 Get Plan
The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver.
Up to here, the parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node
and it assigns this job to TaskTracker, which is in Data node. Here, the
query executes MapReduce job.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
7(B)
Indexing :-
Indexes support the efficient resolution of queries. Without indexes, MongoDB must
scan every document of a collection to select those documents that match the query
statement. This scan is highly inefficient and require MongoDB to process a large
volume of data.
Indexes are special data structures, that store a small portion of the data set in an
easy-to-traverse form. The index stores the value of a specific field or set of fields,
ordered by the value of the field as specified in the index.
Aggergation :-
Aggregations operations process data records and return computed results. Aggregation
operations group values from multiple documents together, and can perform a variety of
operations on the grouped data to return a single result. In SQL count(*) and with group by is an
equivalent of MongoDB aggregation.
8 (2)
As documents can have different lengths, it’s possible that a term would appear
more frequently in longer documents versus shorter ones. Because of this, it will
seem like a term is more important to a longer document than to a shorter one.
To reduce this effect, term frequency is often divided by the total number of terms
in the document as a way of normalization.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document).