DA Assignment 20241015 091512 0000
DA Assignment 20241015 091512 0000
6. a) Apache Spark
1. Unified Engine: Supports batch
processing, streaming, machine
learning, and graph processing.
2. In-Memory Processing: Offers
faster data processing by storing
data in memory rather than on
disk.
3. Scalability: Easily scales across
clusters of computers for
handling large datasets.
4. Support for Multiple
Languages: Compatible with
Python, Java, Scala, and R.
5. Extensive Libraries: Includes
built-in libraries for SQL,
machine learning (MLlib), and
graph processing (GraphX).
6. b) Cloudera Impala
1. SQL-Based Engine: Allows
users to run SQL queries on large
datasets in Hadoop.
2. Low Latency: Designed for fast
query execution, enabling real-
time analytics.
3. Integration with Hadoop:
Works seamlessly with HDFS and
HBase for efficient data access.
4. Columnar Storage: Optimizes
performance by storing data in a
columnar format.
5. Compatibility: Supports
various BI tools for data
visualization and reporting.