0% found this document useful (0 votes)
10 views19 pages

DA Assignment 20241015 091512 0000

Da assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

DA Assignment 20241015 091512 0000

Da assignment
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DESIGN ANALYTICS

1.Data Reduction as a Data Pre-


processing
1. Improved Efficiency: Reduces
the amount of data to be
processed, speeding up analysis.
2. Memory Management: Helps in
fitting large datasets into
memory for analysis.
3. Enhanced Interpretability:
Simplifies data, making it easier
to visualize and understand.
4. Noise Reduction: Eliminates
redundant or irrelevant data,
improving model accuracy.
5. Scalability: Facilitates the
handling of larger datasets by
compressing them into smaller,
manageable forms.

2. Identifying Data Quality and


Quality Measures
1. Accuracy: Ensures data values
are correct and reliable.
2. Completeness: Measures the
extent to which all required data
is present.
3. Consistency: Checks for
uniformity of data across
different sources and systems.
4. Timeliness: Evaluates whether
data is up-to-date and available
when needed.
5. Uniqueness: Assesses the
presence of duplicates to
maintain data integrity.

3. Various Analytics Techniques


1. Descriptive Analytics:
Summarizes historical data to
understand past performance.
2. Inferential Analytics: Makes
predictions or generalizations
about a population based on a
sample.
3. Predictive Analytics: Uses
historical data to forecast future
outcomes and trends.
4. Prescriptive Analytics:
Recommends specific actions
based on data-driven insights.
5. Exploratory Data Analysis
(EDA): Investigates datasets to
discover patterns and
relationships.

4. Data Imputation Techniques


1. Mean/Median/Mode
Imputation: Replaces missing
values with statistical measures
of central tendency.
2. K-Nearest Neighbors (KNN):
Estimates missing values based
on the values of similar data
points.
3. Regression Imputation:
Predicts missing values using
regression models based on
other variables.
4. Multiple Imputation: Creates
multiple datasets with different
imputed values to reflect
uncertainty.
5. Last Observation Carried
Forward (LOCF): Uses the last
observed value to fill in missing
data for time series.

5. Need for Business Modeling


1. Strategic Alignment: Ensures
business processes support
overall organizational goals.
2. Operational Efficiency:
Identifies inefficiencies and
streamlines processes for better
performance.
3. Enhanced Decision-Making:
Provides a clear framework for
analyzing data and making
informed decisions.
4. Risk Management: Helps in
identifying potential risks and
formulating mitigation strategies.
5. Communication Tool: Acts as a
visual representation to facilitate
communication among
stakeholders.

6. a) Apache Spark
1. Unified Engine: Supports batch
processing, streaming, machine
learning, and graph processing.
2. In-Memory Processing: Offers
faster data processing by storing
data in memory rather than on
disk.
3. Scalability: Easily scales across
clusters of computers for
handling large datasets.
4. Support for Multiple
Languages: Compatible with
Python, Java, Scala, and R.
5. Extensive Libraries: Includes
built-in libraries for SQL,
machine learning (MLlib), and
graph processing (GraphX).
6. b) Cloudera Impala
1. SQL-Based Engine: Allows
users to run SQL queries on large
datasets in Hadoop.
2. Low Latency: Designed for fast
query execution, enabling real-
time analytics.
3. Integration with Hadoop:
Works seamlessly with HDFS and
HBase for efficient data access.
4. Columnar Storage: Optimizes
performance by storing data in a
columnar format.
5. Compatibility: Supports
various BI tools for data
visualization and reporting.

7. What is Data? Handling Large


Collections of Data
1. Definition: Data is a collection
of facts, statistics, or information
that can be analyzed.
2. Types of Data: Includes
structured, unstructured, and
semi-structured data.
3. Storage Solutions: Utilizes
databases (SQL/NoSQL) for
effective data storage and
retrieval.
4. Distributed Processing:
Employs frameworks like Hadoop
and Spark for large-scale data
processing.
5. Data Pipeline Management:
Implements ETL (Extract,
Transform, Load) processes to
handle data flows efficiently.

8. Constraints and Influences on


Data Architecture Design
1. Data Volume: High data
volumes require scalable storage
solutions.
2. Data Variety: Diverse data
types necessitate flexible and
adaptable architecture.
3. Data Velocity: Real-time data
processing needs impact design
choices.
4. Compliance Regulations: Legal
and regulatory requirements
influence data governance and
security measures.
5. Technology Trends: Emerging
technologies can shape
architectural decisions for
efficiency and innovation.
9. Analytics Applications in
Various Business Domains
1. Healthcare: Analyzes patient
data for improved treatment
outcomes and operational
efficiency.
2. Finance: Risk assessment and
fraud detection using historical
transaction data.
3. Retail: Customer behavior
analysis for targeted marketing
and inventory management.
4. Manufacturing: Predictive
maintenance to minimize
equipment downtime and
improve production efficiency.
5. Telecommunications: Churn
prediction to retain customers
and optimize service offerings.

10. Data Management and Steps in


Data Analysis
1. Definition: Data management
encompasses practices for
collecting, storing, and using
data securely and efficiently.
2. Data Collection: Gathering
relevant data from various
sources.
3. Data Cleaning: Identifying and
correcting errors and
inconsistencies in the dataset.
4. Data Exploration:
Understanding data
characteristics through
visualizations and summary
statistics.
5. Data Interpretation: Analyzing
results to draw meaningful
conclusions and inform
decisions.

11. Applications of Data Modeling


in Business
1. Database Design: Structuring
data for efficient storage and
retrieval.
2. Business Process Mapping:
Visualizing processes to identify
bottlenecks and improve
workflows.
3. Data Integration: Aligning data
from multiple sources for
comprehensive analysis.
4. Decision Support: Providing a
framework for analyzing data to
support strategic decisions.
5. Regulatory Compliance:
Ensuring data structures adhere
to industry regulations and
standards.

12. Differentiating SQL and NoSQL


1. Structure: SQL databases are
relational; NoSQL databases are
non-relational or semi-
structured.
2. Schema: SQL databases have
fixed schemas; NoSQL databases
allow flexible, dynamic schemas.
3. Query Language: SQL uses
Structured Query Language;
NoSQL uses various query
languages or APIs.
4. Data Integrity: SQL focuses on
ACID compliance; NoSQL often
prioritizes availability and
partition tolerance (CAP
theorem).
5. Scalability: SQL databases
typically scale vertically; NoSQL
databases are designed for
horizontal scaling.
13. Definition of Database and
Types
1. Database Definition: A
structured collection of data
stored electronically.
2. Relational Databases: Store
data in tables with predefined
schemas (e.g., MySQL, Oracle).
3. NoSQL Databases: Handle
unstructured or semi-structured
data; include document, key-
value, and graph databases (e.g.,
MongoDB, Cassandra).
4. Data Warehouses: Centralized
repositories designed for
analysis and reporting,
integrating data from multiple
sources.
5. Data Variables: Refer to
characteristics of the data, such
as categorical (qualitative) and
numerical (quantitative)
variables.

You might also like