HAI411 - HCC411 Assaignment2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Assaignment 2

Group 1 &2

You are provided with a large dataset containing sales data for a multinational retail
company over the past five years. The dataset includes information on product
categories, sales figures, customer demographics, and regional sales.

Task:

1. Data Cleaning and Preparation:

 Identify and handle missing values, outliers, and inconsistencies in the data.
 Normalize the data and ensure data types are appropriate for analysis.
 Create a data model using Power Pivot to optimize data analysis.

2. Exploratory Data Analysis (EDA):

 Utilize PivotTables and Pivot Charts to summarize and visualize key trends and
patterns in the data.
 Calculate relevant metrics such as sales growth, customer retention rates, and
product profitability.
 Identify the top-performing product categories and regions.

3. Predictive Modeling:

 Use Excel's forecasting tools (e.g., Exponential Smoothing, Linear Regression) to


predict future sales trends for specific product categories.
 Create a machine learning model (using tools like Power BI or Python integrated
with Excel) to predict customer churn based on their purchase behavior.

Group 3&4

Develop a robust Hadoop architecture to efficiently process and analyze the provided
dataset. For each phase of the data pipeline, justify your choice of Hadoop components
(HDFS, YARN, MapReduce, Spark, Hive, Pig, etc.) based on their suitability for
handling large-scale data, complex data processing tasks, and real-time analytics.
Consider the trade-offs between batch processing and streaming, and the importance of
data quality and consistency.

Group 5&6

Analyze the factors hindering the widespread adoption of big data technologies in
Zimbabwe. Provide a specific case study where big data could significantly benefit the
country, but has not been fully utilized. Discuss the potential advantages of big data
adoption in this context.

Group 7&8

Given the customer data create a power bi report applying the Gelstat principles of data
visualization.

Group 9 &10

CASE STUDY
Since it was founded in 1975 by Bill Gates and Paul Allen, Microsoft has been a key
player in just about every major advance in the use of computers, at home and in
business. Just as it anticipated the rise of the personal computer, the graphical operating
system and the internet, it wasn't taken by surprise by the dawn of the big data era. It
might not always be the principle source of innovation, but it has always excelled at
bringing innovation to the masses, and packaging it into a user-friendly product (even
though many would argue against this). It has caused controversy along the way, though,
and at one time was called an "abusive monopoly" by the US Department of Justice, over
its packaging of Internet Explorer with Windows operating systems. And in 2004 it was
fined over $600m by the European Union following anti-trust action.

The company's fortunes have wavered in recent years - notably, they were slow to come
up with a solid plan for capturing a significant share of the booming mobile market,
causing them to lose ground (and brand recognition) to competitors Apple and Google.
However it remains a market leader in business and home computer operating systems,
office productivity software, web browsers, games consoles and search - Bing having
overtaken Yahoo as the second most-used search engine. It is now angling to become a
key player in big data, too - offering a suite of services and tools including data hosting
and analytics services based on Hadoop to businesses. But Microsoft had a substantial
head-start over the competition - in fact their first forays into the world of big data started
way before even the first version of MS-DOS. Gates and Allen's first business venture,
two years before Microsoft, a service providing real-time reports for traffic engineers using
data from roadside traffic counters. It's clear that the founders of what would grow into
the world's biggest software company knew how important information (specifically,
getting the right information to the right people, at the right time) would become in the
digital age. Microsoft competed in the search engine wars from the beginning, rebranding
its engine along the way from MSN Search, to Windows Live Search and Live Search
before finally arriving at Bing in 2009. Although most of the changes it brought in appeared
designed to ape the undisputed champion of search Google (such as incorporating
various indexes, public records and relevant paid advertising into its results) there are
differences. Bing places more importance on how well-shared information is on social
networks when ranking it, as well as geographical locations associated with the data.
Microsoft's Kinect device for the Xbox aims to capture more data than ever from our own
living rooms. It uses an array of sensors to capture minute movements and is already
able to monitor and record the heart rate of users, as well as activity levels. Patent
applications suggest there are plans for much wider use, including monitoring the
behavior of television viewers, to provide a more interactive watching experience. The
move fits in with Microsoft's strategy of rebranding the Xbox - generally thought of as a
games console - into an intelligent living room activity hub which monitors, records and
adapts to users' behavior. No, you are not the only person who finds that idea a little bit
scary! In the business-to-business market, where Microsoft made its first fortunes with its
OS and office software, it is now throwing all of its considerable weight into big data-
related services for enterprise. Like Google with its AdWords, Bing Ads provides pay-per-
click advertising services which are targeted at a precise Audience segment, identified
through data collected about our browsing habits. And like competitors Google and
Amazon it offers its own "big data in a box" solutions, combining open-source with
proprietary software to offer large-scale data analytics operations to businesses of all
sizes. Its Analytics Platform System marries Hadoop with its industry standard SQL
Server database management technology, while its ubiquitous Office 365 will soon make
data analytics available to an even wider audience, with the inclusion of PowerBI - adding
basic analytics functions to the world's most widely used office productivity software.

It is also looking to stake its claim on the Internet of Things with Azure Intelligent Systems
Service. This is a cloud-based framework built to handle streaming information from the
growing number of online enabled industrial and domestic devices, from manufacturing
machinery to bathroom scales. It may have missed a trick with mobile - prompting many
premature declarations that Microsoft was falling behind the competition - but its keen
embrace of data and analytics services show that it is still a key player. When CEO Satya
Nadella took up his post at the start of this year he emailed all employees letting them
know he expected huge change in the industry, and the wider world, very soon, prompted
by "an ever growing network of connected devices, incredible computing capacity from
the cloud, insights from big data and intelligence from machine learning." So it's clear that
Microsoft aims to put big data at the heart of its business activities for the foreseeable
future, and provide (relatively) simple software solutions to help the rest of us do the
same.

a) In relation to Big Data, why was Microsoft labelled a monopoly? [5]


b) How did the CEO of Microsoft plan to put big data at the heart of its business
operations and how would this convert into profit for Microsoft? [5]
c) Explain how they adopted the Big Data Analytics life cycle to shift their business
focus [10]

Group 11&12

You’re tasked with analyzing a massive dataset of sensor readings from thousands of
IoT devices deployed across a city. The data includes timestamped readings of
temperature, humidity, and air quality.
Question:

1. Data Processing Pipeline:


 Describe how you would design a MapReduce pipeline to process this data.
 Specify the roles of the Map and Reduce tasks in this context.
 How would you handle data partitioning and shuffling in this scenario?

1. Resource Management:

 Explain how YARN would manage the resources required for this MapReduce
job.
 What factors would influence the allocation of resources (e.g., CPU, memory) to
different Map and Reduce tasks?

2. Scalability and Fault Tolerance:

 How would you ensure the scalability of your MapReduce pipeline to handle
increasing data volumes and device numbers?
 What mechanisms would you implement to make the pipeline fault-tolerant, such
as handling node failures or data loss?

3. Real-Time Processing:

 Discuss the limitations of MapReduce for real-time processing of sensor data.


 Suggest alternative approaches or modifications to the MapReduce framework to
enable near-real-time analysis.

You might also like