HAI411 - HCC411 Assaignment2
HAI411 - HCC411 Assaignment2
HAI411 - HCC411 Assaignment2
Group 1 &2
You are provided with a large dataset containing sales data for a multinational retail
company over the past five years. The dataset includes information on product
categories, sales figures, customer demographics, and regional sales.
Task:
Identify and handle missing values, outliers, and inconsistencies in the data.
Normalize the data and ensure data types are appropriate for analysis.
Create a data model using Power Pivot to optimize data analysis.
Utilize PivotTables and Pivot Charts to summarize and visualize key trends and
patterns in the data.
Calculate relevant metrics such as sales growth, customer retention rates, and
product profitability.
Identify the top-performing product categories and regions.
3. Predictive Modeling:
Group 3&4
Develop a robust Hadoop architecture to efficiently process and analyze the provided
dataset. For each phase of the data pipeline, justify your choice of Hadoop components
(HDFS, YARN, MapReduce, Spark, Hive, Pig, etc.) based on their suitability for
handling large-scale data, complex data processing tasks, and real-time analytics.
Consider the trade-offs between batch processing and streaming, and the importance of
data quality and consistency.
Group 5&6
Analyze the factors hindering the widespread adoption of big data technologies in
Zimbabwe. Provide a specific case study where big data could significantly benefit the
country, but has not been fully utilized. Discuss the potential advantages of big data
adoption in this context.
Group 7&8
Given the customer data create a power bi report applying the Gelstat principles of data
visualization.
Group 9 &10
CASE STUDY
Since it was founded in 1975 by Bill Gates and Paul Allen, Microsoft has been a key
player in just about every major advance in the use of computers, at home and in
business. Just as it anticipated the rise of the personal computer, the graphical operating
system and the internet, it wasn't taken by surprise by the dawn of the big data era. It
might not always be the principle source of innovation, but it has always excelled at
bringing innovation to the masses, and packaging it into a user-friendly product (even
though many would argue against this). It has caused controversy along the way, though,
and at one time was called an "abusive monopoly" by the US Department of Justice, over
its packaging of Internet Explorer with Windows operating systems. And in 2004 it was
fined over $600m by the European Union following anti-trust action.
The company's fortunes have wavered in recent years - notably, they were slow to come
up with a solid plan for capturing a significant share of the booming mobile market,
causing them to lose ground (and brand recognition) to competitors Apple and Google.
However it remains a market leader in business and home computer operating systems,
office productivity software, web browsers, games consoles and search - Bing having
overtaken Yahoo as the second most-used search engine. It is now angling to become a
key player in big data, too - offering a suite of services and tools including data hosting
and analytics services based on Hadoop to businesses. But Microsoft had a substantial
head-start over the competition - in fact their first forays into the world of big data started
way before even the first version of MS-DOS. Gates and Allen's first business venture,
two years before Microsoft, a service providing real-time reports for traffic engineers using
data from roadside traffic counters. It's clear that the founders of what would grow into
the world's biggest software company knew how important information (specifically,
getting the right information to the right people, at the right time) would become in the
digital age. Microsoft competed in the search engine wars from the beginning, rebranding
its engine along the way from MSN Search, to Windows Live Search and Live Search
before finally arriving at Bing in 2009. Although most of the changes it brought in appeared
designed to ape the undisputed champion of search Google (such as incorporating
various indexes, public records and relevant paid advertising into its results) there are
differences. Bing places more importance on how well-shared information is on social
networks when ranking it, as well as geographical locations associated with the data.
Microsoft's Kinect device for the Xbox aims to capture more data than ever from our own
living rooms. It uses an array of sensors to capture minute movements and is already
able to monitor and record the heart rate of users, as well as activity levels. Patent
applications suggest there are plans for much wider use, including monitoring the
behavior of television viewers, to provide a more interactive watching experience. The
move fits in with Microsoft's strategy of rebranding the Xbox - generally thought of as a
games console - into an intelligent living room activity hub which monitors, records and
adapts to users' behavior. No, you are not the only person who finds that idea a little bit
scary! In the business-to-business market, where Microsoft made its first fortunes with its
OS and office software, it is now throwing all of its considerable weight into big data-
related services for enterprise. Like Google with its AdWords, Bing Ads provides pay-per-
click advertising services which are targeted at a precise Audience segment, identified
through data collected about our browsing habits. And like competitors Google and
Amazon it offers its own "big data in a box" solutions, combining open-source with
proprietary software to offer large-scale data analytics operations to businesses of all
sizes. Its Analytics Platform System marries Hadoop with its industry standard SQL
Server database management technology, while its ubiquitous Office 365 will soon make
data analytics available to an even wider audience, with the inclusion of PowerBI - adding
basic analytics functions to the world's most widely used office productivity software.
It is also looking to stake its claim on the Internet of Things with Azure Intelligent Systems
Service. This is a cloud-based framework built to handle streaming information from the
growing number of online enabled industrial and domestic devices, from manufacturing
machinery to bathroom scales. It may have missed a trick with mobile - prompting many
premature declarations that Microsoft was falling behind the competition - but its keen
embrace of data and analytics services show that it is still a key player. When CEO Satya
Nadella took up his post at the start of this year he emailed all employees letting them
know he expected huge change in the industry, and the wider world, very soon, prompted
by "an ever growing network of connected devices, incredible computing capacity from
the cloud, insights from big data and intelligence from machine learning." So it's clear that
Microsoft aims to put big data at the heart of its business activities for the foreseeable
future, and provide (relatively) simple software solutions to help the rest of us do the
same.
Group 11&12
You’re tasked with analyzing a massive dataset of sensor readings from thousands of
IoT devices deployed across a city. The data includes timestamped readings of
temperature, humidity, and air quality.
Question:
1. Resource Management:
Explain how YARN would manage the resources required for this MapReduce
job.
What factors would influence the allocation of resources (e.g., CPU, memory) to
different Map and Reduce tasks?
How would you ensure the scalability of your MapReduce pipeline to handle
increasing data volumes and device numbers?
What mechanisms would you implement to make the pipeline fault-tolerant, such
as handling node failures or data loss?
3. Real-Time Processing: