Big Data (Unit 1)
Big Data (Unit 1)
Big Data (Unit 1)
Digital Data
In computing world, digital data is considered as a collection of facts that’s transmitted
and saved in a electronic format, and processed through software system.
Digital data is data that represents other forms of data using specific machine language
systems that can be interpreted by various technologies.
Digital Data is generated by various devices, like desktops, laptops, tablets, mobile
phones, and electronic sensors.
Digital data is stored as strings of binary values (0s and 1s) on a storage medium that’s
either internal or external to the devices generating or accessing the information.
For Example - Whenever you send an email, read a social media post, or take pictures
with your digital camera, you are working with digital data.
Structured is one of the types of big data and By structured data, we mean data that can
be processed, stored, and retrieved in a fixed format.
This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple search engine algorithms.
For example - the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized
manner.
Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze
unstructured data.
About 80—90% data of an organization is in this format; for example, memos, chat
rooms, PowerPoint presentations, images, videos, letters, researches, white papers,
body of an email, etc.
For example - To play a video file, it is essential that the correct codec (coder-
decoder) is available. Unstructured data cannot be directly processed or queried using
SQL. Email is an example of unstructured data.
This is the data which does not conform to a data model but has some structure.
Semi-structured data is information that does not reside in a relational database but
that have some organizational properties that make it easier to analyze.
It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms.
• Data analysis, data analytics and Big Data originate from the longstanding domain of
database management. It relies heavily on the storage, extraction, and optimization
techniques that are common in data that is stored in Relational Database Management
Systems (RDBMS).
• Database management and data warehousing are considered the core components of Big
Data Phase 1. It provides the foundation of modern data analysis as we know it today,
using well-known techniques such as database queries, online analytical processing and
standard reporting tools.
Since the early 2000s, the Internet and the Web began to offer unique data collections
and data analysis opportunities. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by
analysing click-rates, IP-specific location data and search logs. This opened a whole new
world of possibilities.
• Although web-based unstructured content is still the main focus for many organizations
in data analysis, data analytics, and big data, the current possibilities to retrieve valuable
information are emerging out of mobile devices.
• Mobile devices not only give the possibility to analyze behavioral data (such as clicks
and search queries), but also give the possibility to store and analyze location-based data
(GPS-data). With the advancement of these mobile devices, it is possible to track
movement, analyze physical behavior and even health-related data (number of steps you
take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution. Big data platform are also delivered
through cloud where the provider provides an all-inclusive big data solutions and
services.
Ability to accommodate new applications and tools depending on the evolving business
needs.
Provide the tools for scouring the data through massive data sets.
1. The digitization of society: - Big Data is largely consumer driven and consumer
oriented. Most of the data in the world is generated by consumers, who are nowadays
‘always-on’. Most people now spend 4-6 hours per day consuming and generating data
through a variety of devices and (social) applications. With every click, swipe or
message, new data is created in a database somewhere around the world. Because
everyone now has a smartphone in their pocket, the data creation sums to
incomprehensible amounts. Some studies estimate that 60% of data was generated within
the last two years, which is a good indication of the rate with which society has digitized.
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard
for Big Data) is Apache Hadoop for distributed storage and processing. Due to the high
availability of these software frameworks in open sources, it has become increasingly
inexpensive to start Big Data projects in organizations.
4. Increased knowledge about data science :- The demand for data scientist (and
similar job titles) has increased tremendously and many people have actively become
engaged in the domain of data science. As a result, the knowledge and education about
data science has greatly professionalized and more information becomes available every
day. While statistics and data analysis mostly remained an academic field previously, it is
quickly becoming a popular subject among students and the working population.
5. Social media applications: - Everyone understands the impact that social media has
on daily life. However, in the study of Big Data, social media plays a role of paramount
importance. Not only because of the sheer volume of data that is produced everyday
through platforms such as Twitter, Facebook, LinkedIn and Instagram, but also because
social media provides nearly real-time data about human behaviour. Social media data
provides insights into the behaviours, preferences and opinions of ‘the public’ on a scale
that has never been known before. Due to this, it is immensely valuable to anyone who is
able to derive meaning from these large quantities of data. Social media data can be used
to identify customer preferences for product development, target new customers for
6. The upcoming internet of things (IoT) :- The Internet of things (IoT) is the network
of physical devices, vehicles, home appliances and other items embedded with
electronics, software, sensors, actuators, and network connectivity which enables these
objects to connect and exchange data. It is increasingly gaining popularity as consumer
goods providers start including ‘smart’ sensors in household appliances. Whereas the
average household in 2010 had around 10 devices that connected to the internet, this
number is expected to rise to 50 per household by 2020. Examples of these devices
include thermostats, smoke detectors, televisions, audio systems and even smart
refrigerators.
Data Sources :- Data is sourced from multiple inputs in a variety of formats, including
both structured and unstructured. Data sources, open and third-party, play a significant
role in architecture. Relational databases, data warehouses, cloud-based data warehouses,
SaaS applications, real-time data from company servers and sensors such as IoT devices,
third-party data providers, and also static files such as Windows logs, comprise several
data sources. Both batch processing and real-time processing are possible. The data
managed can be both batch processing and real-time processing.
Data Storage:- There is data stored in file stores that are distributed in nature and that
can hold a variety of format-based big files. It is also possible to store large numbers of
different format-based big files in the data lake. This consists of the data that is managed
for batch built operations and is saved in the file stores. We provide HDFS, Microsoft
Azure, AWS, and GCP storage, among other blob containers.
Batch Processing:- Each chunk of data is split into different categories using long-
running jobs, which filter and aggregate and also prepare data for analysis. These jobs
typically require sources, process them, and deliver the processed files to new files.
Multiple approaches to batch processing are employed, including Hive jobs, U-SQL jobs,
Sqoop or Pig and custom map reducer jobs written in any one of the Java or Scala or
other languages such as Python.
1. Volume:
The name ‘Big Data’ itself is related to a size which is enormous.
Volume is a huge amount of data.
To determine the value of data, size of data plays a very crucial role. If the volume of
data is very large then it is actually considered as a ‘Big Data’. This means whether a
particular data can actually be considered as a Big Data or not, is dependent upon the
volume of data.
Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 ExaBytes of
data.
3. Variety:
It refers to nature of data that is structured, semi-structured and unstructured data.
It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and outside
of an enterprise. It can be structured, semi-structured and unstructured.
4. Veracity:
It refers to inconsistencies and uncertainty in data, that is data which is available can
sometimes get messy and quality and accuracy are difficult to control.
Big Data is also variable because of the multitude of data dimensions resulting from
multiple disparate data types and sources.
Example: Data in bulk could create confusion whereas less amount of data could
convey half or Incomplete Information.
5. Value:
After having the 4 V’s into account there comes one more V which stands for Value!.
The bulk of Data having no Value is of no good to the company, unless you turn it into
something useful.
o Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information.
Big data has found many applications in various fields today. The major fields where
big data is being used are as follows.
Government :- Big data analytics has proven to be very useful in the government
sector. Big data analysis played a large role in Barack Obama’s successful 2012 re-
election campaign. Also most recently, Big data analysis was majorly responsible for the
BJP and its allies to win a highly successful Indian General Election 2014. The Indian
Government utilizes numerous techniques to ascertain how the Indian electorate is
responding to government action, as well as ideas for policy augmentation.
Social Media Analytics :- The advent of social media has led to an outburst of big
data. Various solutions have been built in order to analyse social media activity like
IBM’s Cognos Consumer Insights, a point solution running on IBM’s Big Insights Big
Data platform, can make sense of the chatter. Social media can provide valuable real-
time insights into how the market is responding to products and campaigns. With the
help of these insights, the companies can adjust their pricing, promotion, and campaign
placements accordingly. Before utilizing the big data there needs to be some pre-
processing to be done on the big data in order to derive some intelligent and valuable
results. Thus to know the consumer mind-set the application of intelligent decisions
derived from big data is necessary.
Technology :- The technological applications of big data comprise of the following
companies which deal with huge amounts of data every day and put them to use for
business decisions as well. For example, eBay.com uses two data warehouses at 7.5
petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer
recommendations, and merchandising. Inside eBay‟s 90PB data warehouse.
Amazon.com handles millions of back-end operations every day, as well as queries from
more than half a million third-party sellers. The core technology that keeps Amazon
running is Linux-based and as of 2005, they had the world’s three largest Linux
databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB. Facebook handles 50 billion
photos from its user base. Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers determine their typical drive times to
and from work throughout various times of the day.
Fraud Detection :- For businesses whose operations involve any type of claims or
transaction processing, fraud detection is one of the most compelling Big Data
application examples. Historically, fraud detection on the fly has proven an elusive goal.
In most cases, fraud is discovered long after the fact, at which point the damage has been
done and all that’s left is to minimize the harm and adjust policies to prevent it from
happening again. Big Data platforms that can analyze claims and transactions in real
time, identifying large-scale patterns across many transactions or detecting anomalous
behavior from an individual user, can change the fraud detection game.
1). Easy Result Formats :- Results are imperative parts of big data analytics model as
they support in the decision-making process, that are made to decide future strategy and
goals. Scientists prefer the results to get the result in the real-time so that they can take
better and appropriate decisions, based on the analysis result. The tools must be able to
produce a result in such a way that it can provide insights into data analysis and decision-
making platform. The platform should be able to provide the real-time streams that can
help in making instant and quick decisions.
2). Raw data Processing :- Here, the data processing means collecting and organizing
data in a meaningful manner. Data modeling takes complex data sets and displays them
in the visual form or diagram or chart. Here, data should be interpretable and digestible
so that it can be used in making decisions. Tools of big data analytics must be able to
import data from various data sources such as Microsoft Access, text files, Microsoft
Excel and other files. Tools must be able to collect data from multiple data sources and in
multiple formats. In this way need for data conversion will be reduced and overall
process speed will be improved. Even the export quality and capability to visualize data
sets and handling various formats like PDFs, Excel, or Word files can be used directly to
collect and transfer the data. Below-listed features are essential for the data processing
tools:
Data Mining
Data Modeling
File Exporting
Data File Sources
3). Prediction apps or Identity Management :- Identity management is also a required
and essential feature for any data analytics tool. The tool should be able to access any
system and all related information that may be related to the computer hardware,
software or any other individual computer. Here, the identity management system is also
related to managing all issues related to the identity, data protection, and access so that it
can support system, network passwords, and protocols. Here, it should be clear that
whether a user can access the system or not and to which level the system access
permission is granted? Identity management applications and system ensure that only
authenticated users can access the system information and the tool or system must be
able to organize a security plan and include fraud analytics and real-time security.
4). Reporting Feature :- Businesses remain on top with the help of reporting features.
Even time-to-time data should be fetched and represented in a well-organized manner.
These way decision-makers can take timely decisions and handle the critical situations as
well, especially in a society that is moving rapidly. Data tools use dashboards to present
KPIs and metrics. The reports must be customizable and target data set oriented. The
expected capabilities of reporting tools are Real-time reporting, dashboard management,
and location-based insights.
5). Security Features :- For any successful business, it is essential to save their data. The
tools that are used for big data analytics should offer safety and security to the data. For
this there should be SSO feature that is known as a single sign-on feature with the help of
that there is no need for the user to sign-in multiple times during the same session, even
Satish Kr Singh Page 13
with the help of single or same login user can log in multiple times and monitor user
activities and accounts. Moreover, data encryption is also an imperative feature that
should be provided by Big Data analytics tools. It means to change the form of data or to
make it unreadable from a readable form by using several algorithms and codes.
Sometimes automatic encryption is also offered by web browsers. Comprehensive
encryption capabilities are also offered by data analytics tools. For this single sign-on and
data encryption are two of the most used and popular features.
6). Fraud management :- A variety of fraud detection functionalities remain involved in
the fraud analytics. Mainly when it comes to the fraud detection activities then it involves
various fraud analytics. Due to these activities, businesses mainly focus on the way with
which they will deal with the fraud rather than preventing any fraud. Fraud detection can
be performed by data analytics tools. The tools should be able to perform repeated tests
on the data at any time just to ensure that there will be no amiss. In this way, threats can
be identified quickly and efficiently. With effective fraud analytics and identity
management capabilities.
7). Technologies Support :- Your data analytics tool must support the latest tools and
technologies, especially those that are important for your organization. Here, one most
important one is the A/B testing that is also known as the bucket or split testing, in this
testing two webpage versions are compared to determine the performance of a better
page. Here both the versions are compared on the basis in which user interacts with the
webpage and then the best one is considered. Moreover, as far as technical support is
concerned then your tool must be able to integrate with Hadoop, that is a set of open-
source programs that can work as the backbone of data-analytics activities. Hadoop
mainly involves the following four modules with which integration is expected:
MapReduce: It can read data from a file system that can be interpreted in the
visualized manner.
Hadoop Common: For this, Java tool collection may be required to read data stored
in the user’s file system.
YARN: It is responsible to manage system resources so that data can be stored and
analysis can be performed
Distributed File System: It allows data to be stored in an easy format. If the results of
tools will be integrated with these Hadoop modules then the user can easily send the
results to the user system. In this way flexibility, interoperability and both way
communication can be ensured between organizations.
8). Version Control :- Most of the data analytics tools are involved in adjusting data
analytics model parameters. But it may cause problems when pushed into production.
Version control feature of big analytics tools will surely improve the capabilities to track
changes and it is able to release previous versions too whenever needed.
9). Scalability :- Data will not the same all the times but it will grow as your
organization is growing. With big data tools, this is always easy to scale-up as soon as
new data is collected for the company and it can be analyzed well as expected. Also, the
meaningful insights driven from data is pushed or integrated into the previous data
successfully.
10). Quick Integrations :- With integration capabilities, this is always easy to share data
results with developers and data scientists. Big data tools always support the quick
integration with cloud apps, data warehouses, other databases etc.
1. Informed Consent
To consent means that you give uncoerced permission for something to happen to
you. Informed consent is the most careful, respectful and ethical form of consent. It requires the
data collector to make a significant effort to give participants a reasonable and accurate
understanding of how their data will be used. In the past, informed consent for data collection
was typically taken for participation in a single study. Big data makes this form of consent
impossible as the entire purpose of big data studies, mining and analytics is to reveal patterns
and trends between data points that were previously inconceivable. In this way, consent cannot
possibly be ‘informed’ as neither the data collector nor the study participant can reasonably
know or understand what will be garnered from the data or how it will be used.
2. Privacy
The ethics of privacy involve many different concepts such as liberty, autonomy, security, and in
a more modern sense, data protection and data exposure. You can understand the concept of big
data privacy by breaking it down into three categories:
The scale and velocity of big data pose a serious concern as many traditional privacy processes
cannot protect sensitive data, which has led to an exponential increase in cybercrime and data
leaks. A hacker was able to access and scrape the database which stored:
Names
Phone numbers
Email addresses
Profile descriptions
Follower and engagement data
Locations
LinkedIn profile links
Connected social media account login names
A further concern is the growing analytical power of big data, i.e. how this can impact privacy
when personal information from various digital platforms can be mined to create a full picture of
a person without their explicit consent. For example, if someone applies for a job, information
can be gained about them via their digital data footprint to identify political leanings, sexual
orientation, social life, etc. All of this data could be used as a reason to reject an employment
application even though the information was not offered up for judgement by the applicant.
3. Ownership
When we talk about ownership in big data terms, we steer away from the traditional or legal
understanding of the word as the exclusive right to use, possess, and dispose of property. Rather,
in this context, ownership refers to the redistribution of data, the modification of data, and the
ability to benefit from data innovations.
The right to control data - edit, manage, share and delete data
The right to benefit from data - profit from the use or sale of data
Contrary to common belief, those who generate data, for example, Facebook users, do not
automatically own the data. Some even argue that the data we provide to use ‘free’ online
platforms is in fact a payment for that platform. But big data is big money in today’s world.
Many internet users feel that the current balance is tilted against them when it comes to
ownership of data and the transparency of companies who use and profit from the data we share.
Algorithms are designed by humans, the data sets they study are selected and prepared by
humans, and humans have bias. So far, there is significant evidence to suggest that human
prejudices are infecting technology and algorithms, and negatively impacting the lives and
freedoms of humans. Particularly those who exist within the minorities of our societies.
Algorithm biases have become such an ingrained part of everyday life that they have also been
documented as impacting our personal psyches and thought processes. The phenomenon occurs
when we perceive our reality to be a reflection of what we see online. However, what we view is
often a tailored reality created by algorithms and personalised using our previous viewing habits.
The algorithm shows us content that we are most likely to enjoy or agree with and discards the
rest.
The big data divide seeks to define the current state of data access; the understanding and mining
capabilities of big data is isolated within the hands of a few major corporations. These divides
create ‘haves’ and ‘have nots’ in big data and exclude those who lack the necessary financial,
educational and technological resources to access and analyse big datasets.
The data divide creates further problems when we consider algorithm biases that place
individuals in categories based on a culmination of data that individuals themselves cannot
access. For example, profiling software can mark a person as a high-risk potential for
committing criminal activity, causing them to be legally stop-and-searched by authorities or
even denied housing in certain areas. The big data divide means that the ‘data poor’ cannot
understand the data or methods used to make these decisions about them and their lives.
Big data affects the compliance process directly because you will be expected to account for its
flow inside your organization. The regulatory bodies are keen to examine every stage of data
handling, including the collection, processing, and storage of data. The primary reason for the
comprehensive evaluation is to make sure that the data is safe from cyberattacks. In order to get
compliance status, you will build security measures to secure the data. During the analysis, you
are expected to show how each of the techniques for risk mitigation works and their level of
effectiveness. This thorough report on the data protection programs will make the organization’s
certification easier. Big data assists the creation of a compressive risk assessment framework by:
Data protection is a set of strategies and processes you can use to secure the privacy,
availability, and integrity of your data. It is sometimes also called data security. A data
protection strategy is vital for any organization that collects, handles, or stores sensitive
data. A successful strategy can help prevent data loss, theft, or corruption and can help
minimize damage caused in the event of a breach or disaster.
Data protection principles help protect data and make it available under any
circumstances. It covers operational data backup and business continuity/disaster
recovery (BCDR) and involves implementing aspects of data management and data
availability.
Data availability—ensuring users can access and use the data required to perform
business even when this data is lost or damaged.
Data lifecycle management—involves automating the transmission of critical data to
offline and online storage.
Information lifecycle management—involves the valuation, cataloguing, and
protection of information assets from various sources, including facility outages and disruptions,
application and user errors, machine failure, and malware and virus attacks.
Data privacy is a guideline for how data should be collected or handled, based on its
sensitivity and importance. Data privacy is typically applied to personal health
information (PHI) and personally identifiable information (PII). This includes financial
information, medical records, social security or ID numbers, names, birthdates, and
contact information. Data privacy concerns apply to all sensitive information that
organizations handle, including that of customers, shareholders, and employees. Often,
this information plays a vital role in business operations, development, and finances.
Data privacy helps ensure that sensitive data is only accessible to approved parties. It
prevents criminals from being able to maliciously use data and helps ensure that
organizations meet regulatory requirements.
Big data ethics also known as simply data ethics refers to systemizing, defending, and
recommending concepts of right and wrong conduct in relation to data, in particular
personal data. Since the dawn of the Internet the sheer quantity and quality of data has
dramatically increased and is continuing to do so exponentially. Big data describes this
large amount of data that is so voluminous and complex that traditional data processing
application software is inadequate to deal with them. Recent innovations in medical
research and healthcare, such as high-throughput genome sequencing, high-resolution
imaging, electronic medical patient records and a plethora of internet-connected health
devices have triggered a data deluge that will reach the exabyte range in the near future.
Data ethics is of increasing relevance as the quantity of data increases because of the
scale of the impact.
Big data ethics serves as a branch of ethics that evaluates data practices by collecting,
generating, analyzing, and distributing data. As the world expands its digital footprint,
data collected has the potential to impact people and thus society. Much of this data
consists of Personally identifiable information (PII):
Full name
Birthdate
Street address
Phone number
Social security number
Credit card information
Bank account information
Passport number
Once an organization fails to act ethically, it’s no secret that it damages the company’s
brand and reputation. Similarly, after the many data scandals that occurred in the past
couple of years, people lost trust in companies that manipulated customers’. However,
these scandals don’t just consist of data manipulation and sale. Housing data and keeping
it safe from harm’s way also is part of big data ethics. Some of the top data breaches ever
to occur had lasting effects on brand trustworthiness. Therefore, adopting a concrete big
data ethics framework is essential for the success of any large organization. Companies
must act as information protectors as long as they choose to collect it.
The bigger the data central to the business, the higher its risk of violating customer
privacy and individual rights. In 2022, the responsibility to actively manage data privacy
and security falls on roles within the large organization.
Privacy
When users submit their information, it’s with the expectation that companies will keep it
to themselves. Two common scenarios exist when this information is no longer private:
o A data breach
o A sale of information to a third party
With its growth, users expect talented IT professionals to be able to protect their data. If a
data breach occurs, the company fails to meet privacy expectations. Furthermore, in the
21st century, consumers expect large companies to have the means to protect data if they
choose to collect it.
Lack of transparency
Many users are unaware their information is being collected. In addition to being
unaware, companies go to lengths to make it very inconspicuous that they do so.
Websites add cookie opt-ins on pop-ups so that the user will accept to quickly see the
page. After getting a user to submit information, some companies don’t disclose how
they use a person’s data. Long lists of legal documentation are formatted in a way no
user expects to read through it. It’s only after scandals or some type of media reporting
do people discover the company’s data collection method is unsatisfactory.
Lack of governance
Before big data, the method to collect information was simple. People either gave you a
physical copy of their information or they didn’t. Companies stored the physical files
under lock and key. Although someone could always potentially steal someone’s
identity, criminals had difficulty doing so to masses of people at one time. Now in a new
age of information abundance, we have users unknowing submitting heaps of
information. The possibilities of using that information with AI and algorithms to
someone’s advantage are endless.
Algorithms make assumptions about users. Depending on the assumption, they can begin
to discriminate. For example, court systems started using algorithms to evaluate the
criminal risk potential defendants and have used this data while sentencing. If the data
encompasses a certain gender, nationality, or race then the results house bias against
groupings outside of those specific groups.
Big Data Ethics Framework and Other Ways to Resolve Ethical Issues
Big data privacy involves properly managing big data to minimize risk and protect sensitive
data. Because big data comprises large and complex data sets, many traditional privacy
processes cannot handle the scale and velocity required. To safeguard big data and ensure it
can be used for analytics, you need to create a framework for privacy protection that can
handle the volume, velocity, variety, and value of big data as it is moved between
environments, processed, analyzed, and shared.
In an era of multi-cloud computing, data owners must keep up with both the pace of data
growth and the proliferation of regulations that govern it especially regulations protecting
the privacy of sensitive data and personally identifiable information (PII). With more data
spread across more locations, the business risk of a privacy breach has never been higher,
and with it, consequences ranging from high fines to loss of market share. Big data privacy
is also a matter of customer trust. The more data you collect about users, the easier it gets to
"connect the dots:" to understand their current behavior, draw inferences about their future
behavior, and eventually develop deep and detailed profiles of their lives and preferences.
The more data you collect, the more important it is to be transparent with your customers
about what you're doing with their data, how you're storing it, and what steps you're taking
to comply with regulations that govern privacy and data protection. The volume and
velocity of data from existing sources, such as legacy applications and e-commerce, is
expanding fast. You also have new (and growing) varieties of data types and sources, such
as social networks and IoT device streams.
Prediction 1: Data privacy mandates will become more common. As organizations store
more types of sensitive data in larger amounts over longer periods of time, they will be
under increasing pressure to be transparent about what data they collect, how they analyze
and use it, and why they need to retain it. The European Union's General Data Privacy
Regulation (GDPR) is a high-profile example. More government agencies and regulatory
organizations are following suit. To respond to these growing demands, companies need
reliable, scalable big data privacy tools that encourage and help people to access, review,
correct, anonymize, and even purge some or all of their personal and sensitive information.
Prediction 2: New big data analytic tools will enable organizations to perform deeper
analysis of legacy data, discover uses for which the data wasn't originally intended, and
combine it with new data sources. Big data analytics tools and solutions can now dig into
data sources that were previously unavailable, and identify new relationships hidden in
legacy data. That’s a great advantage when it comes to getting a complete view of your
enterprise data especially for customer 360 and analytics initiatives. But it also raises
questions about the accuracy of aging data and the ability to track down entities for
consent to use their information in new ways.
Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer preferences
that can help organizations make informed business decisions. Big Data analytics deals
with collection, storage, processing and analysis of this massive scale data. Specialized
tools and frameworks are required for big data analysis when:
(1) the volume of data involved is so large that it is difficult to store, process and
analyse data on a single machine,
(2) the velocity of data is very high and the data needs to be analysed in real-time,
(3) there is variety of data involved, which can be structured, unstructured or
semi-structured, and is collected from multiple data sources,
(4) various types of analytics need to be performed to extract value from the data
such as descriptive, diagnostic, predictive and prescriptive analytics.
Big data analytics involves several steps starting from data cleansing, data munging (or
wrangling), data processing and visualization. Big data analytics life-cycle starts from the
collection of data from multiple data sources. Specialized tools and frameworks are required to
ingest the data from different sources into the dig data analytics backend. The data is stored in
specialized storage solutions (such as distributed file systems and non-relational databases)
which are designed to scale.
Based on the analysis requirements (batch or real-time), and type of analysis to be performed
(descriptive, diagnostic, predictive, or predictive) specialized frameworks are used. Big data
analytics is enabled by several technologies such as cloud computing, distributed and parallel
processing frameworks, non-relational databases, in-memory computing, for instance. Some
examples of big data are listed as follows:
Data generated by social networks including text, images, audio and video data
Click-stream data generated by web applications such as e-Commerce to analyse user
behaviour
Machine sensor data collected from sensors embedded in industrial and energy systems
for monitoring their health and detecting failures
Healthcare data collected in electronic health record (EHR) systems
Logs generated by web applications
Stock markets data
Transactional data generated by banking and financial applications
➨Big data analysis derives innovative solutions. Big data analysis helps in understanding
and targeting customers. It helps in optimizing business processes.
➨It helps in improving science and research.
➨It improves healthcare and public health with availability of record of patients.
➨It helps in financial trading’s, sports, polling, security/law enforcement etc.
➨Any one can access vast information via surveys and deliver answers of any query.
➨Every second additions are made.
➨One platform carry unlimited information.
The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology may
take place which can also lead to some unpleasant result.
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation . The preparation of data involves opting for the required data
from the related data source and incorporating it into a data set that can be used for data
mining. The main goal of intelligent data analysis is to obtain knowledge. Data analysis
is the process of a combination of extracting data from data set, analyzing, classification
of data, organizing, reasoning, and so on. It is challenging to choose suitable methods to
resolve the complexity of the process. Regarding the term visualization, we have moved
Nature of Data
What is Data?
Properties of data
• 1. Consistency
The element of consistency removes room for contradictory data. Rules will have to be
set around consistency metrics, which include range, variance, and standard deviation.
• 2. Accuracy
It is a necessity for DQ data to remain error-free and precise, which means it should be
free of erroneous information, redundancy, and typing errors. Error ratio and deviation
are two examples of accuracy metrics.
• 3. Completeness
The data should be complete without any missing data. To deliver cloud data quality
tools, all data entries should be complete with no room for lapses. The completeness
metric is defined as the percentage of complete data records.
• 4. Auditability
The ability to trace data and analyses the changes over time adds to the Data Quality
dimensions of audibility of data. An example of audacity metrics is the percentage of the
gaps in data sets, modified data, and untraceable and disconnected data.
• 5. Validity
Quality data in terms of validity indicates that all data is aligned with the existing
formatting rules. An example of a validity metric is the percentage of data records in the
required format.
• 6. Uniqueness
• 7. Timeliness
For data to retain its quality, it should be recorded promptly to manage changes. Weekly
over annually, tracking is the solution to timeliness. An example of timeliness metrics is
time variance.
• 8. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
• 9. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time. Progress toward performance targets should reflect real changes
rather than variations in data collection approaches or methods. Source data is clearly
identified and readily available from manual, automated, or other systems and records.
Types of data
Categorical Data
Nominal Data
Nominal values represent discrete units and are used to label variables, that have no
quantitative value. Just think of them as „labels“. Note that nominal data that has no
order. Therefore if you would change the order of its values, the meaning would not
Satish Kr Singh Page 28
change. Nominal data are used to label variables where there is no quantitative value and
has no order. So, if you change the order of the value then the meaning will remain the
same.
Thus, nominal data are observed but not measured, are unordered but non-equidistant, and have
no meaningful zero. The only numerical activities you can perform on nominal data is to state
that perception is (or isn't) equivalent to another (equity or inequity), and you can use this data to
amass them.
Neither would you be able to do any numerical tasks as they are saved for numerical data. With
nominal data, you can calculate frequencies, proportions, percentages, and central points.
Examples of Nominal data:
English
German
French
Punjabi
American
Indian
Japanese
German
Ordinal Data
Ordinal values represent discrete and ordered units. It is therefore nearly the same as
nominal data, except that it’s ordering matters. Ordinal data is almost the same as
nominal data but not in the case of order as their categories can be ordered like 1st, 2nd,
etc. However, there is no continuity in the relative distances between adjacent categories.
Examples of Ordinal data:
Opinion
o Agree
o Disagree
o Mostly agree
o Neutral
o Mostly disagree
Time of day
o Morning
o Noon
o Night
Numerical Data
This data type tries to quantify things and it does by considering numerical values that
make it countable in nature. The price of a smartphone, discount offered, number of
For example, the height of a person can vary from x cm to y cm and can be further
broken down based on the fractional values.
Interval Data
Interval Data are measured and ordered with the nearest items but have no meaningful zero.
The central point of an Interval scale is that the word 'Interval' signifies 'space in between',
which is the significant thing to recall, interval scales not only educate us about the order but
additionally about the value between every item.
Even though interval data can show up fundamentally the same as ratio data, the thing that
matters is in their characterized zero-points. If the zero-point of the scale has been picked
subjectively, at that point the data can't be ratio data and should be interval data. Examples of
Interval data:
Ratio Data
Ratio Data are measured and ordered with equidistant items and a meaningful zero and never be
negative like interval data. An outstanding example of ratio data is the measurement of heights.
It could be measured in centimetres, inches, meters, or feet and it is not practicable to have a
negative height. Ratio data enlightens us regarding the order for variables, the contrasts among
them, and they have absolutely zero. It permits a wide range of estimations and surmisings to be
performed and drawn. Example of Ratio data:
In general, data is any set of characters that is gathered and translated for some purpose, usually
analysis. If data is not put into context, it doesn't do anything to a human or computer. There are
multiple types of data. Some of the more common types of data include the following:
In a computer's storage, digital data is a sequence of bits (binary digits) that have the value
one or zero. Data is processed by the CPU, which uses logical operations to produce new
data (output) from source data (input).
Analytic processes
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
– In this phase,
• we deploy the results of the analysis.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– Data collected from the various sources is described in terms of its application and the need for
the project in this phase.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
– test cases are built for assessing the model and model is tested and implemented on the data in
this phase.
Analytic tools
Thus the BDA tools are used throughout the BDA applications development.
Analysis vs Reporting
Reporting: The process of organizing data into informational summaries in order to
monitor how different areas of a business are performing.****
Analysis: The process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.
• Data reporting: Gathering data into one place and presenting it in visual representations
• A firm may be focused on the general area of analytics (strategy, implementation, reporting,
etc.)
– but not necessarily on the specific aspect of analysis.
• It’s almost like some organizations run out of gas after the initial set-up-related activities and
don’t make it to the analysis stage
Reports are like Robots n monitor and alter you and whereas analysis is like parents - c
an figure out what is going on (hungry, dirty diaper, no pacifier, , teething, tired, ear
infection, etc).
Reporting and analysis can go hand-in-hand:
Satish Kr Singh Page 34
Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
Reporting translate a raw data into information
Reporting usually raises a question – What is happening ?
Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.
The characteristics of the data analysis depend on different aspects such as volume, velocity, and
variety.
1. Programmatic
There might be a need to write a program for data analysis by using code to manipulate it or do
any kind of exploration because of the scale of the data.
2. Data-driven
A lot of data scientists depend on a hypothesis-driven approach to data analysis. For appropriate
data analysis, one can also avail the data to foster analysis. This can be of significant advantage
when there is a large amount of data. For example machine learning approaches can be used in
place of hypothetical analysis.
3. Attributes usage
For proper and accurate analysis of data, it can use a lot of attributes. In the past, analysts dealt
with hundreds of attributes or characteristics of the data source. With Big Data, there are now
thousands of attributes and millions of observations.
4. Iterative
As whole data is broken into samples and samples are then analyzed, therefore data analytics can
be iterative in nature. Better compute power enables iteration of the models until data analysts
Satish Kr Singh Page 35
are satisfied. This has led to the development of new applications designed for addressing
analysis requirements and time frames.
It provides infrastructures and platforms for other specific Big Data applications.
b) Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault tolerance and
Satish Kr Singh Page 36
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics framework that is fully
compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.
c) Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to large stores
of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by loading the
aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.