Ds Unit-1
Ds Unit-1
Ds Unit-1
With the fruition of the online services through the extensive use of the Internet, the habits taken
up by businesses, stock markets, economies, and by different organizations of governments. This
has eventually changed the way people live and use technology. With the increase in all of these,
there is a parallel increase in information flows and data collection that arises daily, which is
more than ever.
Such outbreaks of data are relatively new. This is because now, each user and organization can
store the information in digital form. So, for handling these exponential increases of data, there
should be some mechanism and approach. Big Data is one way to handle such. In this lesson,
you will learn about what is Big Data? Its importance and its contribution to large-scale data
handling.
Big data can be defined as a concept used to describe a large volume of data, which are both
structured and unstructured, and that gets increased day by day by any system or business.
However, it is not the quantity of data, which is essential. The important part is what any firm or
organization can do with the data matters a lot. Analysis can be performed on big data for insight
and predictions, which can lead to a better decision and reliable strategy in business moves.
Big data can be defined as a concept used to describe a large volume of data, which are both
structured and unstructured, and that gets increased day by day by any system or business.
However, it is not the quantity of data, which is essential. The important part is what any firm or
organization can do with the data matters a lot. Analysis can be performed on big data for insight
and predictions, which can lead to a better decision and reliable strategy in business moves.
Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and
business of many organizations. The data flow would exceed 150 exabytes per day before
replication.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
• Big Data Analytics examines large and different types of data to uncover hidden
patterns, insights, and correlations.
• Big Data Analytics is helping large companies facilitate their growth and development.
• Big data analytics is important because it helps companies leverage their data to identify
opportunities for improvement and optimization.
• Big data analytics helps companies reduce costs and develop better, customer-centric
products and services.
Reduction in cost.
Time reductions.
New product development with optimized offers.
Well-groomed decision making.
When you merge big data with high-powered data analytics, it is possible to achieve business-
related tasks like:
Businesses use big data to enhance B2B operations, advertising, and communication. Big data
is primarily being used by many industries, such as travel, real estate, finance, and insurance, to
enhance decision-making. Businesses can use big data to accurately predict what customers want
and don't want, as well as their behavioural tendencies because it reveals more information in a
usable format.
Big data provides business intelligence and cutting-edge analytical insights that help with
decision-making. A company can get a more in-depth picture of its target market by collecting
more customer data.
Business trends and behaviours are revealed by data-driven insights, which also help businesses
compete and grow by enhancing their decision-making. Additionally, these insights help
companies develop more specialised goods and services, strategies, and intelligent marketing
campaigns to compete in their sector.
According to surveys done by New Vantage and Syncsort (now Precisely), big data analytics has
helped businesses significantly cut their costs. Big data is being used to cut costs, according to
66.7% of survey participants from New Vantage. Moreover, 59.4% of Syncsort survey
participants stated that using big data tools improved operational efficiency and reduced
costs. Do you know that Hadoop and Cloud-Based Analytics, two popular big data analytics
tools, can help lower the cost of storing big data
3. Detection of Fraud
Financial companies especially use big data to identify fraud. To find anomalies and transaction
patterns, data analysts use artificial intelligence and machine learning algorithms. These
irregularities in transaction patterns show that something is out of place or that there is a
mismatch, providing us with hints about potential fraud.
For credit unions, banks, and credit card companies, fraud detection is crucial for identifying
account information, materials, or product access. By spotting frauds before they cause
problems, any industry, including finance, can provide better customer service.
For instance, using big data analytics, banks and credit card companies can identify fraudulent
purchases or credit cards that have been stolen even before the cardholder becomes aware of the
issue.
4. A rise in productivity
A survey by Syncsort found that 59.9% of respondents said they were using big data analytics
tools like Spark and Hadoop to boost productivity. They have been able to increase sales and
improve customer retention as a result of this rise in productivity. Modern big data tools make it
possible for data scientists and analysts to analyse a lot of data quickly and effectively, giving
them an overview of more data.
They become more productive as a result of this. Additionally, big data analytics aids data
scientists and analysts in learning more about themselves to figure out how to be more effective
in their tasks and job responsibilities. As a result, investing in big data analytics gives businesses
across all sectors a chance to stand out through improved productivity.
As part of their marketing strategies, businesses must improve customer interactions. Since big
data analytics give businesses access to more information, they can use that information to make
more specialised, highly personalised offers to each individual customer as well as more targeted
marketing campaigns.
Social media, email exchanges, customer CRM (customer relationship management) systems,
and other major data sources are the main sources of big data. As a result, it provides businesses
with access to a wealth of data about the needs, interests, and trends of their target market.
Big data also enables businesses better to comprehend the thoughts and feelings of their clients
to provide them with more individualised goods and services. Providing a personalised
experience can increase client satisfaction, strengthen bonds with clients, and, most importantly,
foster loyalty.
Increasing business agility is a big data benefit for competition. Big data analytics can assist
businesses in becoming more innovative and adaptable in the marketplace. Large customer data
sets can be analysed to help businesses gain insights ahead of the competition and more
effectively address customer pain points.
Additionally, having a wealth of data at their disposal enables businesses to assess risks, enhance
products and services, and improve communications. Additionally, big data assists businesses in
strengthening their business tactics and strategies, which are crucial in coordinating their
operations to support frequent and quick changes in the industry.
7. Greater innovation
Innovation is another common benefit of big data, and the NewVantage survey found that 11.6
per cent of executives are investing in analytics primarily as a means to innovate and disrupt
their markets. They reason that if they can glean insights that their competitors don't have, they
may be able to get out ahead of the rest of the market with new products and services.
1. A talent gap
A study by AtScale found that for the past three years, the biggest challenge in this industry has
been a lack of big data specialists and data scientists. Given that it requires a different skill set,
big data analytics is currently beyond the scope of many IT professionals. Finding data scientists
who are also knowledgeable about big data can be difficult.
Data scientists and big data specialists are two well-paid professions in the data science industry.
As a result, hiring big data analysts can be very costly for businesses, particularly for start-ups.
Some businesses must wait a long time to hire the necessary personnel to carry out their big data
analytics tasks.
2. Security hazard
For big data analytics, businesses frequently collect sensitive data. These data need to be
protected, and security risks can be detrimental if they are not properly maintained.
Additionally, having access to enormous data sets can attract the unwanted attention of hackers,
and your company could become the target of a potential cyber-attack. You are aware that for
many businesses today, data breaches are the biggest threat. Unless you take all necessary
precautions, important information could be leaked to rivals, which is another risk associated
with big data.
3. Adherence
Another disadvantage of big data is the requirement for legal compliance with governmental
regulations. To store, handle, maintain, and process big data that contains sensitive or private
information, a company must make sure that they adhere to all applicable laws and industry
standards. As a result, managing data governance tasks, transmission, and storage will become
more challenging as big data volumes grow.
4. High Cost
Given that it is a science that is constantly evolving and has as its goal the processing of ever-
increasing amounts of data, only large companies can sustain the investment in the development
of their Big Data techniques.
5. Data quality
Dealing with data quality issues was the main drawback of working with big data. Data scientists
and analysts must ensure the data they are using is accurate, pertinent, and in the right format for
analysis before they can use big data for analytics efforts.
This significantly slows down the reporting process, but if businesses don't address data quality
problems, they may discover that the insights their analytics produce are useless or even harmful
if used.
6. Rapid Change
The fact that technology is evolving quickly is another potential disadvantage of big data
analytics. Businesses must deal with the possibility of spending money on one technology only
to see something better emerge a few months later. This big data drawback was ranked fourth
among all the potential difficulties by Syncsort respondents.
Rapid Data Growth: The growth velocity at such a high rate creates a problem to look
for insights using it. There no 100% efficient way to filter out relevant data.
Storage: The generation of such a massive amount of data needs space for storage, and
organizations face challenges to handle such extensive data without suitable tools and
technologies.
Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are
totally (100%) accurate. Redundant data, contradicting data, or incomplete data are
challenges that remain within it.
Data Security: Firms and organizations storing such massive data (of users) can be a
target of cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such
colossal data is also a challenge for firms and organizations.
Applications of Big Data
The term Big Data is referred to as large amount of complex and unprocessed data. Now a day's
companies use Big Data to make business more informative and allows to take business
decisions by enabling data scientists, analytical modelers and other professionals to analyse large
volume of transactional data. Big data is the valuable and powerful fuel that drives large IT
industries of the 21st century. Big data is a spreading technology used in each business sector. In
this section, we will discuss application of Big Data.
Travel and tourism are the users of Big Data. It enables us to forecast travel facilities
requirements at multiple locations, improve business through dynamic pricing, and many more.
Financial and banking sector
The financial and banking sectors use big data technology extensively. Big data analytics
help banks and customer behaviour on the basis of investment patterns, shopping trends,
motivation to invest, and inputs that are obtained from personal or financial backgrounds.
Healthcare
Big data has started making a massive difference in the healthcare sector, with the help
of predictive analytics, medical professionals, and health care personnel. It can
produce personalized healthcare and solo patients also.
The government and military also used technology at high rates. We see the figures that
the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing with
traffic jams, and the effect of crime like hacking and online fraud.
Aadhar Card: The government has a record of 1.21 billion citizens. This vast data is analyzed
and store to find things like the number of youth in the country. Some schemes are built to target
the maximum population. Big data cannot store in a traditional database, so it stores and analyze
data by using the Big Data Analytics tools.
E-commerce
E-commerce is also an application of Big data. It maintains relationships with customers that is
essential for the e-commerce industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better strategies of innovative ideas
to improve businesses with Big data.
o Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic daily.
But, when there is a pre-announced sale on Amazon, traffic increase rapidly that may
crash the website. So, to handle this type of traffic and data, it uses Big Data. Big Data
help in organizing and analyzing the data for far use.
Social Media
Social Media is the largest data generator. The statistics have shown that around 500+ terabytes
of fresh data generated from social media daily, particularly on Facebook. The data mainly
contains videos, photos, message exchanges, etc. A single activity on the social media site
generates many stored data and gets processed when required. The data stored is in terabytes
(TB); it takes a lot of time for processing. Big Data is a solution to the problem.
Before we start with the list of big data technologies, let us first discuss this technology's board
classification. Big Data technology is primarily classified into the following two types:
This type of big data technology mainly includes the basic day-to-day data that people used to
process. Typically, the operational-big data includes daily basis data such as online transactions,
social media platforms, and the data from any particular organization or a firm, which is usually
needed for analysis using the software based on big data technologies. The data can also be
referred to as raw data used as the input for several Analytical Big Data Technologies.
Some specific examples that include the Operational Big Data Technologies can be listed as
below:
o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart, Walmart,
etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Analytical Big Data is commonly referred to as an improved version of Big Data Technologies.
This type of big data technology is a bit complicated when compared with operational-big data.
Analytical big data is mainly used when performance criteria are in use, and important real-time
business decisions are made based on reports created by analyzing operational-real data. This
means that the actual investigation of big data that is important for business decisions falls under
this type of big data technology.
Some common examples that involve the Analytical Big Data Technologies can be listed as
below:
We can categorize the leading big data technologies into the following four sections:
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:
o Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
o MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
Four main types of big data analytics support and inform different business decisions.
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Harnessing all of that data requires tools. Thankfully, technology has advanced so that many
intuitive software systems are available for data analysts to use.
Hadoop: An open-source framework that stores and processes big data sets. Hadoop can handle
and analyse structured and unstructured data.
Spark: An open-source cluster computing framework for real-time processing and data analysis.
Data integration software: Programs that allow big data to be streamlined across different
platforms, such as MongoDB, Apache, Hadoop, and Amazon EMR.
Stream analytics tools: Systems that filter, aggregate, and analyse data that might be stored in
different platforms and formats, such as Kafka.
Distributed storage: Databases that can split data across multiple servers and can identify lost or
corrupt data, such as Cassandra.
Predictive analytics hardware and software: Systems that process large amounts of complex
data, using machine learning and algorithms to predict future outcomes, such as fraud detection,
marketing, and risk assessments.
Data mining tools: Programs that allow users to search within structured and unstructured big
data.
NoSQL databases: Non-relational data management systems ideal for dealing with raw and
unstructured data.
Data warehouses: Storage for large amounts of data collected from many different sources,
typically using predefined schemas.