Data Science

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

UNIT-1

Data Science Introduction


Data Science is a combination of multiple disciplines that uses statistics,
data analysis, and machine learning to analyze data and to extract
knowledge and insights from it.

Data Science
Data Science is used in many industries in the world today, e.g. banking,
consultancy, healthcare, and manufacturing.

Examples of where Data Science is needed:

 For route planning: To discover the best routes to ship


 To foresee delays for flight/ship/train etc. (through predictive analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
 To predict who will win elections

Data Science can be applied in nearly every part of a business where data is
available. Examples are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce

How Does a Data Scientist Work?


A Data Scientist requires expertise in several backgrounds:

 Machine Learning
 Statistics
 Programming (Python or R)
 Mathematics
 Databases

A Data Scientist must find patterns within the data. Before he/she can find
the patterns, he/she must organize the data in a standard format.

Here is how a Data Scientist works:

1. Ask the right questions - To understand the business problem.


2. Explore and collect data - From database, web logs, customer
feedback, etc.
3. Extract the data - Transform the data to a standardized format.
4. Clean the data - Remove erroneous values from the data.
5. Find and replace missing values - Check for missing values and
replace them with a suitable value (e.g. an average value).
6. Normalize data - Scale the values in a practical range (e.g. 140 cm is
smaller than 1,8 m. However, the number 140 is larger than 1,8. - so
scaling is important).
7. Analyze data, find patterns and make future predictions.
8. Represent the result - Present the result with useful insights in a way
the "company" can understand.

Big Data Characteristics


Big Data contains a large amount of data that is not being processed by
traditional data storage or the processing unit. It is used by
many multinational companies to process the data and business of
many organizations. The data flow would exceed 150 exabytes per day
before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data


o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast
'volumes' of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human
interactions, and many more.

Facebook can generate approximately a billion messages, 4.5


billion times that the "Like" button is recorded, and more than 350
million new posts are uploaded each day. Big data technologies can
handle large amounts of data.

Variety
Big Data can be structured, unstructured, and semi-structured that
are being collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required


columns. It is in a tabular form. Structured Data is stored in the relational
database management system.

b. Semi-structured: In Semi-structured, the schema is not appropriately


defined, e.g., JSON, XML, CSV, TSV, and email. OLTP (Online
Transaction Processing) systems are built to work with semi-structured
data. It is stored in relations, i.e., tables.

c. Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations
have much data available, but they did not know how to derive the value
of data since the data is raw.

d. Quasi-structured Data:The data format contains textual data with


inconsistent data formats that are formatted with effort and time with
some tools.

Example: Web server logs, i.e., the log file is created and maintained
by some server that contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter
or translate the data. Veracity is the process of being able to handle and
manage data efficiently. Big Data is also essential in business
development.

For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we
process or store. It is valuable and reliable data that we store,
process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the
speed by which the data is created in real-time. It contains the linking of
incoming data sets speeds, rate of change, and activity bursts. The
primary aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social
media sites, sensors, mobile devices, etc.
Web Scraping
Web scraping is an automatic method to obtain large amounts of data
from websites. Most of this data is unstructured data in an HTML
format which is then converted into structured data in a
spreadsheet or a database so that it can be used in various
applications. There are many different ways to perform web
scraping to obtain data from websites. These include using online
services, particular API’s or even creating your code for web
scraping from scratch. Many large websites, like Google, Twitter,
Facebook, StackOverflow, etc. have API’s that allow you to access their
data in a structured format.

How Web Scrapers Work?


Web Scrapers can extract all the data on particular sites or the
specific data that a user wants . Ideally, it’s best if you specify the
data you want so that the web scraper only extracts that data
quickly. For example, you might want to scrape an Amazon page
for the types of juicers available, but you might only want the data
about the models of different juicers and not the customer
reviews.
Types of Web Scrapers
Web Scrapers can be divided on the basis of many different
criteria, including Self-built or Pre-built Web Scrapers, Browser
extension or Software Web Scrapers, and Cloud or Local Web
Scrapers.
You can have Self-built Web Scrapers but that requires
advanced knowledge of programming. And if you want more
features in your Web Scraper, then you need even more
knowledge. On the other hand, pre-built Web Scrapers are
previously created scrapers that you can download and run easily.
These also have more advanced options that you can customize.
Browser extensions Web Scrapers are extensions that can be
added to your browser. These are easy to run as they are
integrated with your browser, but at the same time, they are also
limited because of this. Any advanced features that are outside the
scope of your browser are impossible to run on Browser extension
Web Scrapers. But Software Web Scrapers don’t have these
limitations as they can be downloaded and installed on your
computer. These are more complex than Browser web scrapers, but
they also have advanced features that are not limited by the scope
of your browser.
Cloud Web Scrapers run on the cloud, which is an off-site server
mostly provided by the company that you buy the scraper from.
These allow your computer to focus on other tasks as the computer
resources are not required to scrape data from websites. Local
Web Scrapers, on the other hand, run on your computer using
local resources. So, if the Web scrapers require more CPU or RAM,
then your computer will become slow and not be able to perform
other tasks.
What is Web Scraping Used for?
Web Scraping has multiple applications across various industries.
Let’s check out some of these now!
1. Price Monitoring
2. Market Research
3. News Monitoring
4. Sentiment Analysis
5. Email Marketing

Reporting vs Analytics
Key Differences Between
Reporting and Analytics
Reporting is the process of gathering and presenting data in a structured
format such as graphs and tables. Organizing information in predefined
KPIs and metrics makes it easier for you to understand what is
happening. Analytics is the process of analyzing your data to identify
patterns and gain insights. Using techniques such as predictive and
prescriptive analytics helps you understand why things are happening and
what to do next.

What is Reporting?
Reporting primarily involves the presentation of data in a structured format. Its
purpose is to provide a snapshot of specific metrics or KPIs over a defined
period. Reports are instrumental in summarizing information for stakeholders
and are often automated and scheduled on a regular basis. Ad hoc reports,
created on-demand, can address specific inquiries or issues promptly. Data
visualizations help identify trends, patterns, and anomalies more intuitively.
Dashboards play a crucial role in presenting real-time data to stakeholders for
quick decision-making.
Benefits

Reporting enables informed decision-making, tracks performance trends, and


fosters transparency and accountability. By aligning activities with strategic goals
and optimizing resource allocation, reporting enhances operational efficiency. It
ensures compliance with regulatory requirements and aids in identifying and
mitigating risks. Effective reporting also facilitates communication with
stakeholders, supports competitive advantage, and guides long-term strategic
planning.

Reporting Process

Data is sourced from operational systems such as transactional, supply chain,


and CRM applications. This raw data is extracted, transformed, and combined
into a repository such as a data warehouse or data lake. Bringing together data
from all your systems gives you a holistic view of your business.

Your reporting analytics tool uses this data to allow you to create visualizations,
dashboards and KPI reports via automation. These make it easier for you to
know what has happened or what is happening in your business.

Types of Reporting
Reporting takes various forms in organizations, serving specific functions. Operational
reports offer day-to-day insights into activities like sales and inventory management.
Financial reports detail a company's financial health, including balance sheets and income
statements. Management reports provide summarized data for internal decision-making,
while strategic reports guide long-term planning. Compliance reports ensure adherence to
legal requirements, while ad hoc reports address specific queries

Dashboard Report Example

This marketing dashboard report showcases the number of responders to a campaign by


region, segment, and product category. This dashboard presents data in an easy-to-digest
manner.

Data Collection & Its Methods


What is Data Collection?
Data Collection is the process of collecting information from
relevant sources in order to find a solution to the given statistical
enquiry. Collection of Data is the first and foremost step in a
statistical investigation.
Here, statistical enquiry means an investigation made by any
agency on a topic in which the investigator collects the relevant
quantitative information. In simple terms, a statistical enquiry is the
search of truth by using statistical methods of collection, compiling,
analysis, interpretation, etc.
Important Terms related to Data Collection:
1. Investigator: An investigator is a person who conducts the
statistical enquiry.
2. Enumerators: In order to collect information for statistical
enquiry, an investigator needs the help of some people. These
people are known as enumerators.
3. Respondents: A respondent is a person from whom the
statistical information required for the enquiry is collected.
4. Survey: It is a method of collecting information from individuals.
The basic purpose of a survey is to collect data to describe different
characteristics such as usefulness, quality, price, kindness, etc. It
involves asking questions about a product or service from a large
number of people.
Example:
The table below shows the production of rice in India.

The above table contains the production of rice in India in different


years. It can be seen that these values vary from one year to
another. Therefore, they are known as variable. A variable is a
quantity or attribute, the value of which varies from one
investigation to another. In general, the variables are represented
by letters such as X, Y, or Z. In the above example, years are
represented by variable X, and the production of rice is represented
by variable Y. The values of variable X and variable Y are data from
which an investigator and enumerator collect information regarding
the trends of rice production in India.
Thus, Data is a tool that helps an investigator in understanding the
problem by providing him with the information required. Data can
be classified into two types; viz., Primary Data and Secondary
Data. Primary Data is the data collected by the investigator from
primary sources for the first time from scratch. However,
Secondary Data is the data already in existence that has been
previously collected by someone else for other purposes. It does
not include any real-time data as the research has already been
done on that information.
There are two different methods of collecting data: Primary Data
Collection and Secondary Data Collection.
Storage
Data storage, or data keeping, is storing information and making it as readily available as

possible via technology designed particularly for that purpose. It constitutes a simple method

of storing data in digital form on computer devices, and keeping data on hand makes many

digital processes more effective.

Storage devices may use electromagnetic, optical, or other media to keep the data safe and

recover it when necessary. File recovery and backup procedures become simple by data

storage in the case of an unforeseen computer failure or cyberattack.

While setting this up, every organization should consider these factors: dependability,

affordability of the storage structure, and safety features.

Source: vectorStock

Why Do We Need Data Storage?

Innovative technologies like data analysis, the Internet of Things, and AI produce and utilize

enormous amounts of data. Therefore, data storage plays a major role in the growth of any

organization now more than ever. Some of the benefits are as follows:
1. It is simple to gather large amounts of records for a longer time using electronic data storage.

2. Making duplicates of stored data makes it simple to back it up, enabling file loss or

corruption recoverable more quickly and easily.

3. With today’s cutting-edge security technologies and capabilities, plenty of techniques exist to

safely store and safeguard particularly sensitive data digitally.

4. Every authorized individual has access to centralized stored data, which can be viewed and

shared between teams whenever they collaborate.

5. Digital data can be more easily categorized and organized, and the process can be accessed

using a desktop computer or similar connected device.

6. Digital data storage is faster than producing files that must be kept in file cabinets by printing

out hard copies of data.

Types of Data Storage

There are three major types- primary, secondary, and tertiary.

Primary Storage

A computer system’s primary data storage serves as its primary storage. The primary storage

is temporary memory, also referred to as cache memory. Primary memory is invariably

smaller than secondary memory and comes with comparatively lesser storage. It is the only

storage form readily available to the CPU, unlike RAM and ROM (Read-only Memory). The

CPU always accesses primary storage-stored commands and processes them as needed. All

data that is actively worked on is kept in an organized manner.


Secondary Storage

A secondary storage system can keep data longer and has additional storage space. External

or internal computer components include hard drives, USB drives, CDs, and other media.

One computer typically accesses secondary storage through its input/output channels and

transfers the needed data utilizing an intermediate space in primary storage.

Tertiary Storage

It is an extensive electronic storage system that is typically quite sluggish; hence, it stores

components that are retrieved occasionally. This method often incorporates a robotic device

that mounts and dismounts removable drives into computer storage units following the

system requirements. It helps access extremely massive databases without the assistance of

any human controllers.

Forms of Storage
Data storage comes in three primary forms:

1. File Storage: Data is organized into files and directories in file storage. It is suitable for
storing structured data and is accessible through network protocols like NFS and SMB. File

storage is commonly used for documents, media files, and user data.

2. Block Storage: Block storage breaks data into fixed-sized blocks and is often used in
scenarios where raw storage volumes are needed. It is highly efficient for database systems

and can be accessed via protocols like iSCSI. Block storage provides low-level storage access.

3. Object Storage: Object storage stores data as objects, each with its unique identifier
and metadata. It is ideal for unstructured data, scalable storage, and cloud-based
applications. Object storage is accessible via RESTful APIs and is well-suited for backup,

archival, and content distribution.

Choosing the right storage type depends on the specific requirements of your application and

data.

Traditional Storage Technologies

The conventional data storage systems are as follows:

Magnetic Storage

A magnetic memory, such as an HDD, comprises circular drives composed of non-magnetic

components and coated using a thin film of magnetic material, where data is stored. The

magnetized face of such disks goes inside a rotary drive, with a read-write unit of a magnetic

yoke and a magnetizing coil that spins in close range of the disks.

Optical Storage

An optical drive is a device that uses optical storage techniques for data processing functions

such as read/write/access. Laser light helps in reading and storing data on an optical disk. An

optical disk is a resin similar to polycarbonate, and the electronic data is maintained in tiny

openings on the polycarbonate layer.

Modern Data Storage Technologies

The modern data storage systems are as follows:


Flash Storage

Flash storage uses solid-state drives (SSDs) with flash memory for large-scale data or file

archiving. It substitutes HDDs and other forms of storage. A multi-terabyte dataset can be

kept “in memory” using an all-flash array, which offers read/write speeds four times faster

than HDDs. Compared to HDDs, flash storage has a higher density.

Cloud Storage

By replicating the capability of physical storage devices, cloud storage enables you to save or

retrieve various content types whenever you need to from a virtual setting. Any data uploaded

to the cloud is kept off-site in reliable data centers, and an on-site operator or an off-site third-

party service often handles it. Users can access cloud storage using a computer with an

internet connection, web portal, intranet, cloud storage apps, or additional application

programming interfaces (APIs).

Object Storage

Object storage is a technique that manages data storage in distinct components or objects. A

framework on which data analytics software can run queries on objects is known as an object

store. By adopting a flat address space, object storage removes the need for the hierarchical

structure that different systems need. This enables easy scaling up or down to accommodate

storage workload variations and accommodate quick expansions and contractions.

Software-Defined Storage

SDS is a storage system that separates the hardware and software used for storage. Unlike

conventional NAS or SAN systems, SDS runs on any x86 or industry-standard system,

eliminating the software’s reliance on specific hardware. Software-defined storage is a


method of managing data that makes data storage resources more flexible by abstracting them

from the supporting physical storage hardware.

How Does Data Storage Work?

Whenever you upload digital data to a personal computer, it gets saved to a device, which

stays there until it is damaged. Storage is fundamentally different from computer memory:

While anyone can swiftly retrieve information from your computer’s RAM, such data is only

accessible in RAM when your computer is off.

Modern computers or devices may connect to storage devices directly or via a network

connection. Users give computers instructions for accessing data stored on and retrieved from

various storage devices. On a basic level, data storage depends on two principles: the form it

takes and the hardware that it is captured and stored on.

Data Storage Architectures and


Concepts

The distinct data storage architectures and concepts are as follows:

RAID (Redundant Array of Independent Disks)

RAID is a method that uses several drives in tandem rather than just one to boost

performance, provide data redundancy, or both. It is a method for securing data during a drive

crash by maintaining the same data in different places on many hard drives or solid-state

storage devices. It has two or more parallel-operating disks, and RAID level is how disks are

arranged.
Source: Steemit.com

NAS (Network-Attached Storage)

A network-attached storage (NAS) server is a specific storage platform that links to devices

over a LAN. The server’s connectivity features allow for the retrieval and storage of data

from many external devices, and NAS storage also offers extensive sharing capabilities. The

system utilizes the features of a file-storage technology and the clustering of a redundant

array of drives (RAID).


Source: Phonixnap

SAN (Storage Area Network)

A storage area network (SAN) is a network-based storage that can access data at the block

level. This kind of storage consists of several data storage units connected by a network. The

storage format is an amalgam of NAS and DAS. The storage type transfers data across a

server and storage using specific networking protocols, like Fibre Channels.

Source: Phoenixap.com
Object Storage vs Block Storage

The differences between object storage and block storage are as follows:

Object Storage Block Storage

Data is held in flat-file systems as


Fixed-sized blocks divide the data into sections and
different, distinctive, and recognizable
rearrange it when necessary.
components called objects.

Cost-effective Expensive

Unlimited Scalability Limited Scalability

A single central or decentralized system A centralized system for on-site or private cloud data
that maintains data in the private, storage. If the program and its data storage are located far
public, hybrid, or cloud. away from one another, latencies could pose a concern.

Suitable for large amounts of raw data. Best for storing databases and data related to transactions.
Large files yield the greatest results. It performs best with compact files.

Best Practices for Data Storage Management

Some of the best practices for data storage management are as follows:

Data Backup and Recovery

After you transfer your data from regular, operational systems for immediate and future

storage, a reliable data backup and recovery plan will ensure it is constantly kept secure.

Backup copies enable data to be recovered from an earlier date, enabling the organization to

recover from unforeseen circumstances. Maintaining another copy of the data on another

storage device is important for protecting against original data loss or corruption.

Data Deduplication

There are instances where similar data is produced due to repeated operations. You can

improve data management and reduce storage costs by setting up a human or automated

procedure that constantly evaluates data and eliminates duplicates. Your data will remain

clean and prepared for evaluation and questions.


Data Compression

Data compression makes files take up less room on a hard drive and takes less time to

transfer or download. The decrease in distance and time could lead to major savings in

expenses. It makes it possible to transport data objects and files quickly through networks and

the Internet while maximizing the use of physical storage space.

Data Security and Encryption

It enables you to identify sensitive data and essential assets, and establish robust security

measures that monitor and protect every stage of data sorting, thereby maximizing your data

security. Encryption converts the data you store into nonsensical codes; only the owner’s key

can decode it. This ensures the data won’t be used, even when unauthorized people gain

access to it.

Data Storage in Mobile Devices

The physical storage capacity of mobile devices is limited and ranges typically from just a

few gigabytes up to a few hundred gigabytes. Data file sizes have grown considerably in

bandwidth as technology has advanced. High-resolution images and movies, graphically

demanding software, and resource-intensive games take up an extensive amount of disk.

Privacy and security are key factors to consider because the data kept on mobile devices

could contain personal and sensitive data. Malicious attacks, unauthorized access, or

information theft entail considerable challenges, highlighting the significance of having

robust security measures.

Internal Storage

The internal memory space of the device is the internal storage. The files you maintain here

are restricted to the application itself, so no matter their permissions, other applications
cannot access those. Android OEMs and app developers utilize internal storage to store

private data, app data, user settings, and additional system files.

External Storage

Any storage not part of the device’s internal memory, including an attached SD card, is called

external storage. Any app with the appropriate permissions could have access to this region,

which serves as a free-for-all area. There are two kinds of external storage: SD cards,

commonly called memory cards, which represent the secondary external storage, and built-in

external storage, which is the primary external storage.

How Can Businesses Use Data Storage?

Businesses employ storage in diverse ways to manage data effectively:

1. Data Backup: Businesses safeguard vital data and backups on storage devices or in the cloud

to ensure data recovery during hardware failures or unforeseen disasters. A comprehensive

data backup strategy is essential to prevent data loss and maintain business continuity.

2. Inventory Management: Warehouses utilize storage systems to optimize inventory

management. These systems help maximize space utilization, minimize storage costs, and

streamline order fulfillment processes. Efficient inventory management is crucial for meeting

customer demands and reducing operational expenses.

3. Application Hosting: Companies leverage storage infrastructure to host applications and

databases, ensuring secure access and reliability for employees and customers. Robust

storage solutions underpin seamless application performance, data availability, and

scalability.

4. Cloud Storage: Cloud storage services offer scalable and cost-effective data storage options,

enabling businesses to securely store and access data from anywhere with an internet
connection. This flexibility enhances data accessibility, reduces infrastructure costs, and

supports remote work arrangements.

5. Collaboration and File Sharing: Businesses employ storage systems to facilitate team

collaboration. They securely store and share documents, presentations, and media files,

enabling employees to work together efficiently, whether in the office or remotely. Secure

file storage is essential for protecting sensitive business information.

What’s Next?

Data storage is an essential component of our digital life. Increasing storage capacities, cloud

storage options, and strong security protocols are just a few of the challenges involved in data

storage that are being addressed by innovative technologies and enhanced storage

administration approaches. As technology advances, it becomes increasingly important for

producers and customers to be cautious and flexible in data storage management.

CloudData StoragememoryTypes of Data StorageWhat is Data Storage

avcontentteam12 Sep 2023

CareerData EngineeringDatabaseIntermediate

Frequently Asked Questions


What are some examples of tertiary storage?
A. Magnetic tape, optical discs, and optical tapes are a few examples of tertiary storage.
These gadgets have distinctive portable storage components and are made up of fixed drivers.

What is the most popular type of storage?


Where is SDS used?

Write, Shine, Succeed


Write, captivate, and earn accolades and rewards for your work

 Reach a Global Audience

 Get Expert Feedback


 Build Your Brand & Audience

 Cash In on Your Knowledge

 Join a Thriving Community

 Level Up Your Data Science Game

Rahul Shah

27

Sion Chakrabarti

16

CHIRAG GOYAL
87

Barney Darlington

Suvojit Hore

Arnab Mondal

15

Prateek Majumder

68


You might also like