Chapter 2 Introduction To Data Science

Uploaded by

milkessa27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Chapter 2 Introduction To Data Science

Uploaded by

milkessa27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Chapter Two

Overview Of Data Science

Habtamu Abune

habtamu.abune@aau.edu.et
Learning outcomes
After completing this lesson you should be able to
❑ Describe what data science is and the role of data scientists.

❑ Differentiate data, information and Knowledge.

❑ Describe data processing life cycle

❑ Understand different data types from diverse perspectives

❑ Describe data value chain in emerging era of big data

❑ Basic concepts of Big Data

What is Data Science ? Why we need it?

More data usually beats better algorithms, techniques and tools

An Overview of Data Science
➢ Data science is a multidisciplinary field .
➢ It uses scientific methods, processes and algorithm systems to extract
knowledge, Insights from structured, semi-structured and unstructured data
➢ Data science is much more than simply analyzing data.
➢ It offers a range of roles and requires a range of skills.
➢ As an academic discipline and profession:
➢ Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals.
➢ Today, successful data professionals must advance past the traditional skills of
analysing large amounts of data, data mining, data warehousing, programming
skills and modelling to build and anlyze algorithm.

➢ Data scientists need to be curious and result-oriented with

exceptional industry specific knowledge and communication skill
Need for data science
● Data science is the ability
○ To store large amounts of data
○ To understand, process and extract value from it
○ To visualize and communicate it for decision making and problem
solving in such dynamic world
● Data Science supports “Business Intelligence”
○ For smart decision-making and problem solving
○ For predicting potential market, potential product,
potential customers
○ Need data for identifying risks, opportunities, conducting
“what-if” analyses
Component of Data Science
What is data? How it is created? What makes it different from
information and knowledge?
What is data?
❑ Data is a representation of facts, concepts, or instructions in a formalized
manner, which should be suitable for communication, interpretation, or
processing by human or electronic machine.
➢ No meaning attached to it as a result of which it may have multiple
meaning
● Example: what does “Habtamu” mean?

❑ Data can be described as unprocessed facts and figures

❑ It is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.).

❑ It can also be defined as groups of non-random symbols in the form of

text, images, and voice representing quantities, action and objects
What is Information?
❑ Aggregation of data as per the context that makes decision
making easier.
❑Meaning is attached and contextualized
❑ Answers question: "who", "what", "where", and "when"

❑ Organized or classified data, which has some meaningful values for the
receiver
❑ Processed data on which decisions and actions are based.
❑ Plain collected data as raw facts cannot help much in decision-making
❑ Interpreted data created from organized, structured, and processed data in
a particular context.
What is knowledge?
➢ Includes facts about the real world entities and the relationship
between them.
➢ It is an Understanding gained through experience
➢ Knowledge is the appropriate collection of information, the intent of which
is usefulness.
➢ Answer question "how"
What is Wisdom?
● Wisdom embodies an understanding of fundamental principles, insight, ethical
code and moral by integrating knowledge
○ Answer ‘why’ question
● Knowledge that are essentially the basis for the knowledge being what it
is.
Data→Information→Knowledge→Wisdom
Data Processing Cycle
❑ Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular
purpose.
❑ Data Proccessing cyle contain three steps such as to take input, process it
and generate output.
Cont.………
Input
➢ The input data is prepared in some convenient form for processing

➢ The form will depend on the processing machine

➢ For example, when electronic computers are used, the input data can
be recorded on any one of the several types of input medium, such
as flash disks, hard disk, and so on
Cont.…..
Processing
➢ In this step, the input data is changed to produce data in a more
useful form
➢ For example, interest can be calculated on deposit to a bank, or a
summary of sales for a month can be calculated from the sales orders
data
Cont.……
Output
➢ At this stage, the result of the proceeding processing step is
collected
➢ The particular form of the output data depends on the use of the
data
➢ For example, output data can be total sale in a month or may be
payroll for eemployee.
Data types and its representation
Data types from Computer programming perspective
➢ Data type simply an attribute that tells compiler or intreppreter how the
programmer intended to use the data
➢ Common data types include
➢ Integers(int)- to store whole numbers
➢ Booleans(bool)- true or false
➢ Characters(char)- to store a single character
➢ Floating-point numbers(float)- to store real numbers
➢ Alphanumeric strings(string)- to store a combination of characters and
numbers
➢ Data type define:
➢ The opertation that can be done on the data
➢ The meaning of data and
➢ The way values of that type can stored
Data types from Data Analytics perspective
● There are three common types of data type or structure:

I. Structure
II. Semi- Structure
III. Unstructured
Structured Data
➢ Predefined data model and is therefore straightforward to analyze

➢ It Conforms to a tabular format with the relationship between different

rows and columns
➢ Common examples are Excel files or SQL databases.
Cont.….
● SQL Data ● Excel File
Semi-structured Data
❑ A form of structured data that does not conform with the formal
structure of data models associated with relational databases or other
forms of data tables
❑ But contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data

❑ Therefore, it is also known as self-describing structure

❑ Common Examples are XML, JSON, Sensor Data etc

Semi-structured Data -- examples

Examples of semi-structured data

JSON and XML
Unstructured Data
➢ Data that either does not have a predefined data model or is not
organized in a predefined manner
➢ It is typically text-heavy but may contain data such as dates, numbers,
and facts as well.
➢ This result in irregularity and ambigiuties which make it difficult to process or
understand using tradittional program (database) unlike structured data.

➢ Common examples are audio, video files, NoSQL, pictures, pdfs , word
docs.
Unstructured Data -- examples
● Pdf files ● Images
Metadata
➢ Data about Data
➢ It provides additional information about a specific set of data
➢ It is one of the most important element for big data analysis and solution.
➢ It is the last category of data type
For example
➢ Metadata of a photo could describe when and where the photos
were taken
➢ The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data
Metadata -- example
● Metadata about an image
Cont….
Data Value Chain
➢ Describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data
➢ The Big Data Value Chain identifies the following key high-level activities
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage
Data Value Chain
Data Acquisition
➢ It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage solution on which data analysis
can be carried out.
➢ Data cleaning tasks

➢ Correct redundant data

➢ Fill in missing value
➢ Correct inconsistency

➢ Data acquisition is one of the major big data challenges in terms of

infrastructure requirements
Data Acquisition Cont…
● The infrastructure required for data acquisition must
➢ Deliver low, predictable latency in both capturing data and in
executing queries.
➢ Be able to handle very high transaction volumes, often in a
distributed environment
➢ Support flexible and dynamic data structures
● To extract value from the data, the data needs to be cleaned to remove
noise.
○ Cleansing of data is important so that incorrect and faulty data can be ﬁltered
out.
Data Analysis
● Data analysis is a process of preparing, exploring and modeling data
with the goal of discovering useful information and knowledge towards
informed decision-making.
○ Big data analysis is a special kind of data analysis with more massive volumes of
data
○ Data analysis has multiple facets and approaches, encompassing diverse
techniques under a variety of names, such as statistics, data mining, business
intelligence and data analytics.
● Related areas include data mining, business intelligence, and machine
learning
● Data mining and data analytics focuses on knowledge discovery for
predictive and descriptive purposes,
Four types of data analytics
Data Curation
➢ It is an active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage
➢ Data curation processes can be categorized into different
activities such as
➢ content creation, selection, classification, transformation, validation, and
preservation.
➢ Data curation is performed by expert curators that are responsible for
improving the accessibility and quality of data.
➢ Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible,
reusable, and fit for purpose

➢ A key trend for the curation of big data utilizes community and crowd
sourcing approaches.
Data Storage
➢ It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data
➢ Relational Database Management Systems (RDBMS) have been the main,
and almost unique, solution to the storage paradigm for nearly 40 years.
➢ Relational database that guarantee database transactions, lack flexibility
with regard to schema changes, performance and fault tolerance when
data volumes and complexity grow, making them unsuitable for big data
scenarios.
➢ ACID properties (Atomicity, Consistency, Isolation, and Durability)

➢ NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models
Data Usage
➢ It covers the data-driven business activities that need access to the
curated data, its analysis, and the tools needed to integrate the data
analysis within the business activity
➢ In business decision-making , it can enhance competitiveness through
reduction of costs, increased added value, or any other parameter that
can be measured against existing performance criteria
What Is Big Data?
➢ Big data refers to the large, diverse sets of information that grow at ever-
increasing rates.
➢ The term big data is used for massive scale data that is difﬁcult to store,
manage and process using traditional databases and data processing
architectures.
➢ It is so difficult to process using on-hand database management tools require special
tool.
➢ Big data can be structured (often numeric, easily formatted and stored) or
unstructured (more free-form, less quantifiable).
➢ Nearly every department in a company can utilize findings from big data
analysis, but handling its clutter and noise can pose problems.
Cont.…..
● Nowadays Systems/services generate huge amount of data from TBs to
PB/ZBs of information
● Examples:
○ Google (processes 20 PB a day), Facebook (15 TB/day), eBay (50
TB/day), Walmart, Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
Big data cont.…….
• Some examples of big data are listed as follows:
o Data generated by social networks including text, images, audio and video
data
o Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
o Machine sensor data collected from sensors embedded in industrial and
energy systems
o Healthcare data collected in electronic health record (EHR) systems for
monitoring their health and detecting failures
o Logs generated by web applications
o Stock markets data
o Transactional data generated by banking and ﬁnancial applications
The 4 V’s Characterizing Big Data.
Volume: Massive scale of data
○ Large amounts of data in yottabytes or Zetabytes/Massive datasets
Velocity: How fast the data is generated
○ Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.
Variety: Different forms of the data
○ Data comes in many different forms from/ diverse sources and formats
Veracity: how accurate is the data.
○ Can we trust the data? How accurate is it? Doubt in data etc
The 4 V’s cont….
Clustered Computing and Hadoop Ecosystem
Clustered Computing
● Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
● To better address the high storage and computational needs of
big data, computer clusters are a better fit.
● Big data clustering software provide a number of benefits:
○ Resource Pooling
○ High Availability
○ Easy Scalability
● Using clusters requires a solution: for managing cluster membership,
coordinating resource sharing, and scheduling actual work on individual
nodes
● Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
Hadoop and its Ecosystem
● Hadoop is an open-source framework intended to make interaction with
big data easier.
○ It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
● The four key characteristics of Hadoop are:
○ Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
○ Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
○ Scalable: It is easily scalable both, horizontally and vertically.
■ A few extra nodes help in scaling up the framework.
○ Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Cont.…..
● Hadoop ecosystems has four core components: data management, data
access, data processing, and data storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others:
○ HDFS: Hadoop Distributed File System
○ YARN: Yet Another Resource Negotiator
○ MapReduce: Programming based Data Processing
○ Spark: In-Memory data processing
○ PIG, HIVE: Query-based processing of data services
○ HBase: NoSQL Database
○ Mahout, Spark MLLib: Machine Learning algorithm libraries
○ Solar, Lucene: Searching and Indexing
○ Zookeeper: Managing cluster
○ Oozie: Job Scheduling
Cont….
Big Data Life Cycle with Hadoop
❑ Ingesting data into the system: The first stage of Big Data processing is
Ingest.
❑ The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
❑ Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event
data.
❑ Processing the data in storage: The second stage is Processing.
● In this stage, the data is stored and processed.
❑ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.
Cont.…..
❑ Computing and analyzing data : The third stage is to Analyze.
○ Here, the data is analyzed by processing frameworks such as Pig, Hive, and
Impala.
○ Pig converts the data using a map and reduce and then analyzes it.
○ Hive is also based on the map and reduce programming and is most suitable for
structured data.
❑ Visualizing the results: The fourth stage is Access, which is performed by
tools such as Hue and Cloudera Search.
○ In this stage, the analyzed data can be accessed and communicated by users.
THANK YOU!

DNV-RU-SHIP Pt.2 Ch.4. (2022.07) - Ship Hull Welds
No ratings yet
DNV-RU-SHIP Pt.2 Ch.4. (2022.07) - Ship Hull Welds
144 pages
Communication Aids and Strategies Using Tools of Technology PDF
90% (10)
Communication Aids and Strategies Using Tools of Technology PDF
21 pages
The Extec C-12: Features and Benefits Book
No ratings yet
The Extec C-12: Features and Benefits Book
21 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
ETCh2
No ratings yet
ETCh2
36 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Module 1
No ratings yet
Module 1
35 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
32 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Chapter 2 Data Science (4)
No ratings yet
Chapter 2 Data Science (4)
8 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Cr 10-x - Sb 03 - Software Update Arc_1202 Available to Solve Galvo Error 10852 (1)
No ratings yet
Cr 10-x - Sb 03 - Software Update Arc_1202 Available to Solve Galvo Error 10852 (1)
6 pages
Study Enterprise Asset Management
No ratings yet
Study Enterprise Asset Management
1 page
Acfd 2016 Review Autodesk CFD
No ratings yet
Acfd 2016 Review Autodesk CFD
6 pages
Annex Binder 3
No ratings yet
Annex Binder 3
21 pages
Machine Learning Assignment 1
No ratings yet
Machine Learning Assignment 1
4 pages
Q1-Ppt 3 Peripheral Devices
No ratings yet
Q1-Ppt 3 Peripheral Devices
30 pages
Css Exp 8 PDF Updated Tuesday
No ratings yet
Css Exp 8 PDF Updated Tuesday
4 pages
Vehicle Maintenance Log Excel Template
No ratings yet
Vehicle Maintenance Log Excel Template
6 pages
Mark Scheme Computer Systems
No ratings yet
Mark Scheme Computer Systems
30 pages
Capstone Project: Intelligent Shopping Assistant
No ratings yet
Capstone Project: Intelligent Shopping Assistant
20 pages
Edtech 2 Group 1
No ratings yet
Edtech 2 Group 1
40 pages
MC6803U4 DataSheet Archive
No ratings yet
MC6803U4 DataSheet Archive
44 pages
YEAR 9 CS - Data Communication
No ratings yet
YEAR 9 CS - Data Communication
11 pages
Manual: Epicverb
No ratings yet
Manual: Epicverb
9 pages
Test Code: Abcpdf
No ratings yet
Test Code: Abcpdf
1 page
Mathematics F.Y.B.Sc.VSC Syllabus with Practicals 24-25 edited (1)
No ratings yet
Mathematics F.Y.B.Sc.VSC Syllabus with Practicals 24-25 edited (1)
12 pages
Posiflex OPOS Driver Installation V13xx
No ratings yet
Posiflex OPOS Driver Installation V13xx
11 pages
Placement Dossier 2014 2016 SITM - Low Resolution File
No ratings yet
Placement Dossier 2014 2016 SITM - Low Resolution File
60 pages
Sample Resume
No ratings yet
Sample Resume
1 page
Hino 9
No ratings yet
Hino 9
109 pages
What Is Power Electronics?: 1957 1900 Late 1980s Mid 1970s
No ratings yet
What Is Power Electronics?: 1957 1900 Late 1980s Mid 1970s
12 pages
Ma Gsec (En) A 1vcd601151 Digiprint
No ratings yet
Ma Gsec (En) A 1vcd601151 Digiprint
44 pages
Computer Network Assignment
No ratings yet
Computer Network Assignment
2 pages
Global Life Science Reagents Market
No ratings yet
Global Life Science Reagents Market
2 pages
Ampli Pioneer
No ratings yet
Ampli Pioneer
6 pages
AI Auto
No ratings yet
AI Auto
16 pages
Big Data Analytics - AKM
No ratings yet
Big Data Analytics - AKM
208 pages