Chapter 2 Introduction To Data Science
Chapter 2 Introduction To Data Science
habtamu.abune@aau.edu.et
Learning outcomes
After completing this lesson you should be able to
❑ Describe what data science is and the role of data scientists.
❑ It is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.).
❑ Organized or classified data, which has some meaningful values for the
receiver
❑ Processed data on which decisions and actions are based.
❑ Plain collected data as raw facts cannot help much in decision-making
❑ Interpreted data created from organized, structured, and processed data in
a particular context.
What is knowledge?
➢ Includes facts about the real world entities and the relationship
between them.
➢ It is an Understanding gained through experience
➢ Knowledge is the appropriate collection of information, the intent of which
is usefulness.
➢ Answer question "how"
What is Wisdom?
● Wisdom embodies an understanding of fundamental principles, insight, ethical
code and moral by integrating knowledge
○ Answer ‘why’ question
● Knowledge that are essentially the basis for the knowledge being what it
is.
Data→Information→Knowledge→Wisdom
Data Processing Cycle
❑ Data processing is the re-structuring or re-ordering of data by people or
machine to increase their usefulness and add values for a particular
purpose.
❑ Data Proccessing cyle contain three steps such as to take input, process it
and generate output.
Cont.………
Input
➢ The input data is prepared in some convenient form for processing
➢ For example, when electronic computers are used, the input data can
be recorded on any one of the several types of input medium, such
as flash disks, hard disk, and so on
Cont.…..
Processing
➢ In this step, the input data is changed to produce data in a more
useful form
➢ For example, interest can be calculated on deposit to a bank, or a
summary of sales for a month can be calculated from the sales orders
data
Cont.……
Output
➢ At this stage, the result of the proceeding processing step is
collected
➢ The particular form of the output data depends on the use of the
data
➢ For example, output data can be total sale in a month or may be
payroll for eemployee.
Data types and its representation
Data types from Computer programming perspective
➢ Data type simply an attribute that tells compiler or intreppreter how the
programmer intended to use the data
➢ Common data types include
➢ Integers(int)- to store whole numbers
➢ Booleans(bool)- true or false
➢ Characters(char)- to store a single character
➢ Floating-point numbers(float)- to store real numbers
➢ Alphanumeric strings(string)- to store a combination of characters and
numbers
➢ Data type define:
➢ The opertation that can be done on the data
➢ The meaning of data and
➢ The way values of that type can stored
Data types from Data Analytics perspective
● There are three common types of data type or structure:
I. Structure
II. Semi- Structure
III. Unstructured
Structured Data
➢ Predefined data model and is therefore straightforward to analyze
➢ Common examples are audio, video files, NoSQL, pictures, pdfs , word
docs.
Unstructured Data -- examples
● Pdf files ● Images
Metadata
➢ Data about Data
➢ It provides additional information about a specific set of data
➢ It is one of the most important element for big data analysis and solution.
➢ It is the last category of data type
For example
➢ Metadata of a photo could describe when and where the photos
were taken
➢ The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data
Metadata -- example
● Metadata about an image
Cont….
Data Value Chain
➢ Describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data
➢ The Big Data Value Chain identifies the following key high-level activities
1. Data Acquisition
2. Data Analysis
3. Data Curation
4. Data Storage and
5. Data Usage
Data Value Chain
Data Acquisition
➢ It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage solution on which data analysis
can be carried out.
➢ Data cleaning tasks
➢ A key trend for the curation of big data utilizes community and crowd
sourcing approaches.
Data Storage
➢ It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data
➢ Relational Database Management Systems (RDBMS) have been the main,
and almost unique, solution to the storage paradigm for nearly 40 years.
➢ Relational database that guarantee database transactions, lack flexibility
with regard to schema changes, performance and fault tolerance when
data volumes and complexity grow, making them unsuitable for big data
scenarios.
➢ ACID properties (Atomicity, Consistency, Isolation, and Durability)
➢ NoSQL technologies have been designed with the scalability goal in mind
and present a wide range of solutions based on alternative data models
Data Usage
➢ It covers the data-driven business activities that need access to the
curated data, its analysis, and the tools needed to integrate the data
analysis within the business activity
➢ In business decision-making , it can enhance competitiveness through
reduction of costs, increased added value, or any other parameter that
can be measured against existing performance criteria
What Is Big Data?
➢ Big data refers to the large, diverse sets of information that grow at ever-
increasing rates.
➢ The term big data is used for massive scale data that is difficult to store,
manage and process using traditional databases and data processing
architectures.
➢ It is so difficult to process using on-hand database management tools require special
tool.
➢ Big data can be structured (often numeric, easily formatted and stored) or
unstructured (more free-form, less quantifiable).
➢ Nearly every department in a company can utilize findings from big data
analysis, but handling its clutter and noise can pose problems.
Cont.…..
● Nowadays Systems/services generate huge amount of data from TBs to
PB/ZBs of information
● Examples:
○ Google (processes 20 PB a day), Facebook (15 TB/day), eBay (50
TB/day), Walmart, Twitter (500M tweets/day), traffic surveillance
cameras, detecting fraud, identity theft...
Big data cont.…….
• Some examples of big data are listed as follows:
o Data generated by social networks including text, images, audio and video
data
o Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
o Machine sensor data collected from sensors embedded in industrial and
energy systems
o Healthcare data collected in electronic health record (EHR) systems for
monitoring their health and detecting failures
o Logs generated by web applications
o Stock markets data
o Transactional data generated by banking and financial applications
The 4 V’s Characterizing Big Data.
Volume: Massive scale of data
○ Large amounts of data in yottabytes or Zetabytes/Massive datasets
Velocity: How fast the data is generated
○ Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.
Variety: Different forms of the data
○ Data comes in many different forms from/ diverse sources and formats
Veracity: how accurate is the data.
○ Can we trust the data? How accurate is it? Doubt in data etc
The 4 V’s cont….
Clustered Computing and Hadoop Ecosystem
Clustered Computing
● Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
● To better address the high storage and computational needs of
big data, computer clusters are a better fit.
● Big data clustering software provide a number of benefits:
○ Resource Pooling
○ High Availability
○ Easy Scalability
● Using clusters requires a solution: for managing cluster membership,
coordinating resource sharing, and scheduling actual work on individual
nodes
● Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
Hadoop and its Ecosystem
● Hadoop is an open-source framework intended to make interaction with
big data easier.
○ It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
● The four key characteristics of Hadoop are:
○ Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
○ Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
○ Scalable: It is easily scalable both, horizontally and vertically.
■ A few extra nodes help in scaling up the framework.
○ Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Cont.…..
● Hadoop ecosystems has four core components: data management, data
access, data processing, and data storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others:
○ HDFS: Hadoop Distributed File System
○ YARN: Yet Another Resource Negotiator
○ MapReduce: Programming based Data Processing
○ Spark: In-Memory data processing
○ PIG, HIVE: Query-based processing of data services
○ HBase: NoSQL Database
○ Mahout, Spark MLLib: Machine Learning algorithm libraries
○ Solar, Lucene: Searching and Indexing
○ Zookeeper: Managing cluster
○ Oozie: Job Scheduling
Cont….
Big Data Life Cycle with Hadoop
❑ Ingesting data into the system: The first stage of Big Data processing is
Ingest.
❑ The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
❑ Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event
data.
❑ Processing the data in storage: The second stage is Processing.
● In this stage, the data is stored and processed.
❑ The data is stored in the distributed file system, HDFS, and the NoSQL
distributed data, HBase. Spark and MapReduce perform data processing.
Cont.…..
❑ Computing and analyzing data : The third stage is to Analyze.
○ Here, the data is analyzed by processing frameworks such as Pig, Hive, and
Impala.
○ Pig converts the data using a map and reduce and then analyzes it.
○ Hive is also based on the map and reduce programming and is most suitable for
structured data.
❑ Visualizing the results: The fourth stage is Access, which is performed by
tools such as Hue and Cloudera Search.
○ In this stage, the analyzed data can be accessed and communicated by users.
THANK YOU!