0% found this document useful (0 votes)

21 views

Data Science

This document provides an introduction and overview of the key concepts and steps involved in data science, from collecting and managing data to building models and deploying solutions. It discusses the roles of data engineers, analysts, scientists, and how exploratory analysis, visualization, preprocessing, and mathematical modeling are used to turn data into knowledge. Requirements for working in data science such as programming skills, mathematics, machine learning expertise, and experience with data analytics tools are also outlined.

Uploaded by

scientist01234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Data Science

Uploaded by

scientist01234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Introduction to Data science

Rathinaraja Jeyaraj Ph.D., RJ

Post-doctoral fellow,
University of Houston - Victoria,
Texas, USA.
FROM DATA TO KNOWLEDGE
1. Domain knowledge and problem formulation
2. Data engineering Data engineer
2.1 Capturing (collecting) the data from device/software/application
2.2 Ingesting (transporting) the data to the storage location
2.3 Managing the data (storing and retrieving data from databases/files)
3. Exploratory data analysis (to summarize the main characteristics and behaviour of data) Data analyst
4. Visualization (answering the questions – table, chart, plot, graph, statistics, rules (if-else), trees) Data visualization
5. Data pre-processing (preparing the data for feeding into the algorithm)
6. Mathematical modelling (machine learning)
Data scientist
6.1 Building the model
Data analytics
6.2 Evaluating (testing) the model
6.3 Is the model good? If not go to step 3 or step 4

7. Deploy the model for production ML (MLops) engineer

2
1. Abstract science
2. Social science
3. Natural science
4. Applied science

3
END-TO-END (E2E) IMPLEMENTATION
1. Domain knowledge and problem formulation for the questions.

Example: For web-series recommender system in Netflix,

Domain knowledge – the function of the social network, user activities, objective of E-com companies, etc.

Question – Can you recommend a new web series “W” to subscriber “X” based on his past browsing history?

Problem formulation – identify the list of variables and objectives for this problem to build an equation to be solved.

2. Data engineering

2.1 Capturing (collecting) the data from devices/software/application

From smartphone1 – teamscope, open data kit, kobo toolbox, Redcap, Magpi, Jotforms mobile, CommCare, etc.

Logging tools2 – log4j, Loggly, Splunk, Sumo Logic, Sematext, LogStash, GrayLog, PaperTrails, etc.

IoT tools – Raspberry pi, sensors, actuators, RFID readers, Scanner, temperature recorder, CCTV, etc.

Any applications – Facebook, Instagram, WhatsApp, etc.

4
2.2 Ingesting3 (transporting) the data – Kafka, Nifi, Kinesis, Spark, Storm, Syncsort, Flume, Chukwa, Sqoop, Samza, etc.

2.3 Managing the data (storing and retrieving from databases/files)

SQL – MySQL, Oracle, MariaDB, PostgreSQL, Microsoft SQL Server, DB2, etc.

NoSQL – Hbase, MongoDB, Cassandra, DynamoDB, Neo4j, etc.

File formats – CSV, XML, JSON, images, videos, etc.

3. Exploratory data analysis – EDA (to summarize the main characteristics and behaviour of data)

Statistical measures of centre and variation, graphs, charts, plots, etc., probability distribution.

4. Data pre-processing (preparing the data for modelling) - Data wrangling

Data cleaning1 – Binning, clustering, regression, normalization, aggregation, etc.

Data transformation2 – Smoothing, aggregation, normalization, feature extraction, etc.

Data integration – Correlation analysis, etc.

5
Data reduction – Data cube aggregation, dimensionality reduction, data compression, numerosity reduction, discretization

Feature engineering – Imputation, categorical encoding, binning, scaling, log transform, feature selection and grouping.

5. Visualization1 (answering the questions) – Python libraries, Tableau, PowerBI, Infogram, ChartBlocks, Datawrapper

The discovered knowledge can be presented as table, chart, plot, graph, statistics, rules (if-else), trees.

6. Mathematical modelling – machine learning (Python libraries, R, Weka, Matlab)

6.1 Building the model from pre-processed data

6.2 Evaluating (testing) the model

6.3 Is the model good? If not go to step 3 or step 4

7. Deploying the model for production – cloud (AWS, Google), personal computer, smartwatch, etc.
6
7
WHAT DO YOU NEED FOR DATA SCIENCE?
Single machine vs distributed system platform for data science

▪ To work in data science on a single machine – Python, Excel, MATLAB, SAS, R, Weka, SQL databases, etc.

▪ To work in data science on the distributed system – Hadoop, Spark, Storm, etc.

To get into data science using Python

▪ Invest your time and gain respective domain/subject knowledge.

▪ Get a grip on the basics of statistics, probability, mathematics (calculus, linear algebra), machine learning, optimization
techniques, etc.

▪ Python framework (Anaconda)

▪ Python programming and IDEs (Jupyter/Spyder, Google colab).

8
▪ Math and scientific computing libraries (Numpy/Scipy).

▪ Data pre-processing and managing library (Pandas).

▪ Graphing and visualization library (Matplotlib/Plotly/Seaborn).

▪ Machine learning and deep learning libraries (Scikit-learn, TensorFlow, PyTorch, Keras, Caffe, Thaeno).

▪ To work on an image dataset for computer vision (OpenCV).

▪ To work on a text dataset for NLP (NLTK).

9
REQUIREMENTS FOR DATA SCIENCE JOBS

Data science job - expectation Now, I am an expert in data science

10
Data science job - reality What?

11
Data scientist role

▪ An analytical mind and critical thinking to define and work on a wide variety of problems in different domains.

▪ Strong familiarity with algorithm design techniques for a given problem.

▪ Good at statistics, probability, discrete, mathematics, calculus, linear algebra, machine learning, optimization techniques, etc.

▪ Good programming knowledge.

▪ Experience in data analytics.

▪ Working knowledge of data science E2E implementation tools.

12
▪ PhD is expected as they accumulate domain knowledge.

▪ Ultimately, more focused on building models (algorithms) in data analytics.

Data analyst role

▪ Sufficient knowledge of exploratory data analysis tasks.

▪ Hands-on experience in using algorithms (pre-built models), sometimes building algorithms.

▪ Preferably graduate degree is desired.

Data engineer role

▪ ETL tools like database, data warehouse, and distributed file systems for designing storage plans for storing data.

▪ Undergraduate degree is good enough.

13
Any QUESTIONS?

You can reach me at: jrathinaraja@gmail.com

Personal website: https://jrathinaraja.co.in/

Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Water Test ASTM E2140
No ratings yet
Water Test ASTM E2140
2 pages
Godan by Premchand Chap 13-22
No ratings yet
Godan by Premchand Chap 13-22
4 pages
Euro Bond & Eurocredit
No ratings yet
Euro Bond & Eurocredit
10 pages
File
No ratings yet
File
27 pages
Unit I
No ratings yet
Unit I
52 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science Report - Compress
No ratings yet
Data Science Report - Compress
31 pages
Unit 3
No ratings yet
Unit 3
9 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
5th Sem Internship Eport
No ratings yet
5th Sem Internship Eport
83 pages
Data Science
No ratings yet
Data Science
65 pages
Internship Report 2023-24 Data Science
100% (2)
Internship Report 2023-24 Data Science
23 pages
Data Science Management_vss
No ratings yet
Data Science Management_vss
84 pages
Data Science Course Curriculum 27 Feb 2023
No ratings yet
Data Science Course Curriculum 27 Feb 2023
21 pages
IDS UNIT 1,2,3,4 & 5
No ratings yet
IDS UNIT 1,2,3,4 & 5
117 pages
BMA - Recommended Sources For Analytics
No ratings yet
BMA - Recommended Sources For Analytics
13 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Module 1 Applied Data Science 1.1 and 1.2
No ratings yet
Module 1 Applied Data Science 1.1 and 1.2
104 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Data Science
No ratings yet
Data Science
10 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Selected Topics - Datascience
No ratings yet
Selected Topics - Datascience
17 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
data-science-report
No ratings yet
data-science-report
32 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
1666777204580-1666708806962-Introduction to Data Science REV
No ratings yet
1666777204580-1666708806962-Introduction to Data Science REV
41 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Unit 1
No ratings yet
Unit 1
21 pages
Self Learning Material - Introduction To Data Science
No ratings yet
Self Learning Material - Introduction To Data Science
10 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Lecture_5_2_Skills Required by Data Scientist
No ratings yet
Lecture_5_2_Skills Required by Data Scientist
11 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Harsh Synopsis
No ratings yet
Harsh Synopsis
21 pages
Data Science Bcs A
No ratings yet
Data Science Bcs A
20 pages
1 Introduction To Data Science
No ratings yet
1 Introduction To Data Science
14 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM f70a3eba Db3d 11ef a12f 96f32f87411b
63 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Getting Started With Data Science Using Python
100% (1)
Getting Started With Data Science Using Python
25 pages
Data Science Syllabus From Beginner to Advanced
No ratings yet
Data Science Syllabus From Beginner to Advanced
7 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Notes Data Science
No ratings yet
Notes Data Science
5 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
Fundamentals of Data Science unit 1
No ratings yet
Fundamentals of Data Science unit 1
33 pages
Data Science Skills
No ratings yet
Data Science Skills
31 pages
datascience
No ratings yet
datascience
12 pages
Internship
No ratings yet
Internship
28 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Data Science
100% (2)
Data Science
52 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
No ratings yet
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
6 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Free Guide - Comprehensive Guide To Become A Data Science Professional in 2023
No ratings yet
Free Guide - Comprehensive Guide To Become A Data Science Professional in 2023
17 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Data Science
No ratings yet
Data Science
18 pages
Building a Product Master
From Everand
Building a Product Master
Edufdev
No ratings yet
AUI3701 Study Guide 2024 S1
No ratings yet
AUI3701 Study Guide 2024 S1
138 pages
Ledmedics Surgical Lights: Lighting Competence
No ratings yet
Ledmedics Surgical Lights: Lighting Competence
16 pages
Full Download of Solutions for Intermediate Accounting 15th Edition by Kieso in PDF DOCX Format
100% (18)
Full Download of Solutions for Intermediate Accounting 15th Edition by Kieso in PDF DOCX Format
46 pages
Ography of Transport Systems 3rd Edition shaaNiG
100% (1)
Ography of Transport Systems 3rd Edition shaaNiG
25 pages
PostScript SDK v2 - Pssdkds
No ratings yet
PostScript SDK v2 - Pssdkds
2 pages
Msds Corrosion Inhibitor
No ratings yet
Msds Corrosion Inhibitor
4 pages
The Indian Patent Act, 1970
100% (1)
The Indian Patent Act, 1970
29 pages
1 Economic Principles and Culling PDF
No ratings yet
1 Economic Principles and Culling PDF
24 pages
BDC 2022 - Supplemental
No ratings yet
BDC 2022 - Supplemental
5 pages
Hydrologic Routing
No ratings yet
Hydrologic Routing
21 pages
Box Governance Datasheet (External)
No ratings yet
Box Governance Datasheet (External)
2 pages
SH1016
No ratings yet
SH1016
20 pages
Inventory Management
No ratings yet
Inventory Management
8 pages
Check of Flange Ratings - ASME B16.5
No ratings yet
Check of Flange Ratings - ASME B16.5
13 pages
58622rmo11 25
No ratings yet
58622rmo11 25
1 page
Akram Palestinian Refugees and Their Legal Status
No ratings yet
Akram Palestinian Refugees and Their Legal Status
16 pages
Apollo Tyres LTD
No ratings yet
Apollo Tyres LTD
84 pages
Report On CS TRAINEE DRIVE-2019 Organised at ICSI House, Noida
No ratings yet
Report On CS TRAINEE DRIVE-2019 Organised at ICSI House, Noida
3 pages
For Sugar (MMT) 69.0 122.3 174.8 176.7 For Khandsari (MMT) 10.5 13.2 10.0 11.0 For Gur (MMT) 71.6 76.6 67.3 72.5 For Seed (MMT) 20.6 28.9 30.1 35.5
No ratings yet
For Sugar (MMT) 69.0 122.3 174.8 176.7 For Khandsari (MMT) 10.5 13.2 10.0 11.0 For Gur (MMT) 71.6 76.6 67.3 72.5 For Seed (MMT) 20.6 28.9 30.1 35.5
6 pages
Case Study Cephalopelvic Disproportion
100% (1)
Case Study Cephalopelvic Disproportion
18 pages
Etika Marketing
No ratings yet
Etika Marketing
15 pages
Soal-bab 9 -Kelas-4-Bahasa-Inggris
No ratings yet
Soal-bab 9 -Kelas-4-Bahasa-Inggris
1 page
Amount (RS) Amount (RS) : As Per Our Report On Even Date For & On Behalf of The Board
No ratings yet
Amount (RS) Amount (RS) : As Per Our Report On Even Date For & On Behalf of The Board
8 pages
Embedded Systems Architecture Types
No ratings yet
Embedded Systems Architecture Types
3 pages
Procedures For Dealing With Walk-Ins, Scanty Baggage While Taking Advance
No ratings yet
Procedures For Dealing With Walk-Ins, Scanty Baggage While Taking Advance
2 pages
Marketing Agreement: V.L.S. Foods Private LTD., A Company Incorporated
No ratings yet
Marketing Agreement: V.L.S. Foods Private LTD., A Company Incorporated
21 pages
Only / Underline Format: Chapter C21 Site-Specific Ground Motion Procedures For Seismic Design
No ratings yet
Only / Underline Format: Chapter C21 Site-Specific Ground Motion Procedures For Seismic Design
17 pages