0% found this document useful (0 votes)
21 views

Data Science

This document provides an introduction and overview of the key concepts and steps involved in data science, from collecting and managing data to building models and deploying solutions. It discusses the roles of data engineers, analysts, scientists, and how exploratory analysis, visualization, preprocessing, and mathematical modeling are used to turn data into knowledge. Requirements for working in data science such as programming skills, mathematics, machine learning expertise, and experience with data analytics tools are also outlined.

Uploaded by

scientist01234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Science

This document provides an introduction and overview of the key concepts and steps involved in data science, from collecting and managing data to building models and deploying solutions. It discusses the roles of data engineers, analysts, scientists, and how exploratory analysis, visualization, preprocessing, and mathematical modeling are used to turn data into knowledge. Requirements for working in data science such as programming skills, mathematics, machine learning expertise, and experience with data analytics tools are also outlined.

Uploaded by

scientist01234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Data science

Rathinaraja Jeyaraj Ph.D., RJ


Post-doctoral fellow,
University of Houston - Victoria,
Texas, USA.
FROM DATA TO KNOWLEDGE
1. Domain knowledge and problem formulation
2. Data engineering Data engineer
2.1 Capturing (collecting) the data from device/software/application
2.2 Ingesting (transporting) the data to the storage location
2.3 Managing the data (storing and retrieving data from databases/files)
3. Exploratory data analysis (to summarize the main characteristics and behaviour of data) Data analyst
4. Visualization (answering the questions – table, chart, plot, graph, statistics, rules (if-else), trees) Data visualization
5. Data pre-processing (preparing the data for feeding into the algorithm)
6. Mathematical modelling (machine learning)
Data scientist
6.1 Building the model
Data analytics
6.2 Evaluating (testing) the model
6.3 Is the model good? If not go to step 3 or step 4

7. Deploy the model for production ML (MLops) engineer


2
1. Abstract science
2. Social science
3. Natural science
4. Applied science

3
END-TO-END (E2E) IMPLEMENTATION
1. Domain knowledge and problem formulation for the questions.

Example: For web-series recommender system in Netflix,

Domain knowledge – the function of the social network, user activities, objective of E-com companies, etc.

Question – Can you recommend a new web series “W” to subscriber “X” based on his past browsing history?

Problem formulation – identify the list of variables and objectives for this problem to build an equation to be solved.

2. Data engineering

2.1 Capturing (collecting) the data from devices/software/application

From smartphone1 – teamscope, open data kit, kobo toolbox, Redcap, Magpi, Jotforms mobile, CommCare, etc.

Logging tools2 – log4j, Loggly, Splunk, Sumo Logic, Sematext, LogStash, GrayLog, PaperTrails, etc.

IoT tools – Raspberry pi, sensors, actuators, RFID readers, Scanner, temperature recorder, CCTV, etc.

Any applications – Facebook, Instagram, WhatsApp, etc.


4
2.2 Ingesting3 (transporting) the data – Kafka, Nifi, Kinesis, Spark, Storm, Syncsort, Flume, Chukwa, Sqoop, Samza, etc.

2.3 Managing the data (storing and retrieving from databases/files)

SQL – MySQL, Oracle, MariaDB, PostgreSQL, Microsoft SQL Server, DB2, etc.

NoSQL – Hbase, MongoDB, Cassandra, DynamoDB, Neo4j, etc.

File formats – CSV, XML, JSON, images, videos, etc.

3. Exploratory data analysis – EDA (to summarize the main characteristics and behaviour of data)

Statistical measures of centre and variation, graphs, charts, plots, etc., probability distribution.

4. Data pre-processing (preparing the data for modelling) - Data wrangling

Data cleaning1 – Binning, clustering, regression, normalization, aggregation, etc.

Data transformation2 – Smoothing, aggregation, normalization, feature extraction, etc.

Data integration – Correlation analysis, etc.


5
Data reduction – Data cube aggregation, dimensionality reduction, data compression, numerosity reduction, discretization

Feature engineering – Imputation, categorical encoding, binning, scaling, log transform, feature selection and grouping.

5. Visualization1 (answering the questions) – Python libraries, Tableau, PowerBI, Infogram, ChartBlocks, Datawrapper

The discovered knowledge can be presented as table, chart, plot, graph, statistics, rules (if-else), trees.

6. Mathematical modelling – machine learning (Python libraries, R, Weka, Matlab)

6.1 Building the model from pre-processed data

6.2 Evaluating (testing) the model

6.3 Is the model good? If not go to step 3 or step 4

7. Deploying the model for production – cloud (AWS, Google), personal computer, smartwatch, etc.
6
7
WHAT DO YOU NEED FOR DATA SCIENCE?
Single machine vs distributed system platform for data science

▪ To work in data science on a single machine – Python, Excel, MATLAB, SAS, R, Weka, SQL databases, etc.

▪ To work in data science on the distributed system – Hadoop, Spark, Storm, etc.

To get into data science using Python

▪ Invest your time and gain respective domain/subject knowledge.

▪ Get a grip on the basics of statistics, probability, mathematics (calculus, linear algebra), machine learning, optimization
techniques, etc.

▪ Python framework (Anaconda)

▪ Python programming and IDEs (Jupyter/Spyder, Google colab).

8
▪ Math and scientific computing libraries (Numpy/Scipy).

▪ Data pre-processing and managing library (Pandas).

▪ Graphing and visualization library (Matplotlib/Plotly/Seaborn).

▪ Machine learning and deep learning libraries (Scikit-learn, TensorFlow, PyTorch, Keras, Caffe, Thaeno).

▪ To work on an image dataset for computer vision (OpenCV).

▪ To work on a text dataset for NLP (NLTK).

9
REQUIREMENTS FOR DATA SCIENCE JOBS

Data science job - expectation Now, I am an expert in data science

10
Data science job - reality What?

11
Data scientist role

▪ An analytical mind and critical thinking to define and work on a wide variety of problems in different domains.

▪ Strong familiarity with algorithm design techniques for a given problem.

▪ Good at statistics, probability, discrete, mathematics, calculus, linear algebra, machine learning, optimization techniques, etc.

▪ Good programming knowledge.

▪ Experience in data analytics.

▪ Working knowledge of data science E2E implementation tools.

12
▪ PhD is expected as they accumulate domain knowledge.

▪ Ultimately, more focused on building models (algorithms) in data analytics.

Data analyst role

▪ Sufficient knowledge of exploratory data analysis tasks.

▪ Hands-on experience in using algorithms (pre-built models), sometimes building algorithms.

▪ Preferably graduate degree is desired.

Data engineer role

▪ ETL tools like database, data warehouse, and distributed file systems for designing storage plans for storing data.

▪ Undergraduate degree is good enough.


13
Any QUESTIONS?

You can reach me at: jrathinaraja@gmail.com


Personal website: https://jrathinaraja.co.in/

14

You might also like