Data Science
Data Science
3
END-TO-END (E2E) IMPLEMENTATION
1. Domain knowledge and problem formulation for the questions.
Domain knowledge – the function of the social network, user activities, objective of E-com companies, etc.
Question – Can you recommend a new web series “W” to subscriber “X” based on his past browsing history?
Problem formulation – identify the list of variables and objectives for this problem to build an equation to be solved.
2. Data engineering
From smartphone1 – teamscope, open data kit, kobo toolbox, Redcap, Magpi, Jotforms mobile, CommCare, etc.
Logging tools2 – log4j, Loggly, Splunk, Sumo Logic, Sematext, LogStash, GrayLog, PaperTrails, etc.
IoT tools – Raspberry pi, sensors, actuators, RFID readers, Scanner, temperature recorder, CCTV, etc.
SQL – MySQL, Oracle, MariaDB, PostgreSQL, Microsoft SQL Server, DB2, etc.
3. Exploratory data analysis – EDA (to summarize the main characteristics and behaviour of data)
Statistical measures of centre and variation, graphs, charts, plots, etc., probability distribution.
Feature engineering – Imputation, categorical encoding, binning, scaling, log transform, feature selection and grouping.
5. Visualization1 (answering the questions) – Python libraries, Tableau, PowerBI, Infogram, ChartBlocks, Datawrapper
The discovered knowledge can be presented as table, chart, plot, graph, statistics, rules (if-else), trees.
7. Deploying the model for production – cloud (AWS, Google), personal computer, smartwatch, etc.
6
7
WHAT DO YOU NEED FOR DATA SCIENCE?
Single machine vs distributed system platform for data science
▪ To work in data science on a single machine – Python, Excel, MATLAB, SAS, R, Weka, SQL databases, etc.
▪ To work in data science on the distributed system – Hadoop, Spark, Storm, etc.
▪ Get a grip on the basics of statistics, probability, mathematics (calculus, linear algebra), machine learning, optimization
techniques, etc.
8
▪ Math and scientific computing libraries (Numpy/Scipy).
▪ Machine learning and deep learning libraries (Scikit-learn, TensorFlow, PyTorch, Keras, Caffe, Thaeno).
9
REQUIREMENTS FOR DATA SCIENCE JOBS
10
Data science job - reality What?
11
Data scientist role
▪ An analytical mind and critical thinking to define and work on a wide variety of problems in different domains.
▪ Good at statistics, probability, discrete, mathematics, calculus, linear algebra, machine learning, optimization techniques, etc.
12
▪ PhD is expected as they accumulate domain knowledge.
▪ ETL tools like database, data warehouse, and distributed file systems for designing storage plans for storing data.
14