0% found this document useful (0 votes)
15 views9 pages

ho

The document outlines the course structure and objectives for 'Introduction to Data Science' at the Birla Institute of Technology & Science, Pilani, detailing the course content, textbooks, and evaluation methods. It covers key topics such as data preprocessing, classification, prediction, and ethical considerations in data science. The course aims to provide students with a comprehensive understanding of data science applications in various fields and the necessary skills for data analysis and visualization.

Uploaded by

kamalesh p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

ho

The document outlines the course structure and objectives for 'Introduction to Data Science' at the Birla Institute of Technology & Science, Pilani, detailing the course content, textbooks, and evaluation methods. It covers key topics such as data preprocessing, classification, prediction, and ethical considerations in data science. The course aims to provide students with a comprehensive understanding of data science applications in various fields and the necessary skills for data analysis and visualization.

Uploaded by

kamalesh p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES


Digital
Part A: Content Design
Course Title Introduction to Data Science

Course No(s)
Credit Units 5

Content Authors Ms. Seetha Parameswaran

Version 2.0c (May 24)

Date August 5th 2022

Course Objectives
No Course Objective

CO1 Gain basic understanding of the role of Data Science in various scenarios in the real-world
of business, industry and government.

CO2 Understand various roles and stages in a Data Science Project and ethical issues to
be considered.

CO3 Explore the processes, tools and technologies for collection and analysis of structured
and unstructured data.

CO4 Appreciate the importance of techniques like data visualization, storytelling with data
for the effective presentations of the outcomes with the stakeholders

CO5 Understand techniques of preparing real-world data for data analytics.

CO6 Implement data analytic techniques for discovering interesting patterns from data.
Text Book(s)
T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar 2nd Ed, Pearson
2021

T2 Introducing Data Science by Cielen, Meysman and Ali

T3 Storytelling with Data, A data visualization guide for business professionals, by


Cole Nussbaumer Knaflic; Wiley

T4 Data Mining: Concepts and Techniques, 4th Edition by Jiawei Han and others
Morgan Kaufmann Publishers, 2023

Reference Book(s) & other resources


R1 The Art of Data Science by Roger D Peng and Elizabeth Matsui
(https://bookdown.org/rdpeng/artofdatascience/)

R2 Ethics and Data Science by DJ Patil, Hilary Mason, Mike Loukides

R3 Python Data Science Handbook: Essential tools for working with data by Jake
VanderPlas

R4 KDD, SEMMA and CRISP-DM: A Parallel Overview , Ana Azevedo and M.F. Santos ,
IADS-DM, 2008

Content Structure
1 Fundamentals of Data Science (2 hrs)
1.1 Real World applications
1.2 Data Science Challenges
1.3 Data Science Teams and Roles
1.4 Data Science Process
a) CRISP-DM Methodology
b) SEMMA
c) BIG DATA LIFE CYCLE
d) SMAM
1.5 Software Engineering for Data Science
1.5.1 DataOps
1.5.2 MLOps

2. Data Quality and Data Infrastructure (2 hrs)


2.1. Types of Data and Datasets
2.2. Data Quality and Issues: An overview
2.3. Data Models
2.4. Data Pipelines and patterns;
2.5. Data Pipeline Stages
2.6 Modern Data Infrastructure
2.6.1 Diverse data sources
2.6.2 Cloud data warehouses and lakes
3. Data Preprocessing (6 hrs)
3.1 Data cleaning
3.2 Data Aggregation, Sampling,
3.3 Statistical descriptions of data
3.4 Measuring data similarity & dissimilarity
3.5. Handling Numeric Data
3.5.1 Discretization, Binarization
3.5.2 Normalization
3.5.3 Data Smoothening
3.6 Feature Engineering
3.7 Managing Categorical Attributes
3.7.1 Transforming Categorical to Numerical Values
3.7.2 Encoding techniques
3.8 Overview of visualization techniques for Data Exploratory analysis

4. Classification and Prediction (6 hrs)


4.1. Concepts of classification and prediction
4.2. Decision trees for classification - ID3 algorithm using entropy and Gini Index
4.3 Rule based classification
4.4. Feature Subset Selection Methods
4.4. Evaluation of classification algorithms
4.5 Prediction using Regression

5. Association Analysis (4 hrs)


5.1. Association analysis concepts
5.2. Apriori Algorithm for frequent itemsets
5.3 FP Growth for frequent itemsets
5.4. Mining association rules

6. Clustering (6 hrs)
6.1. Cluster analysis concepts.
6.2. Partitioning methods – k-Means algorithm
6.3. Hierarchical methods for cluster analysis
6.4. Density based methods for cluster analysis - DBSCAN
6.5. Evaluation of clustering algorithms

7. Anomaly Detection ( 2 hr)


7.1. Concepts of Outliers
7.2. Statistical approaches
7.3. Proximity and Density based outlier detection

8. Storytelling with Data (1 hr)


8.1. The final deliverable
8.2. The Narrative - report / presentation structure
8.3. Building narrative with Data
8.4. Effective storytelling
9. Ethics for Data Science ( 1 hr)
9.1. Bias and Fairness in Data
9.2 Being a data skeptic – examples of misuse of Data
9.3 Five C’s
9.4 Ethical guidelines for Data Scientist
9.5 Ethics of data scraping and storage
9.6 Case Study
Part B: Learning Plan
Academic Term

Course Title Introduction to Data Science

Course No
Lead Instructor

Sessio
n No. Topic Title Resource Reference

Introduction to Data Science


1
• Fundamentals of Data Science
• Real World applications
• Data Science Challenges T3 – Ch 1
• Data Science Teams and Roles T4 – Ch1
• Data Science Process T1 – Ch1
◦ CRISP-DM Methodology
Class Room Discussion
◦ SEMMA Class Notes
◦ BIG DATA LIFE CYCLE Additional Reading (AR) material provided
LMS
◦ SMAM
• Software Engineering for Data
Science
◦ DataOps
◦ MLOps (intro)

2 Data Quality and Data Infrastructure


• Types of Data and Datasets
T1 – Ch 2.1, 2.2
• Data Quality and Issues: An
overview
• Data Models R1 – Ch 2, Ch 7
• Data Pipelines and patterns
• Data Pipeline Stages Class room discussions
• Modern Data Infrastructure
◦ Diverse data sources
◦ Cloud data warehouses and
lakes

Data Preprocessing T1 – Ch2.3, 2.4


3-5
• Data cleaning T4 – Ch 2
• Data Aggregation, Sampling,
• Statistical descriptions of data
• Measuring data similarity &
dissimilarity
• Handling Numeric Data
◦ Discretization, Binarization
◦ Normalization
◦ Data Smoothening
• Feature Engineering
• Managing Categorical Attributes
◦ Transforming Categorical to
Numerical Values
◦ Encoding techniques
• Overview of visualization
techniques for Data Exploratory
analysis

Classification and Prediction (2 hrs)


6
• Concepts of classification and
prediction
• Decision trees for classification - T4 – Ch6.1, 6.2, 6.3
ID3 algorithm using entropy and T4 – 6.6, 6.7
Gini Index, Occam’s razor
• (Mutual Information and Gini Index are used
as Feature subset selection techniques. )

Classification and Prediction (2 hrs)


7 T4 – Ch6.1, 6.2, 6.3
• Rule Based Classification 6.6, 6.7
• Feature subset selection methods

Classification and Prediction (2 hrs)


8 T4 – Ch6.1, 6.2, 6.3
• Evaluation of classification 6.6, 6.7
algorithms Class Notes
• Prediction Approaches

Association Analysis (2 hrs)


9
• Association analysis concepts T1 – Ch 4
• Apriori Algorithm for frequent T4 – Ch 4
itemsets

Association Analysis (2 hrs)


10 T1 – Ch 4
• FP Growth for frequent itemsets T4 – Ch 4
• Mining association rules
Clustering
11
• Cluster analysis concepts. T1 – Ch 5
• Partitioning methods – k-Means T4 – Ch 8
algorithm

Clustering
12
• Density based methods for cluster T1 – Ch 5
analysis – DBSCAN T4 – Ch 8
• Hierarchical methods for cluster
analysis

Clustering
13 T1 – Ch 5
• Evaluation of clustering
algorithms

Anomaly Detection
14
• Concepts of Outliers
• Statistical approaches T1 – Ch 9
• Proximity and Density based T4 – Ch 11
outlier detection

Storytelling with Data


15
• The final deliverable
• The Narrative - report / T3 – Ch10
presentation structure
• Building narrative with Data
• Effective storytelling

Ethics for Data Science


16
• Bias and Fairness
◦ Types of Bias
◦ Identifying Bias
https://hbr.org/2013/04/the-hidden-biases-in-
◦ Evaluating Bias big-data
• Being a data skeptic – examples of
https://www.oreilly.com/data/free/files/being-a-
misuse of Data data-skeptic.pdf
• Five C’s
• Ethical guidelines for Data T4 – Ch12.4
R2 – Ch1, Ch3
Scientist
• Ethics of data scraping and
storage
• Case Study: IBM AI Fairness 360
(PS: Ethics for Data is the focus.)
Detailed Plan for Lab work

Lab Sheet Session


Lab No. Lab Objective Reference
Access URL
Introduction to Python, Numpy, Scipy, Python Pandas, 2
1
Data ingestion and extraction, data aggregation 3
2 techniques
Exploration and Visualizing using Matplotlib, Seaborn 4
3
Data pre-processing in Python - Discretization,
Binarization, Normalization, Data Smoothening, 5
4
Managing Categorical Attributes
Feature Engineering using Filter methods, wrapper 7
5 methods, PCA
Data pre-processing and Feature Engineering 8
6 techniques for text, images, audio, video
Decision trees for classification using Scikit learn 9
7
Association Analysis using Scikit learn 11
8
Clustering analysis by kmeans, hierarchical methods, 13
9 DBScan using Scikit learn

Evaluation Scheme:
Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

Name Type Duration Weight Day, Date, Session, Time


No
Quizzes Online 10%
EC-1(a)
Assignments Take Home 20%
EC-1(b)
Mid-Semester Test Closed Book 25%
EC-2
Comprehensive Exam Open Book 45%
EC-3

Note:
Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 8
Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)
Important links and information:

Elearn portal: https://elearn.bits-pilani.ac.in or Canvas


Students are expected to visit the Elearn portal on a regular basis and stay up to date with
the latest announcements and deadlines.

Contact sessions: Students should attend the online lectures as per the schedule provided
on the Elearn portal.

Evaluation Guidelines:
1 EC-1 consists of two Quizzes. Students will attempt them through the course pages
on the Elearn portal. Announcements will be made on the portal, in a timely manner.
2 EC-2 consists of either one or two Assignments. Students will attempt them through
the course pages on the Elearn portal. Announcements will be made on the portal, in
a timely manner.
3 For Closed Book tests: No books or reference material of any kind will be permitted.
4 For Open Book exams: Use of books and any printed / written reference material
(filed or bound) is permitted. However, loose sheets of paper will not be allowed.
Use of calculators is permitted in all exams. Laptops/Mobiles of any kind are not
allowed. Exchange of any material is not allowed.
5 If a student is unable to appear for the Regular Test/Exam due to genuine exigencies,
the student should follow the procedure to apply for the Make-Up Test/Exam which
will be made available on the Elearn portal. The Make-Up Test/Exam will be
conducted only at selected exam centres on the dates to be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self-
study schedule as given in the course hand-out, attend the online lectures, and take all the
prescribed evaluation components such as Assignment/Quiz, Mid-Semester Test and
Comprehensive Exam according to the evaluation scheme provided in the hand-out.

You might also like