INTRODUCTION TO DATA
SCIENCE
Introduction and Administration
Plan
requirements
Why data science
is important?
What is data
science?
What this course
is about?
Course
homework assignments
• Homepage, contact details
• “Why are you here”
• Mashup of
disciplines
• Hopefully right mix of
theory and practical
skills
• Syllabus
• Grade ,exam,
1. Why are you here?
Introduction: Media Buzz
Data Scientists are in high demand
Also in Academia
Demand will outpace the supply
Israel
Pays well
2. What is data science?
Technology and raising expectations
Data Science
New Discipline
Very little/none textbooks/courses
covering
the discipline as a whole
Compare to Software Engineering/Compute Science during
70-80s of the last century
Data Science is what data scientists do
Why data science and data scientists are needed?
Development of enabling technology
Raising Expectations from customers
2. What is data science?
Technological developments
Declining cost of storage
Declining cost of computing
Surpassing the brain
More data can be stored and processed
Value of Big Data
Devices vs. People
Internet of Things
Next frontier: IoT
2. What is data science?
Raising expectations
Cognitive Computing
People expect systems to behave like humans
Be Adaptive
Learn as information and goals change
Be Interactive
Interact easily with people and other systems
Be Contextual
Understand meaning, exploit additional sources of information
Need to process large quantities of uncertain data of
different types (text, speech, sensors, images etc.)
Cognitive and Data Science
People want their systems/devices to
behave smarter
Personal devices
Industrial systems
More data to acquire and analyze using
more complex algorithms and technologies
3. What is data science
Some examples
Example I: Marketing
Predicting Lifetime Value (LTV)
what for: if you can predict the characteristics
of high LTV customers, this supports customer
segmentation, identifies upsell opportunities and
supports other marketing initiatives
usage: can be both an online algorithm and a static
report showing the characteristics of high LTV customers
Example II: Logistics
Demand forecasting
How many of what thing do you need and where
will we need them? (Enables leaninventory and
prevents out of stock situations.)
revenue impact: supports growth and militates
against revenue leakage
usage: online algorithm and static report
Example III: Healthcare
Survival analysis
Analyze survival statistics for different patientattributes
(age, blood type, gender, etc) and treatments
Medication (dosage) effectiveness
Analyze effects of admittingdifferent types and dosage
of medication for a disease
Readmission risk
Predict risk of re-admittance based on patient
attributes, medical history, diagnose & treatment
Example IV: Wearable Health and
Fitness
Example V: Brain Computer Interface
2. What is data science?
A Mashup of disciplines
A mashup of disciplines
Math and Theory • Statistics, Linear Algebra, Optimization,
Time Series, etc.
Applied Algorithms • Machine Learning, Data Structures,
Parallel Algorithms, etc.
Engineering and • Storage and computing platforms,
Technologies statistical tools ,etc.
Domain Expertise • Text, Finance, Images, Econometrics etc.
Art • Visualization, Infographics
Best practices • Handle missed values in data, transform
and hacks and represent data, etc.
Yet Another View
Types of Data Scientists
Roles and Paycheck
3. About this course
A mix of theory and practice
General
Introductory course
But for advanced undergrads
Broad overview of subjects
But deep enough to have an exam
Focus on practical aspects
But not on ever-changing technology and tools
Tentative content(subject to change)
70% Statistical Machine Learning (7 weeks)
Focus on practical aspects
Classes
Necessary theoretical background
Basic R programming lab
20% Big Data Algorithms (2 weeks)
Focus on algorithms not on big data technologies
10% Data Visualization (1 weeks)
Grammar of graphics in R
This course is not
About big data tools or technologies
No: Hadoop technical details
Yes: Basic R programming
About statistical learning theory
No: Theoretical low bounds or other proofs
Yes: Some theory is necessary
About a specific domain
No: Deep discussions on Text, Finance, BI etc.
Yes: Some examples will be presented
Some case studies we will cover
PREDICTION OF
FUTURE MOVEMENTS • What is the next move of S&P 500?
IN THE STOCK MARKET:
PREDICTING INSURANCE
PURCHASE • Will a potential customer purchase?
DIRECT MARKETING • Who will respond?
HOUSING VALUATIONS • What affect the price of a house?
MARKETING OF ORANGE
JUICE • What brand a customer will buy?
EMAIL SPAM • Is this a spam message?
The course’s language of choice: R
What you are expected to know
Data is represented as a matrix
Basic linear algebra
Most problems are not well-defined/uncertain
Basic probability and statistics
Big data requires non-trivial data structures and algorithms
Basic data structures and algorithms concepts
Practical means programming
Basic Programming skills
Textbooks are available online
Machine Learning and R Big Data Algorithms
Visualization
Introduction from On-going examples
For curious minds
More on Machine Learning More on R Programming
Becoming a data scientist
Data Scientist Skills Quick Hacks/Examples
4. Course requirements
Requirements
Grade
100% closed material exam
No previous year exams
Both textbooks have after chapter exercises
Exam questions (and HW assignments) will
be very similar to these questions
See course homepage for HW submission guielines
Plan
requirements
Why data science
is important?
What is data
science?
What this course
is about?
Course
homework assignments
• Homepage, contact details
• “Why are you here”
• Mashup of
disciplines
• Hopefully right mix of
theory and practical
skills
• Syllabus
• Grade ,exam,
Few More Disclaimers
Very inaccurate explanation
Statistics: take a sample (data), answer questions about the
process that produced this sample
Is it a normal distribution? Estimate it’s mean.
Machine Learning: take a sample(data),
build a model to answer
questions about future samples
Given a sample of named faces, design a model for naming a new unseen face.
Data Mining: mine huge data store for interesting patterns or relationships
Given DBof transactions, apply tools and algorithms to find frequent product
bundles
Data Science: do whatever necessary to extract value from the data
Use data to improve book sales: mine patterns, engineer recommender
systems, suggest improvements, estimate impact
No clear-cut boundaries!
Disclaimer: Math in the course
All the computation are performed by computer
You are in charge for interpretation of numbers
So you’ll have to understand the logic behind the number
You’ll see significant amount formulas during the course
Mostly arithmetic, matrices and probability
You are not expected to memorize or derive each
formula (with exceptions), but you are expected to
Understand its meaning and use