Introduction to Data Science

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Introduction to Data

Science
What is Data Science?

Data science is the interdisciplinary field that combines


mathematics, statistics, computer science, and domain knowledge
to extract meaningful insights from data. It involves processes like
data collection, cleaning, analysis, visualization, and predictive
modeling to solve problems and make data-driven decisions.

Example: Predicting customer churn using historical data.


What is Digital Transformation?

Digital transformation refers to the integration of digital


technologies into all areas of a business, fundamentally changing
how it operates and delivers value to customers. Data science plays
a key role by providing insights and enabling data-driven
automation.
Example: A bank using AI-powered chatbots for customer service.
How Do Companies Get Started in Data Science?

 Companies typically follow these steps:


 Gather and organize relevant data.
 Identify key business problems where data science can add value.
 Use tools like Python, R, and visualization software to analyze data.
 Start with small, measurable projects before scaling up.
 Build a team of data professionals or outsource to consultants.

Example: A retail company predicting inventory needs using sales


data.
What Are the Primary and Secondary Skills of a Data Scientist?

 Primary Skills:
 Programming (Python, R, SQL).
 Statistics and mathematics (probability, linear algebra).
 Machine learning and AI techniques.
 Data visualization (Tableau, Power BI, Matplotlib).
 Secondary Skills:
 Domain expertise (e.g., finance, healthcare).
 Communication skills for presenting findings.
 Business acumen for understanding problems and objectives.
Example: Creating a machine learning model to forecast sales trends.
What is the Relationship Between Data Science and Industry?

Industry 4.0 refers to the fourth industrial revolution, driven by IoT, AI,
robotics, and smart technologies. Data science is the backbone of Industry
4.0 as it analyzes data collected by IoT devices, optimizes processes, and
enables automation.

Example: Real-time analytics for predictive maintenance in manufacturing.


What is the Data Science Workflow?

The data science workflow includes these stages:


 Problem Definition: Understanding the business problem.
 Data Collection: Gathering data from relevant sources.
 Data Cleaning: Removing inconsistencies and errors.
 Exploratory Data Analysis (EDA): Identifying patterns and trends.
 Model Building: Using machine learning algorithms.
 Evaluation: Testing the model’s accuracy.
 Deployment: Applying the model in a real-world scenario.
 Monitoring: Ensuring the model’s ongoing effectiveness.

Example: Creating a customer segmentation model for marketing.


How is Data Science Applied to Real-World Problems?

 Data science is applied across industries to solve problems such as:


 Fraud detection in banking.
 Personalizing recommendations in e-commerce.
 Diagnosing diseases in healthcare.
 Optimizing routes in logistics.

Example: Uber using data science for route optimization and pricing
strategies.
What Are the Different Roles Within the Data Science Field?

 Data Analyst: Interprets and visualizes data to provide insights.


 Data Scientist: Designs models and algorithms to solve complex
problems.
 Data Engineer: Manages data pipelines and infrastructure.
 Machine Learning Engineer: Builds and deploys machine learning
models.
 Business Intelligence Specialist: Focuses on strategic data
visualization for decision-making.
Example: A data engineer creating a robust pipeline to feed data into a
real-time dashboard.
Data Collection and
Storage
What is Data Collection?

 Data collection is a foundational step in the data science workflow,


involving the gathering of information from diverse sources. This process
aims to acquire relevant and accurate data, which is critical for making
informed decisions, creating models, and extracting meaningful insights.
 Manual Data Gathering: Collecting data manually, such as conducting
surveys or recording observations.
 Web Scraping: Automated extraction of data from websites using tools
like BeautifulSoup or Selenium.
 APIs (Application Programming Interfaces): Facilitating data
retrieval from platforms like Twitter, Google Maps, or weather services.
 IoT Devices (Internet of Things): Capturing data from interconnected
devices, such as sensors in smart homes or wearable health monitors.
Types of Data Sources

1. Structured Data

Definition: Data that is well-organized in a tabular format with rows and columns.

Characteristics:
 Easily searchable in relational databases.
 Typically used for numerical and categorical data.
Examples:
 Employee records in an Excel sheet.
 Sales data stored in a CSV file.
Advantages: Simpler to store, query, and analyze using SQL or data analysis tools.
Types of Data Sources

2. Unstructured Data
Definition: Data that lacks a predefined format or organization.
Characteristics:
 Includes media files, natural language text, or sensor data.
 Difficult to analyze without preprocessing.

Examples:
 Social media posts (images, videos, captions).
 Audio recordings or PDFs.

Challenges: Requires advanced tools for processing, such as image


recognition algorithms or natural language processing (NLP).
Types of Data Sources

3. Semi-Structured Data
Definition: Data that is partially organized, often containing tags or markers
for elements.
Characteristics:
 Falls between structured and unstructured data.
 Easier to store and query than unstructured data.

Examples:
 JSON files from web APIs.
 XML files for data transfer between systems.

Uses: Frequently used in web development and integration between systems.


How to Store Collected Data?

File-Based Storage:
• For smaller datasets like flat files (CSV, text files).
• Tools: Local storage, cloud storage (Google Drive, Dropbox).

Databases:
• For larger datasets and complex queries.
• Relational Databases (SQL): MySQL, PostgreSQL.
• NoSQL Databases: MongoDB, Cassandra.

Data Warehouses:
• For integrating and analyzing large datasets from multiple
sources.
• Examples: Snowflake, Google BigQuery.

Data Lakes:
• For raw and unstructured data storage.
• Examples: AWS S3, Azure Data Lake.
Data Pipelines

A data pipeline is an automated system that manages the flow of data


from collection to storage and transformation. It ensures data is processed
efficiently and delivered to where it is needed.
Steps in a Data Pipeline

 Data Ingestion:
• Collecting raw data from multiple sources, such as APIs, files, or IoT sensors.
• Tools: Kafka, Apache NiFi.

 Data Transformation:
• Cleaning: Removing missing values, duplicates, or irrelevant data.
• Formatting: Standardizing data types or creating derived variables.
• Tools: Python with pandas, Apache Spark.

 Data Storage:
• Storing the processed data in a database, data warehouse, or data lake.
• Tools: SQL databases, cloud platforms like AWS or Azure.
Tools for Data Pipelines

 Apache Airflow: Orchestrates and schedules workflows.


 Apache NiFi: Automates the flow of data between systems.
 Talend: Integrates and processes data from various sources.
Real-World Example:

 Scenario: An e-commerce company wants to analyze its sales trends.


 Data Ingestion: Fetch raw sales data from the website and API.
 Data Transformation: Standardize date formats, remove duplicate
transactions, and calculate monthly revenue.
 Data Storage: Store the transformed data in Snowflake for business
intelligence analysis.
Introduction to Data
Cleaning and
Preprocessing
Key Challenges in Data

 Missing Data
 Duplicate Data
 Noisy Data
 Inconsistent Data Formats
Missing Data

Missing data occurs due to data entry errors, data loss during
extraction, or system issues. When dealing with missing data, it’s
important to understand the types of missing data and why they
occur, as this can influence how we handle them. There are three
main types of missing data:
1. MCAR (Missing Completely at Random):
Definition: In MCAR, the missing data occurs purely by chance and is
independent of both the observed and unobserved data. It means that
the reason for the missing data is completely unrelated to the data
itself.
Missing Data

2. MAR (Missing at Random):


Definition: In MAR, the missing data is related to the observed data but
not the missing data itself. The probability that data is missing depends on
some known factors, but it is unrelated to the missing values.

MNAR (Missing Not at Random):


Definition: In MNAR, the missing data is related to the value of the
missing data itself. In other words, the fact that data is missing is directly
related to the unobserved data. This is the most challenging type to deal
with.
Duplicate Data

 Definition: Duplicate data refers to records or data points that appear more
than once in your dataset. These are exact or near-identical copies of the
same information.
 Why it happens: Duplicates can occur due to various reasons such as
multiple submissions of forms, errors in data collection systems, or merging
data from different sources.
 Example: If you have a customer dataset, the same customer’s information
might appear twice due to repeated registration or system errors.
 Impact: Duplicates can distort analysis, leading to inaccurate results like
inflated counts, biased averages, or incorrect correlations.
 How to handle:
 Remove exact duplicates: You can use methods like the drop_duplicates() function in
Python (Pandas) to eliminate duplicate rows.
 Detect near-duplicates: Use techniques like fuzzy matching or manual review if
duplicates are not exact but are slightly different, such as due to spelling mistakes.
Noisy Data

 Definition: Noisy data refers to data that contains random errors,


inconsistencies, or outliers that don’t reflect the true characteristics
of the variable being measured.
 Why it happens: Noisy data can result from sensor errors, human
mistakes, or environmental interference during data collection.
 Example: In a dataset of house prices, an entry showing a house
price as $1 when the typical range is $100,000–$500,000 would be
considered noise.
 Impact: Noisy data can lead to skewed results, obscure patterns, and
weaken the performance of machine learning models by introducing
outliers that influence predictions.
Noisy Data

 How to handle noisy data:

 Smoothing techniques: You can apply methods like binning, moving


averages, or using clustering to smooth noisy data.
 Remove outliers: Identifying and either removing or correcting outliers
using statistical methods (like Z-score) helps reduce the noise in your
data.
 Imputation: In some cases, you can replace noisy values with estimates
based on the surrounding data (e.g., replacing an outlier with the
median).
Inconsistent Data Formats

 Definition: Inconsistent data formats occur when data entries are


recorded in different formats, making it difficult to analyze or
process the data in a uniform way.
 Why it happens: Inconsistent formats arise from different data
sources, human errors, or varying standards used by data collectors.
This can include mismatched date formats, different units of
measurement, or varied categorical labels.
 Example: In a dataset where the date is recorded, some records
may follow the DD/MM/YYYY format, while others might follow the
MM/DD/YYYY format. Similarly, some entries might use kg for
weight, while others use pounds.
 Impact: Inconsistent formats can lead to misinterpretation or errors
when analyzing data. For instance, incorrect date formats may lead
to wrong time-series analysis or chronological sorting.
Inconsistent Data Formats

 How to handle:
 Standardize formats: Convert all values to a common format (e.g.,
standardizing all dates to YYYY-MM-DD format).
 Unit conversion: Ensure all measurements are converted to the same
unit to maintain consistency and comparability.
Identifying Outliers

Definition of Outlier
An outlier is an observation or data point that lies significantly outside
the general pattern or distribution of a dataset. Outliers can be
unusually high or low values compared to the rest of the data. They
often indicate variability, errors in data collection, or special cases that
require further investigation.
Example of Outlier
Consider the following dataset of ages in a group of people:
Ages: 22, 25, 29, 30, 35, 40, 41, 44, 50, 200.

 Normal Range: Most ages are within 22 to 50 years.


 Outlier: The age "200" is far outside the normal range and is an
outlier.

You might also like