Introduction to Data Science
Introduction to Data Science
Introduction to Data Science
Science
What is Data Science?
Primary Skills:
Programming (Python, R, SQL).
Statistics and mathematics (probability, linear algebra).
Machine learning and AI techniques.
Data visualization (Tableau, Power BI, Matplotlib).
Secondary Skills:
Domain expertise (e.g., finance, healthcare).
Communication skills for presenting findings.
Business acumen for understanding problems and objectives.
Example: Creating a machine learning model to forecast sales trends.
What is the Relationship Between Data Science and Industry?
Industry 4.0 refers to the fourth industrial revolution, driven by IoT, AI,
robotics, and smart technologies. Data science is the backbone of Industry
4.0 as it analyzes data collected by IoT devices, optimizes processes, and
enables automation.
Example: Uber using data science for route optimization and pricing
strategies.
What Are the Different Roles Within the Data Science Field?
1. Structured Data
Definition: Data that is well-organized in a tabular format with rows and columns.
Characteristics:
Easily searchable in relational databases.
Typically used for numerical and categorical data.
Examples:
Employee records in an Excel sheet.
Sales data stored in a CSV file.
Advantages: Simpler to store, query, and analyze using SQL or data analysis tools.
Types of Data Sources
2. Unstructured Data
Definition: Data that lacks a predefined format or organization.
Characteristics:
Includes media files, natural language text, or sensor data.
Difficult to analyze without preprocessing.
Examples:
Social media posts (images, videos, captions).
Audio recordings or PDFs.
3. Semi-Structured Data
Definition: Data that is partially organized, often containing tags or markers
for elements.
Characteristics:
Falls between structured and unstructured data.
Easier to store and query than unstructured data.
Examples:
JSON files from web APIs.
XML files for data transfer between systems.
File-Based Storage:
• For smaller datasets like flat files (CSV, text files).
• Tools: Local storage, cloud storage (Google Drive, Dropbox).
Databases:
• For larger datasets and complex queries.
• Relational Databases (SQL): MySQL, PostgreSQL.
• NoSQL Databases: MongoDB, Cassandra.
Data Warehouses:
• For integrating and analyzing large datasets from multiple
sources.
• Examples: Snowflake, Google BigQuery.
Data Lakes:
• For raw and unstructured data storage.
• Examples: AWS S3, Azure Data Lake.
Data Pipelines
Data Ingestion:
• Collecting raw data from multiple sources, such as APIs, files, or IoT sensors.
• Tools: Kafka, Apache NiFi.
Data Transformation:
• Cleaning: Removing missing values, duplicates, or irrelevant data.
• Formatting: Standardizing data types or creating derived variables.
• Tools: Python with pandas, Apache Spark.
Data Storage:
• Storing the processed data in a database, data warehouse, or data lake.
• Tools: SQL databases, cloud platforms like AWS or Azure.
Tools for Data Pipelines
Missing Data
Duplicate Data
Noisy Data
Inconsistent Data Formats
Missing Data
Missing data occurs due to data entry errors, data loss during
extraction, or system issues. When dealing with missing data, it’s
important to understand the types of missing data and why they
occur, as this can influence how we handle them. There are three
main types of missing data:
1. MCAR (Missing Completely at Random):
Definition: In MCAR, the missing data occurs purely by chance and is
independent of both the observed and unobserved data. It means that
the reason for the missing data is completely unrelated to the data
itself.
Missing Data
Definition: Duplicate data refers to records or data points that appear more
than once in your dataset. These are exact or near-identical copies of the
same information.
Why it happens: Duplicates can occur due to various reasons such as
multiple submissions of forms, errors in data collection systems, or merging
data from different sources.
Example: If you have a customer dataset, the same customer’s information
might appear twice due to repeated registration or system errors.
Impact: Duplicates can distort analysis, leading to inaccurate results like
inflated counts, biased averages, or incorrect correlations.
How to handle:
Remove exact duplicates: You can use methods like the drop_duplicates() function in
Python (Pandas) to eliminate duplicate rows.
Detect near-duplicates: Use techniques like fuzzy matching or manual review if
duplicates are not exact but are slightly different, such as due to spelling mistakes.
Noisy Data
How to handle:
Standardize formats: Convert all values to a common format (e.g.,
standardizing all dates to YYYY-MM-DD format).
Unit conversion: Ensure all measurements are converted to the same
unit to maintain consistency and comparability.
Identifying Outliers
Definition of Outlier
An outlier is an observation or data point that lies significantly outside
the general pattern or distribution of a dataset. Outliers can be
unusually high or low values compared to the rest of the data. They
often indicate variability, errors in data collection, or special cases that
require further investigation.
Example of Outlier
Consider the following dataset of ages in a group of people:
Ages: 22, 25, 29, 30, 35, 40, 41, 44, 50, 200.