Unit 1-Part3-Compressed

Uploaded by

bmgfmhyzjbqegntoez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views28 pages

Unit 1-Part3-Compressed

Uploaded by

bmgfmhyzjbqegntoez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Sources of data in data science

4.Logs and Server Data:

Analyzing server logs, application logs, and other system

-generated data to gain insights into system performance,
user behavior, and other relevant metrics.
Sources of data in data science

5.Sensor Data:
•Data generated by sensors, IoT devices, and other
monitoring tools.
•This can include data from smart devices, industrial sensors,
environmental sensors, etc.

6.Text and Documents:

•Analyzing unstructured text data from sources such as books,
articles, emails, and social media.

•Natural Language Processing (NLP) techniques are

commonly applied to extract insights from textual data.
Sensor Data:
Sensor data:
Sources of data in data science

7. Open Data Repositories:

•Publicly available datasets from sources like Kaggle,
UCI Machine Learning Repository, and other data repositories
that provide datasets for research and analysis.

8. Government and Institutional Data:

•Data published by government agencies, research institutions,
and other organizations.
•This can include demographic data, economic indicators,
health statistics, and more.
Sources of data in data science:
Sources of data in data science

9.Image and Video Data:

•Analyzing image and video data for computer vision applications.
Image datasets like ImageNet or video datasets like YouTube-8M are
examples.

•10. Social Media Data:

Analyzing data from social media platforms to understand trends,
user behavior, and sentiments.
APIs from platforms like Twitter, Facebook, and Instagram are
commonly used.
Sources of data in data science
11. Machine-Generated Data:
•Data generated by machines and devices, such as logs,
telemetry data, and performance metrics.

12. Surveys and Questionnaires:

•Data collected through surveys, questionnaires, and feedback
forms. This can include both structured and unstructured
responses.

13. Historical Data:

•Time-series data collected over time, which is often used for
forecasting and trend analysis.
Steps Used in Data Science

• Data collection
• Data cleaning
• Exploratory data analysis
• Modeling
• Deployment
Steps Used in Data Science
Data collection
After formulating any problem statement the main task
is to calculate data that can help us in our analysis and
manipulation.
• Sometimes data is collected by performing some
kind of survey and there are times when it is done
by performing scrapping.
• Gather relevant data from various sources, which
may include databases, APIs, files, or external
datasets.
• Ensure the data collected is sufficient and
appropriate for addressing the defined problem.
Data collection
Step 1: Remove Duplicates

When you are working with large datasets, working across multiple data sources, or have not implemented any quality checks before
adding an entry, your data will likely show duplicated values.

These duplicated values add redundancy to your data and can make your calculations go wrong. Duplicate serial numbers of products
in a dataset will give you a higher count of products than the actual numbers.

Duplicate email IDs or mobile numbers might cause your communication to look more like spam. We take care of these duplicate
records by keeping just one occurrence of any unique observation in our data.

Step 2: Remove Irrelevant Data

Consider you are analyzing the after-sales service of a product. You get data that contains various fields like service
request date, unique service request number, product serial number, product type, product purchase date, etc.
While these fields seem to be relevant, the data may also contain other fields like attended by (name of the person who
initiated the service request), location of the service center, customer contact details, etc., which might not serve our
purpose if we were to analyze the expected period for a product to undergo servicing. In such cases, we remove those
fields irrelevant to our scope of work. This is the column-level check we perform initially.

Step 3: Standardize capitalization

You must ensure that the text in your data is consistent. If your capitalization is inconsistent, it could result in the creation of
many false categories.
For example: having column name as “Total_Sales” and “total_sales” is different (most programming languages are
case-sensitive).
To avoid confusion and maintain uniformity among the column names, we should follow a standardized way of providing the
column names. The most preferred code case is the snake case or cobra case.
Cobra case is a writing style in which the first letter of each word is written in uppercase, and each space is substituted by
the underscore (_) character.
Step 4: Convert data type
When working with CSV data in python, pandas will attempt to guess the types for us; for the most part, it
succeeds, but occasionally we'll need to provide a little assistance.
The most common data types that we find in the data are text, numeric, and date data types. The text data types
can accept any kind of mixed values including alphabets, digits, or even special characters. A person’s name,
type of product, store location, email ID, password, etc., are some examples of text data types.

Step 5: Handling Outliers

An outlier is a data point in statistics that dramatically deviates from other observations. An outlier may reflect
measurement variability, or it may point to an experimental error; the latter is occasionally removed from the data
set.
For example: let us consider pizza prices in a region. The pizza prizes vary between INR 100 to INR 7500 in the
region after surveying around 500 restaurants. But after analysis, we found that there is just one record in the
dataset with the pizza price as INR 7500, while the rest of the other pizza prices are between INR 100 to INR
1500. Therefore, the observation with pizza price as INR 7500 is an outlier since it significantly deviates from
the population. These outliers are usually identified using a box plot or scatter plot.

Step 6: Fix errors

Errors in your data can lead you to miss out on the key findings. This needs to be avoided by fixing the
errors that your data might have
Removing the country code from the mobile field so that all the values are exactly 10 digits.
Remove any unit mentioned in columns like weight, height, etc. to make it a numeric field.
Identifying any incorrect data format like email address and then either fixing it or removing it. .
Step 6: Fix errors
Errors in your data can lead you to miss out on the key findings. This needs to be avoided by fixing the
errors that your data might have
Removing the country code from the mobile field so that all the values are exactly 10 digits.
Remove any unit mentioned in columns like weight, height, etc. to make it a numeric field.
Identifying any incorrect data format like email address and then either fixing it or removing it. .

Step 7: Language Translation

Datasets for machine translation are frequently combined from several sources, which can result in
linguistic discrepancies.

Step 8: Handle missing values

During cleaning and munging in data science, handling missing values is one of the most common tasks. The real-life
data might contain missing values which need a fix before the data can be used for analysis. We can handle missing
values by:
Either removing the records that have missing values or
Filling the missing values using some statistical technique or by gathering data understanding.
Exploratory data analysis
• Explore and visualize the data to gain insights.
• Identify patterns, trends, and relationships
within the data.
• Use statistical and graphical methods to
understand the distribution of variables.
Exploratory data analysis
Modeling
• Choose appropriate machine learning or
statistical models based on the problem.
• Split the data into training and validation sets.
• Train the model on the training data.
• Tune hyperparameters to improve model
performance.
• Validate the model using the validation set.
Modeling
Modeling
There are a few tasks we can perform in modelling.
We can also train models to perform classification to
differentiating the emails you received as “Inbox” and
“Spam” using logistic regressions.
We can also forecast values using linear regressions.
We can also use modelling to group data to understand the
logic behind those clusters.
For example, we group our e-commerce customers to
understand their behaviour on your website. This requires
us to identify groups of data points with clustering
algorithms like k-means or hierarchical clustering.
In short, we use regression and predictions for forecasting
future values, and classification to identify, and clustering to
group values.
Deployment
• Implement the model into a production
environment.
• Integrate the model with other systems if
necessary.
• Monitor the model's performance in real-world
scenarios.
Deployment

Debre Berhan University Collage of Engineering Department of Electrical and Computer Engineering
No ratings yet
Debre Berhan University Collage of Engineering Department of Electrical and Computer Engineering
72 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
DATA SCIENCE 6th Sem
No ratings yet
DATA SCIENCE 6th Sem
40 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
RPT 01-54
No ratings yet
RPT 01-54
144 pages
intro
No ratings yet
intro
144 pages
DS-Unit-2_ABM_final
No ratings yet
DS-Unit-2_ABM_final
134 pages
CHapter 4 Allocation of MOH costs
No ratings yet
CHapter 4 Allocation of MOH costs
13 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Protocol
No ratings yet
Protocol
8 pages
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
DA MOD 1
No ratings yet
DA MOD 1
60 pages
M2.pptx
No ratings yet
M2.pptx
33 pages
UNIT-2
No ratings yet
UNIT-2
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Ds unit 2 notes
No ratings yet
Ds unit 2 notes
26 pages
Economic Modelling: Robert Sollis
No ratings yet
Economic Modelling: Robert Sollis
8 pages
DATA SCIENCE 1(7th sem)
No ratings yet
DATA SCIENCE 1(7th sem)
49 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Hanger of Vibrating Feeder
No ratings yet
Hanger of Vibrating Feeder
7 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Unit-2
No ratings yet
Unit-2
21 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
DX Diag
No ratings yet
DX Diag
30 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Data Visulaziation
No ratings yet
Data Visulaziation
42 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
23 pages
Approaches in data science [Slides]
No ratings yet
Approaches in data science [Slides]
13 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
Introduction to Data Analysis
No ratings yet
Introduction to Data Analysis
94 pages
Program 12 12. Write A Program To Perform Multiplication of Matrices
No ratings yet
Program 12 12. Write A Program To Perform Multiplication of Matrices
5 pages
1989 Book ViscousFlowApplications
No ratings yet
1989 Book ViscousFlowApplications
195 pages
Data Analytics For IOT
No ratings yet
Data Analytics For IOT
57 pages
Data Sceince - UNIT -4
No ratings yet
Data Sceince - UNIT -4
70 pages
1-DA (1).pptx
No ratings yet
1-DA (1).pptx
44 pages
1ST Quarter Exam
No ratings yet
1ST Quarter Exam
3 pages
datas_unit1
No ratings yet
datas_unit1
20 pages
M-II FDS U-II questions
No ratings yet
M-II FDS U-II questions
43 pages
Future Time Clauses
No ratings yet
Future Time Clauses
7 pages
Lecture 2 The data science process and tools for each step
No ratings yet
Lecture 2 The data science process and tools for each step
8 pages
LCD Character Display - UART Interface
No ratings yet
LCD Character Display - UART Interface
4 pages
Approaches in data analysis [Slides]
No ratings yet
Approaches in data analysis [Slides]
13 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
ملزمة تدريبات وتمارين فى Math للصف الثانى الابتدائى الازهرى واللغات الترم الثانى-الامتحان التعلي - 221228 - 234410
No ratings yet
ملزمة تدريبات وتمارين فى Math للصف الثانى الابتدائى الازهرى واللغات الترم الثانى-الامتحان التعلي - 221228 - 234410
40 pages
Multi Spindle
No ratings yet
Multi Spindle
10 pages
Data Science Process
No ratings yet
Data Science Process
4 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
Krish Pachani_Resume
No ratings yet
Krish Pachani_Resume
1 page
Fibonacci Slides
No ratings yet
Fibonacci Slides
19 pages
FDSA PPT - UNIT 1
No ratings yet
FDSA PPT - UNIT 1
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Unit 1
No ratings yet
Unit 1
19 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
Fods Notes for Lecturing
No ratings yet
Fods Notes for Lecturing
5 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
AIX Working With Open Firmware
No ratings yet
AIX Working With Open Firmware
4 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data v2
No ratings yet
Data v2
25 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
100% (1)
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
481 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Precal 1
No ratings yet
Precal 1
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Enthusiast Score-1 2023-24 (Centers)
No ratings yet
Enthusiast Score-1 2023-24 (Centers)
1 page
Data Science 2
No ratings yet
Data Science 2
55 pages
Nozzle Design - Codeware-Compress FAQs
No ratings yet
Nozzle Design - Codeware-Compress FAQs
9 pages
ANZ 2012 Handbook For Website
No ratings yet
ANZ 2012 Handbook For Website
48 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
Cma Final - Book 1 - Portfolio Management and Mutual Fund
No ratings yet
Cma Final - Book 1 - Portfolio Management and Mutual Fund
97 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Maths Class Xii Chapter 02 Inverse Trigonometric Functions Practice Paper 02 Answers
No ratings yet
Maths Class Xii Chapter 02 Inverse Trigonometric Functions Practice Paper 02 Answers
8 pages
Class Vii Blue Print Se 23-24
No ratings yet
Class Vii Blue Print Se 23-24
2 pages
Construction Materials For Acoustic Design
100% (1)
Construction Materials For Acoustic Design
35 pages
FICHA TECNICA SOKKIA Serie SET 650X PDF
100% (1)
FICHA TECNICA SOKKIA Serie SET 650X PDF
2 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Communication Lab Matlab-Zeytinoglu
No ratings yet
Communication Lab Matlab-Zeytinoglu
114 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Power Programming For The Commodore 64
100% (1)
Power Programming For The Commodore 64
364 pages
SQL DBA Course - Optimize SQL Technologies
0% (1)
SQL DBA Course - Optimize SQL Technologies
7 pages
Mo Ta 3
No ratings yet
Mo Ta 3
37 pages