0% found this document useful (0 votes)
22 views28 pages

Unit 1-Part3-Compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views28 pages

Unit 1-Part3-Compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Sources of data in data science

4.Logs and Server Data:

Analyzing server logs, application logs, and other system


-generated data to gain insights into system performance,
user behavior, and other relevant metrics.
Sources of data in data science

5.Sensor Data:
•Data generated by sensors, IoT devices, and other
monitoring tools.
•This can include data from smart devices, industrial sensors,
environmental sensors, etc.

6.Text and Documents:


•Analyzing unstructured text data from sources such as books,
articles, emails, and social media.

•Natural Language Processing (NLP) techniques are


commonly applied to extract insights from textual data.
Sensor Data:
Sensor data:
Sources of data in data science

7. Open Data Repositories:


•Publicly available datasets from sources like Kaggle,
UCI Machine Learning Repository, and other data repositories
that provide datasets for research and analysis.

8. Government and Institutional Data:


•Data published by government agencies, research institutions,
and other organizations.
•This can include demographic data, economic indicators,
health statistics, and more.
Sources of data in data science:
Sources of data in data science

9.Image and Video Data:


•Analyzing image and video data for computer vision applications.
Image datasets like ImageNet or video datasets like YouTube-8M are
examples.

•10. Social Media Data:


Analyzing data from social media platforms to understand trends,
user behavior, and sentiments.
APIs from platforms like Twitter, Facebook, and Instagram are
commonly used.
Sources of data in data science
11. Machine-Generated Data:
•Data generated by machines and devices, such as logs,
telemetry data, and performance metrics.

12. Surveys and Questionnaires:


•Data collected through surveys, questionnaires, and feedback
forms. This can include both structured and unstructured
responses.

13. Historical Data:


•Time-series data collected over time, which is often used for
forecasting and trend analysis.
Steps Used in Data Science

• Data collection
• Data cleaning
• Exploratory data analysis
• Modeling
• Deployment
Steps Used in Data Science
Data collection
After formulating any problem statement the main task
is to calculate data that can help us in our analysis and
manipulation.
• Sometimes data is collected by performing some
kind of survey and there are times when it is done
by performing scrapping.
• Gather relevant data from various sources, which
may include databases, APIs, files, or external
datasets.
• Ensure the data collected is sufficient and
appropriate for addressing the defined problem.
Data collection
Step 1: Remove Duplicates

When you are working with large datasets, working across multiple data sources, or have not implemented any quality checks before
adding an entry, your data will likely show duplicated values.

These duplicated values add redundancy to your data and can make your calculations go wrong. Duplicate serial numbers of products
in a dataset will give you a higher count of products than the actual numbers.

Duplicate email IDs or mobile numbers might cause your communication to look more like spam. We take care of these duplicate
records by keeping just one occurrence of any unique observation in our data.

Step 2: Remove Irrelevant Data


Consider you are analyzing the after-sales service of a product. You get data that contains various fields like service
request date, unique service request number, product serial number, product type, product purchase date, etc.
While these fields seem to be relevant, the data may also contain other fields like attended by (name of the person who
initiated the service request), location of the service center, customer contact details, etc., which might not serve our
purpose if we were to analyze the expected period for a product to undergo servicing. In such cases, we remove those
fields irrelevant to our scope of work. This is the column-level check we perform initially.

Step 3: Standardize capitalization


You must ensure that the text in your data is consistent. If your capitalization is inconsistent, it could result in the creation of
many false categories.
For example: having column name as “Total_Sales” and “total_sales” is different (most programming languages are
case-sensitive).
To avoid confusion and maintain uniformity among the column names, we should follow a standardized way of providing the
column names. The most preferred code case is the snake case or cobra case.
Cobra case is a writing style in which the first letter of each word is written in uppercase, and each space is substituted by
the underscore (_) character.
Step 4: Convert data type
When working with CSV data in python, pandas will attempt to guess the types for us; for the most part, it
succeeds, but occasionally we'll need to provide a little assistance.
The most common data types that we find in the data are text, numeric, and date data types. The text data types
can accept any kind of mixed values including alphabets, digits, or even special characters. A person’s name,
type of product, store location, email ID, password, etc., are some examples of text data types.

Step 5: Handling Outliers


An outlier is a data point in statistics that dramatically deviates from other observations. An outlier may reflect
measurement variability, or it may point to an experimental error; the latter is occasionally removed from the data
set.
For example: let us consider pizza prices in a region. The pizza prizes vary between INR 100 to INR 7500 in the
region after surveying around 500 restaurants. But after analysis, we found that there is just one record in the
dataset with the pizza price as INR 7500, while the rest of the other pizza prices are between INR 100 to INR
1500. Therefore, the observation with pizza price as INR 7500 is an outlier since it significantly deviates from
the population. These outliers are usually identified using a box plot or scatter plot.

Step 6: Fix errors


Errors in your data can lead you to miss out on the key findings. This needs to be avoided by fixing the
errors that your data might have
Removing the country code from the mobile field so that all the values are exactly 10 digits.
Remove any unit mentioned in columns like weight, height, etc. to make it a numeric field.
Identifying any incorrect data format like email address and then either fixing it or removing it. .
Step 6: Fix errors
Errors in your data can lead you to miss out on the key findings. This needs to be avoided by fixing the
errors that your data might have
Removing the country code from the mobile field so that all the values are exactly 10 digits.
Remove any unit mentioned in columns like weight, height, etc. to make it a numeric field.
Identifying any incorrect data format like email address and then either fixing it or removing it. .

Step 7: Language Translation


Datasets for machine translation are frequently combined from several sources, which can result in
linguistic discrepancies.

Step 8: Handle missing values


During cleaning and munging in data science, handling missing values is one of the most common tasks. The real-life
data might contain missing values which need a fix before the data can be used for analysis. We can handle missing
values by:
Either removing the records that have missing values or
Filling the missing values using some statistical technique or by gathering data understanding.
Exploratory data analysis
• Explore and visualize the data to gain insights.
• Identify patterns, trends, and relationships
within the data.
• Use statistical and graphical methods to
understand the distribution of variables.
Exploratory data analysis
Modeling
• Choose appropriate machine learning or
statistical models based on the problem.
• Split the data into training and validation sets.
• Train the model on the training data.
• Tune hyperparameters to improve model
performance.
• Validate the model using the validation set.
Modeling
Modeling
There are a few tasks we can perform in modelling.
We can also train models to perform classification to
differentiating the emails you received as “Inbox” and
“Spam” using logistic regressions.
We can also forecast values using linear regressions.
We can also use modelling to group data to understand the
logic behind those clusters.
For example, we group our e-commerce customers to
understand their behaviour on your website. This requires
us to identify groups of data points with clustering
algorithms like k-means or hierarchical clustering.
In short, we use regression and predictions for forecasting
future values, and classification to identify, and clustering to
group values.
Deployment
• Implement the model into a production
environment.
• Integrate the model with other systems if
necessary.
• Monitor the model's performance in real-world
scenarios.
Deployment

You might also like