Module 2 - Data Preprocessing and Visualization
Module 2 - Data Preprocessing and Visualization
Module 2 - Data Preprocessing and Visualization
https://www.geeksforgeeks.org
In this post, we will discuss what different sources of data that are used are in
data mining process. The data from multiple sources are integrated into a
common source known as Data Warehouse.
Flat Files
Relational Databases
DataWarehouse
Transactional Databases
Multimedia Databases
Spatial Databases
Time Series Databases
World Wide Web(WWW)
1. Flat Files
Flat files is defined as data files in text form or binary form with a structure
that can be easily extracted by data mining algorithms.
Data stored in flat files have no relationship or path among themselves,
like if a relational database is stored on flat file, then there will be no
relations between the tables.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying
data to and from server, etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in
tables with rows and columns.
Physical schema in Relational databases is a schema which defines the
structure of tables.
Logical schema in Relational databases is a schema which defines the
relationship among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from
multiple sources that will queries and decision making.
There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases is a collection of data organized by time stamps,
date, etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation
when a transaction is not completed or committed.
Highly flexible system where users can modify information without
changing any sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in a pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged
activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by Uniform
Resource Locators (URLs) through web browsers, linked by HTML pages,
and accessible via the Internet network.
It is the most heterogeneous repository as it collects data from multiple
resources.
It is dynamic in nature as Volume of data is continuously increasing and
changing.
Application: Online shopping, Job search, Research, studying, etc.
Did you think data is only for big companies and corporations to analyze and
obtain business insights? No, data is also fun! There is nothing more interesting
than analyzing a data set to find the correlations between the data and obtain
unique insights. It’s almost like a mystery game where the data is a puzzle you
have to solve! And it is even more exciting when you have to find the best data
set for a Data Science project you want to make. After all, if the data is not
good, there is no chance of your project being any good as well.
Luckily, there are many online data sources where you can get free data sets to
use in your project. In this article, we have mentioned some of these data
sources that you can download and use for free. So whether you want to make
a Data Visualization, Data Cleaning, Machine Learning or any other type of
project, there is a data set for you to use!
(Note that each sources have their different websites and with different
regulations upon acquisitions of data.)
Google is not just a search engine, it’s much more! There are many public data
sets that you can access on the Google cloud and analyze to obtain new
insights from this data. There are more than 100 datasets and all of them are
hosted by BigQuery and Cloud Storage. You can also use Google’s Machine
Learning capabilities to analyze the data sets such as BigQuery ML, Vision AI,
Cloud AutoML, etc. You can also use Google Data Studio to create data
visualizations and interactive dashboards so that you can obtain better insights
and find patterns in the data. Google Cloud Public Datasets has data from
various data providers such as GitHub, United States Census Bureau, NASA,
BitCoin, US Department of Transportation, etc. You can access these data sets
for free and get free query access of about 1 TB of data per month in BigQuery.
Amazon Web Services have a large number of data sets on their open data
registry. You can download these data sets and use them on your own system or
you can analyze the data on the Amazon Elastic Compute Cloud (Amazon
EC2). Amazon also has various tools that you can use such as Apache Spark,
Apache Hive, etc. This AWS open data registry is a part of the AWS Public
Dataset Program that aims to democratize the access of data so it is freely
available for everybody and also creating new data analysis techniques and
tools that minimize the cost of working with data. You can access the data sets
for free but you need a free AWS account before doing anything else.
3. Data.gov
The United States of America is a pioneer and world leader in technology. Most
of the top tech companies today have originated in the silicon valley and it
stands to reason that the US government is also very involved in Data Science.
Data.gov is the main repository of the US government’s open data sets which
you can use for research, developing data visualizations, creating web and
mobile applications, etc. This is an attempt by the government to be more
transparent and so you can access the data sets directly without registering on
the site. However, some data sets might require you to agree to licensing
agreements and other technicalities before you can download them. There are
a wide variety of datasets on Data.giv relating to different fields such as climate,
energy, agriculture, ecosystems, oceans, etc, so be sure to check them all out!
4. Kaggle
There are around 23,000 public datasets on Kaggle that you can download for
free. In fact, many of these datasets have been downloaded millions of times
already. You can use the search box to search for public datasets on whatever
topic you want ranging from health to science to popular cartoons! You can
also create new public datasets on Kaggle and those may earn you medals
and also lead you towards advanced Kaggle titles like Expert, Master, and
Grandmaster. You can also download competition data sets from Kaggle while
participating in these competitions. The competitive Kaggle data sets are much
more detailed, curated, and well cleaned than the public data sets available
on Kaggle so you might have to sort through them. But all in all, if you are
interested in Data Science, then Kaggle is the place for you!
The UCI Machine Learning Repository is a great place to look for interesting data
sets as it is one of the first and oldest data sources available on the internet (It
was created in 1987!). These data sets are great for machine learning and you
can easily download the data sets from the repository without any registration.
All of the data sets on the UCI Machine Learning Repository are contributed by
different users and so they happen to be a little small with different levels of
data cleanliness. But most of the data sets are well maintained and you can
easily use them for machine learning algorithms.
If you want to access data about the weather and environmental conditions,
then the National Center for Environmental Information is the best bet! This was
earlier known as the National Climatic Data Center but they have since merged
the National Oceanic and Atmospheric Administration (NOAA) data centers as
well to create the National Centers for Environmental Information (NCEI). The
NCEI has many datasets related to the climatic and weather conditions across
the United States. In fact, it is the largest repository of environmental data in the
world. It includes oceanic data, meteorological data, climatic conditions,
geophysical data, atmospheric information, etc. If you want to know about the
Earth, this data archive is the best place to go. Check out some of the datasets
here.
7. Global Health Observatory
If you are in the medical field and interested in health data or you are just
creating a project on global health systems and diseases, then the Global
Health Observatory is the best place to get loads of health data. The World
Health Organization has made all their data public on the Global Health
Observatory so that good quality health information is freely available
worldwide in case it is needed to detect and recover from a health emergency
anywhere in the world. The health data is divided according to various
characteristics such as communicable and non-communicable diseases,
mental health, mortality rates, medicines and vaccines, tobacco control,
women and health, health risks, immunization, etc. Currently, they have a huge
focus on COVID-19 data so that this pandemic can be stopped as soon as
possible.
8. Earthdata
If you want data related to the Earth and Space, Earthdata is the perfect place
for that. It is created by NASA after all! Earthdata is a part of the Earth Science
Data Systems Program created by NASA that provides data sets based on the
Earth’s atmosphere, oceans, solar flares, cryosphere, geomagnetism, tectonics,
etc. Earthdata is specifically a part of the Earth Observing System Data and
Information System (EOSDIS) that collects and processes the data from different
NASA aircraft, satellites, and field data obtained from the ground. While
Earthdata provides many of these data sets, they also have data tools for
searching, handling, ordering, mapping, and visualizing the data.
https://www.geeksforgeeks.org
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
This situation arises when some data is missing in the data. It can be handled in
various ways.
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
B2. Regression:
B3. Clustering:
This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
a. Normalization:
b. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
c. Discretization:
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.
c. Numerosity Reduction:
This enable to store the model of data instead of whole data, for
example: Regression Models.
d. Dimensionality Reduction:
Data inconsistencies can occur due to human errors (the information was stored
in a wrong field). Duplicated values should be removed through reduplication to
avoid giving that data object an advantage (bias).
Data Integration: Data with different representations are put together and
conflicts within the data are resolved.
2. Data Reduction: When the volume of data is huge, databases can become
slower, costly to access, and challenging to properly store. Data reduction step
aims to present a reduced representation of the data in a data warehouse.
There are various methods to reduce data. For example, once a subset of
relevant attributes is chosen for its significance, anything below a given level is
discarded. Encoding mechanisms can be used to reduce the size of data as
well. If all original data can be recovered after compression, the operation is
labeled as lossless.
If some data is lost, then it’s called a lossy reduction. Aggregation can also be
used, for example, to condense countless transactions into a single weekly or
monthly value, significantly reducing the number of data objects.
Purpose:
Techniques:
The techniques mentioned here are very vast to discuss in this post. You can
learn more about them on internet. I have added YouTube links to both, in case
you want to watch those videos and learn.
What is Feature Subset Selection?
Embedded approaches
Filter approaches
Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task. For example, we might
select sets of attributes whose pair wise correlation is as low as possible.
Wrapper approaches
These methods use the target data mining algorithm as a black box to
find the best subset of attributes, in a way similar to that of the ideal algorithm
described above, but typically without enumerating all possible subset.
Binarization
Types of Sampling
1. Simple Random Sampling:
Stratified sampling: Split the data into several partitions, then draw random
samples from each partition.