Module 2 - Data Preprocessing and Visualization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Types of Sources of Data in Data Mining (Last Updated: 20-08-2019)

https://www.geeksforgeeks.org

In this post, we will discuss what different sources of data that are used are in
data mining process. The data from multiple sources are integrated into a
common source known as Data Warehouse.

Let’s discuss what type of data can be mined:

 Flat Files
 Relational Databases
 DataWarehouse
 Transactional Databases
 Multimedia Databases
 Spatial Databases
 Time Series Databases
 World Wide Web(WWW)

1. Flat Files
 Flat files is defined as data files in text form or binary form with a structure
that can be easily extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves,
like if a relational database is stored on flat file, then there will be no
relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in Data Warehousing to store data, Used in carrying
data to and from server, etc.

2. Relational Databases
 A Relational database is defined as the collection of data organized in
tables with rows and columns.
 Physical schema in Relational databases is a schema which defines the
structure of tables.
 Logical schema in Relational databases is a schema which defines the
relationship among tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
 A datawarehouse is defined as the collection of data integrated from
multiple sources that will queries and decision making.
 There are three types of datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
 Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
 Application: Business decision making, Data mining, etc.

4. Transactional Databases
 Transactional databases is a collection of data organized by time stamps,
date, etc to represent transaction in databases.
 This type of database has the capability to roll back or undo its operation
when a transaction is not completed or committed.
 Highly flexible system where users can modify information without
changing any sensitive information.
 Follows ACID property of DBMS.
 Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in a pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

6. Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.

7. Time-series Databases
 Time series databases contain stock exchange data and user logged
activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.

8. WWW
 WWW refers to World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by Uniform
Resource Locators (URLs) through web browsers, linked by HTML pages,
and accessible via the Internet network.
 It is the most heterogeneous repository as it collects data from multiple
resources.
 It is dynamic in nature as Volume of data is continuously increasing and
changing.
 Application: Online shopping, Job search, Research, studying, etc.

Top 8 Free Dataset Sources to Use for Data Science Projects

Did you think data is only for big companies and corporations to analyze and
obtain business insights? No, data is also fun! There is nothing more interesting
than analyzing a data set to find the correlations between the data and obtain
unique insights. It’s almost like a mystery game where the data is a puzzle you
have to solve! And it is even more exciting when you have to find the best data
set for a Data Science project you want to make. After all, if the data is not
good, there is no chance of your project being any good as well.

Luckily, there are many online data sources where you can get free data sets to
use in your project. In this article, we have mentioned some of these data
sources that you can download and use for free. So whether you want to make
a Data Visualization, Data Cleaning, Machine Learning or any other type of
project, there is a data set for you to use!

(Note that each sources have their different websites and with different
regulations upon acquisitions of data.)

1. Google Cloud Public Datasets

Google is not just a search engine, it’s much more! There are many public data
sets that you can access on the Google cloud and analyze to obtain new
insights from this data. There are more than 100 datasets and all of them are
hosted by BigQuery and Cloud Storage. You can also use Google’s Machine
Learning capabilities to analyze the data sets such as BigQuery ML, Vision AI,
Cloud AutoML, etc. You can also use Google Data Studio to create data
visualizations and interactive dashboards so that you can obtain better insights
and find patterns in the data. Google Cloud Public Datasets has data from
various data providers such as GitHub, United States Census Bureau, NASA,
BitCoin, US Department of Transportation, etc. You can access these data sets
for free and get free query access of about 1 TB of data per month in BigQuery.

2. Amazon Web Services Open Data Registry

Amazon Web Services have a large number of data sets on their open data
registry. You can download these data sets and use them on your own system or
you can analyze the data on the Amazon Elastic Compute Cloud (Amazon
EC2). Amazon also has various tools that you can use such as Apache Spark,
Apache Hive, etc. This AWS open data registry is a part of the AWS Public
Dataset Program that aims to democratize the access of data so it is freely
available for everybody and also creating new data analysis techniques and
tools that minimize the cost of working with data. You can access the data sets
for free but you need a free AWS account before doing anything else.

3. Data.gov

The United States of America is a pioneer and world leader in technology. Most
of the top tech companies today have originated in the silicon valley and it
stands to reason that the US government is also very involved in Data Science.
Data.gov is the main repository of the US government’s open data sets which
you can use for research, developing data visualizations, creating web and
mobile applications, etc. This is an attempt by the government to be more
transparent and so you can access the data sets directly without registering on
the site. However, some data sets might require you to agree to licensing
agreements and other technicalities before you can download them. There are
a wide variety of datasets on Data.giv relating to different fields such as climate,
energy, agriculture, ecosystems, oceans, etc, so be sure to check them all out!

4. Kaggle

There are around 23,000 public datasets on Kaggle that you can download for
free. In fact, many of these datasets have been downloaded millions of times
already. You can use the search box to search for public datasets on whatever
topic you want ranging from health to science to popular cartoons! You can
also create new public datasets on Kaggle and those may earn you medals
and also lead you towards advanced Kaggle titles like Expert, Master, and
Grandmaster. You can also download competition data sets from Kaggle while
participating in these competitions. The competitive Kaggle data sets are much
more detailed, curated, and well cleaned than the public data sets available
on Kaggle so you might have to sort through them. But all in all, if you are
interested in Data Science, then Kaggle is the place for you!

5. UCI Machine Learning Repository

The UCI Machine Learning Repository is a great place to look for interesting data
sets as it is one of the first and oldest data sources available on the internet (It
was created in 1987!). These data sets are great for machine learning and you
can easily download the data sets from the repository without any registration.
All of the data sets on the UCI Machine Learning Repository are contributed by
different users and so they happen to be a little small with different levels of
data cleanliness. But most of the data sets are well maintained and you can
easily use them for machine learning algorithms.

6. National Center for Environmental Information

If you want to access data about the weather and environmental conditions,
then the National Center for Environmental Information is the best bet! This was
earlier known as the National Climatic Data Center but they have since merged
the National Oceanic and Atmospheric Administration (NOAA) data centers as
well to create the National Centers for Environmental Information (NCEI). The
NCEI has many datasets related to the climatic and weather conditions across
the United States. In fact, it is the largest repository of environmental data in the
world. It includes oceanic data, meteorological data, climatic conditions,
geophysical data, atmospheric information, etc. If you want to know about the
Earth, this data archive is the best place to go. Check out some of the datasets
here.
7. Global Health Observatory

If you are in the medical field and interested in health data or you are just
creating a project on global health systems and diseases, then the Global
Health Observatory is the best place to get loads of health data. The World
Health Organization has made all their data public on the Global Health
Observatory so that good quality health information is freely available
worldwide in case it is needed to detect and recover from a health emergency
anywhere in the world. The health data is divided according to various
characteristics such as communicable and non-communicable diseases,
mental health, mortality rates, medicines and vaccines, tobacco control,
women and health, health risks, immunization, etc. Currently, they have a huge
focus on COVID-19 data so that this pandemic can be stopped as soon as
possible.

8. Earthdata

If you want data related to the Earth and Space, Earthdata is the perfect place
for that. It is created by NASA after all! Earthdata is a part of the Earth Science
Data Systems Program created by NASA that provides data sets based on the
Earth’s atmosphere, oceans, solar flares, cryosphere, geomagnetism, tectonics,
etc. Earthdata is specifically a part of the Earth Observing System Data and
Information System (EOSDIS) that collects and processes the data from different
NASA aircraft, satellites, and field data obtained from the ground. While
Earthdata provides many of these data sets, they also have data tools for
searching, handling, ordering, mapping, and visualizing the data.

Data Preprocessing in Data Mining (Last Updated: 09-09-2019)

https://www.geeksforgeeks.org

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the


raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.

Some of them are:

A1. Ignore the tuples:

This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.

A2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

(b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can


be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways:

B1. Binning Method:

This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.

B2. Regression:

Here data can be made smooth by fitting it to a regression function.


The regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).

B3. Clustering:

This approach groups the similar data in a cluster. The outliers may
be undetected or it will fall outside the clusters.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:

a. Normalization:

It is done in order to scale the data values in a specified range (-1.0


to 1.0 or 0.0 to 1.0)

b. Attribute Selection:

In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
c. Discretization:

This is done to replace the raw values of numeric attribute by


interval levels or conceptual levels.

d. Concept Hierarchy Generation:

Here attributes are converted from level to higher level in hierarchy.


For Example-The attribute “city” can be converted to “country”.

3. Data Reduction:

Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

a. Data Cube Aggregation:

Aggregation operation is applied to data for the construction of the


data cube.

b. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be


discarded. For performing attribute selection, one can use level of
significance and p- value of the attribute.the attribute having p-value
greater than significance level can be discarded.

c. Numerosity Reduction:

This enable to store the model of data instead of whole data, for
example: Regression Models.

d. Dimensionality Reduction:

This reduce the size of data by encoding mechanisms.It can be


lossy or lossless. If after reconstruction from compressed data, original data
can be retrieved, such reduction are called lossless reduction else it is
called lossy reduction. The two effective methods of dimensionality
reduction are:Wavelet transforms and PCA (Principal Componenet
Analysis).

Data goes through a series of steps during preprocessing:


(https://www.techopedia.com/definition/14650/data-preprocessing)

1. Data Cleaning: Data is cleansed through processes such as filling in missing


values or deleting rows with missing data, smoothing the noisy data, or resolving
the inconsistencies in the data.

Smoothing noisy data is particularly important for ML datasets, since machines


cannot make use of data they cannot interpret. Data can be cleaned by
dividing it into equal size segments that are thus smoothed (binning), by fitting it
to a linear or multiple regression function (regression), or by grouping it into
clusters of similar data (clustering).

Data inconsistencies can occur due to human errors (the information was stored
in a wrong field). Duplicated values should be removed through reduplication to
avoid giving that data object an advantage (bias).

Data Integration: Data with different representations are put together and
conflicts within the data are resolved.

Data Transformation: Data is normalized and generalized. Normalization is a


process that ensures that no data is redundant, it is all stored in a single place,
and all the dependencies are logical.

What is Variable Transformation?

An attribute transform is a function that maps the entire set of values of a


given attribute to a new set of replacement values such that each old value
can be identified with one of the new values

Simple functions: power(x, k), log(x), power(e, x), |x|

Normalization: It refers to various techniques to adjust to differences among


attributes in terms of frequency of occurrence, mean, variance, range
Standardization: In statistics it refers to subtracting off the means and dividing by
the standard deviation.

2. Data Reduction: When the volume of data is huge, databases can become
slower, costly to access, and challenging to properly store. Data reduction step
aims to present a reduced representation of the data in a data warehouse.

There are various methods to reduce data. For example, once a subset of
relevant attributes is chosen for its significance, anything below a given level is
discarded. Encoding mechanisms can be used to reduce the size of data as
well. If all original data can be recovered after compression, the operation is
labeled as lossless.

If some data is lost, then it’s called a lossy reduction. Aggregation can also be
used, for example, to condense countless transactions into a single weekly or
monthly value, significantly reducing the number of data objects.

What is Dimensionality Reduction?

The term dimensionality reduction is often reserved for those techniques


that reduce the dimensionality of a data set by creating new attributes that are
a combination of the old attributes.

Purpose:

 Avoid curse of dimensionality. To learn more about this, visit my earlier


article explaining it in detail.
 Reduce amount of time and memory required by data mining algorithms.
 Allow data to be more easily visualised.
 May help to eliminate irrelevant features or reduce noise.

Techniques:

 Principal Components Analysis (PCA)


 Singular Value Decomposition

The techniques mentioned here are very vast to discuss in this post. You can
learn more about them on internet. I have added YouTube links to both, in case
you want to watch those videos and learn.
What is Feature Subset Selection?

It is another way to reduce dimensionality of data by only using a subset


of the features available. While it might seem that such an approach would lose
information, this is not the case if redundant and irrelevant features are present.

There are three standard approaches to feature selection: embedded, filter,


and wrapper.

Embedded approaches

Feature selection occurs naturally as part of the data mining algorithm.


Specifically, during the operation of the data mining algorithm, the algorithm
itself decides which attributes to use and which to ignore.

Filter approaches

Features are selected before the data mining algorithm is run, using some
approach that is independent of the data mining task. For example, we might
select sets of attributes whose pair wise correlation is as low as possible.

Wrapper approaches

These methods use the target data mining algorithm as a black box to
find the best subset of attributes, in a way similar to that of the ideal algorithm
described above, but typically without enumerating all possible subset.

Flow chart of a feature subset selection process


3. Data Discretization: Data could also be discretized to replace raw values with
interval levels. This step involves the reduction of a number of values of a
continuous attribute by dividing the range of attribute intervals.

 Discretization is the process of converting a continuous attribute into an


ordinal attribute.
 A potentially infinite number of values are mapped into a small number of
categories.
 Discretization is commonly used in classification.
 Many classification algorithms work best if both the independent and
dependent variables have only a few values

Binarization

 Binarization maps a continuous or categorical attribute into one or more


binary variables
 Typically used for association analysis
 Often convert a continuous attribute to a categorical attribute and then
convert a categorical attribute to a set of binary attributes
 Association analysis needs asymmetric binary attributes
 Examples: eye colour and height measured as {low, medium, high}

Conversion of a categorical attribute to three binary attributes


Conversion of a categorical attribute to five asymmetric binary attributes

4. Data Sampling: Sometimes, due to time, storage or memory constraints, a


dataset is too big or too complex to be worked with. Sampling techniques can
be used to select and work with just a subset of the dataset, provided that it has
approximately the same properties of the original one.

Sampling is a commonly used approach for selecting a subset of the data


objects to be analysed.

→ The key aspect of sampling is to use a sample that is representative. A sample


is representative if it has approximately the same property (of interest) as the
original set of data. If the mean (average) of the data objects is the property of
interest, then a sample is representative if it has a mean that is close to that of
the original data.

Types of Sampling
1. Simple Random Sampling:

 There is an equal probability of selecting any particular item


 Sampling without replacement: As each item is selected, it is removed
from the population.
 Sampling with replacement: Objects are not removed from the
population as they are selected for the sample. In sampling with
replacement, the same object can be picked up more than once.

Stratified sampling: Split the data into several partitions, then draw random
samples from each partition.

Progressive Sampling: The proper sample size can be difficult to determine, so


adaptive or progressive sampling schemes are sometimes used. These
approaches start with a small sample, and then increase the sample size until a
sample of sufficient size has been obtained.

You might also like