(IJCST-V3I1P21) : S. Padmapriya

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015

RESEARCH ARTICLE

OPEN ACCESS

A Study on Algorithmic Approaches and Mining Methodologies


In Data Mining
S. Padmapriya
Assistant Professor
Department of Computer Science
Srimad Andavan Arts and Science College
Tamil Nadu India

ABSTRACT
Data mining finds valuable information hidden in large volumes of data that need to be turned into useful information. It is
considered to deal with huge amounts of data which are kept in the database.Data mining is the analysis of data and the use of
software techniques for finding hidden patterns and regularities in sets of data Knowledge discovery from the large data set
becomes difficult. The increase in demand of finding pattern from huge data is improved by means of data mining algorithms
and techniques.Researchers presented a lot of approaches and algorithms for determining patterns. This paper presented various
data mining algorithms and mining methods to discover valuable patterns from the hidden information.
Keywords:-Data mining, Knowledge Discovery.

I.

INTRODUCTION
Data mining is an emerging trends.The information age has

serve as placeholders for data. There are two types of

enabled many organizations to gather large volumes of data.

variables,

numerical

and

categorical.A

numerical

or

However, the usefulness of this datam is negligible if

continuous variable is one that can accept any value within a

meaningful information or Knowledge cannot be extracted

finite or infinite interval. There are two types of numerical

from it [4].Data mining, otherwise known as knowledge

data, interval and ratio. Data on an interval scale can be added

discovery, attempts to answer this need. In contrast to

and subtracted but cannot be meaningfully multiplied or

standard Statistical methods, data mining techniques search

divided because there is no true zero. For example, we cannot

for interesting information without demanding a priori

say that one day is twice as hot as another day. On the other

hypotheses. Now a day, advances in hardware technology

hand, data on a ratio scale has true zero and can be added,

have lead to an increase in the capability to store and record

subtracted, multiplied or divided (e.g., weight).A categorical

personal data.

or discrete variable is one that can accept two or more values


(categories). There are two types of categorical data, nominal
and ordinal. Nominal data does not have an intrinsic ordering
in the categories. For example, "gender" with two categories,
male and female. In contrast, ordinal data does have an
intrinsic ordering in the categories. For example, "level of
energy" with three orderly categories (low, medium and high).
Figure:1 Data Minig Process
Data Preparation:
Data preparation is about constructing a dataset from

II. BASIC TERMINOLOGY


Data:

one or more data sources to be used for exploration and


Data

is

information

of

modelling. It is a solid practice to start with an initial dataset

measurement (numerical) or counting (categorical). Variables

to get familiar with the data, to discover first insights into the

ISSN: 2347-8578

typically

the

results

www.ijcstjournal.org

Page 105

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015
data and have a good understanding of any possible data

classification. For example, the CART (Classification And

quality issues. Data preparation is often a time consuming

Regression Trees) decision tree algorithm can be used to build

process and heavily prone to errors. The old saying "garbage-

both classification trees (to classify categorical response

in-garbage-out" is particularly applicable to those data mining

variables) and regression trees (to forecast continuous

projects where data gathered with many invalid, out-of-range

response

and missing values. Analyzing data that has not been carefully

classification and regression models.

screened for such problems can produce highly misleading

c. Logistic regression:

results. Then, the success of data mining projects heavily


depends on the quality of the prepared data.

variables).Neural

nets

too

can

create

both

Logistic regression is a generalization of linear


regression. It is used primarily for predicting binaryvariables
(with values such as yes/no or 0/1) and occasionally multi-

Dataset:

class variables[1]. Because the response variable is discrete, it


Dataset is a collection of data, usually presented in a

cannot be modeled directly by linear regression. Therefore,

tabular form. Each column represents a particular variable,

rather than predict whether the event itself (the response

and each row corresponds to a given member of the data.Data

variable) will occur, we build the model to predict the

sets are classified into two types test data and training data.

logarithm of the odds of its occurrence. This logarithm is

III. MINING METHODOLOGY

called the log odds or the logit transformation.

a.Classification
Classification

The odds ratio: Probability of an event occurring


problems

aim

to

identify

the

characteristics that indicate the group to which each


casebelongs. This pattern can be used both to understand the
existing data and to predict how new instances will
behave[1].Data mining creates classification models by
examining already classified data (cases) and inductively
finding a predictive pattern. These existing cases may come
from an historical database, such as people who have already
undergone a particular medical treatment or moved to a new
long- distance service. They may come from an experiment in
which a sample of the entire database is tested in the real

Probability of the event not occurring


d.Neural networks
Neural networks are of particular interest because
they offer a means of efficiently modeling large and complex
problems in which there may be hundreds of predictor
variables that have many interactions.(Actual biological
neural networks are incomparably more complex.) Neural nets
may be used in classification problems (where the output is a
categorical variable) or for regressions (where the output
variable is continuous). A neural network starts with an input
layer , where each node corresponds to a predictor variable.
These input nodes are connected to a number of nodes in a

world and the results used to create a classifier.


b..Regression
Regression uses existing values to forecast what

hidden layer . Each input node is connected to every node in


the hidden layer. The nodes in the hidden layer may be

other values will be. In the simplest case, regression uses

connected to nodes in another hidden layer, or to an output

standard

layer . The output layer consists of one or more response

statistical

techniques

such

as

linear

regression[1].Unfortunately, many real-world problems are

variable.

not simply linear projections of previous values. For instance,

Algorithm
There are different types of neural networks, but they

sales volumes, stock prices, and product failure rates are all
very difficult to predict because they may depend on complex
interactions of multiple predictor variables. Therefore, more
complex techniques (e.g., logistic regression, decision trees, or
neural nets) may be necessary to forecast future values.The
same model types can often be used for both regression and

are generally classified into feed-forward and feed-back


networks.A feed-forward network is a non-recurrent network
which contains inputs, outputs, and hidden layers; the signals
can only travel in one direction. Input data is passed onto a
layer of processing elements where it performs calculations.
Each processing element makes its computation based upon a

ISSN: 2347-8578

www.ijcstjournal.org

Page 106

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015
weighted sum of its inputs. The new calculated values then
become the new input values that feed the next layer. This
process continues until it has gone through all the layers and
determines the output. A threshold transfer function is
sometimes used to quantify the output of a neuron in the
output layer. Feed-forward networks include Perceptron
(linear and non-linear) and Radial Basis Function networks.
Feed-forward networks are often used in data mining.
A feed-back network has feed-back paths meaning they can
have signals traveling in both directions using loops. All
possible connections between neurons are allowed. Since

Figure2:Support Vector Method

IV. ALGORITHMIC APPROACHES

loops are present in this type of network, it becomes a non-

a.Genetic algorithms
Genetic algorithms are not used to find patterns per

linear dynamic system which changes continuously until it

se, but rather to guide the learning process of data mining

reaches a state of equilibrium. Feed-back networks are often

algorithms such as neural nets. Essentially, genetic algorithms

used in associative memories and optimization problems

act as a method for performing a guided search for good

where the network looks for the best arrangement of

models in the solution space.They are called genetic

interconnected factors.

algorithms because they loosely follow the pattern of

f.Support Vector machine


A Support Vector

Machine

(SVM)

is

discriminative classifier formally defined by a separating


hyperplane. Support Vector Machine can also be used as a
regression method, maintaining all the main features that
characterize the algorithm (maximal margin). The Support

biological evolution in which the members of one generation


(of models) compete to pass on their characteristics to the next
generation (of models), until the best (model) is found. The
information to be passed on is contained in chromosomes,
which contain the parameters for building the model.

Vector Regression (SVR) uses the same principles as the

b.Artificial Neural Networks


An ANN is comprised of a network of artificial

SVM for classification, with only a few minor differences.

neurons (also known as "nodes"). These nodes are connected

First of all, because output is a real number it becomes very

to each other, and the strength of their connections to one

difficult to predict the information at hand, which has infinite

another is assigned a value based on their strength: inhibition

possibilities. In the case of regression, a margin of tolerance

(maximum being -1.0) or excitation (maximum being +1.0). If

(epsilon) is set in approximation to the SVM which would

the value of the connection is high, then it indicates that there

have already requested from the problem. But besides this

is a strong connection. Within each node's design, a transfer

fact, there is also a more complicated reason, the algorithm is

function is built in. There are three types of neurons in an

more complicated therefore to be taken in consideration[3].

ANN, input nodes, hidden nodes, and output nodes.The input

However, the main idea is always the same: to minimize error,

nodes take in information, in the form which can be

individualizing the hyperplane which maximizes the margin,

numerically expressed. The information is presented as

keeping in mind that part of the error is tolerated.

activation values, where each node is given a number, the


higher the number, the greater the activation. This information
is then passed throughout the network. Based on the
connection strengths (weights), inhibition or excitation, and
transfer functions, the activation value is passed from node to
node. Each of the nodes sums the activation values it receives;
it then modifies the value based on its transfer function. The

ISSN: 2347-8578

www.ijcstjournal.org

Page 107

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015
activation flows through the network, through hidden layers,

and all computation is deferred until classification. The k-NN

until it reaches the output nodes. The output nodes then reflect

algorithm is among the simplest of all machine learning

the input in a meaningful way to the outside world.

algorithms.Both for classification and regression, it can be

c.The k-means Algorithm


The k-means algorithm is an evolutionary algorithm

useful to weight the contributions of the neighbors, so that the

that gains its name from its method of operation. The


algorithm clusters observations into k groups, where k is
provided as an input parameter. It then assigns each
observation to clusters based upon the observations proximity
to the mean of the cluster. The clusters mean is then
recomputed and the process begins again. Heres how the
algorithm works:

nearer neighbors contribute more to the average than the more


distant ones. For example, a common weighting scheme
consists in giving each neighbor a weight of 1/d, where d is
the distance to the neighbor.The neighbors are taken from a
set of objects for which the class (for k-NN classification) or
the object property value (for k-NN regression) is known. This
can be thought of as the training set for the algorithm, though
no explicit training step is required.A shortcoming of the k-

1.The algorithm arbitrarily selects k points as the initial cluster


centers (means).

NN algorithm is that it is sensitive to the local structure of the


data[2].

2. Each point in the dataset is assigned to the closed cluster,


based upon the Euclidean distance between each point and

Choosing the Number of Clusters


One of the main disadvantages to k-means is the fact

each cluster center.

that you must specify the number of clusters as an input to the


algorithm. As designed, the algorithm is not capable of

3.Each cluster center is recomputed as the average of the

determining the appropriate number of clusters and depends

points in that cluster.

upon the user to identify this in advance. For example, if you

4.Steps 2 and 3 repeat until the clusters converge.

had a group of people that were easily clustered based upon

Convergence may be defined differently depending upon the

gender, calling the k-means algorithm with k=3 would force

implementation, but it normally means that either no

the people into three clusters, when k=2 would provide a more

observations change clusters when steps 2 and 3 are repeated

natural fit. Similarly, if a group of individuals were easily

or that the changes do not make a material difference in the

clustered based upon home state and you called the k-means

definition of the clusters[2].

algorithm with k=20, the results might be too generalized to

b.Nearest Neighbour Algorithms


In pattern recognition, the k-Nearest Neighbors

be effective.

algorithm (or k-NN for short) is a non-parametric method

d.K-nearest neighbor and memory-based reasoning (MBR)


When trying to solve new problems, people often

used for classification and regression.[1] In both cases, the

look at solutions to similar problems that they have previously

input consists of the k closest training examples in the feature

solved. K-nearest neighbor (k-NN) is a classification

space. The output depends on whether k-NN is used for

technique that uses a version of this same method. It decides

classification or regression:In k-NN classification, the output

in which class to place a new case by examining some number

is a class membership. An object is classified by a majority

the k in k-nearest neighbor of the most similar cases or

vote of its neighbors, with the object being assigned to the

neighbors . It counts the number of cases for each class, and

class most common among its k nearest neighbors (k is a

assigns the new case to the same class to which most of its

positive integer, typically small). If k = 1, then the object is

neighbors belong.

simply assigned to the class of that single nearest neighbor.In

e.Bayesian Algorithms
Bayesian approches are a fundamentally important DM

k-NN regression, the output is the property value for the


object. This value is the average of the values of its k nearest
neighbors[2].k-NN is a type of instance-based learning, or
lazy learning, where the function is only approximated locally

ISSN: 2347-8578

technique. Given the probability distribution, Bayes classifier


can provably achieve the optimal result.Bayesian method is
based on the probability theory. Bayes Rule is applied to

www.ijcstjournal.org

Page 108

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 1, Jan-Feb 2015
calculate the posterior from the prior and the likelihood,

set much better than it fits the test set, over fitting is probably

because the later two is generally easier to be calculated from

the cause.

a probability model.One limitation that the Bayesian

Cross-Validation

approaches can not cross is the need of the probability


estimation from the training dataset.It is noticeable that in
some situations, such as the decision is clearly based on
certain criteria, or the dataset has high degree of randomality,
the Bayesian approaches will not be a good choice. Bayes
theorem plays a critical role in probabilistic learning and
classification. Build a generative model that approximates
how data is produced uses prior probability of each category

When only a limited amount of data is available, to


achieve an unbiased estimate of the model performance we
use k-fold cross-validation. In k-fold cross-validation, we
divide the data into k subsets of equal size. We build models k
times, each time leaving out one of the subsets from training
and use it as the test set. If k equals the sample size, this is
called "leave-one-out".

given no information about an item. Categorization produces a

Time series
Time series forecasting predicts unknown future

posterior probability distribution over the possible categories

values based on a time-varying series of predictors. Like

given a description of an item

regression, it uses known results to guide its predictions.

P(C,D)=P(C|D)P(D)=P(D|C)P(C)

Models must take into account the distinctive properties of


time, especially the hierarchy of periods (including such
varied definitions as the five- or seven-day work week, the

P(C|D)=P(D|C)P(C)

thirteen-month year, etc.), seasonality, and calendar effects

P(D)

such as holidays, date arithmetic, and special considerations

V. MODEL EVALUATION

such as how much of the past is relevant [1].


Model Evaluation is an integral part of the model

VI. CONCLUSION

development process. It helps to find the best model that


represents our data and how well the chosen model will work

Due to increase in the demanding need of valuable

in the future. Evaluating model performance with the data

data and accuracy new methods and techniques are needed to

used for training is not acceptable in data mining because it

be identified to improve the quality parameter. This paper

can easily generate overoptimistic and over fitted models.

describes different methodologies associated with different

There are two methods of evaluating models in data mining,

algorithms used to handle huge data and also it gives an

Hold-Out and Cross-Validation. To avoid over fitting, both

overview of various techniques and algorithms used in big

methods use a test set (not seen by the model) to evaluate

data sets.

model performance.

REFERENCES

Hold-Out:
[1] http://www.twocrows.com/intro-dm.pdf
In this method, the mostly large dataset is randomly divided to
three subsets: Training set is a subset of the dataset used to
build predictive models. Validation set is a subset of the
dataset used to assess the performance of model built in the
training phase. It provides a test platform for fine tuning
model's parameters and selecting the best-performing model.

[2] http://en.wikipedia/wiki/k-nearest neighbour algorithm


[3] http://research.cs.queensu.ca/home/xiao/dm.html
[4] Mohammed
Mohammed

Younus,Dr.Ahmad
Farooq,Fahmida

A.Alhamed,Khazi

BegumData

Mining

Modeling Techniques and Algorithm Approaches in


Privacy Data,ijarcsse volume4-2014.

Not all modelling algorithms need a validation set. Test set or

[5] Rachna Somkunwar,A study on Various Data Mining

unseen examples is a subset of the dataset to assess the likely

Approaches of Association Rules,ijarcsse volume2-

future performance of a model. If a model fit to the training

ISSN: 2347-8578

2012.

www.ijcstjournal.org

Page 109

You might also like