UNIT I Introduction to Data Sciecne
UNIT I Introduction to Data Sciecne
Introduction
Data structures are unique ways of storing data that are optimized
for certain situations. Data structures like a priority queue allow us to
model how a CPU processes requests, or how to efficiently model a set of
cities and interconnecting flights.
Characteristics of an Algorithm
Not all procedures can be called an algorithm. An algorithm should have
the following characteristics –
Integer
Float
Character
Boolean
Double
Void
2. Non-primitive data structures: These are complex data
structures that are built using primitive data types. Non-primitive
data structures can be further categorized into the following
types:
Programming Paradigms
There are lots for programming language that are known but all of
them need to follow some strategy when they are implemented and this
methodology/strategy is paradigms. Apart from varieties of
programming language there are lots of paradigms to fulfill each and
every demand. They are discussed below.
Imperative Programming Paradigm: It is one of the oldest
programming paradigms. It features close relation to machine
architecture. It is based on Von Neumann architecture. It works by
changing the program state through assignment statements. It
performs step by step task by changing state. The main focus is on how
to achieve the goal. The paradigm consists of several statements and
after execution of all the result is stored.
Advantages
1. Very simple to implement
2. It contains loops, variables etc.
Disadvantages
• Query: The query might not be well formed or precisely stated. The
data miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that of
the original operational database. The data have been cleansed and
modified to better support the mining process.
• Output: The output of the data mining query probably is not a subset
of the database. Instead it is the output of some analysis of the
contents of the database.
The current state of the art of data mining is similar to that of
database query processing in the late 1960s and early 1970s. Over the
next decade there undoubtedly will be great
Although data mining is currently in its infancy, over the last decade we
have seen a proliferation of mining algorithms, applications, and
algorithmic approaches. Example 1.1 illustrates one such application.
EXAMPL1.1
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not
authorize, and (4) do not authorize but contact police. The data mining
functions here are twofold. First the historical data must be examined to
determine how the data fit into the four classes. Then the problem is to
apply this model to each new purchase. Although the second part indeed
may be stated as a simple database query, the first part cannot be.
Classification
EXAMPLE 1.2
An airport security screening station is used to determine: if passengers
are potential terrorists or criminals. To do this, the face of each
passenger is scanned and its basic pattern (distance between eyes, size
and shape of mouth, shape of head, etc.) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that
are associated with known offenders.
Regression
Regression is used to map a data item to a real valued prediction variable.
In actually regression involves the learning of the function that does this
mapping. Regression assumes that the target data fit into some known
type of function (e.g., linear, logistic, etc.) and then determines the best
function of this type that models the given data as illustrated in Example
1.3 is a simple example of regression.
EXAMPLE 1.3
A college professor wishes to reach a certain level of savings before
his/her retirement. Periodically, he/she predicts what his/her
retirement savings will be based on its current value and several past
values. He/She uses a simple linear regression formula to predict this
value by fitting past behavior to a linear function and then using this
function to predict the values at points in the future. Based on these
values, she then alters his/her investment portfolio.
EXAMPLE 1.4
Mr. Smith is trying to determine whether to purchase stock from
Companies X, Y, or Z. For a period of one month he charts the daily stock
price for each company. Mr. Smith has generated time series plot. Using
this and similar information available from his stockbroker, Mr. Smith
decides to purchase stock from X because it is less volatile while overall
showing a slightly larger relative amount of growth than either of the
other stocks.
Prediction
Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a
type of classification. (Note: This is a data mining task that is different
from the prediction model, although the prediction task is a type of
prediction model.) The difference is that prediction is predicting a
future state rather than a current state. Here we are referring to a
type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding,
speech recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well. Example 1.5
illustrates the process.
EXAMPLE 1.5
Predicting flooding is a difficult problem. One approach uses monitors
placed at various points in the river. These monitors collect data relevant
to flood prediction: water level, rain amount, time, humidity, and so on.
Then the water level at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this
point. The prediction must be made with respect to the time the data
were collected.
Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone. Clustering is
alternatively referred to as unsupervised learning or segmentation. It
can be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed. The clustering is usually accomplished
by determining the similarity among the data on predefined attributes.
The most similar data are grouped into clusters. Example 1.6 provides a
simple clustering example. Since the clusters are not predefined, a
domain expert is often required to interpret the meaning of the created
clusters
EXAMPLE 1.6
A certain national department store chain creates special catalogs
targeted to various demographic groups based on attributes such as
income, location, and physical characteristics of potential customers
(age, height, weight, etc.). To determine the target mailings of the
various catalogs and to assist in the creation of new, more specific
catalogs, the company performs a clustering of potential customers
based on the determined attribute values. The results of the clustering
exercise are then used by management to create special catalogs and
distribute them to the correct target population based on the cluster
for that catalog.
A special type of clustering is called segmentation. With segmentation a
database is partitioned into disjointed groupings of similar tuples called
segments. Segmentation is often viewed as being identical to clustering.
In other circles segmentation is viewed as a specific type of clustering
applied to a database itself. In this text we use the two terms,
clustering and segmentation, interchangeably.
Summarization
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions
of the data. Alternatively, summary type information (such as the mean
of some numeric attribute) can be derived from the data. The
summarization succinctly characterizes the contents of the database.
Example 1.7 illustrates this process.
EXAMPLE 1.7
One of the many criteria used to compare universities by the U.S. News
& World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the
student body.
Association Rules
Link analysis, alternatively referred to as affinity analysis or association,
refers to the data mining task of uncovering relationships among data.
The best example of this type of application is to determine association
rules. An association rule is a model that identifies specific types of data
associations. These associations are often used in the retail sales
community to identify items that are frequently purchased together.
Associations are also used in many other applications such as
predicting the failure of telecommunication switches.
EXAMPLE 1.8
A grocery store retailer is trying to decide whether to put bread on sale.
To help determine the impact of this decision, the retailer generates
association rules that show what other products are frequently
purchased with bread. He finds that 60% of the times that bread is sold
so are pretzels and that 70% of the time jelly is also sold. Based on
these facts, he tries to capitalize on the association between bread,
pretzels, and jelly by placing some pretzels and jelly at the end of the
aisle where the bread is placed.
Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are based on a time sequence of actions.
These patterns are similar to associations in that data (or events) are
found to be related, but the relationship is based on time. Unlike a
market basket analysis, which requires the items to be purchased at the
same time, in sequence discovery the items are purchased over time in
some order.
EXAMPLE 1.9
The Webmaster at the XYZ Corp. periodically analyzes the Web log data
to determine how users of the XYZ's Web pages access them. He is
interested in determining what sequences of pages are frequently
accessed. He determines that 70 percent of the users of page A follow
one of the following patterns of behavior: (A, B, C) or (A, D, B, C) Or (A,
E, B, C). He then determines to add a link directly from page A to page C.
_____________________________________________________________________
Data Visualization
How the data mining results are presented to the users is
extremely important because the usefulness of the results is dependent
on it. Various visualization and GUI strategies are used at this last step.
Transformation techniques are used to make the data easier to mine and
more useful, and to provide more meaningful results. The actual
distribution of the data may be
Some attribute values may be combined to provide new values, thus
reducing the complexity of the data. The use of visualization techniques
allows users to summarize, extract, and grasp more complex results than
more mathematical or text type descriptions of the results. Visualization
techniques include:
(Note: Students must have to submit & check the assignment within 8 days after receiving
notes on respective WhatsApp groups.)