0% found this document useful (0 votes)
10 views18 pages

UNIT I Introduction to Data Sciecne

The document provides an overview of essential algorithms and data structures in data science, emphasizing their importance in problem-solving and efficient data processing. It categorizes algorithms and data structures, highlighting their characteristics and types, including linear and non-linear structures. Additionally, it discusses programming paradigms, data mining techniques, and the significance of predictive and descriptive modeling in extracting insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

UNIT I Introduction to Data Sciecne

The document provides an overview of essential algorithms and data structures in data science, emphasizing their importance in problem-solving and efficient data processing. It categorizes algorithms and data structures, highlighting their characteristics and types, including linear and non-linear structures. Additionally, it discusses programming paradigms, data mining techniques, and the significance of predictive and descriptive modeling in extracting insights from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT I

Introduction to Data Science

Essentials of Algorithms and Data structures

Introduction

The Essential Algorithms and Data Structures is the broadest


concept generally with respect to data science & data mining. Basically
its focuses on how to solve the problems by using the techniques included
in these two broad areas. Algorithms are a guaranteed way of solving a
type of problem that works in a predictable fashion with the data.

Algorithms like sorting algorithms can be used to sort 10 values or a


billion values and won’t need any modifications to work with either set.
Other algorithms allow us to efficiently search a set of data or find the
lowest cost option to connect a series of points on a graph. Algorithms
are like blueprints that we use to solve problems in our programs.

Data structures are unique ways of storing data that are optimized
for certain situations. Data structures like a priority queue allow us to
model how a CPU processes requests, or how to efficiently model a set of
cities and interconnecting flights.

Choosing a good data structure to store data can make programs


millions of times faster than a bad choice. Data structures are like the
power tools of programming that let us drastically speed up our
programs.

Algorithm is a step-by-step procedure, which defines a set of


instructions to be executed in a certain order to get the desired output.
Algorithms are generally created independent of underlying languages,
i.e. an algorithm can be implemented in more than one programming
language. An algorithm is a set of commands that must be followed for a
computer to perform calculations or other problem-solving operations.

According to its formal definition, an algorithm is a finite set of


instructions carried out in a specific order to perform a particular task.
It is not the entire program or code; it is simple logic to a problem
represented as an informal description in the form of a flowchart or
pseudo code. From the data structure point of view, following are some
important categories of algorithms –

 Search: Algorithm to search an item in a data structure.


 Sort: Algorithm to sort items in a certain order.
 Insert: Algorithm to insert item in a data structure.
 Update: Algorithm to update an existing item in a data structure.
 Delete: Algorithm to delete an existing item from a data structure.

Characteristics of an Algorithm
Not all procedures can be called an algorithm. An algorithm should have
the following characteristics –

• Unambiguous: Algorithm should be clear and unambiguous. Each


of its steps (or phases), and their inputs/outputs should be
clear and must lead to only one meaning.
• Input: An algorithm should have 0 or more well-defined inputs.
• Output: An algorithm should have 1 or more well-defined
outputs, and should match the desired output.
• Finiteness: Algorithms must terminate after a finite number of
steps.
• Feasibility: Should be feasible with the available resources.
• Independent: An algorithm should have step-by-step directions,
which should be independent of any programming code.

Data Structure: Data structures are an essential concept in computer


science and programming. Here are some reasons why they are
important:

1. Efficient data processing: Data structures provide a way to


organize and store data in a way that allows for efficient retrieval,
manipulation, and storage of data. For example, using a hash table
to store data can provide constant-time access to data.
2. Memory management: Proper use of data structures can help to
reduce memory usage and optimize the use of resources. For
example, using dynamic arrays can allow for more efficient use of
memory than using static arrays.
3. Code reusability: Data structures can be used as building blocks in
various algorithms and programs, making it easier to reuse code.
4. Abstraction: Data structures provide a level of abstraction that
allows programmers to focus on the logical structure of the data
and the operations that can be performed on it, rather than on the
details of how the data is stored and manipulated.
5. Algorithm design: Many algorithms rely on specific data
structures to operate efficiently. Understanding data structures
is crucial for designing and implementing efficient algorithms.
Data structures can be classified into two main types primitive
data structures and non-primitive data structures.

1. Primitive data structures: These are the most basic data


structures and are usually built into programming languages.
Examples include:

 Integer
 Float
 Character
 Boolean
 Double
 Void
2. Non-primitive data structures: These are complex data
structures that are built using primitive data types. Non-primitive
data structures can be further categorized into the following
types:

 Arrays: A collection of elements of the same data type,


stored in contiguous memory locations.
 Linked lists: A collection of elements that are connected by
links or pointers.
 Stacks: A collection of elements that follow the Last-In-
First-Out (LIFO) principle.
 Queues: A collection of elements that follow the First-In-
First-Out (FIFO) principle.
 Trees: A hierarchical data structure consisting of nodes
connected by edges.
 Graphs: A non-linear data structure consisting of nodes and
edges.
 Hash tables: A data structure that stores data in an
associative manner using a hash function.
 Heaps: A specialized tree-based data structure that
satisfies the heap property.
 Tries: A tree-like data structure used to store associative
arrays where the keys are strings.
 Sets: A collection of unique elements.
 Maps: An abstract data type that stores key-value pairs.
Understanding the different types of data structures and their
characteristics is important for efficient algorithm design and
implementation.

1. Linear Data Structure


2. Non-Linear Data Structure.

Linear Data Structure: A linear data structure is a type of data


structure in which data elements are arranged in a sequential order, and
each element has a unique predecessor and successor, except for the
first and last elements. Linear data structures are one-dimensional and
can be traversed sequentially from the first to the last element.
Elements are arranged in one dimension, also known as linear dimension.
Example: lists, stack, queue, etc.
Non-Linear Data Structure:
A Non-linear data structure is a type of data structure in which data
elements are not arranged in a sequential order, and each element may
have one or more predecessors and successors. Non-linear data
structures can represent complex relationships between data elements,
such as hierarchies, networks, and graphs. Elements are arranged in
one-many, many-one and many-many dimensions.
Examples: tree, graph, table, etc.

Programming Paradigms

Paradigm can also be termed as method to solve some problem or do


some task. Programming paradigm is an approach to solve problem using
some programming language or also we can say it is a method to solve a
problem using tools and techniques that are available to us following
some approach.

There are lots for programming language that are known but all of
them need to follow some strategy when they are implemented and this
methodology/strategy is paradigms. Apart from varieties of
programming language there are lots of paradigms to fulfill each and
every demand. They are discussed below.
Imperative Programming Paradigm: It is one of the oldest
programming paradigms. It features close relation to machine
architecture. It is based on Von Neumann architecture. It works by
changing the program state through assignment statements. It
performs step by step task by changing state. The main focus is on how
to achieve the goal. The paradigm consists of several statements and
after execution of all the result is stored.

Advantages
1. Very simple to implement
2. It contains loops, variables etc.
Disadvantages

1. Complex problem cannot be solved


2. Less efficient and less productive
3. Parallel programming is not possible
Imperative programming is divided into three broad categories:
Procedural, OOP and parallel processing. These paradigms are as
follows:

Procedure Programming Paradigm: This paradigm emphasizes on


procedure in terms of under lying machine model. There is no difference
in between procedural and imperative approach. It has the ability to
reuse the code and it was boon at that time when it was in use because
of its reusability.
Object Oriented Programming Paradigm: The program is written as a
collection of classes and object which are meant for communication. The
smallest and basic entity is object and all kind of computation is
performed on the objects only. More emphasis is on data rather
procedure. It can handle almost all kind of real life problems which are
today in scenario.
Advantages:
 Data security
 Inheritance
 Code reusability
 Flexible and abstraction is also present

Parallel Processing Programming Paradigm: Parallel processing is the


processing of program instructions by dividing them among multiple
processors. A parallel processing system posses many numbers of
processor with the objective of running a program in less time by
dividing them. This approach seems to be like divide and conquer.
Declarative Programming Paradigm: It is divided as Logic, Functional
and Database. In computer science the declarative programming is a
style of building programs that expresses logic of computation without
talking about its control flow. It often considers programs as theories
of some logic. It may simplify writing parallel programs.

The focus is on what needs to be done rather how it should be done


basically emphasizing on what code is actually doing. It just declares
the result we want rather how it has be produced.

This is the only difference between imperative (how to do) and


declarative (what to do) programming paradigms. Getting into deeper
we would see logic, functional and database.

Logic Programming Paradigm: It can be termed as abstract model of


computation. It would solve logical problems like puzzles, series etc. In
logic programming we have a knowledge base which we know before and
along with the question and knowledge base which is given to machine, it
produces result.

In normal programming languages, such concept of knowledge base is


not available but while using the concept of artificial intelligence,
machine learning we have some models like Perception model which is
using the same mechanism.
In logical programming the main emphasize is on knowledge base and
the problem. The execution of the program is very much like proof of
mathematical statement, e.g., Prolog

Functional Programming Paradigm: The functional programming


paradigms have its roots in mathematics and it is language independent.
The key principle of these paradigms is the execution of series of
mathematical functions. The central model for the abstraction is the
function which is meant for some specific computation and not the data
structure. Data are loosely coupled to functions. The function hides
their implementation. Function can be replaced with their values without
changing the meaning of the program. Some of the languages like Perl,
JavaScript mostly uses this paradigm.
Database/Data Driven Programming Approach: This programming
methodology is based on data and its movement. Program statements
are defined by data rather than hard-coding a series of steps. A
database program is the heart of a business information system and
provides file creation, data entry, update, query and reporting
functions.
There are several programming languages that are developed
mostly for database application. For example SQL. It is applied to
streams of structured data, for filtering, transforming, aggregating
(such as computing statistics), or calling other programs. So it has its
own wide application.
Imperative and object oriented: Object-oriented imperative
languages developed by combining the need for classes and the need for
safe functional programming. A function, in an object-oriented language,
is assigned to a class. An assigned function is then referred to as
a method, member function, or operation. Object-oriented programming
is executing operations on objects. The following table describes some of
the general differences between these two approaches.

Data mining models & techniques:

The amount of data kept in computer files and databases is growing


at a phenomenal rate. At the same time, the users of these data are
expecting more sophisticated information from them. A marketing
manager is no longer satisfied with a simple listing of marketing contacts,
but wants detailed information about customers' past purchases as well
as predictions for future purchases. Simple structured/query language
queries are not adequate to support these increased demands for
information. Data mining steps in to solve these needs. Data mining is
often defined as finding hidden information in a database. Alternatively,
it has been called exploratory data analysis, data driven discovery, and
deductive learning. Traditional database queries (Figure 1.1), access a
database using a well-defined query stated in a language such as SQL.
The output of that query consists of the data from the database that
satisfies the query. The output is usually a subset of the database, but it
may also be an extracted view or may contain aggregations. Data mining
access of a database differs from this traditional access in several ways:

• Query: The query might not be well formed or precisely stated. The
data miner might not even be exactly sure of what he wants to see.
• Data: The data accessed is usually a different version from that of
the original operational database. The data have been cleansed and
modified to better support the mining process.
• Output: The output of the data mining query probably is not a subset
of the database. Instead it is the output of some analysis of the
contents of the database.
The current state of the art of data mining is similar to that of
database query processing in the late 1960s and early 1970s. Over the
next decade there undoubtedly will be great

Although data mining is currently in its infancy, over the last decade we
have seen a proliferation of mining algorithms, applications, and
algorithmic approaches. Example 1.1 illustrates one such application.

EXAMPL1.1
Credit card companies must determine whether to authorize credit card
purchases. Suppose that based on past historical information about
purchases, each purchase is placed into one of four classes: (1) authorize,
(2) ask for further identification before authorization, (3) do not
authorize, and (4) do not authorize but contact police. The data mining
functions here are twofold. First the historical data must be examined to
determine how the data fit into the four classes. Then the problem is to
apply this model to each new purchase. Although the second part indeed
may be stated as a simple database query, the first part cannot be.

Data mining involves many different algorithms to accomplish


different tasks. All of these algorithms attempt to fit a model to the
data. The algorithms examine the data and determine a model that is
closest to the characteristics of the data being examined.
Data mining algorithms can be characterized as consisting of three parts:
•Model: The purpose of the algorithm is to fit a model to the data.
•Preference: Some criteria must be used to fit one model over another.
•Search: All algorithms require some technique to search the data.
In Example 1.1 the data are modeled as divided into four classes.
The search requires examining past data about credit card purchases and
their outcome to determine what criteria should be used to define the
class structure. The preference will be given to criteria that seem to fit
the data best. For example, we probably would want to authorize
a credit card purchase for a small amount of money with a credit card
belonging to a long-standing customer. Conversely, we would not want to
authorize the use of a credit card to purchase anything if the card has
been reported as stolen. The search process requires that the criteria
needed to fit the data to the classes be properly defined.
As seen in Figure 1.2, the model that is created can be either
predictive or descriptive in nature. In this figure, we show under each
model type some of the most common data mining tasks that use that
type of model.

A predictive model makes a prediction about values of data using


known results found from different data. Predictive modeling may be
made based on the use of other historical data. For example, a credit
card use might be refused not because of the user's own credit history,
but because the current purchase is similar to earlier purchases that
were subsequently found to be made with stolen cards. Example 1.1 uses
predictive modeling to predict the credit risk. Predictive model data
mining tasks include classification, regression, time series analysis, and
prediction. Prediction may also be used to indicate a specific type of data
mining function. A descriptive model identifies patterns or relationships
in data. Unlike the predictive model, a descriptive model serves as a way
to explore the properties of the data examined, not to predict new
properties. Clustering, summarization, association rules, and sequence
discovery are usually viewed as descriptive in nature.

BASIC DATA MINING TASKS


In the following paragraphs we briefly explore some of the data
mining functions. We follow the basic outline of tasks shown in Figure 1.2.
This list is not intended to be exhaustive, but rather illustrative. Of
course, these individual tasks may be combined to obtain more
sophisticated data mining applications.

Classification

Classification maps data into predefined groups or classes. It is


often referred to as supervised learning because the classes are
determined before examining the data. Two examples of classification
applications are determining whether to make a bank loan and identifying
credit risks. Classification algorithms require that the classes be defined
based on data attribute values. Pattern recognition is a type of
classification where an input pattern is classified into one of several
classes based on its similarity to these predefined classes. Example 1.1
illustrates a general classification problem. Example 1.2 shows a simple
example of pattern recognition.

EXAMPLE 1.2
An airport security screening station is used to determine: if passengers
are potential terrorists or criminals. To do this, the face of each
passenger is scanned and its basic pattern (distance between eyes, size
and shape of mouth, shape of head, etc.) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that
are associated with known offenders.

Regression
Regression is used to map a data item to a real valued prediction variable.
In actually regression involves the learning of the function that does this
mapping. Regression assumes that the target data fit into some known
type of function (e.g., linear, logistic, etc.) and then determines the best
function of this type that models the given data as illustrated in Example
1.3 is a simple example of regression.

EXAMPLE 1.3
A college professor wishes to reach a certain level of savings before
his/her retirement. Periodically, he/she predicts what his/her
retirement savings will be based on its current value and several past
values. He/She uses a simple linear regression formula to predict this
value by fitting past behavior to a linear function and then using this
function to predict the values at points in the future. Based on these
values, she then alters his/her investment portfolio.

Time Series Analysis

With time series analysis, the value of an attribute is examined as it


varies over time. The values usually are obtained as evenly spaced time
points (daily, weekly, hourly, etc.). There are three basic functions
performed in time series analysis: In first case, distance measures are
used to determine the similarity between different time series. In the
second case, the structure of the line is examined to determine (and
perhaps classify)
its behavior. A third application would be to use the historical time series
plot to predict future values. A time series example is given in following
example 1.4.

EXAMPLE 1.4
Mr. Smith is trying to determine whether to purchase stock from
Companies X, Y, or Z. For a period of one month he charts the daily stock
price for each company. Mr. Smith has generated time series plot. Using
this and similar information available from his stockbroker, Mr. Smith
decides to purchase stock from X because it is less volatile while overall
showing a slightly larger relative amount of growth than either of the
other stocks.

Prediction
Many real-world data mining applications can be seen as predicting future
data states based on past and current data. Prediction can be viewed as a
type of classification. (Note: This is a data mining task that is different
from the prediction model, although the prediction task is a type of
prediction model.) The difference is that prediction is predicting a
future state rather than a current state. Here we are referring to a
type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding,
speech recognition, machine learning, and pattern recognition.
Although future values may be predicted using time series analysis or
regression techniques, other approaches may be used as well. Example 1.5
illustrates the process.

EXAMPLE 1.5
Predicting flooding is a difficult problem. One approach uses monitors
placed at various points in the river. These monitors collect data relevant
to flood prediction: water level, rain amount, time, humidity, and so on.
Then the water level at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this
point. The prediction must be made with respect to the time the data
were collected.

Clustering
Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone. Clustering is
alternatively referred to as unsupervised learning or segmentation. It
can be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed. The clustering is usually accomplished
by determining the similarity among the data on predefined attributes.
The most similar data are grouped into clusters. Example 1.6 provides a
simple clustering example. Since the clusters are not predefined, a
domain expert is often required to interpret the meaning of the created
clusters

EXAMPLE 1.6
A certain national department store chain creates special catalogs
targeted to various demographic groups based on attributes such as
income, location, and physical characteristics of potential customers
(age, height, weight, etc.). To determine the target mailings of the
various catalogs and to assist in the creation of new, more specific
catalogs, the company performs a clustering of potential customers
based on the determined attribute values. The results of the clustering
exercise are then used by management to create special catalogs and
distribute them to the correct target population based on the cluster
for that catalog.
A special type of clustering is called segmentation. With segmentation a
database is partitioned into disjointed groupings of similar tuples called
segments. Segmentation is often viewed as being identical to clustering.
In other circles segmentation is viewed as a specific type of clustering
applied to a database itself. In this text we use the two terms,
clustering and segmentation, interchangeably.

Summarization
Summarization maps data into subsets with associated simple
descriptions. Summarization is also called characterization or
generalization. It extracts or derives representative information about
the database. This may be accomplished by actually retrieving portions
of the data. Alternatively, summary type information (such as the mean
of some numeric attribute) can be derived from the data. The
summarization succinctly characterizes the contents of the database.
Example 1.7 illustrates this process.

EXAMPLE 1.7
One of the many criteria used to compare universities by the U.S. News
& World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the
student body.

Association Rules
Link analysis, alternatively referred to as affinity analysis or association,
refers to the data mining task of uncovering relationships among data.
The best example of this type of application is to determine association
rules. An association rule is a model that identifies specific types of data
associations. These associations are often used in the retail sales
community to identify items that are frequently purchased together.
Associations are also used in many other applications such as
predicting the failure of telecommunication switches.

EXAMPLE 1.8
A grocery store retailer is trying to decide whether to put bread on sale.
To help determine the impact of this decision, the retailer generates
association rules that show what other products are frequently
purchased with bread. He finds that 60% of the times that bread is sold
so are pretzels and that 70% of the time jelly is also sold. Based on
these facts, he tries to capitalize on the association between bread,
pretzels, and jelly by placing some pretzels and jelly at the end of the
aisle where the bread is placed.

Users of association rules must be cautioned that these are not


causal relationships. They do not represent any relationship inherent in
the actual data (as is true with functional dependencies) or in the real
world. There probably is no relationship between bread and pretzels that
causes them to be purchased together. And there is no guarantee that
this association will apply in the future. However, association rules can be
used to assist retail store management in effective advertising,
marketing, and inventory control.

Sequence Discovery
Sequential analysis or sequence discovery is used to determine sequential
patterns in data. These patterns are based on a time sequence of actions.
These patterns are similar to associations in that data (or events) are
found to be related, but the relationship is based on time. Unlike a
market basket analysis, which requires the items to be purchased at the
same time, in sequence discovery the items are purchased over time in
some order.

EXAMPLE 1.9
The Webmaster at the XYZ Corp. periodically analyzes the Web log data
to determine how users of the XYZ's Web pages access them. He is
interested in determining what sequences of pages are frequently
accessed. He determines that 70 percent of the users of page A follow
one of the following patterns of behavior: (A, B, C) or (A, D, B, C) Or (A,
E, B, C). He then determines to add a link directly from page A to page C.
_____________________________________________________________________

Data Visualization
How the data mining results are presented to the users is
extremely important because the usefulness of the results is dependent
on it. Various visualization and GUI strategies are used at this last step.
Transformation techniques are used to make the data easier to mine and
more useful, and to provide more meaningful results. The actual
distribution of the data may be
Some attribute values may be combined to provide new values, thus
reducing the complexity of the data. The use of visualization techniques
allows users to summarize, extract, and grasp more complex results than
more mathematical or text type descriptions of the results. Visualization
techniques include:

• Graphical: Traditional graph structures including bar charts, pie


charts, histograms, and line graphs may be used.
• Geometric: Geometric techniques include the. Box plot and scatter
diagram techniques.
• Icon-based: Using figures, colors, or other icons can improve the
presentation of the results.
• Pixel-based: With these techniques each data value is shown as a
uniquely colored pixel.
• Hierarchical: These techniques hierarchically divide the display area
(screen) into regions based on data values.
• Hybrid: The preceding approaches can be combined into one display.

Software Engineering trends & techniques

Software engineering is the systematic application


of engineering approaches to the development of software. A software
engineer is a person who applies the principles of software engineering to
design, develop, maintain, test, and evaluate computer software. The
term programmer is sometimes used as a synonym, but may also lack
connotations of engineering education or skills.

Engineering techniques are used to inform the software


development process which involves the definition, implementation,
assessment, measurement, management, change, and improvement of the
software life cycle process itself. It heavily uses software configuration
management which is about systematically controlling changes to the
configuration, and maintaining the integrity and traceability of the
configuration and code throughout the system life cycle. Modern
processes use software versioning. In software engineering, a software
development process is the process of dividing software
development work into smaller, parallel or sequential steps or sub-
processes to improve design, product management. It is also known as
a software development life cycle (SDLC). The methodology may include
the pre-definition of specific deliverables and artefacts that are
created and completed by a project team to develop or maintain an
application.
Most modern development processes can be vaguely described as agile.
Other methodologies include waterfall, prototyping, iterative and
incremental development, spiral development, rapid application
development, and extreme programming.
A life-cycle "model" is sometimes considered a more general term
for a category of methodologies and a software development "process" a
more specific term to refer to a specific process chosen by a specific
organization.[citation needed] For example, there are many specific
software development processes that fit the spiral life-cycle model. The
field is often considered a subset of the systems development life cycle.
The basic principles of software engineering we applied in the
development of algorithms, application & tools in data mining

Important Questions: (Assignment: 01)

Q.1: What is algorithm? Explain in detail.

Q.2: Explain data structure in detail.

Q.3: What is Programming paradigm? Explain in detail.

Q.4: Explain data mining task and techniques with examples.

Q.5: Write a note on Software Engineering trends.

(Note: Students must have to submit & check the assignment within 8 days after receiving
notes on respective WhatsApp groups.)

You might also like