Lecture 6-Data Mining and Warehousing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Data Mining Functionalities

Functionalities of data mining are used to specify the kind of patterns to be found in data mining tasks. It can be
classified into two categories such as Descriptive and Predictive. Descriptive mining task characterize the general
properties of data in the database, whereas predictive mining task perform inference on the current data in order to
make predictions.

These functionalities are classified as follows:


– Characterization and discrimination
– Association analysis
– Classification and prediction
– Cluster analysis
– Outlier analysis
– Evolution analysis

Classification of Data Mining Systems


Data mining is an interdisciplinary field, the confluence of set of disciplines, including database system statistics,
machine learning, visualization, and information science. Moreover, depending on the data mining approach used,

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
techniques from other disciplines may be applied. Data mining research is expected to generate a large variety of
data mining systems. It can be described as follows.

Classification According to the Kinds of Database Mined


Database system themselves can be classified according to different criteria such as data models, each of which may
require its own data mining techniques. If classifying according to the special types of data handled, we may have a
spatial, time series, text, or world wide mining system.

Classification According to the Kinds of Knowledge Mined


It can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such
as characterization, discrimination, association, classification, clustering, cluster outlier analysis, and evolution
analysis. It can be based on granularity or levels of abstraction of the knowledge mined.

Classification According to the Kinds of Techniques Utilized


Data mining techniques can be categorized according to the degree of user interaction involved or methods of data
analysis employed. A sophisticated data mining system will often adopt multiple data mining techniques or work out
an effective integrated technique that combines the merits of a few individual approaches.

Major Issues in Data Mining


Major issues in data mining are mining methodology, user interaction, performance, and diverse data types. These
are described follows.

Mining Methodologies and User Interaction Issues


These reflect the kind of knowledge mine, the ability to mined knowledge at multiple granularities, the user domain
knowledge, and knowledge visualization.
Mining Different Kind of Knowledge in Database
Since different users can be interested in different kinds of knowledge, data mining should cover a wide spectrum of
data analysis and knowledge discovery task, including data characterization discrimination, association,
classification, clustering, trend and deviation analysis, and similarity analysis. These
tasks may be used in the same database in different ways and require the development of numerous data mining
techniques.

Incorporation of Background Knowledge


Background knowledge or information regarding the domain under study may be used to guide the discovery
process and allows discovered patterns to be expressed in concise terms and at different levels of abstraction.
Presentation and Visualization of Data Mining Results

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
Discover knowledge should be expressed in high-level languages, visual representations, or other expressive forms
so that knowledge can be easily understood and directly used by humans. This is especially crucial if the data
mining system is to be interactive.

Handling Noisy or Incomplete Data


The data stored in database may reflect noise, exceptional cases, or incomplete data objects. When mining data
regularities, these objects may confuse the process, causing knowledge model constructed to over fit the data. As a
result, the accuracy of the discovered pattern can be poor.

Performance Issues
The performance issues in data mining include efficiency, scalability, and parallelization of data mining algorithms.

Efficiency and Scalability of Data Mining Algorithms


To effectively extract information from a huge amount of data in databases, data mining algorithm must be efficient
and scalable. Many of the issues are followed under mining methodology, and user interaction must consider
efficiency and scalability.
Parallel, Distributed, and Incremental Mining Algorithms
The huge size of many databases, the wide distribution of data, and computational complexity of some data mining
methods are factors motivating the development of parallel and distributed data mining algorithm. Such algorithms
divide the data into partitions, which are processed in parallel. The results from the partitions are then merged.
Therefore, this algorithm performs the knowledge modification incrementally to amend and strengthened.

Issues Relating to the Diversity of Database Types


The main issues related to the diversity of the database types are handling of relational and complex types of data
and mining information from heterogeneous database.

Handling of Relational and Complex Types of Data


Relational databases are widely used, the development of efficient and effective data mining systems for such data
are important. However, other database may contain complex data object, hypertext and multimedia data, spatial
data, temporal data, or transaction data. It is unrealistic to expect one system to mine all kinds of data, given the
diversity of data types and different goals of data mining.

Mining Information from Heterogeneous Database and Global


Information Systems

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
LAN connects many sources of data, forming huge, distributed, and heterogeneous databases. The discovery of
knowledge from different sources of structure and semistructured or unstructured data with diverse data semantics
poses great challenges to data mining. Web mining, which uncovers interesting knowledge about Web contents,
Web usage became a very challenging and highly dynamic field in data mining.

Data Preprocessing

Today’s real world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically
huge size, often several giga bytes or more. To improve the quality of the data and efficiency, data preprocessing is
introduced. Real world data tends to be dirty incomplete and inconsistent. This technique can improve the quality of
data, thereby improving accuracy and efficiency of the subsequent data mining process. It is an important step in the
knowledge discovery process. Since quality decisions must be based on quality data. Detecting data anomalies,
rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.
There are a number of data preprocessing techniques. They are:
– Data Cleaning
– Data Integration
– Data Transformation
– Data Reduction

Data Mining Query Language


The importance of design of a good data mining query language can also be seen by observing history of relational
database system. Relational database systems have dominated the database market for decades. The standardization
of relational query language, which occurred at early stages of relational database development, is widely credited
for success of the relational database field. Hence having a good query language may help standardize the
development of platforms for data mining system.

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
Data Warehousing
A Data Warehouse (DW) is a database that stores information oriented to satisfy decision-making requests. It is a
database with some particular features concerning the data it contains and its utilization. A very frequent problem
in enterprises is the impossibility for accessing to corporate, complete, and integrated information of the enterprise
that can satisfy decision-making
requests.

Goals of Data Warehousing


Data warehousing technology comprises a set of new concepts and tools which
support the knowledge worker like executive, manager, and analyst with
information material for decision making. The fundamental reason for building
a data warehouse is to improve the quality of information in the organization.
The key issues are the provision of access to a company-wide view of data
whenever it resides. Data coming from internal and external sources, existing
in a variety of forms form traditional structural data to unstructured data like
text files or multimedia is cleaned and integrated into a single repository. A
data warehouse is the consistent store of this data which is made available to
end users in a way they can understand and use in a business context.

Characteristics of Data in Data Warehouse


Data in the Data Warehouse is integrated from various, heterogeneous operational systems like database systems,
flat files, etc. Before the integration, structural and semantic differences have to be reconciled, i.e., data have to be
“homogenized” according to a uniform data model.

Data Warehouse Architectures

Data warehouses and their architectures vary depending upon the specifics of an organization’s situation. Three
common data warehouse architectures which are discussed in this section are:

1. Basic Data Warehouse Architecture


2. Data Warehouse Architecture with a Staging Area
3. Data Warehouse Architecture with a Staging Area and Data Marts

Data Warehouse Architecture with Staging Area and Data Marts


The data warehouse architecture with staging area and data marts

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
Data Mart
Data marts are complete logical subsets of the complete data warehouse. Data marts should be consistent in their
data representation in order to assure Data Warehouse robustness. A data mart is a set of tables that focus on a
single task. This may be for a department, such as production or maintenance department, or a single task such as
handling customer products.
Classification of Data Warehouse Design

The data warehouse design can be broadly classified into two categories
(1) Logical design and (2) Physical design.

Logical Design
The logical design is more conceptual and abstract than physical design In the logical design, the emphasis is on
the logical relationship among the objects. One technique that can be used to model organization’s logical
information requirements is entity-relationship modeling. Entity-relationship modeling
involves identifying the things of importance (entities), the properties of these things (attributes), and how they
are related to one another (relationships).

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi
Physical Design
During the physical design process the data gathered during the logical design phase is converted into a description
of the physical database structure. Physical design decisions are mainly driven by query performance and database
maintenance aspects.

Physical Design Structures


Some of the physical design structures a) Table spaces (b) Tables and Partitioned Tables (c) Views
(d) Integrity Constraints, and (e) Dimensions

Prepared by Kaje David David.kaje@kemu.ac.ke


Murithi

You might also like