0% found this document useful (0 votes)
6 views

Unit-2 Finalized

dwdm notes unite 2

Uploaded by

Tandav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit-2 Finalized

dwdm notes unite 2

Uploaded by

Tandav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
EEE Data Warehousing and Data Mining Reference Note Unit-2 Introduction to Data M is Data Mining Data mining is the process of discovering interesting patterns and knowledge from the imge amount of data. Data Mining is one of the essential step in the process of KDD (Knowledge Discovery in Database), Why Data Mining? (Motivation) + Data mining helps to turn the huge amount of data into useful information and knowledge that can have different applications. * Data mining helps in a. Automatic discovery of pattems b. Predietion of likely outcomes ¢. Creation of actionable information + Data mining can answer questions that cannot be addressed through simple query and reporting techniques. Types of Data that can be mined on Data Mining Different kinds of data can be mined. Some of the examples are mentioned below: + Flat Files: Flat files are in the binary form or text form and having a structure that can be easily extracted by data mining algorithms. The data stored in the flat file has no relationship or path to each other. Flat files are represented by data dictionary. E.g. CSV file, Itis often used in data warehousing to store data, in carrying data to and from servers, ete, + Relational Databases: A relational database is a data collection organized into tables with rows and columns. The physical schema of a relational database is the schema that defines the structure of the table. A relational database logical schema is a schema that defines the relationships between tables + Data Warehouses: A data warehouse is defined as the collection of data integrated from multiple sources (often heterogeneous) that will queries and decision making, Data warehouses consist of three types, enterprise data warehouses, data marts, and virtual warehouses. It is widely used in everyday business decision-making, + Transaction Databases: A transaction database is a set of records representing transactions, each with a time stamp, an identifier and a set of items. This type of database has the capability to roll back or undo its operation when a transaction is not completed or committed. Object databases, ATM machine, Banking, and Distributed systems are very famous applications of a transactional database. + Multimedia Databases: Multimedia databases include video, images, audio and text media. They can be stored on Object-Oriented Databases. E-book databases, video website databases, news website databases, ete. are famous applications of multimedia databases. + Spatial Databases: Spatial databases are databases that store geographical information like maps and global or regional positioning. Itstores data in the form of coordinates, topology, lines, polygons, etc. Collegenote Prepared By: Jayanta Poudel WEEENEN Data Warehousing and Data Mining Reference Note Data Mining Architecture The major components of a data mining system architecture are as follows: 1 {Data cleaning, iteration and selection | Fig: Architecture of typical data mining system + Database, Data Warehouse or Other Information Repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data * Database or Data Warehouse Server: It fetches the data as per the users’ requirement which one need for data mining task. + Knowledge Base: This is the domain knowledge that is used to ~ guide the search or evaluate the interestingness of resulting patterns. It is simply stored in the form of set of rules, * Data Mining Engine: Wt performs the data mining task such as characterization, association, classification, prediction, cluster analysis etc. + Pattern Evaluation Module: They are responsible for finding interesting patterns in the data using a threshold value. It interacts with the data mining engine to focus the search on interesting patterns. = Graphical User Interface: This module is used to communicate between user and the data mining system and allow users to browse databases or data warehouse schemas by specifying a data mining query or task. Collegenote Prepared By: Jayanta Poudel BEEN Data Warehousing and Data Mining Reference Note Data Mining Functionalities — What kinds of Patterns Can Be Mined’ Data mining functionalities are used to specify the kinds of patterns tobe found in data mining, tasks, In general, such tasks can be classified into two categories: descriptive and predictive. * Descriptive mining tasks characterize the general properties of the data in the database. * Predictive mining tasks perform inference on the current data in order to make predictions. Data mining functionalities or the kinds of patterns that can be mined are as follows: 1. Class/Concept Description: Data can be associated with classes or concepts that can be described in summarized, concise and yet precise, terms. Such descriptions of a concept or class are called class/concept descriptions. These descriptions can be derived via: = Data Characterization: Characterization is a summarization of the general characteristics or features of a target class of data which creates what is called a characteristic rule. + Data Discrimination: Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of, contrasting classes. 2. Association analysis on frequent patterns: Frequent pattems are pattems that occur frequently in data. Association analysis aims to discover associations between items occurring together frequently. Exg. buys(X."computer”) —> buys(X,"software”) [support=1%,confidence-S0%] where X is a variable representing a customer. Confidence=30% means that if a customer buys a computer, there is a 50% chance that she will buy software as well, 3. Classification and Prediction: Classification is the process of finding a model (or function) that deseribes and distinguishes data classes or concepts. This model is derived based on the analysis of a set of training data and used to predict the class label of objects for which the class label is unknown. Prediction is used to predict missing or unavailable numeric data values rather than class labels. Regression analysis is a statistical methodology that is most often used for numeric prediction, although other methods exist as well. 4. Cluster Analysis / Clustering: Clustering analyzes data objects without cousulting class labels. It can be used to generate class labels for a group of data which did not exist at the beginning. The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters 5. Quilier Analysis: Outliers axe objects that do not comply with the general behavior or model of the data. Most data mining methods discard outliers as noise or exceptions However, in some events these kind of events are more interesting. This analysis of outlier data is referred to as outlier analysis, E.g. Fraud detection Evolution Analysis: Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. This may include characterization, discrimination, association and correlation analysis, classification, prediction or clustering of time related data, Distinet features of such data incinde time-series data analysis, sequence or periodicity pattem matching, and similarity-based data analysis, Collegenote Prepared By: Jayanta Poudel EEE Data Warehousing and Data Mining Reference Note Knowledge Discovery in Database (KDD) Knowledge discovery in databases (KDD) is the process of discovering useful knowledge fiom collection of data Fig: KDD process The steps involved in knowledge discovery process: 1. Data Cleaning: Data cleaning is a process of removing unnecessary and inconsistent data from the databases. The main purpose of cleaning is to improve the quality of the data by filling the missing values, configuring the data to make sure that it in consistent format. 2. Data Integration; In this step data from various sources such as database, data warehouse and transactional data are combined. 3. Data Selection: Data which is required for data mining process can be extracted fom multiple and heterogeneous data sources such as databases, files etc. Data selection is a process where the appropriate data required for analysis is fetched from the databases. 4. Data Transformation: In the transformation stage data extracted from multiple data sources are converted into an appropriate format for data mining process. Data reduction or summarization is used to decrease the number of possible values of data without affecting the integrity of data. $. Data Mining: It is the most essential step of KDD process where intelligent methods are applied in order to extract hidden patterns from data stored in databases, 6. Pattern Evaluation: This step identifies the truly interesting pattems representing ‘Knowledge on the basis of some interestingness measures. Support and confidence are two widely used interestingness measures. These patterns are helpful for decision support systems, 7. Knowledge Presentation: In this step, visualization and knowledge representation techniques are used to present mined knowledge to users. Visualizations can be in form of graphs, charts or table Collegenote Prepared By: Jayanta Poudel BEEN Data Warehousing and Data Mining Reference Note Classification of Data Mining System The data mining system can be classified according to the following criteria 1. Classification according to kind of databases mined We can classify the data mining system according to kind of databases mined. Database system can be classified according to different criteria such as data models, types of data etc, And the data mining system can be classified accordingly. For example if we classify the database according to data model then we may have a relational, transactional, abject- relational, or data warehouse mining system. 2. Classification according to kind of knowledge mined We can classify the data mining system according to kind of knowledge mined. It is means data mining system are classified on the basis of functionalities such as: Characterization, Discrimination, Association and Correlation Analysis, Classification, Prediction, Clustering, Outlier Analysis, Evolution Analysis 3. Classification according to kinds of techniques utilized We can classify the data mining system according to kind of techniques used. We can describes these techniques according to degree of user interaction involved or the methods of analysis employed. 4. Classification according to applications adapted We can classify the data mining system according to application adapted. These applications are as follows: Finance, Telecommunications, DNA, Stock Markets, E-mail Issues in Data Mining In data mining, the algorithm used is complex and data is not available from single sources so these factors also create some issues. The major issues are Date Mining ‘Mining Methodology and User Performance Deri ata Types Interaction nae issues shining aifferentkinds of knowleage: ‘Gificiency and scalability of data | [ *Handling oF relational ana indatabeses ie wa _ ‘mining algorithms complextypes of data sinteractive mining of trowtedge at | | | cparatelcistributed, and ‘Mining information from patipte ewe lsof sheteaetioge incrementalmining algorithms || | heterogeneous dotabaces ancl incorporation of background tlobal information systems knowledge /s0ata mining query languages and ad hoc data mining Presentation and visualization of dota mining celts Handling noisy or incomplete data Pattern evaluation Collegenote Prepared By: Jayanta Poudel BEE Data Warehousing and Data Mining Reference Note L Mining Methodology and User Interaction Issues 4) Mining different kinds of knowledge in databases: Different users may be interested in different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task. b) Interactive mining of knowledge at muttiple levels of abstraction: The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on the returned results. ©) Incorporation of background knowledge: To guide discovery process and to express the discovered pattems, the background knowledge can be used. Background knowledge may be used to express the discovered pattems not only in concise terms but at multiple levels of abstraction @) Data mining query languages and ad hoc data mining: Data Mining Query language that allows the user to describe ad hoc inining tasks, should be integrated with a data wareliouse query language and optimized for efficient aud flexible data mining ©) Presentation and visualization of data mining results: Once the patterns are discovered it needs to be expressed in high level languages, and visual representations. These representations should be easily understandable D) Handling noisy or incomplete data: The data cleaning methods are required to handle the noise and incomplete objects while mining the data regularities. Ifthe data cleaning methods are not there then the aecuracy of the discovered pattems will be poor ) Pattern evaluation: The pattems discovered should be interesting because either they represent common knowledge or lack novelty. Performance Issues 4) Efficiency and scalability of data mining algorithms: tn order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable ») Parallel, distributed, and incremental mining algorithms: The factors such as huge size of databases, wide distribution of data, and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel fashion, Then the results from the partitions is merged. The incremental algorithms, update databases without mining the data again from scratch. Diverse Data Types Issues @) Handling of relational and complex types of data: The database may coutain complex data objects, multimedia data objects, spatial data, temporal data ete. It is not possible for one system to mine all these kind of data, 4) Mining information from heterogeneous databases and global information systems. The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data mining. Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note Data Object and Attribute ‘Types Data Objects Data sets are made up of data objects. A data object represents an entity - in a sales database, the objects may be customers, store items, and sales. Data objects are typically described by attributes. If the data objects are stored in a database, they are data tuples. Attribute An attribute is a data field, representing a characteristic or feature of a data object. Attributes describing a customer object can include, for example, customer ID, name, and address. On the basis of set of possible values attributes can be divided into following types: ds ‘Qualitative Quantitative | | Po] | | fomina Orcinery Discrete Continous | Symmetric 1) Nominal Attributes: Nominal means “relating to names.” The velues of ¢ nominal attribute are symbols or names of things. Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. The values do not have any meaningful order. B.g, - Hair_color: possible values are: {black, brown, red, grey, white} - Marital_status: possible values are: {Married, Single, Divorced, Widowed} 2) Binary Attributes: A binary attribute is a nominal attribute with only two categories or states: 0 or I, where 0 typically means that the attribute is absent, and 1 means that it is present. E.g. Given the attribute smoker describing a patient object, 1 indicates that the patient smokes, while 0 indicates that the patient does not - A binary attribute is symmetric if both of its states are equally valuable. E.g attribute gender having the states male and female. - Abinary attribute is asymmetric ifthe outcomes of the states are not equally important, such as the positive (1) and negative (0) outcomes of a medical test for HIV. Collegenote Prepared By: Jayanta Poudel BEEN Data Warehousing and Data Mining Reference Note 3) Ordinal Attributes; An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. F.g. Height: possible values are: {Tall, Medium, Short}. The values have a meaningfitl sequence (which corresponds to increasing height); however, we cannot tell from the values how much bigger, say, a medium is than a short. Other example of ordinal attributes include grade (e.g., A+, A, A~, B+, and so on). 4) Numeric Auribures: A numeric attribute is quantitative; that is, itis a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled ot ratio- scaled. - Interval-Scated Attributes: Interval-scaled attributes are measured ona scale of equal- size units. The values of interval-scaled attributes have order and can be positive, 0, or negative. E.g. Calendar Date (2002 and 2010 are 8 years apart) - Ratio-Scated Attributes: If measurement is ratio scaled means a value being multiple (ortatio) of another value. In addition, the values are ordered, aud we can also compute the difference between values, as well as the mean, median, and mode. E.g. Frequency cof words in a document. 5) Discrete versus Continuous Attributes: A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers. The attributes haircolor, smoker, medical test each have a finite number of values, and so are discrete A continuous attribute has an infinite no. of states. Continnous attributes are typically represented as floating-point variables. E.g. The attribute Height having the values 5.4... 6.5,. ete. Statistical Description of Data The basic statistical description of data can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. Basic statistical descriptions include Measure of Central Tendency and Measure of Dispersion. Measure of Central Tendency Measure of central tendency measures the location of the middle or center of a data distribution Measures of central tendency include the mean, median, mode, and midrange = Mean: Mean is the most common and effective numeric measure, which is used to measure the “center” of a set of data. Let x22)... u/Xn be the set of N observed valnes for X. The mean of this set of values is yo oO tatty N N If each x is associated with a weight w; for i = 1, N then the weighted mean is Collegenote Prepared By: Jayanta Poudel BEEN Data Warehousing and Data Mining Reference Note Median: better measure of the center of data is the median, which is the middle value in a set of ordered data values. It is the value that separates the higher half of a data set from the lower half. Suppose that a given data set of N values for an attribute X is sorted in increasing order. If NV is odd, then the median is the middle value of the ordered set. If IV is even, then the median is not unique; it is the two middlemost values and any value in between. If X is a mumeric attribute in this case, by convention, the median is taken as the average of the two middlemost values. Mode: The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal, In general, a data set with two or more modes is multimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal numeric data, we have the following empirical relation: mean —mode = 3 x (mean — median). Midrange: The midrange can also be used to assess the central tendency of a numeric data set. It is the average of the largest and smallest values in the set. Example: Let 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 are the values. 4g) w toisetarisotszisztsstooteat70+70+410 Mean(z) = S21 Sete7 soca iszisercosesirot7oi no 7 =58 52456 > Median= =54 > Mode: The given data are bimodal. Two modes are 52 and 70. 304110 > Midrange = =70 Measures of dispersion indicate how much the observed data is spread ont around a measure of central tendency. The measures include range, quantiles, quartiles, percentiles, and the interquartile range. Variance and standard deviation also indicate the spread of a data distribution. Range: The range of the set is the difference between the largest (max()) and smallest (arin) values. Example: 1, 3,5,6,7 = Range = 7 —1= 6 Quantites: Suppose that the data for attribute X are sorted in increasing numeric order. Quantiles axe points taken at regular intervals of a data distribution, dividing it into essentially equal-size consecutive sets. - The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds to the median Collegenote Prepared By: Jayanta Poudel BEET Data Warehousing and Data Mining Reference Note Quartites: The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents one-fourth of the data distribution. They are more commonly referred to as quartiles. Percentites: The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into 100 equal-sized consecutive sets 2 @ 28h Median 75th percentile percentile Interquartile Range: The distance between the first (25% percentile) and third (75 percentile) quartiles is called the interquartile range (IQR). TOR = Qs Qh iance: The variance of N observations, x3, . 1a BYwws Xn fora numeric attribute X is where © is the mean value of the observations. Standard Deviation: The standard deviation, ¢, of the observations is the square root of the variance, a2. A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values ‘Exampl Marks: 8, 10, 15, 20 Mean of marks 13.25 3.25)?+(10-13.25)2 + 4 25)2+(@20 > Variance(o?) > Standard Deviation(a) = V21.6 = Collegenote Prepared By: Jayanta Poudel BEET Data Warehousing and Data Mining Reference Note Applications of Data Mining Data mining can be applied in almost every field. Some of the major applications of data mining are briefly discussed below. L 2 4. Market Analysis and Management Listed below are the various fields of market where data mining is used: + Customer Profiting: Data mining helps determine what kind of people buy what kind of products + Identifying Customer Requirements: Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may attract new customers. * Cross Market Analysis: Data mining performs association/conrelations between product sales. * Target Marketing: Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, ete * Determining Customer purchasing pattern: Data mining helps in determining customer purchasing pattern. * Providing Summary Information: Data mining provides us various nmultidimensional summary reports, Corporate Analysis and Risk Management Data mining is used in the following fields of the Corporate Sector: + Finance Planning and Asset Evaluation: It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets. * Resource Planning: It involves summarizing and comparing the resources and spending. + Competition: 1: involves monitoring competitors and market directions. Fraud Detection Data mining is also used in the fields of eredit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration of the call, time of the day or week, etc. It also analyzes the pattems that deviate from expected norms. Intrusion Detection Data mining can help improve intrusion detection by adding a level of focus to anomaly detection. It helps an analyst to distinguish an activity from common everyday network activity. Web Search Engines Web search engines are essentially very large data mining applications. Various data mining techniques are used in all aspects of search engines, ranging from crawling, indexing, and searching Collegenote Prepared By: Jayanta Poudel Data Warehousing and Data Mining Reference Note 6. Social Web and Networks There are a growing mumber of highly-popular user-centric applications such as blogs, wikis and Web communities that generate a lot of structured and semi-structured information. In these applications data mining can be used to explain and predict the evolution of social networks, personalized search for social interaction, user behavior prediction etc. 7. Space Science Data mining can be used to automate the analysis image data collected from sky survey with better accuracy. Please let me know if I missed anything or anything is incorrect, poudeljayanta99@gmail.com Collegenote Prepared By: Jayanta Poudel

You might also like