Weka is an open source collection of machine learning algorithms and data mining tools written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rule mining and visualization. Weka supports common data formats like ARFF and CSV and allows data to be imported from files, URLs or databases. It provides several graphical user interfaces for exploratory analysis, experimentation and workflow modeling. The goal of the German credit data set is to classify loan applicants as good or bad credit risks based on 20 attributes describing their financial status and history.
Weka is an open source collection of machine learning algorithms and data mining tools written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rule mining and visualization. Weka supports common data formats like ARFF and CSV and allows data to be imported from files, URLs or databases. It provides several graphical user interfaces for exploratory analysis, experimentation and workflow modeling. The goal of the German credit data set is to classify loan applicants as good or bad credit risks based on 20 attributes describing their financial status and history.
Weka is an open source collection of machine learning algorithms and data mining tools written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rule mining and visualization. Weka supports common data formats like ARFF and CSV and allows data to be imported from files, URLs or databases. It provides several graphical user interfaces for exploratory analysis, experimentation and workflow modeling. The goal of the German credit data set is to classify loan applicants as good or bad credit risks based on 20 attributes describing their financial status and history.
Weka is an open source collection of machine learning algorithms and data mining tools written in Java. It contains tools for data pre-processing, classification, regression, clustering, association rule mining and visualization. Weka supports common data formats like ARFF and CSV and allows data to be imported from files, URLs or databases. It provides several graphical user interfaces for exploratory analysis, experimentation and workflow modeling. The goal of the German credit data set is to classify loan applicants as good or bad credit risks based on 20 attributes describing their financial status and history.
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1/ 14
Introduction to WEKA
Weka Main Features
A collection of open source of many data mining and machine learning algorithms, including pre-processing on data Classification: clustering association rule extraction Created by researchers at the University of Waikato in New Zealand Java based (also open source). 49 data preprocessing tools
76 classification/regression algorithms
8 clustering algorithms
15 attribute/subset evaluators + 10 search algorithms for feature selection.
3 algorithms for finding association rules
3 graphical user interfaces
The Explorer (exploratory data analysis)
The Experimenter (experimental environment)
The KnowledgeFlow (new process model inspired interface)
2. Weka : Download and Installation
Start the Weka
From windows desktop,
Download Weka (the stable version) from
click Start, choose All programs,
http://www.cs.waikato.ac.nz/ml/weka/
Choose a self-extracting executable (including Java VM) Choose Weka 3.6 to start Weka
Then the first interface
Weka GUI Chooser.
After download is completed, run the self-extracting file to install Weka, and use the default set-ups.
WEKA Application Interfaces
Weka Functions and Tools
Pre-processing Filters
Attribute selection
Classification/Regression
Clustering
Association discovery
Visualization
Data can be imported from a file in various formats:
ARFF (Attribute Relation File Format) has two sections:
the Header information defines attribute name, type and relations. the Data section lists the data records.
CSV: Comma Separated Values (text file)
C4.5: A format used by a decision induction algorithm
C4.5, requires two separated files Name file: defines the names of the attributes Date file: lists the records (samples) binary
Data can also be read from a URL or from an SQL database (using JDBC)
ARFF File Format
ARFFFile Format Require declarations of @RELATION, @ATTRIBUTE and @DATA
@RELATION declaration associates a name with the dataset
@RELATION <relation-name>
@ATTRIBUTE declaration specifies the name and type of an attribute
@ATTRIBUTE <attribute-name> <datatype>
Datatype can be numeric, nominal, string or date
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Setosa,Versicolor,Virginica}
@DATA declaration is a single line denoting the start of the data segment
Missing values are represented by ?
% % data related to student % @relation 'XYZ' @attribute name numeric @attribute qualification numeric @attribute designation numeric @attribute addr numeric
@data 1,2,3,4 4,5,?,7
Predicting whether a loan will be repaid (credit scoring)is" important task for any bank. "High accuracy benefits both the banks and the loan applicants. There are 20 attributes used in judging a loan applicant. The goal is to classify the applicant into one of two categories ,good or bad. Number of Attributes in german credit data set are 20 (7 numerical, 13 categorical)
.Attribute description for german
Attribute 1: (qualitative) Status of existing checking account A11 : ... < 0 DM A12 : 0 <= ... < 200 DM A13 : ... >= 200 DM / salary assignments for at least 1 year A14 : no checking account
Attribute 2: (numerical) Duration in month Attribute 3: (qualitative) Credit history A30 : no credits taken/ all credits paid back duly A31 : all credits at this bank paid back duly A32 : existing credits paid back duly till now A33 : delay in paying off in the past A34 : critical account/ other credits existing (not at this bank) Attribute 4: (qualitative) Purpose A40 : car (new) A41 : car (used)