Derebe Tekeste
Derebe Tekeste
Derebe Tekeste
A Thesis submitted to
School of Electrical and Computer Engineering
Addis Ababa Institute of Technology
In Partial Fulfillment of the Requirements for the Degree of Master of Telecommunication Engineering
(TIS)
I, the undersigned, declare that this thesis and the work presented in it are my own
and has been generated by me as the result of my own original research. I trully
acknowledged and referred every materials which used in this thesis work.
DEREBE TEKESTE
Signature
Name
ADDIS ABABA UNIVERSITY
Addis Ababa Institute of Technology
School of Electrical and Computer Engineering
Thesis on
A Comparative Analysis of Machine
Learning Algorithms for Subscription fraud
Detection: The case of ethio telecom
By: DEREBE TEKESTE
Signed by :
This thesis is dedicated to the memory of my beloved father Tekeste Birhanu, for
making me be who I am. I still miss him every day.
ABSTRACT
As a result, J48 algorithm using Cross Validation (CV) options is found to be the
best classifier algorithm by scoring 99.3% accuracy followed by the two algorithms
highest scores of ANN ( CV ) and SVM (ST) with 97.51% and 96.0% respectively.
This result happens because of J48’s capable of learning disjunctive expressions in
addition to it reduced error pruning. Pruning decreases the complexity in the final
classifier, so that improves predictive accuracy from the decrease of over fitting.
KEYWORDS
v
ACKNOWLEDGMENTS
First and foremost, I would like to thank my Almighty God " YAHWEH " for all
of things happened in my life and for giving me beautiful wife Yodit Gudeta and
beloved kids Fraol, Mesgana and Cute lady Maya.
Secondly, the success of this thesis is credited to the extensive support and assis-
tance from my advisor Ephrem Teshale [PhD]. I would like to express my grateful
gratitude and sincere appreciation to him for his guidance, valuable advice, con-
structive comments, encouragement and kindness to me throughout this study.
Thank you !
Finally, I would like to thank all of you who support me to complete this thesis
work even if I didn’t mention your name here.
vi
CONTENTS
1 introduction 1
1.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Scope and limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 machine learning 18
3.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vii
contents
4 data preparation 31
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Understanding CDR data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Attribute Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.3 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.4 Validation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.5 Algorithm Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Performance Measurement parameters . . . . . . . . . . . . . . . . . . . 41
4.5.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.3 F-Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.4 Root Mean Squared Error (RMSE) . . . . . . . . . . . . . . . . . . . . . 44
4.5.5 Receiver Operating Characteristic Curve - ROC . . . . . . . . . . . . . 44
bibliography 55
a appendix 59
a.1 CDR Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
a.2 File Uploader script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
viii
contents
ix
LIST OF FIGURES
x
L I S T O F TA B L E S
xi
ACRONYMS
AI Artificial Inteligent
CV Cross Validation
ML Machine Learning
ST Supplied Test
xii
1
INTRODUCTION
1
1.1 statement of the problem
One of the common types of fraud is subscription fraud in which usage type is in
contradiction with subscription type [5].
year Top five total fraud loses subscription fraud losses alone
2017 29 6.95
Observing from Table 1.0.1 it is possible to verify the dimension of the "financial
hole" generated by fraud in the telecommunications industry and the impact of
fraud in the revenue of the telecom operators specifically subscription fraud it has
a big impact on telecommunication companies revenue.
Telecomm fraud is a major source of revenue loss for TSP and their customers.
Recent (2017) CFCA survey shows that Global Fraud Loss Estimated $29 billion
[6]. Experts agree that most telecom providers are losing 3 to 10 percent of their
income to fraud. The intention of a subscriber plays a central role in the definition
of fraud [9]. Due to subscribers incontradiction use of the services based on their
subscription agreements, revenue loss has been occurred to telecom operators [5].
2
1.1 statement of the problem
2013 50
2015 33
2017 89
Ethio telecom provides different types of services to its customers. The two main
service types are voice and data. Depending on the customer interest they can use
prepaid as well as postpaid payment methods for their service requests. Prepaid
means pay as you go and postpaid for credit to a month or above a month. For any
service requests ethio telecom registers its customers on the Customer Relation
Management (CRM) system for billing purpose. However, it will not have the
chance to know these subscribers who are fraudulent or not at the time of service
request or applications. Subscription fraud could be done with false identification
numbers, stolen credentials and stolen accounts which helps to get customers
full information for creating many new accounts for a fraudulent purpose. So,
once the subscriber registered and get accounts to access the network they have a
chance to start their fraudulent activity.
Ethio telecom by now deploys and use a traditional type of Fraud Management
System (FMS) to detect fraudulent activities. Due to the inflexibility rules of this
traditional detecting system, fraudsters get a chance to adapt the existing FMS
detection rules and policies. Then after, they increase the company’s revenue loses
as well as damages of subscribers’ trust relationship with the company and loses
the company’s brand. This traditional management system cannot handle the new
behavioural change of fraudsters activity.
3
1.2 objectives
By considering the above problems, this research needs to analyze and answer the
following research questions.
Research questions
1.2 objectives
4
1.3 scope and limitation
There exist more than 200 fraud types in the telecom industry [13]. However, this
specific research focused only on subscription fraud type. The row data used is
limited to CDR data, which has been callected from ethio telecom. Due to storage
limitation, two-month prepaid mobile subscriber data were used.
• Deliver practical suggestions that may help anti-fraud managers and em-
ployees to get a better understanding of subscription fraud.
5
1.5 related work
In order to have detail understanding of this research topic’s about fraud detection
and prevention mechanisms specifically subscription fraud detection, related liter-
atures like journals, articles, magazines besides with the Internet were reviewed.
In the past few years, subscription fraud to be the trendiest and the fastest-growing
type of fraud [14]. In a similar spirit, Abidogun [15] characterize subscription
fraud as being the most significant and prevalent worldwide telecommunications
fraud type.
Stevez et al. [9] describes the identification of fraudsters on the time of applying to
a service request or subscription basis for fixed telecommunication. Two strategies
have been proposed for detecting subscription fraud: examining account applica-
tions and tracking customer behavior. They use a classification module and a
prediction module. The classification module ( fuzzy rule) classifies subscribers
according to their previous historical behavior into four different categories: sub-
scription fraudulent, otherwise fraudulent, insolvent and normal whereas the pre-
diction module (ANN) allows them to identify potential fraudulent customers at
the time of subscription.
Regarding to their experimental test Stevez et al. [9] use a database containing
10,000 real subscribers in a major telecom company in chile and 2.2% subscription
fraud was detected and a multi-layer perception neural network was implemented
for prediction purpose and it identifies 56.2% from true fraudsters and screening
only 3.5% of all the subscribers in the test set. Their study was tested and sig-
nificantly preventing subscription fraud in telecommunications by analyzing the
application information at the time of customer application.
Kabari et al. [16] present a design and implements of a subscription fraud detec-
tion system using Artificial Neural Networks and Neurosolutions for Excel was
used to implement the Artificial Neural Network. The study was grounded on cus-
tomers Internet data usage. The system was trained and tested and 85.7% success
rate achieved. The designed system found to be user friendly and effective.
6
1.5 related work
On the other hand, Kabari et al. [2] identify the different subscription services
provided by the telecommunications sector, identifys the different ways telecom-
munications fraud is perpetrated and proposes the use of Naïve Bayesian Network
technology to detect subscription fraud in the telecommunications sector. The sys-
tem takes care of the challenges encountered by the rule-based system of detecting
fraud. The paper grounded on customers Internet data usage.
Farvaresh and Sepehri [5], describe that one of the common types of fraud is sub-
scription fraud in which usage type is in contradiction with subscription type. The
study aimed at identifying customers’ subscription fraud by employing data min-
ing techniques and adopting knowledge discovery process based on the leased
line telephone services. A hybrid approach consisting of preprocessing, cluster-
ing, and classification phases was applied and appropriate tools were employed
commensurate to each phase. Specifically, for clustering phase Self Orgamized
Map and k-Means were used and in classification phase decision tree (C.45), ANN
and SVM as single classifier and bagging, boosting, stacking, majority and consen-
sus voting assemblies were examined. The results showed that SVM among single
classifiers and boosted trees among all classifiers have the best performance in
terms various metrics. The result is significant both theoretically and practically.
Saravanan et al. [14], describe about the identification of true high usage customers
from illegitimate customers based on their calling patterns fraud detection. The
paper implements a probability-based method for fraud detection in telecommu-
nication sector used Naïve-Bayesian classification to calculate the probability and
an adapted version of KL-divergence to identify the fraudulent customers on the
basis of subscription. Each user’s data corresponds to one record in the database.
This paper overcomes the problem of identifying fraudulent customers in telecom-
munication sector by classifying the true fraudulent customers alone. In other
words the paper result indicates that the normal high usage customers has a sim-
ilar behavioral patterns whereas fraudsters behavioral changes indicating using
the service with some time with high usage and disappearing from the system for
some time.
7
1.6 methodology
All related papers presented have their own roles of detecting subscription fraud
depending on the nature of the telecom service provider’s customer usage behav-
ior. So that we can see different types of methods and techniques of subscription
fraud detection. What the researcher clearly sees is that types of algorithm, feature
selection, dataset size limitations are the most significant impact of classification
accuracy. However, these papers do not consider aggregated information of the
subscriber usage behavior. In this study, by considering subscription fraud usage
behaviors the researcher uses feature number of incoming calls, Total number of
calls, Distinct calls, ratio of Distinct calls to total calls, Ratio of international call
to Total calls.
1.6 methodology
In order to achieve the objectives of this thesis and answering the research ques-
tions the methods that the researcher follows
3. Since the data were labeled, Supervised ML techniques were chosen. WEKA
ML tool were prefered for this specific research. Weka is a collection of ma-
chine learning algorithms for data mining tasks. The algorithms can either
be applied directly to a dataset or called from own Java code. Weka contains
tools for data pre-processing, classification, regression, clustering, associa-
tion rules, and visualization[17].
4. The developed model was tested and evaluated using performance mea-
surement parameters (Confusion matrix, Receiver Operating Characteris-
8
1.7 thesis organization
This thesis paper contains six chapters, objective, research methodology and prob-
lem formulations are discussed in chapter 1. Chapter 2 discuses about telecommu-
nication services and its fraud types including subscription fraud theory. Third
chapter discussed about machine learning techniques and algorithms which are
applied in this thesis. Chapter 4 concerned to where and how the necessary data
collected, prepare for experimentation in addition to explaining performance mea-
surement parameters. The fifth chapter relay on classifier classification perfor-
mance result evaluation and comparison. After all these things are applied the
researcher’s conclusion and ways of further researches regarding to subscription
fraud detection pointed on chapter 6
9
2
T E L E C O M M U N I C AT I O N S E R V I C E S A N D F R A U D S
This chapter discusses telecommunication services and fraud types. The first sec-
tion describes about the telecom mobile voice services whereas the second section
describes telecommunication fraud types related to subscription behaviours.
10
2.2 telecommuniaction fraud
The telecommunication industry has expanded dramatically in the last few years
with the development of affordable mobile phone technology [18]. With the in-
creasing number of mobile phone subscribers, global mobile phone fraud is also
set to rise. Telecommunication fraud is defined as unauthorized use, tampering
or manipulation of a mobile phone or service[19]. The problem with telecommu-
nication fraud is the huge loss of revenue and it can affect the credibility and
performance of telecommunication companies. Telecommunication fraud which
attracts particularly to fraudsters as calling from the mobile terminal is not bound
to a physical location and it is easy to get a subscription. This provides a means for
illegal high-profit business for fraudsters requiring minimal investment and rela-
tively low risk of getting caught due to possibility of subscribing with fabricated
identity.
11
2.3 common types of telecommunication fraud
gorized to six these are subscription fraud, PABX fraud, handset theft, premium
rate fraud, free phone call fraud and roaming fraud. Hilas and Mastorocostas, [21]
categorized fraud in to four technical fraud, contractual fraud, hacking fraud, and
procedural fraud. The third Scholars Kang and Yang [3] categories fraud types in
to two subscription and superimposed frauds.
In recent year report (2017), the top five fraud were listed as Subscription Fraud
(Identity), PBX Hacking, IP PBX Hacking, Subscription Fraud (Application) and
Internal Fraud/Employee theft as shown in Figure 2.3.1.
By the coming sub sections, the researcher discuss some fraud types which are
subscription fraud properties Superimposed fraud, SIM swapping, SIM cloning,
SIM-Box and Roaming fraud.
12
2.3 common types of telecommunication fraud
Subscription fraud is a contractual fraud [16]. Subscription fraud is the most com-
mon since with a stolen or manufactured identity, there is no need for a fraudster
to undertake a digital network‘s encryption or authentication systems. Their pre-
ferred methods are using low techniques which is using the network under the
threshold level of FMS. This has less chance of detection.
Regarding to Koi-Acroti et al. [8], in this day subscription fraud to be the trendiest
and the fastest-growing type of fraud. In similar spirit, characterizes subscription
fraud as being probably the most significant and prevalent worldwide telecom-
munications fraud type. In subscription fraud, a fraudster obtains a subscription
(possibly with false identification) and starts a fraudulent activity with no inten-
tion to pay the bill.
13
2.3 common types of telecommunication fraud
Subscription fraud involves the fraudulent individual obtaining the customer in-
formation required for signing up for telecommunication service with authoriza-
tion. The usage of the service creates a payment obligation for the real or normal
customer [2].
Fraudsters obtain an account without intention to pay the bill. In such cases, ab-
normal usage occurs throughout the active period of the account. The account is
usually used for call selling or intensive self-usage. Cases of bad debt, where cus-
tomers who do not necessarily have fraudulent intentions never pay a single bill,
also fall into this category. These cases, while not always considered as “fraud”,
are also interesting and should be identified.
In this scenario case as shown in Figure 2.3.2 the normal user subscribe a service
from TSP and signed with the service provider and accessing the TSP infrastruc-
ture and pay depends on their usage. However, the fraudulent steals normal users
identity, cloning the SIM or stolen phones then after accessing the TSP’s infrastruc-
ture as much as they want without the intention to pay but billing will be handled
by normal customers.
Superimposed fraud is the most common fraud scenario in private networks. This
is the case of an employee, the fraudster, who uses another employee’s authoriza-
tion code to access outgoing trunks and costly services [24]. Unlike subscription
14
2.3 common types of telecommunication fraud
SIM cloning has the same goal as SIM swapping, but cloning does not require call-
ing the mobile carrier. Rather, it is more about technical sophistication. The cloning
attack uses smart card copying software to carry out the actual duplication of the
SIM card, thereby enabling access to the victim’s international mobile subscriber
identity (IMSI) and master encryption key. Since the information is seared onto
the SIM card, physical access to it is a requirement. That means taking the SIM card
15
2.3 common types of telecommunication fraud
out of the mobile device and placing it into a card reader that can be attached to
a computer where the duplication software is installed.
After the initial stealthy SIM replication takes place, the attacker inserts that SIM
into a device they control. Next, the victim has to be contacted. The trick may
begin with a seemingly innocuous text message to the victim asking them to
restart their phone within a given period of time. Then, once the phone is powered
off, the attacker starts their own phone before the victim restarts and, in doing
so, initiates a successful clone followed by an account takeover. Once the victim
restarts their phone, the attack is complete, and the attacker will have successfully
taken over the victim’s SIM and phone number. Then the legal phone user then
gets billed for the cloned phone‘s calls. Cloning mobile phones is achieved by
cloning the SIM card contained within, not necessarily any of the phone’s internal
data [25] .
2.3.5 SIM-BOX
A SIM box fraud is a setup in which fraudsters install SIM boxes with multiple low-
cost prepaid SIM cards most of the time all these SIM cards will be subscribed with
forged credentials. The fraudster then can terminate international calls through
local phone numbers in the respective country to make it appear as if the call is a
local call. This allows the box operator to bypass international rates to fraudulently
below the prices charged by Mobile Network Operators (MNO) and evade the tax
charged by the government. This act denies telecommunications and government
from benefiting from international phone calls. Besides the loss of revenue, SIM
Box operators cause degradation of call quality which prevents them from meeting
service level agreements. The fraudster will pay the network for a national call but
will charge the Wholesale operator for every minute he terminated; the Network
Operator loses the Interconnection fee [26].
16
2.3 common types of telecommunication fraud
2.3.6 Roaming
17
3
MACHINE LEARNING
18
3.2 semi-supervised learning
Supervised machine learning algorithms need labelled datasets for learning and
classifying the dataset. In supervised algorithms, the goal is to predict an event or
estimate the values of a continuous numeric attribute. In supervised algorithms,
there are input fields or attributes and an output or target field. Input fields are
also called predictors because they are used by the model to identify a prediction
function for the output field. The predictor needs to be labelled or previously
identified data which needs to train the algorithm. After learning is accomplished
19
3.4 supervised learning
the outcome tells to the researcher that how well the algorithm predicts the new
input class or instances.
3.4.1 Regression
The objective of this study is to compare and select the best classifier algorithms
based on their classification performance measures. Following this objective, the
researcher selects three supervised ML algorithms J48 decision tree, ANN and SVM.
20
3.4 supervised learning
Decision Trees are a non-parametric supervised learning method used for classi-
fication and regression [31]. A decision tree follows a top-down approach to split
data recursively into smaller mutually exclusive subsets that include a root node,
branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node. The learning process involves finding
the best variables for the partition at each split. Decision tree works by creating a
tree to evaluate an instance of data, start at the root of the tree and moving down
to the leaves (roots) until a prediction can be made.
Based on patit et al. [32], J48 classifier is a simple C4.5 decision tree for classifi-
cation. J48 examine the normalized information gain that actually the outcomes
splitting the data by choosing an attribute randomly from the given dataset. In
order to make the decision, the attribute utmost standardized information gain is
used ( equation 3.2). The splitting methods stop if a subset belongs to the same
class in all the instances. J48 constructs a decision node using the expected values
of the class.
Step 1: The leaf is labelled with the same class if the instances belong to the same
class.
Step 2: For every attribute, the potential information will be calculated and the
gain in information will be taken from the test on the attribute with equation 3.1
and 3.2 .
To determine root attributes and decision tree size, Entropy and information gain
are statistical measures which are used to construct tree of the algorithm. At-
tributes which has the highest information gain value will be the root or decition
nodes of the tree but nodes which have an entropy of zero are considered to be
21
3.4 supervised learning
leaf node while nodes with entropy greater than zero will further split until the
entropy is zero.
• Overcome Over-fitting
• Pre-pruning or Post-pruning
• Both the discrete and continuous attributes are handled by this algorithm. A
threshold value is decided by C4.5 for handling continuous attributes. This
value divides the data list into those who have their attribute value below
the threshold and those having more than or equal to it.
Counting Gain
This process uses the “Entropy” which is a measure of the data disorder. The
Entropy of the attributes is calculated by:
where:
y attributes
H(X) Entropy of the dataset
Entropy controls how a Decision Tree decides to split the data. It actually effects
how a Decision Tree draws its boundaries.
22
3.4 supervised learning
Based on equation 3.1 and 3.2 the size of the tree, root and leaf will be determined.
Pruning
Classification is one of the most active research and application areas of neural
networks. An ANN is a computational model that is inspired by the way biological
neural networks in the human brain process information. Artificial neural net-
works are composed of nodes called neurons or processing elements, which are
connected together to form a network of nodes. Neural networks are physically
cellular systems which can acquire, store and utilize experimental knowledge.
ANN architecture generally consists of input layer, output layer and hidden layer(s).
Each contact between these nodes has a set of values called weights which con-
tribute to the determination of the values resulting from each processing element
based on the input values of that element. ANNs can learn from their Architecture
of the network through an iterative process by adjustments of its synaptic weight
and bias level. It can be used to model a complex relationship between inputs and
outputs which can find patterns in data.
There are two major classes of ANN Feed Forward Neural Network (FFNN) and
Recurcisve Neural Network (RNN). FFNN is the first and simplest type of artificial
neural network. It contains multiple neurons arranged in layers. In a FFNN, the
information moves in only one direction which is forward from the input nodes,
through the hidden nodes (if any) and to the output nodes. There are no cycles or
loops in the network. Single layer and Multi Layer Perception (MLP) are the two
23
3.4 supervised learning
examples of FFNN. RNN on the other hand are dynamical networks with recurring
path of synaptic connections serve as controlling time-dependent problems.
ANN used for supervised and unsupervised techniques based on its learning
paradigm. In the case of supervised learning it uses associative learning which
requered a training inputes data but in the case of unsupervised neural network,
it only requires input patterns from which it develops its own representation of
the input stimuli.
Perceptron
The most simple neural network unit is called "Perceptron"[33]. Perceptron has
just two layers, input layers and output layers. Often called a single-layer network
on account of having 1 layer of links, between input and output. Input nodes are
connected fully to a node or multiple nodes in the next layer. A node in the next
layer takes a weighted sum of all its inputs.
where,
24
3.4 supervised learning
The outputs of all neurons in the hidden layer are calculated by the summation
function Y (see Equation ( 3.3)).
X
N
Y = f( (w1 x1 + w2 x2 + ... + wN xN + w0 ) (3.3)
n=1
where:
Y generate the output.
f function (sigmoid)
x inpute ( attributes )
w weight of bias
These products are simply summed and fed through the transfer function with
equation ( 3.4 )
1
f(x) = (3.4)
1 + e−x
where:
f( x ) generate the output
MLP contains one or more hidden layers (apart from one input and one output
layer) as shown in figure 3.4.2. While a single layer perceptron can only learn lin-
ear functions but a multi-layer perceptron can both learn linear and non – linear
functions. A good point of MLPs is their applicability to any field of pattern recog-
nition tasks of supervised learning [34]. MLP can solve problems which are not
linearly separable. MLP is often applied to supervised learning problems. It is a
FFNN that generates a set of outputs from a set of inputs. MLP is a neural network
connecting multiple layers in a directed graph goes in one direction only. MLPs are
widely used for pattern recognition, classification, prediction, and approximation,
optimization, control, time series modelling and data mining.
25
3.4 supervised learning
Classifying with ANN features has desirable characteristics like high accuracy,
noise tolerance, independence from prior assumption, ease of maintenance, over-
coming the drawbacks of other statistical methods, ability to be implemented in
parallel hardware, minimized human intervention (highly automated) and suit-
ability to be implemented in non-conservative domain are major ones. However
ANN has limitations like poor transparency, trial and error design, data hungriness
(requires large amount of data), over fitting, lack of explicit set of rules to select a
suitable neural network, dependency on the quality and amount of data, lack of
classical statistical properties (confidence interval and hypothesis testing) and the
techniques are still evolving (not robust).
26
3.4 supervised learning
Support vector machine is a supervised classifier which has been proved highly
effective in solving a wide range of pattern recognition. SVMs are a new technique
suitable for binary and multi-class classification tasks in addition to a new promis-
ing non-linear, non-parametric classification technique [35]. SVM is state of the art
classification and regression algorithm and optimization procedure maximize pre-
dictive accuracy while automatically avoiding over-fitting the training data. How-
ever, they suffer from the important shortcomings of their high time and memory
training complexities[36]. The SVM classifier uses only a small subset of the total
training set for classification, thus reducing the computational complexities. The
reduction is achieved by the use of kernel trick and overfitting of data is avoided
by classifying with a maximum margin [37].
Initially, the hyperplane randomly classifies the classes by making a line or hy-
perplanes. Input vectors that just touch the boundary of the margin H1 and H2 –
as shown in figure 3.4.3 are called support vectors. In this approach, the training
stage is used to delimit the region (boundary) where data is classified as normal.
In the testing stage, the instances are compared to that region, if they fall in the
delimited region they are classified as normal, if not as fraudulent. The main ob-
jective is to minimize the number of points within the margin as much as possible.
In a compact form:
z(x) = xTj wj + b
where:
27
3.4 supervised learning
Maximum perpendicular distance between the nearest data point and hyperplane
- Margin SVM algorithms find the function (hyperplane) that returns the largest
minimum distance to the data points. This distance is called a margin, and the
data closest to the margins are then termed support vectors. In figure 3.4.3 the
points laid on the margin lines are support vectors, and the distance between
these margin lines is width of the margin. Because the solution depends only
on the support vectors, the remaining data are not important in developing the
model.
The points on the planes H1 and H2 are the tips of the Support Vectors The plane
H0 is the median in between, where w xi + b = 0
28
3.4 supervised learning
|wx + b| 1
d+ = = (3.8)
||w|| ||w||
and also on the other side the distance between the median and hyperplane H2 is :
|wx + b| 1
d− = = (3.9)
||w|| ||w||
d + + d– (3.10)
In other words from equation 3.8 and 3.9 we can compute the total margin distance
between H1 and H2 is thus:
w (1 − b) + (1 + b) 2
margin = (d + - d−). = = (3.11)
||w|| ||w|| ||w||
Hence, Maximizes the margin while minimising some measure of loss on the
training data. The optimization problem for the calculation of w and b can thus
be expressed by:
1 X n
minw ||w||2 + C Ei (3.12)
2
i=1
where:
Ei is errors
In the first part of equation (3.12 ) C is a tuning parameter, which weights in-
sample classification errors and thus controls the generalisation ability of an SVM.
29
3.4 supervised learning
30
4
D ATA P R E PA R AT I O N
Under the upcoming sections details of how the data were collected, understand-
ing the collected data, selecting a relevant attribute, preprocessing such as data
clearing, data integration and aggregation tasks will be conducted.
The main objective of this study is subscription fraud detection grounded on sub-
scribers usage call patterns, CDR. A two month period from May 25 to July 25,
2019 were collected.
For collecting and processing purpose windows server 2012 R2 standards V.6.3.9600
with 8GB RAM and quad-core Processor, 2.5 TB storage mounted on windows
server and Oracle 11th generation version database was deployed on the same
31
4.2 understanding cdr data
server. The CDR data collected and stored every day to the server in text format
as shown in Figure 4.1.1 and it has 33 attributes. Each day an average of 29 mil-
lion subscribers accessing the network and about 175 million calls recorded which
requests 30Gb space to store for a single day activity.
Due to space limitation seven tables were created, one table which is a source
table for inserting or uploading the original dumped CDR data as it is and six
table (three for fraudulent and three for legitimate) created based on the type of
the service SMS table, data table and voice table with selected attributes. These
daily collected text format data uploaded to the source table with batch files (see
in appendix A.2) and from the source table the selected attributes based on the
services inserted to the three tables with another batch files (see in appendix A.3).
Source table has similar attributes as collected text format CDR data. After insert-
ing these data to their related tables the original dumped text format CDR data
were deleted (daily) and the source table truncated.
This thesis work grounded on ethio telecom prepaid mobile CDR data. So, the
CDR data is collected to the database. Before going through preprocessing and
further tasks the researcher works together with domain experts to understand
and evaluate the data with their respective attribute values to the problem do-
main, it is a key task. Additionally, verifying usefulness of the data, completeness,
redundancy, missing values, and reasonableness of attribute values with respect
to the ML goals.
32
4.3 data selection
All attributes from the collected CDR data cannot be used for this study. Based on
Kamel [39], before starting to select attributes we need to identify the parameters
that could represent and lead to hitting the objectives of study. The expect output
from the classification algorithms fully depend on the quality of the selected input
data.
A two months prepaid mobile CDR data from MAY 25 to July 25 2019 were
collected. After CDR data collected and understanding the value of data, the re-
searcher selects relevant attributes which could identify the behaviors’ of subscrip-
tion fraud (see Section 2.3.1.1). Irrelevant data to this research objective removed
from the collected CDR data in order to learn relevant information to the machine
learning algorithm. In this specific study, subscription fraud behavior were used
as an input to this data selection task. In this data selection task domain expert ad-
vise has been a key role. Reducing unrelated attributes will improve the training
time, algorithms performance and reduce complexity of the algorithm task.
For this specific study, nine out of 33 attributes as shown in Table 4.3.1 were
selected. The selected attributes identifies the behaviour of subscribers usage. The
remaining attributes which are irrelevant to the objective of this study eliminated.
33
4.3 data selection
Some attributes are redundant or having the same value such as charge_fee and
Call_Fee, calling_number and service_number, third_party and called_number.
No Attribute Reason
This study is mainly depend on prepaid mobile subscribers usage pattern and the
features are differentiated based on combination of the following three category
inaddition to subscription fraud behaviour:
• Usage category: number of calls made (call frequency), call charge and dura-
tion of calls made.
34
4.3 data selection
4.3.2 Sampling
The goal of sampling is to learn via sampling from the statistical data in order to
estimate the characteristics of the subscription fraud. ML algorithms adaptively im-
prove their performance as the number of samples available for learning increases
[41]. But, Using the entire two months collected CDR data is impossible.
A small number of fraudulent were identified. So, we need to make some pro-
portionality between normal and fraudulent numbers. Depending on the problem
domain area the number of proportionalities between the classified parties could
be different. But as discussed with domain experts and scholars research related
to this thesis Farvaresh and Sepehr [5] and Estévez et al. [9] uses the number of
different sample sizes based on their problem domains. We decided to use 25%
fraudulent and 75% legitimate users to create a dataset.
A total of about 29 million subscriber data were stored in the database and out of
these subscribers 15,000 legitimate subscribers has been selected using a simple
random sampling technique. In this technique, each instance in the database has
an equal chance of being selected as a subject. The entire process of sampling is
done in a single step with each subject selected independently of the other in-
stances of the database. In addition to the legitimate subscriber 5,000 high usage
(subscription) fraudulent subscriber were identified with the help of domain ex-
perts has been used for further experiment and analysis purpose. A total of 20,000
subscribers used as a sample in this study. Table 4.3.2 shows the sampled data
with their number of records.
35
4.4 data preprocessing
Under this subsection preprocessing stage data cleaning, data integration and data
aggregation will be conducted. For this study there are number of oracle database
scripts used in addition to weka preprocessing features.
Data cleaning is a method for fixing missing values, outliers, and possible incon-
sistent data. Missing data is common [42]. Data cleansing is a process in which
we go through all the data within the database. The presence of missing, irrele-
vant and noisy values in a dataset can affect the performance and accuracy of a
classifier constructed using that dataset as a training sample.
The collected CDR data have a null value, incomplete values, missing value and
duplicate records. As the objective of this study is grounded on prepaid mobile,
the researcher selects only those calling numbers from the collected CDR data. But
records different from prepaid is discarded. The collected CDR has 33 columns,
however, duplicated and null valued attributes removed. Records having missing
values, outliers and incomplete were discarded. In this study, we found less than
1% missed value ( 77,332 ) from a total records of data ( 9,463,043 ) however based
on Acuna and C. Rodriguez [42] rates of less than 1 % missing data are generally
considered trivial. However, it was removed.
36
4.4 data preprocessing
On the other hand, an outlier is unusual variables of value that may otherwise
adversely lead to model misspecification, biased parameter estimation and incor-
rect results. Its appears could be at the maximum or minimum of the variable that
distorts the distribution of the data [43]. It is therefore important to identify them
prior to modelling and analysis. Outlier detection algorithms evaluate instances
based on distance, density, projections, or distributions. But the most common
data distribution measures is an Interquartile Range (IQR) for identifying outliers.
IQR describes the middle 50% of values when ordered the data from lowest to
highest. To find the interquartile range (IQR), first, find the median (middle value)
of the lower and upper half of the data. These values are quartile 1 (Q1) and
quartile 3 (Q3) respectivelly.
The interquartile range is a value that is the difference between the upper quartile
value ( Q3 ) and the lower quartile value (Q1 ).
Equation 4.1 result tells how spread out the "middle" values is, it can also be used
to tell when some of the other values are "too far" from the central value. These
"too far away" points are called "outliers" because they "lie outside" the range .
In this study, researcher applies IQR techniques to identify and remove the outliers
for reasonable and acceptable classification results on weka tool platform. A total
of 64,202 records were detected and removed (below -3IQR or above +3IQR )
Data integration is the first step toward transforming data into meaningful and
valuable information of the subscribers. In this study, data were integrated from
multiple sources to have a single view of the overall sources. The researcher al-
ready have six different tables (voice, Data and SMS ) for fraudulent and legitimate
user. Attributes from these tables need to be integrated to form a subscriber level.
The collected CDR data were inserted to these three tables daily basis for two
37
4.4 data preprocessing
month. These integration processes were done for both fraudulent and legitimate
subscribers.
Aggregated output of SMS, voice call and Internet data are integrated to form
single instance per subscriber level. Moreover, a class label field that identifies
the subscriber type added for training purpose. These aggregated attributes are
described on Table 4.4.1.
Attribute Description
38
4.4 data preprocessing
After aggregating attributes, the last task is preparing the data in file formate that
is suitable to ML tool. The tool accepts CSV and arrf file formats.
1 2, 3, 4, 5, 6, 7, 8, 9, 10 1
2 1, 3, 4, 5, 6, 7, 8, 9, 10 2
3 1,2, 4, 5, 6, 7, 8, 9, 10 3
4 1, 2, 3, 5, 6, 7, 8, 9, 10 4
5 1, 2, 3, 4, 6, 7, 8, 9, 10 5
6 1, 2, 3, 4, 5, 7, 8, 9, 10 6
7 1, 2, 3, 4, 5, 6, 8, 9, 10 7
8 1, 2, 3, 4, 5, 6, 7, 9, 10 8
9 1, 2, 3, 4, 5, 6, 7, 8, 10 9
10 1, 2, 3, 4, 5, 6, 7, 8, 9 10
This is the most used testing method. Table 4.4.2 describes how train classifier on
each fold and test against folds works. When the second iteration applied which
39
4.4 data preprocessing
trained on folds 1 to 10 except fold 2 and tested with the 2nd fold and iterate it
10 times. The accuracy estimate is the ratio of sum of correctly classification from
the whole iteration to the total number of instances in dataset
Use training set: The classifier is evaluated on how well it predicts the class of the
instances it was trained on.
Supplied test set: The classifier is evaluated on how well it predicts the class of
a set of instances loaded from a file. In this supplied test experimental process,
the researcher creates two datasets for training and test from the total 349,164 in-
stances with resampling method. The first dataset has 261,873 instances for train-
ing and the second dataset which has 87,291 instances for testing as shown in
Figure 4.4.1. In case of supplied test a single instance never used for training and
testing.
A total of six experiments performed with both 10-Fold cross-validation and sup-
plied test options. For the cross-validation case the three algorithms have been
using same dataset size of 349,164 instances whereas for the second test options
40
4.5 performance measurement parameters
supplied test the three algorithms use 261,873 for train and 87,291 for testing. For
training purpose, a label attribute was added to the dataset before the experiment
started. In this study algorithms were experimented with their default parameters.
In many cases, using the default setting parameters scores adequate results, how-
ever, compare results /models and achieve the research objectives other options
are considered. Results of each experiments evaluated with algorithm’s perfor-
mance measurement parameters which discussed under Section 4.5.
The next tasks after training and testing the algorithms, evaluating the algorithms
outcomes based on their performance measures parameters. The upcoming sub-
sections describe about parameters for comparing and analyzing the algorithms.
Predicted Class
41
4.5 performance measurement parameters
True positive and true negatives are the observations that are correctly predicted.
A good classifier minimize false positives and false negatives values.
True Positives - TP - These are the correctly predicted positive values which
means that the value of actual class is Fraudulent and the value of predicted class
is also Fraudulent.
True Negatives - TN - These are the correctly predicted negative values which
means that the value of actual class is Legitimate and value of predicted class is
also Legitimate.
False positives and false negatives, these values occur when actual class contra-
dicts with the predicted class.
False Positive - FP – When actual class is Negative and predicted class is Positive.
False Negative - FN – When actual class is Positive but predicted class in Negative.
Once we understand these four parameters then we can calculate Accuracy, Preci-
sion, Recall and F-Measure.
4.5.2 Accuracy
TP + TN
Accuracy = (4.2)
T P + FP + FN + T N
42
4.5 performance measurement parameters
4.5.3 F-Measure
Precision × Recall
F − Measure = 2 × (4.3)
Precision + Recall
TP
Precision = (4.4)
T P + FP
Recall can be interpreted as the amount of positive test samples that were actu-
ally classified as positive. A classifier that just outputs positive for every sample,
regardless if it is really positive, would get a recall of 1.0 but a lower precision.
The less false negatives a clasifier gives, the higher is its recall.
TP
Recall = (4.5)
T P + FN
43
4.5 performance measurement parameters
Root Mean Squared Error (RMSE) is a frequently used measure of the differences
between sample values predicted by a model and the values observed. RMSE is
a quadratic scoring rule that also measures the average magnitude of the error.
It’s the square root of the average of squared differences between prediction and
actual observation. The Lower values of RMSE indicate better fit. RMSE is a good
measure of how accurately the model predicts the response, and it is the most
important criterion for fit if the main purpose of the model is prediction.
r
1 n
RMSE = Σ (di − fi )2 (4.6)
n i=1
ROC curve is a graph of FPR Vs TPR. The area measures the ability of the classi-
fier to correctly classify the test data. It shows performance of models across all
possible thresholds. The wider the coverage area of the model a better classifier it
is.
44
5
R E S U LT A N D D I S C U S S I O N
Performance metrics that were using for comparisons are accuracy, precision, re-
call, F-Measure, RMSE and ROC curves. Table 5.1.1 shows classifiers classification
performance results of the algorithms besides the validation techniques. The high-
est classification accuracy noted from cross-validation and supplied test options
are J48 algorithms with 99.3% and 99.2% respectively. Both validation technique
results are comparable in the case of J48 algorithm. On the contrary, SVM scores
the smallest accuracy of 94.71% from the total experimentation results using cross-
validation option and relatively bigger with 1.29% in its supplied test option than
cross-validation. ANN is relatively the second-highest classifier in cross-validation
and supplied test 97.51% and 96.57% respectively.
45
5.1 results and comparison
46
5.1 results and comparison
47
5.1 results and comparison
that the J48 algorithm scores the highest with similar values of 0.992 for the three
performance measures compared with SVM and ANN algorithms. However, ANN
0.967 is the second-highest precision value than SVM 0.961 value. SVM F-measures
0.959 is the least measuring value of the supplied test option in other words SVM
is relatively the lowest classifier than the other two algorithms related to the three
performance metrics. The higher the precision value means the lower false positive
detected. In fraud detection, lesser false positive detection is preferable because it
minimizes the risk of blocking legitimate subscribers.
48
5.1 results and comparison
ROC curve is another choice to compare the performances of the proposed algo-
rithms. It is a standard technique which is used for summarizing classifier perfor-
mance through a range of tradeoffs between false positive and true positive error
rates. As it is described in performance measure Table 5.1.1 the lower ROC value
recorded on each algorithm on SVM, J48 and ANN with suppled test validation
0.892, 0.979 and 0.908 respectively. But, each algorithm’s uppermost ROC value is
plotted as shown in Figure 5.1.4 for further comparison to select the superlative
classifier based on Area Under the Curve (AUC). Among the smallest ROC results
SVM is the least.
ROC Figure 5.1.4 shows that J48 followed by ANN is the best classifier due to their
wider coverage area of the curve. J48 curve laid to the vertical true positive rate
axis which approximated to [0,1). This indicates that the algorithm was accurate
in other words it measures the proportion of actual positives that are correctly
identified as positive values. In contrary, SVM is the least classifier compared to
the two selected ( highest) algorithms based on its coverage area.
49
5.1 results and comparison
The main objective of this study was analysing which algorithms perform better
for detecting subscription fraud based on their classification performance. As the
result depicted in the above sections comparisons in both validation options the
highest accuracy results of each algorithm tabulated in Table 5.1.3. So, the final
findings of the study help to answer the research questions which has been initi-
ated at the beginning of the study.
Research question
• Call fee which describe how much money is spent per certain period
• Data usage which describe how much data the subscriber uses
50
5.1 results and comparison
51
6
CONCLUSION AND FUTURE WORK
6.1 conclusion
The focusing point of this study was train and analysing which algorithms per-
form better for detecting subscription fraud based on their classification perfor-
mance. To achieve the goal of this research a CDR data were collected from ethio
telecom and preprocessing tasks were applied to clear unnecessary missing data
values and to remove outliered data. Attribute selection task is a key for fraud de-
tection. To select attributes subscription fraud behaviours were identified by the
help of domain expert advice in addition to related paper review. After eliminat-
ing irrelevant attributes nine out of 33 were selected. To have full information of
the subscriber, attributes were aggregated in subscriber level to discriminate legit-
imate subscribers from fraudulent based on the behaviour of subscription fraud.
A total of six experiments were done using the two validation techniques ten-
fold cross-validation and separate test data on the three ML algorithms namely
J48, ANN and SVM. In separate test case instances in the training dataset never
52
6.1 conclusion
included in the test dataset. However, in CV cases the validation technique has
knowlages of each instances as it uses a single instance for training and testing.
The performance of all algorithms using both validation technique were evaluated
based on various metrics including accuracy, precision, Recall, F-measures, RMSE,
ROC and time (building and evaluation). These metrics results provides quantita-
tive understanding of their suitability to subscription fraud detection.
As a result, J48 algorithm (99.3%) is a superlative classifier that perfectly fits the
prediction solution of subscription fraud detection based on the performance mea-
sure evaluation matrices. This result happens because of its capable of learning
disjunctive expressions in addition to it reduced error pruning. Pruning decreases
the complexity in the final classifier, and therefore improves predictive accuracy
from the decrease of over fitting. The two algorithms highest scores of ANN (CV)
and SVM ST with 97.51% and 96.0% respectively. For the experimentation tech-
niques, SVM shows a little improvements in supplied test options by 1.29%. But
ANN on the other hand improves on the cross-validation techniques by 0.94%. In
case of J48 both validation technique results are comparable. The performance
measurement results show that cross-validation technique preferred as compared
to the results recorded from the supplied test.
This research will play an important role in controlling and preventing the current
fraudulent threat in ethio telecom specifically subscription fraud detection. More-
over, this research gives a telecom operator to identify the fraudulent subscribers
from legitimate and benefits the telecommunications company to decrease rev-
enue losses caused by fraudsters, increase their profitability, increase trust rela-
tionships with their customers and build the company brand.
53
6.2 future work
This study was carried out on prepaid mobile with two months sample data. In-
creasing the dataset size may improve the performance and accuracy of the tech-
nique.
with similar methodology and techniques research can be performed with addi-
tional unseen attributes.
with these similar techniques and algorithms research can be conducted for other
fraud types
This study was carried out on prepaid mobile but the techniques proposed here
could be extended to subscription fraud in fixed and post-paid mobile communi-
cations.
54
BIBLIOGRAPHY
[3] S. Wu, N. Kang, and L. Yang, “Fraudulent behavior forecast in telecom in-
dustry based on data mining technology,” Communications of the IIMA, vol. 7,
no. 4, p. 1, 2007.
the-digital-era/.
[7] oseland. (2013). Communications fraud control association. 2013 global fraud
loss survey. N. (CFCA), Ed., [Online]. Available: http : / / www . cfca . org /
press.php.
55
bibliography
[10] C. F. C. Association et al., “2017 global fraud loss survey,” Press Release, June,
2017.
[11] fanabc. (Oct. 1, 2019). Telecomfraud. fanabc, Ed., [Online]. Available: https:
/ / www . fanabc . com / english / 2018 / 10 / ethio - telecom - mulls - over -
preventing-telecom-fraud.
[12] A.-A. Ababa. (Mar. 6, 2017). Ethiopia-telecom fraud. A.-A. Ababa, Ed., [On-
line]. Available: http://apanews.net/en/news/ethiopia-loses-over-52m-
to-telecom-fraud-official.
[17] S. S. Aksenova, “Machine learning with weka weka explorer tutorial for
weka version 3.4. 3,” sabanciuniv. edu, 2004.
56
bibliography
[22] G. M., “Telecoms fraud,” Computer Fraud & Security, Jul. 15, 2010.
[23] WeDo. (Dec. 18, 2019). How to handle subscription fraud. WeDo, Ed., [On-
line]. Available: https : / / web . wedotechnologies . com / hubfs / 10 _ RAID .
Cloud / Subscription % 20Fraud / Datasheets / RAID - CLOUD - Subscription -
Fraud-datasheet.pdf.
[25] D. Bales. (Oct. 10, 2019). Clone or swap? sim card vulnerabilities to reckon
with. L. Kessem, Ed., [Online]. Available: https://securityintelligence.
com/posts/clone-or-swap-sim-card-vulnerabilities-to-reckon-with/.
[26] kahsu hagos, Sim-box fraud detection using data mining techniques: The case of
ethio telecom, D. Ephrem, Ed., Nov. 3, 2018.
[30] G. Kaur and A. Chhabra, “Improved j48 classification algorithm for the pre-
diction of diabetes,” International Journal of Computer Applications, vol. 98,
no. 22, 2014.
57
bibliography
[32] T. R. Patil, S. Sherekar, et al., “Performance analysis of naive bayes and j48
classification algorithm for data classification,” International journal of com-
puter science and applications, vol. 6, no. 2, pp. 256–261, 2013.
[34] H. Sug, “The effect of training set size for the performance of neural net-
works of classification,” WSEAS Transactions on Computers, vol. 9, no. 11,
pp. 1297–1306, 2010.
[35] L. Auria and R. A. Moro, “Support vector machines (svm) as a technique for
solvency analysis,” 2008.
[36] J. Nalepa and M. Kawulok, “Selecting training sets for support vector ma-
chines: A review,” Artificial Intelligence Review, vol. 52, no. 2, pp. 857–900,
2019.
[37] S. Subudhi and S. Panigrahi, “Use of fuzzy clustering and support vector
machine for detecting fraud in mobile telecommunication networks.,” IJSN,
vol. 11, no. 1/2, pp. 3–11, 2016.
[38] R. Berwick, “An idiot’s guide to support vector machines (svms),” Retrieved
on October, vol. 21, p. 2011, 2003.
[42] E. Acuna and C. Rodriguez, “The treatment of missing values and its effect
on classifier accuracy,” in Classification, clustering, and data mining applications,
Springer, 2004, pp. 639–647.
[43] I. Ben-Gal, “Outlier detection,” in Data mining and knowledge discovery hand-
book, Springer, 2005, pp. 131–146.
58
A
APPENDIX
No Attributes Description
59
A.2 file uploader script
No Attributes Description
24 RATE_ID1 Rate ID
=============
60
A.2 file uploader script
:Endegena for /f "delims=" ’dir /b /a-d /tw /od ") do (RENAME "if /f :Aleke
DEL PAUSE
:loadingerror :: this part of the code writes error log with the name of the file
which caused error echo error on !Fileholer! >
:delete DEL
:rnm :: This function moves .bad files to sepcific directory ::echo the value of
!Fileholer! MOVE "RENAME "RENAME "
PAUSE
===============
61
A.3 file loader
62
A.4 oracle scripts
Table Joining
63