Automatic Lithology Well Logging
Automatic Lithology Well Logging
Automatic Lithology Well Logging
Petroleum Engineering
Submission date: June 2016
Supervisor: Sigve Hovda, IPT
Trondheim, 2016-06-09
i
Abstract
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
1 Introduction 1
2 Background Theory 4
2.1 Well Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Gamma Ray Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Neutron Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Motivation of hypothesis testing . . . . . . . . . . . . . . . . . . . . . 12
2.4 Kernel density estimation (KDE) . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Motivation of univariate density estimation . . . . . . . . . . . . . . 15
2.4.2 Properties of kernel density estimator . . . . . . . . . . . . . . . . . . 16
2.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
ii
CONTENTS
5.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Preliminary study of bivariate analysis . . . . . . . . . . . . . . . . . . . . . 70
5.4 Concluding remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 Conclusions 73
A Acronyms 75
B Geological Description 76
Bibliography 92
iii
List of Figures
5.1 Density probability plots with KDE and kernel bandwidth results of Well
15/5-7 A for present lithology in borehole size: (a)36", (b)26", (c)17 12 ",
and (d)8 12 " . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iv
LIST OF FIGURES
5.2 Density probability plots with KDE and kernel bandwidth results of Well
15/6-11 S for present lithology in borehole size: (a)36", (b)26", (c)17 12 ",
(d)12 41 ", and (e)8 21 " . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Density probability plots with KDE and kernel bandwidth results of Well
15/6-9 S for present lithology in borehole size: (a)24", (b)17 21 ", and (c)8 12 " 47
5.4 Comparison of histogram and KDE in estimating probability density in
(a)Well 15/5-7 A , section 26", (b)Well 15/6-9 S , section 17 12 ", and (c) sec-
tion 24" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Probability density of Well 15/5-7 A for hole size 17 12 ". The gray area is
the overlapping distribution between shale and sandstone lithology. . . . 49
5.6 Comparison of probability density from group of 8 12 " in Well 15/6-9 S
which are (a) not-merged and (b) merged . . . . . . . . . . . . . . . . . . . . 50
5.7 Preview of the interface for validation process built by using MATLAB . . . 54
5.8 Training and testing data distributions from different experiments for
group of 17 12 ". The testing data was from Well 15/6-9 S . . . . . . . . . . . . 64
5.9 Plot of probability density for case in experiment 1: 17 12 " in Well 15/6-9 S . 66
5.10 Distribution of model 1, model 2, and testing data of experiment 1 group
26" in Well 15/5-7 A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.11 Distribution of model 1, model 2, and testing data of experiment 2 for
group 8 12 ". The model was from Well 15/6-11 S and the testing data was
from Well 15/6-9 S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.12 Bivariate analysis plots of group 8 12 " from Well 16/1-14 . . . . . . . . . . . . 71
5.13 Bivariate analysis plots of group 17 21 " from Well 16/2-7 . . . . . . . . . . . . 71
D.1 Probability density results of each hole section in Well 15/5-7 A using
kernel estimator grouped into shale and non-shale lithology . . . . . . . . 84
D.2 Probability density results of each hole section in Well 15/6-11 S using
kernel estimator grouped into shale and non-shale lithology . . . . . . . . 85
D.3 Probability density results of each hole section in Well 15/6-9 S using ker-
nel estimator grouped into shale and non-shale lithology . . . . . . . . . . 86
E.1 Scatter plot and probability density plots of GR and neutron for Well
16/1-14 in hole section: (a) 12 14 " and (b) 8 21 " . . . . . . . . . . . . . . . . . 88
E.2 Scatter plot and probability density plots of GR and neutron for Well
16/2-7 in hole section: (a) 17 12 ", (b) 12 41 ", and (c) 8 21 " . . . . . . . . . . . . 90
E.3 Scatter plot and probability density plots of GR and neutron for Well
16/2-13 A in hole section: (a) 12 14 " and (b) 8 21 " . . . . . . . . . . . . . . . . 91
v
List of Tables
vi
LIST OF TABLES
5.7 Summary of data and results for validation of Well 15/5-7 A and Well
15/6-11 S , by using Well 15/6-9 S as training data . . . . . . . . . . . . . . . 58
5.8 Summary of data and results for validation of wells in Block 16 by using
Well 15/5-7 A as training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Summary of data and results for validation of wells in Block 16 by using
Well 15/6-11 S as training data . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.10 Summary of data and results for validation of wells in Block 16 by using
Well 15/6-9 S as training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.11 Table of minimum, maximum, and averaged value of misclassification
error in each model and experiment . . . . . . . . . . . . . . . . . . . . . . . 62
5.12 The values of the intersection points of the models and the testing data
from experiment 1, group of 17 12 " in Well 15/6-9 S . . . . . . . . . . . . . . . 67
vii
Chapter 1
Introduction
Subsurface lithology within drilling operation is interpreted with the usage of dif-
ferent techniques, involving seismic records, mud logging, and well logging. In the
petroleum industry, well logs are the main source of information concerning the sub-
surface formations. And, the variation in borehole geophysical data usually is used
to relate any change in lithology and geological properties. In addition to mud log-
ging, real-time drilling measurement nowadays is improving through the development
of measurement-while-drilling. Moreover, logging within drilling phase is even possi-
ble through logging-while-drilling (LWD). Continuous transmission of the information
from MWD and LWD has shown the benefits in helping the drilling decisions and real-
time formation evaluation (Bonner et al., 1992).
Despite of well logging, mud logging also provides information for lithology inter-
pretation. Within mud logging, measurement of the progress of the drilling operation
and the contents of the formation are recorded. The drilling parameters which are
recorded include weight on bit (WOB), hook load, and mud properties. Meanwhile,
the cuttings from the drilled formation inside borehole is circulated to the surface. By
visualizing the cutting sample, the lithology can be approximated. Generally, cuttings
samples require an amount of time to be circulated to the surface which lead to a de-
layed interpretation. Restoring the information from cuttings also can be difficult due
to the requirement of correlating the cutting origin, fluid loss in the borehole, flushed
rock fragment. These factors cause cutting visualization alone does not provide an ac-
curate lithology interpretation and requires a combination with information from well
logs and mud logs.
A wide span of methods in lithology interpretation has been proposed through
combinations of various measurements both in the qualitative and quantitative eval-
uation. Some of the qualitative evaluations from mud logging include ROP interpreta-
tion (such as identification of drilling break and drill-off trend), drilling force, and bit
evaluation, and specific energy deflection (Provost (1987), Ziaja and Roegiers (1998),
Laosripaiboon et al. (2015) ). Meanwhile, qualitative evaluations from logging mea-
surement include visualizations of multiple logs , photoelectric (P e ) factor interpre-
tation, and gamma-ray evaluation for shale identification (Gardner and Dumanoir
(1980), Serra et al. (1985), Dewan (1986) ). Qualitative methods itself are inadequate
for lithology interpretation, especially in formation with complex lithologies which re-
quire a large set of logging information.
Within time, lithology interpretation has expanded and starts to consider the usage
of quantitative methods, such as crossplot, statistical analysis, and neural network.
1
Crossplot is one of the basic method of quantitative evaluation which is performed by
plotting data points of two or more than two different log data. The geophysical data
which are widely used for crossplotting are density, neutron, sonic, and P e . The types
and the applications of crossplot method has been studied by Burke et al. (1969) and
Clavier and Rust (1976). However, these methods still require manual analyses and can
not be applied for automatic interpretation.
There are a plethora of statistical classification methods, such as discrimination
analysis, linear regression, kernel estimation, etc. The selection between these meth-
ods is greatly dependent on the character of the data, so that the method can pro-
vide an interpretation that is important for model building. Probabilistic classification
is a common classification method which can predict the belonging of a member by
returning the probability. An early research by Delfiner et al. (1987) has shown how
statistical analysis can be applied for lithology determination and prediction. This re-
search proposed a procedure which combines modern wireline measurement in order
to produce automatic lithologic description. Within the procedure, lithology classifi-
cation by discriminant analysis and probability calculation by Bayesian rule was in-
troduced. The application of this research was shown in a case study by Busch et al.
(1987). This research showed that the statistical discriminant analysis is possible to
predict lithology formation. However, the proposed method in these researches is lim-
ited for geophysical data with normal (Gaussian) distribution thus, this method is not
flexible to be applied in non-parametric distribution.
A statistical method called kernel density estimator provides an estimation of the
probability density function for a non-parametric distribution and examines the mul-
timodality of data. A research by Silverman (1986) showed that kernel density estima-
tion is the excellent tool for estimating the univariate, bivariate, or trivariate data. The
application of kernel density estimation in borehole geophysical data was performed
by Mwenifumbo (1993). Within this research, the kernel density method was applied
for analysis of univariate and bivariate data to identify lithology and sulfide mineral-
ization. However, the assessment of the statistical significance was not performed.
Until now, there has been no study of lithology prediction based on borehole geo-
physical data which applied statistical analysis. Within this study, a method of lithology
classification and prediction from borehole geophysical data is developed by applying
kernel density method as the statistical analysis. The objective of this study is that the
developed method can be applied to give interpretation and prediction in a real-time
operation. Specifically, the main focus of this study is the application of univariate
kernel density which is not extensively used. The models from gamma-ray data con-
structed by kernel estimator are assessed by the used of the confusion matrix to un-
derstand the model accuracy in classifying shale and non-shale lithology. The results
are presented in term of misclassification rate. In addition to that, a brief insight of
bivariate analysis to improve lithology classification is also provided.
The models are tested with two different classification rules to assess the effect from
adding prior probability value which value is taken from the geological description.
The models are also tested in different experiments with testing datasets which are
taken from different well locations. This study shows that gamma-ray is a good variable
for lithology classification. But, the accuracy is dependent on the source of data which
is tested towards the models. Moreover, the method proposed in this study can be
applicable for lithology prediction, even though the application is still limited because
it processes the data from the current depth of logging tool, not beyond the logging
2
CHAPTER 1. INTRODUCTION
3
Chapter 2
Background Theory
This purpose of this chapter is to discuss the theories which relevant to this study. The
first theory discusses the well logging in petroleum industry, including the descrip-
tion of gamma ray and neutron logging which are relevant to the geophysical data
which is used in this study. The next theories discussed are relevant to the method-
ology adapted in this study, they are exploratory data analysis, hypothesis testing, and
kernel density estimation.
4
CHAPTER 2. BACKGROUND THEORY
5
2.1. WELL LOGGING
Figure 2.2: Gamma ray reading for various lithology (Glover, 2001)
6
CHAPTER 2. BACKGROUND THEORY
Method Equation
Figure 2.3: Comparison of shale volume by using different methods (Glover, 2001)
However, determining lithology shaliness only by using GR index can cause misin-
terpretations, such as in cases of uranium-rich formations, sandstone containing mica,
and nonradioactive clays. These misinterpretations can be prevented by the use of
spectral gamma ray which measures not only the total radioactivity, but also the con-
centration of potassium (K), thorium (Th), and uranium (U).
GR tools are sensitive to a number of factors (Bateman, 2012):
1. Eccentricity of gamma ray tool
2. Hole size
3. Mud weight
4. Casing weight and size
5. Cement thickness
Modern logs usually have automatic corrections applied to GR readings. However,
7
2.1. WELL LOGGING
Figure 2.4: The graph of neutron life after neutron is emitted by the neutron tool
The energy loss due to elastic scattering is maximum when a neutron collides nu-
cleus with the same mass (i.e. hydrogen). Therefore, the count rate to slow down the
neutron and the distance traveled by neutron depend on the amount of hydrogen. Be-
cause hydrogen is mostly found in pores (composed in water or hydrocarbon), the neu-
tron log is related to porosity function. The count rate in high porosity rocks is slower
than in low porosity rocks.
Mainly, there are three different types of neutron tools available, they are:
1. The gamma ray/neutron tool (GNT)
2. The sidewall neutron porosity tool (SNP)
3. The compensated neutron log (CNL)
The detailed explanation of each tool can be found in a textbook written by Bateman
(2012). The data measured by neutron tools are shown in the unit of porosity or hy-
drogen index. The neutron tools are calibrated in limestone filled with fresh water.
Therefore, the results often presented in equivalent limestone porosity units.
The neutron logs are used for porosity calculation which assumes that the contri-
bution of elements other than hydrogen is negligible. The second use of neutron logs is
lithology determination. The source of a hydrogen atom is not only from fluids occupy-
8
CHAPTER 2. BACKGROUND THEORY
ing the pore space but it can be originated from bound water molecules in shales, crys-
tallized water in evaporites, or hydrated minerals in igneous and metamorphic rocks
(Glover, 2001).
The apparent porosity of shale is varying, but it is usually higher than the apparent
porosity identified in carbonate and sandstone rocks. This high porosity reading by
neutron tool is caused by the effect of hydrogen contained in the bound water in shale.
However, shale identification by using neutron log requires extra concern due to the
effect of hydrocarbon gas which may present and disturbs the log.
Figure 2.5: Typical apparent porosity from neutron log for varies lithologies
9
2.2. EXPLORATORY DATA ANALYSIS
provides the most appropriate way to explore, summarize data, and also create a visu-
alization of the data. By implementing EDA, it is expected to gain some confidences of
the data. Most of the EDA use graphical techniques to get data visualization, such as
histograms, box plots, and steam-leaf plots.
2.2.1 Histogram
Histogram is one of graphical techniques which serves the data by using bars to show
data distribution. During the construction, the data values are divided into series of
intervals which illustrated in bars. The number of intervals are often referred as bin,
which each interval contains a certain range of values. Each bar usually has a consis-
tent and equal ranges with others, and its interval is not overlapping with others and
adjacent.
Histograms are often constructed as frequency or density histogram. The descrip-
tion of each type of histogram is explained below:
Class frequency histogram measures the number of occurrences which values are
falling within a given class interval.
Relative frequency histogram calculates the frequency of each interval divided by the
total number of measurements. This histogram has a total height of bars equal
to 1 or 100%.
Density histogram calculates the relative frequency of each interval divided by bin
width. In other words, density histogram shows relative frequency as area of each
bar. In density histogram, the total area of bars is normalized to 1.
Histogram is very well suited to illustrate a large set of data and continuous data.
With the advantage of its simplicity of construction, histogram is a graphical technique
which is commonly used. However, describing data using histogram may encounter
some difficulties and requires some understandings. A histogram is very sensitive to
bin width because of its effect to the graph smoothness. It is evidenced that larger bin
width (fewer bins) reduces noises and makes the graph oversmoothed, while smaller
bin width (more bins) makes the graph undersmoothed.
Figure 2.6: Histograms of a data set with 15, 35, and 100 bins (Scott, 2004)
10
CHAPTER 2. BACKGROUND THEORY
There are many theories have been developed in determining the optimum bin
width. However, the application of these methods depends on the data distribution
and the goal of analysis. Some of the methods proposed calculation of bin numbers, k,
and some proposed calculation of bin width, h. The relationship of bin numbers and
bin width is:
x max − x mi n
k= (2.2)
h
where x is random variable. The most common methods on calculating optimum bin
width are explained below.
Sturges method. This method suggests to calculate bin width with formula (Scott,
1992),
k = 1 + l og 2 n (2.3)
where n is the number of data points. Sturges derived the formula above based
on binomial distribution with normal distribution data. This method is popu-
lar due to its simplicity in the calculation. However, this method may give poor
result if the data is not normal.
Scott method. This method is derived by minimizing the integrated means squared
error of the density estimate for normally distributed data (Scott, 1992). The cal-
culation requires standard deviation 1 , σ.
3.491σ
h= (2.4)
n 1/3
2.2.2 Boxplot
Boxplot is a graphical technique to examine the shape of data distribution by using
parameters called quartiles. In addition, boxplot is also used to study the variability of
values in a set of data (Ott and Longnecker, 2010).
Quartiles are three points dividing a dataset which is arranged from the lowest to
the highest value equally into 4 groups. The first quartile (Q 1 ), which is called lower
quartile, has a value between the smallest value and median of a dataset. The second
quartile (Q 2 ) is called as the median of a dataset. The median value is taken from a
qP
2
1 i (x−x̄)
Standard deviation measures the variation of data set ,σ = n−1 (Ott and Longnecker, 2010)
11
2.3. HYPOTHESIS TESTING
data point located in the middle of a dataset which is arranged from the lowest to the
highest value. In other words, a median is the center of data distribution. Last, third
quartile (Q 3 ), which is called upper quartile, has a value between the median and the
highest value of a dataset. The difference between upper and lower quartiles value is
defined as an interquartile range.
Figure 2.7: An example of boxplot with whiskers of 1.5 IQR of upper and lower
quartiles
The boxplot is constructed by creating a box with two sides which values are equal
to the upper and lower quartiles. Within the box, a line is drawn to indicate median
(see figure 2.7). The boxplot is often constructed with whiskers to indicate data vari-
ability. There are several ways to plot the whiskers depend on the information to be
represented. In most cases, the whiskers represent the minimum and maximum val-
ues of the data. The whisker can also set at a value equal to 1.5 IQR of the upper and
lower quartiles, and this boxplot is often referred as Tukey boxplot (Frigge et al., 1989).
12
CHAPTER 2. BACKGROUND THEORY
3. The observations in each group are independent and there is no relationship be-
tween the groups.
The stated H0 of the test is that the data set comes from same distribution. Be-
fore getting into Kruskal-Wallis test, firstly we would explain the underlying theory of
rank-sum test for two independent variables from Mann-Whitney U test (Mann and
Whitney, 1947).
Rank-sum test is a method ranking the raw data from the lowest (rank #1) to the highest
value (rank #N), with tied ranks included. Tied ranks are assigned if there are two or
more than two tied values in the raw data, thus the ranks are adjusted and equalized.
Define a population with 2 group of samples, group 1 and group 2, where n 1 is the
size of observations in group 1 and n 2 is the size of observations in group 2. The sum
rank of each group is defined with R 1 and R 2 respectively for group 1 and group 2. For
any combination of n 1 and n 2 , the maximum possible value of sum rank, R max , in each
group can be calculated as follow
n 1 (n 1 + 1)
R max1 = n 1 n 2 + (2.6a)
2
n 2 (n 2 + 1)
R max2 = n 1 n 2 + (2.6b)
2
Mann-Whitney proposed a parameter U which is equal to the difference between
maximum possible value of rank sum, R max , and the actual rank sum observed, R. The
equation U for both groups follows
U1 = R max1 − R 1 (2.7a)
U2 = R max2 − R 2 (2.7b)
U1 +U2 = n 1 n 2 (2.8a)
U1 = n 1 n 2 −U2 (2.8b)
U2 = n 1 n 2 −U1 (2.8c)
U = min(U1 ,U2 ) (2.8d)
n1 n2
µ= (2.9a)
ss
n 1 n 2 (n 1 + n 2 + 1)
σ= (2.9b)
12
From these identities, standardized variable, z-values, can be calculated following
|U − µ| − 0.5
z= (2.10)
σ
13
2.3. HYPOTHESIS TESTING
The value of - 0.5 is a correction for continuity to accommodate the sampling distri-
butions of U which are discrete. Then, the probability value (p-value) of z is generated
from normal distribution. If p-value < level of significance, then the null hypothesis is
rejected.
Kruskal-Wallis test
The concept of Kruskal-Wallis test is quite similar with Mann-Whitney which also
adapts the rank-sum test. Defined a population with k independent group of sam-
ples, where n = (n 1 , n 2 , . . . , n k ) represents the number of observation in each group,
R = (R 1 , R 2 , . . . , R k ) represents the sum rank of the kth group, and M = (M 1 , M 2 , . . . , M k )
represents the mean of rank of the kth group. In addition, R T is the sum of R of all k
group and M T is the sum of M of all k group, following:
k
X
RT = Rj (2.11a)
j =1
k
X
MT = Mj (2.11b)
j =1
(R j )2 (R T2 )
" #
X k
SS bg (R) = − (2.13)
j =1 nj N
The p-value can be approximated by inputting the calculated H into chi-square distri-
bution because H value has a close approximation to the chi-square distribution for
d f = k − 1, or following
H ∼ χ2(k−1) (2.16)
where χ2 is chi-squared distribution. If the returned p-value is less than level of signifi-
cance (typically set at 5%), then we rejected the statement of the null hypothesis. Small
values of p-value remove the doubt of the validity of H0 .
14
CHAPTER 2. BACKGROUND THEORY
d F (x + h) − F (x)
f (x) ≡ F (x) ≡ lim , (2.18)
dx h→0 h
where F (x) represents the cumulative distribution function of X .
Assume that density f consists random samples with size n which samples are in-
dependent and identically distributed, represented as {x 1 , . . . , x n }. By dividing Equa-
tion 2.18 into a set of K bin numbers with width h and replace F (x) with the empirical
cumulative distribution function,
#{x i ≤ x}
F̂ (x) = , (2.19)
n
This equation leads histogram to be a density function estimator with each bin
value equal to
d F (x + h) − F (x − h)
f (x) ≡ F (x) ≡ lim , (2.22)
dx h→0 2h
15
2.4. KERNEL DENSITY ESTIMATION (KDE)
Different from histogram, the smooth estimator approaches the density function
by estimating the derivative at each point x separately. By replacing F (x) with empiri-
cal cumulative distribution,
{#x i ∈ (x − h, x + h)}
fˆ(x) = (2.23)
2nh
The equation above also can be written as
1 X n ³x −x ´
i
fˆ(x) = K , (2.24)
nh i =1 h
where
(
1
2
, if − 1 < u ≤ 1,
K (u) =
0, otherwise.
The Equation 2.24 is the form of kernel density estimator, with uniform kernel func-
tion K . This estimator fˆ(x) counts the percentage of the observations in each data
point over the local neighbourhood which is close to the examined data point x. By
merging all of the smooth kernel function at each data point, we will have a smooth
density estimation for one population.
The comparison of simple and smooth density estimator in estimating density
function can be seen in Fig. 2.8. The example given by Simonoff (1996) shows the com-
parison of probability density function from kernel estimator and histogram. Kernel
estimator gives a estimation which is smoother compared to discreteness of the his-
togram.
(a) (b)
Figure 2.8: (a) Simple density estimation, and (b) smooth density estimation (with
underlying Gaussian kernel function) for certificated of deposit (CR) rates
16
CHAPTER 2. BACKGROUND THEORY
tion with more peaks and bumps, meanwhile a very large bandwidth will give over-
smoothed graph. Bandwidth also have strong relation to bias and variance which cre-
ates a dilemma in selecting optimal bandwidth. A small bandwidth will reduce the bias
of fˆ(x), but it will trigger larger variance of fˆ(x), and vice versa. The criterion on choos-
ing optimal bandwidth mostly quantified through the measurement of mean squared
error (MSE).
¤2
MSE fˆ(x) = E f fˆ(x) − f (x)
£ ¤ £
R(k) h σK R( f )
4 4 00
AMISE(h) = + (2.26)
nh 4
where R(K ) = K (u)2 du, K satisfies condition u 2 K (u)du = σ2K > 0, and f 00 is the
R R
5
AMISE0 = [σK R(K )]4/5 R( f 00 )1/5 n −4/3 (2.28)
4
The formula of h 0 consists an unknown density function f which value is out of
the control of data analyst, thus this formula can not be applied directly. However,
the term [σK R(K )[4/5 can be minimized by choosing the optimal kernel function K ,
and this kernel function is often called as Epanechnikov kernel. Several other kernel
functions which commonly user are shown in Table 2.2 and Fig. 2.9.
By comparing the inefficiency of each kernel function to Epanechnikov kernel, it is
obvious that the selection of kernel functions is insensitive with MSE (Simonoff, 1996).
Therefore, kernel function should be selected based on other consideration, such as
properties of fˆ.
17
2.4. KERNEL DENSITY ESTIMATION (KDE)
Kernel Form
1
Uniform 2
If the reference of density function f is based on the Gaussian function, then the Gaus-
sian density can be substituted into equation 2.27, and resulting
This Gaussian reference density can also be used and converted into other types
of kernel function. The optimal bandwidth of other kernel function, h 0,K ∗ satisfies the
condition,
h 0,K ∗ = c K ∗ h 0,G , (2.30)
where
18
CHAPTER 2. BACKGROUND THEORY
" p #1/5
2 πR(K ∗)
cK ∗ = (2.31)
σ4K ∗
and h 0,G is the optimal bandwidth of Gaussian kernel. This method is often referred
as the Silverman Rule-of-Thumb of selecting optimal bandwidth. Depending on the
true density, this method can give the optimal bandwidth if the true density is normal.
However, if the true density is close to normal, the bandwidth will be close to optimal
(Hansen, 2009).
2.5 Classification
In a population which consists several independent groups, very often we wish to look
for the characteristics or features to separate the multivariate samples into the known
groups. Based on the features, classification rule could be developed in order to iden-
tify and allocate an object from new observations into one of the groups.
Consider a population consists two sub-populations, denoted as π1 and π2 . The
probability density of each population is denoted as f 1 (x) and f 2 (x), with random vari-
able of X = X 1 , . . . , X p . Denote that Ω is the collection of all possible outcomes x, R 1
¡ ¢
is the possible outcomes x which are classified as population π1 , and R 2 = Ω−R 1 is the
possible outcomes of x which are classified as population π2 .
The classification probabilities can be presented in the following table:
Classified as:
π1 π2
π1 P (1|1) P (2|1)
True population:
π2 P (1|2) P (2|2)
In some cases, prior probability and costs of misclassification are taken account
into classification rules. Prior probability is the probability of one population from
prior observation and denoted as p 1 for prior probability of population π1 and p 2 for
prior probability of population π2 . The total of prior probability is equal to 1, p 1 + p 2 =
1. Costs of misclassification are defined as the prices to pay if an object is misclassified
19
2.5. CLASSIFICATION
Figure 2.10: Description of misclassification regions P (1|2) and P (2|1). The purple
shaded area indicates region of P (1|2) and the light green shaded area indicates region
of P (2|1)
and denoted as c 1 for the cost of classifying an object of π1 as π2 and c 2 for the cost of
classifying an object of π2 as π1 .
The classification rules are evaluated in terms of the expected cost of misclassifica-
tion (ECM)
EC M = c(2|1)P (2|1)p 1 + c(1|2)P (1|2)p 2 (2.34)
The optimal classification rule is calculated by minimizing the ECM, resulting
f 1 (x) c(1|2) p 2 o
n µ ¶µ ¶
R 1 = x ∈ Ω; ≥
f 2 (x) c(2|1) p 1
(2.35)
f 1 (x) c(1|2) p 2 o
n µ ¶µ ¶
R 2 = x ∈ Ω; <
f 2 (x) c(2|1) p 1
Special classification rules prevail for conditions such as:
1. Equal (or unknown) prior probabilities: p 1 = p 2 . The classification rule now de-
pends on probability density ratio and cost ratio.
f 1 (x) c(1|2) f 1 (x) c(1|2)
R1 : ≥ , R2 : < (2.36)
f 2 (x) c(2|1) f 2 (x) c(2|1)
20
Chapter 3
This chapter contains the overview of the methodology applied and the management
of data which would be used in this study. Within data management section, the data
source, the process of collecting data, and the data limitation are also discussed.
3.1 Methodology
This study was performed based on a set of systematic methods, which is referred as
methodology. The methodology applied was summarized in a flowchart, described in
Figure 3.1. The results from each process are presented and discussed further in this
report.
21
3.2. DATA MANAGEMENT
Start
Collecting data
Data exploration
Hypothesis testing
Stop
quality of geophysical data from various wells were checked and recorded in a spread-
sheet. Considering the large number of wells available and time limitation, the evalu-
ation within this process was limited to 16 geological blocks.
From the evaluation, we chose 3 neighboring wells from Block 15, Well 15/5-7
A,Well 15/6-11 S, and Well 15/6-9 S. Apart from the availability and the good quality
of logging data, these wells also had similar geological features. The evaluation also
discovered that Block 15 has a large resource of wells which would be beneficial for
validation process later in this study.
Once the filtering process was completed and the wells were chosen, we started ex-
tracting the data which related to the study, they were geophysical data, geological de-
scription and well schematic. The well schematic data provided information regarding
casing size, hole size, and shoe depth. While the geological description provided infor-
mation regarding formation distribution, characteristic, basal stereotype, depositional
environment, and dominant lithology.
According to the geological description, Block 15 consisted ± 35 formations in total
which were divided into 6 big formation groups. There were 4 major lithologies found
in these formations, they were sandstone, shale, chalk, and carbonate lithology. How-
ever, the available geological description only provided one generalized description
of formation for all of the selected wells and also lacked of detailed information (i.e.
lithology mixture, minerals). The geological description of the formations is shown in
22
CHAPTER 3. METHODOLOGY AND DATA
Figure 3.2: The spreadsheet for recording the availability of logging data in some of
the wells
Appendix B.
After investigated the data attributes, we discovered that the geophysical data were
only recorded based on the measured depth (MD), without the availability of true ver-
tical depth (TVD)- and time-based geophysical data. The other data which were not
available from the software were the information of the run wireline tools and well tra-
jectory data. Such these data constraints would lead to limitations in the study analyses
and results.
To begin with, we visualized all the extracted data into one figure of interface which
constructed by using MATLAB, shown in Fig. 3.3. Eventually, we set some assumptions
in this study:
1. The lithology information from the geological description represented the real
lithology condition.
2. All the geophysical data were recorded from well-calibrated logging tools, thus
the geophysical data were valid and represented real borehole condition.
23
3.2. DATA MANAGEMENT
Figure 3.3: The interface for data visualization of Well 15/5-7 A, Well 15/6-11 S, and
Well 15/6-9 S. In each well, the geophysical data were plotted in log traces, alongside
casing data and geological description.
24
Chapter 4
This chapter contains the process of data exploration of GR data of Well 15/5-7 A, Well
15/6-11 S, and Well 15/6-9 Sand the results. This process was performed to check the
quality of GR data and discover any variables grouping the GR data. From this process,
we expected that the GR data would be valid to be used for further analysis. After the
data exploration was completed, hypothesis testings were carried out in order to test
and support the discovery from data exploration.
25
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
Table 4.1: Statistic description of GR data in Well 15/5-7 A grouped according to the
present lithology
26
CHAPTER 4. DATA EXPLORATION ON GR DATA
(a) Histograms
(b) Boxplots
Figure 4.1: Graphical description of GR data in Well 15/5-7 A which grouped based on
the lithology type
27
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
(a) Histograms
(b) Boxplots
Figure 4.2: Graphical description of GR data in Well 15/6-11 S which grouped based
on the lithology type
28
CHAPTER 4. DATA EXPLORATION ON GR DATA
(a) Histograms
(b) Boxplots
Figure 4.3: Graphical description of GR data in Well 15/6-9 S which grouped based on
the lithology type
29
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
By looking into the GR distribution from histograms and boxplots, low GR values
were observed in chalk and carbonate lithology while the GR values of shale and sand-
stone were varying. It was also discovered that lithology is a variable that divide the GR
data into groups and each group appeared to be independent with each other. This
presumption tested and proved later in the hypothesis test Chapter 4.2.1.
High variance of GR value in sandstone and shale lithology was detected from the
standard deviation (σ) value, while carbonate and chalk had less variance. The widest
span of GR values was detected in shale lithology with many outliers and long whiskers
indicated from the boxplots. According to the discoveries, the GR data of shale and
sandstone were not convincing due to indication of bimodal distribution and high
variance of the data. We sensed that there were sub-groups might exist within each
lithology group. Hence, another investigation was performed and the detailed expla-
nation is provided in the next section.
30
CHAPTER 4. DATA EXPLORATION ON GR DATA
(a) (b)
Figure 4.4: Shifted GR value from logging visualization: (a) 26" (blue area) and 17 12 "
(red area) in Well 15/5-7 A and b) 17 21 " (blue area) and 12 14 " (red area) in Well
15/6-11 S
31
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
Table 4.4: Statistic description of Well 15/5-7 A grouped according to the present
lithology and subgroup of hole size
Well 15/5-7 A
26" - - - - - -
Chalk 17 21 " - - - - - -
8 12 " 57.39 55.18 22.67 21.20 12.16 184.99
32
CHAPTER 4. DATA EXPLORATION ON GR DATA
Table 4.5: Statistic description of Well 15/6-11 S grouped according to the present
lithology and subgroup of hole size
Well 15/6-11 S
26" - - - - - -
17 12 " 110.16 110.82 16.02 22.78 77.37 162.63
Sandstone
12 14 " 50.50 46.54 9.67 13.79 35.75 84.67
8 12 " 50.52 42.33 21.77 39.23 14.63 109.02
26" - - - - - -
17 1/2" - - - - - -
Chalk
12 14 " 35.66 35.62 10.42 14.01 12.99 72.67
8 12 " - - - - - -
26" - - - - - -
17 12 " - - - - - -
Carbonate
12 14 " 50.39 50.21 5.67 9.82 39.96 60.69
8 12 " - - - - - -
33
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
Table 4.6: Statistic description of Well 15/6-9 S grouped according to the present
lithology and subgroup of hole size
Well 15/6-9 S
24" - - - - - -
Sandstone 17 21 " 107.84 108.49 11.73 16.54 85.31 149.65
8 12 " 68.71 70.03 17.47 32.09 36.41 109.81
24" - - - - - -
Chalk 17 21 " - - - - - -
8 12 " 45.06 45.66 10.05 13.25 23.09 82.65
24" - - - - - -
Carbonate 17 21 " - - - - - -
8 12 " 43.68 40.48 14.98 21.14 22.16 124.79
34
CHAPTER 4. DATA EXPLORATION ON GR DATA
(a) Histograms
(b) Boxplots
35
4.1. DATA EXPLORATION RESULTS AND DISCUSSIONS
(a) Histograms
(b) Boxplots
36
CHAPTER 4. DATA EXPLORATION ON GR DATA
(a) Histograms
(b) Boxplots
37
4.2. HYPOTHESIS TESTING RESULTS AND DISCUSSIONS
Summing up the investigation, hole size grouping in each lithology group had im-
proved the GR data distribution significantly and it was proved that our presumption
was correct. The results also indicated that each groups are independent with others.
To prove these discoveries and the presumptions, we performed another hypothesis
testing in Chapter 4.2.2.
Grouping the GR data based on the hole size can also remove other error factors,
such as mud weight and casing size, because in the application usually the value of
mud weight and casing size are constant during drilling one hole section. We contem-
plated on correcting the data based on the borehole environment, but the correction
was not possible because we were constrained with the availability of GR tool descrip-
tion from the data source.
H1 : lithology type affected GR data value and GR data behavior in each lithology was
independent with the others.
The Kruskal-Wallis test was executed by using sample data from GR data and
grouping variable of the lithology type. The level of significance for the test was set to
5%. The summary of p-value and mean rank is presented in Table 4.7 and the detailed
result of ANOVA table is provided in Appendix C.
Table 4.7: P-value of the first hypothesis test for indicating any lithology group in GR
data. The detailed lithology group investigated in each well is provided in Table 4.8
Well P-Value
The lithology effect on GR data has been evaluated by using Kruskal-Wallis H test.
According to p-value given in each well, the null hypothesis was rejected due to p-
value<0.05. As stated in the theory, the Kruskal-Wallis test is an omnibus test, thus
we could not indicate which specific groups are statistically and significantly different
with the others. However, observation of the mean rank value in each group can give a
picture how each group differs from the others.
If a group has a mean rank value which relatively closed with the value from other
groups, then these groups are considered identical. In Well 15/5-7 A, the mean rank
38
CHAPTER 4. DATA EXPLORATION ON GR DATA
Table 4.8: Mean ranks of each lithology group in Well 15/5-7 A, Well 15/6-11 S, and
Well 15/6-9 S from the first hypothesis test
values of shale, sandstone, and chalk were significantly different. Meanwhile, sand-
stone and shale lithology in Well 15/6-11 S had quite similar mean rank (4804 and
4459, respectively). However at this state, we could not conclude that shale and sand-
stone lithology were identical because it was proved that there were subgroups of hole
size within shale and sandstone lithology (see Chapter 4.1.2).
The level of significance was set to 5%. The results of p-value are summarized in Ta-
ble 4.9 and the ANOVA table is provided in Appendix C.
Referring to the results of p-value in Table 4.9, all of the groups had p-value<0.05,
thus the null hypothesis was rejected. this also proved that there were at least 2 sub-
groups existed in each lithology group. In order to acknowledge any similarity between
the groups, we observed the mean rank value.
Observing groups in Well 15/5-7 A, we discovered that the shale lithology in 17 12 "
and 8 12 " group had similar mean rank values, respectively 2313 and 2556. However,
a contrast of the mean rank values between these groups was identified in sandstone
lithology. This observation showed that shale from group 17 12 " and 8 21 " were identical
but not for the sandstone lithology.
Another case from Well 15/6-11 S, both sandstone and shale had similar mean rank
values which identified in hole 12 14 " and 8 12 " group. Thus, the GR data from these
groups were identical for both sandstone and shale lithology and might be originated
from the same population. However, we took a careful approach and did not merge the
GR data of these groups in order to avoid errors.
In previous hypothesis testing, the observation of mean rank values of lithology
groups were limited because indication of hole size sub-group. Now, after grouping GR
data according to the hole size, we observed that lithologies in a group of hole size had
a diverse mean rank value. A significant variation of the mean rank of each lithology
39
4.2. HYPOTHESIS TESTING RESULTS AND DISCUSSIONS
was observed in the group of hole size 8 12 " in Well 15/6-11 S with mean rank of shale =
1545, sandstone = 455, chalk = 795, carbonate = 75. This observation proved that each
lithology group was independent and not identical with others.
Based on the observations of the results of hypotheses testing, the subsequent anal-
ysis on GR data would follow the discoveries from hypotheses test. Henceforth, the
analysis would be performed and presented separately according to the group of lithol-
ogy type and hole size.
Table 4.9: P-value result of the second hypothesis test. Empty (-) p-value results
indicated that there are no hole size group found in the particular lithology group.
40
CHAPTER 4. DATA EXPLORATION ON GR DATA
Table 4.10: Mean rank for each category. Empty (-) mean rank results indicated that
there are no mean rank for selected hole size and lithology group.
Lithology
Well Hole Size
Shale Sandstone Chalk Carbonate
26" 784 - - -
17 12 " 2435 1886 - -
Well 15/6-11 S
12 14 " 1545 455 795 47
8 12 " 1570 417 - -
26" 757 - - -
Well 15/6-11 S 17 12 " 2572 1043 - -
8 12 " 1883 152 570 145
41
4.3. CONCLUDING REMARKS
• Two hypothesis tests were run and the p-value from the results had shown that
GR data was dependent to the lithology type and the borehole size.
42
Chapter 5
This chapter introduces the application of KDE on GR data to get the estimation of
probability density. Later on, a validation process was performed to assess the effec-
tiveness of GR variable on classifying lithology by using the results acquired from KDE
application. The results and discussions from validation process are provided in a sep-
arated section in this chapter.
1 X n ³x −x ´
i
fˆ(x) = K , (2.24 revisited)
nh i =1 h
The kernel function applied for this analysis was Epanechnikov function by con-
sidering that this function minimizes the AMISE value (see Chapter 2.4.2). However,
it was stated that the choice of kernel function does not affect the result significantly
(Silverman, 1981).
The estimation was carried out by using MATLAB R2015A built-in function,
ksdensity, which returns the estimation of the probability density and the sample
vector (Deveaux, 1999). The optimal bandwidth of kernel smoothing, h, was also cal-
culated and selected by using this function automatically. The default of the optimal
bandwidth from this function is based on a distribution of normal densities. The val-
ues of the optimal bandwidth for each data group are provided in the result section
together with the results of probability density plots.
43
5.1. KDE ANALYSIS ON GR DATA
5.1.1 Results
The results of probability density plot for each well are provided based on hole size.
The lithology groups which presented in each hole size group were plotted together in
one axis with different line colors (indicated in the legend).
Well 15/5-7 A
(a)
(b)
(c)
44
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
(d)
Figure 5.1: Density probability plots with KDE and kernel bandwidth results of Well
15/5-7 A for present lithology in borehole size: (a)36", (b)26", (c)17 21 ", and (d)8 12 "
Well 15/6-11 S
(a)
(b)
45
5.1. KDE ANALYSIS ON GR DATA
(c)
(d)
(e)
Figure 5.2: Density probability plots with KDE and kernel bandwidth results of Well
15/6-11 S for present lithology in borehole size: (a)36", (b)26", (c)17 12 ", (d)12 41 ", and
(e)8 21 "
46
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
Well 15/6-9 S
(a)
(b)
(c)
Figure 5.3: Density probability plots with KDE and kernel bandwidth results of Well
15/6-9 S for present lithology in borehole size: (a)24", (b)17 12 ", and (c)8 21 "
5.1.2 Discussions
There are two discussion points presented within this discussion section. The first dis-
cussion point discusses the results from histogram and kernel estimator. While, the
second discussion point discussed about the GR data distribution for each lithology.
The plots of probability density estimated by kernel estimator in the result section
showed smoothed probability density lines along GR values. In addition, we also dis-
covered that the kernel estimator was able to match the non-parametric distribution
47
5.1. KDE ANALYSIS ON GR DATA
(a)
(b)
(c)
48
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
Observing the probability density of groups which contain two or more than two
lithologies, there were overlapped distributions of lithologies which forming region(s).
This region indicated that there were multiple lithologies exist for any GR value located
within it. Moreover, the size of the overlapped regions affects the misclassification rate.
The bigger the size of this region, more misclassification will likely occur.
Figure 5.5: Probability density of Well 15/5-7 A for hole size 17 21 ". The gray area is the
overlapping distribution between shale and sandstone lithology.
Group of 17 12 " from Well 15/5-7 A was dominated with high GR value of shale
and low GR value of sandstone. The separation between these lithologies was clear
if we observed the peak of probability density of each lithology. However, there was an
overlapped region formed because of the left-skewed distribution from shale lithology.
Similar cases also found in other groups which contain shale and sandstone lithology.
In another case, several overlapped regions were formed within groups which con-
tain additional chalk and/or carbonate lithology. This typical case was identified in
49
5.1. KDE ANALYSIS ON GR DATA
group of 8 12 " from Well 15/5-7 A and Well 15/6-9 S and 12 14 " from Well 15/6-11 S.
These multiple overlapped regions were formed due to low GR value of chalk, carbon-
ate, and sandstone lithology and the separation of these lithologies was difficult to be
identified. This discovery was consistent with the discussion from GR data exploration
in Chapter 4.
Based on this, chalk, carbonate, and sandstone lithology were grouped into one
group, which was non-shale lithology group, because these lithologies had identical
value of GR. This decision was also made to reduce the complexity of classification
of these lithologies. Henceforth, the lithology classification was simplified into shale
and non-shale lithology group. This type of classification was in accordance with the
usage of GR data which is shale identification. The results of probability density plots
categorized into shale and non-shale lithology are shown in Appendix D.
(b) Probability density with merging sandstone, chalk, and carbonate into non-shale lithology
group
Figure 5.6: Comparison of probability density from group of 8 21 " in Well 15/6-9 S
which are (a) not-merged and (b) merged
50
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
high radioactivity, there are some cases when high radioactivity is failed to be indi-
cated as shale, such as in the shaly sandstone formation, uranium bearing formation,
high radioactivity sandstone formation. These failed cases often cause misinterpreta-
tion which leads to lithology misclassification. This condition explained why the over-
lapped regions formed between shale and non-shale lithologies. There are methods
that have been proposed to improve the lithology classification in cases mentioned
above, such as the usage of P e measurement or quantitative analysis to indicate min-
erals in shaly sandstone (Poupon and Gaymard, 1970), uranium measurement from
spectral GR logging to indicate uranium-rich formation, and potassium reading to in-
dicate sandstone with mica formation (Ellis and Singer, 2010). However, the investiga-
tion on these cases was beyond the scope of this study.
FP + FN
Misclassification rate = , (5.1)
Total number of observation
where total number of observation = TN + FP + FN + TP.
From a brief explanation above, there are three main properties to be set within the
validation process: the models (testing dataset), the testing dataset, and the classifica-
tion rule.
The classification rule
Referring the theory of classification in Chapter 2.5, prior probability and cost of mis-
classification can be taken account in classification rule. In this study, we neglected
the misclassification cost, thus c(1|2) = c(2|1). However, we would investigate the
51
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
Predicted
π1 π2
belong to π1
π2 False Negative (FN): True Positive (TP):
Number of observations Number of observations correctly
incorrectly classified as π1 that classified as π2 that belong to π2
belong to π2
effect of prior probability and compared the result with another result which neglected
the effect of prior probability. Therefore, there were two classification rules applied in
this study, they were rules from Eq. 2.38 and Eq. 2.37.
1. Model 1, model without prior probability. This model was constructed by as-
suming that prior probabilities between two lithology groups were equal (p(1) =
p(2) = 0.5). This model would agree with the classification rule from Eq. 2.38.
2. Model 2, model with prior probability. This model was constructed by taking
account the effect of prior probabilities. The prior probability values were deter-
mined by calculating the number of observations of shale and non-shale lithol-
ogy from the geological description of testing dataset. The number of observa-
tions then standardized into 1 in order to fulfill the condition p(1)+p(2) = 1. This
model complied the classification rule from Eq. 2.37.
1. Experiment 1, a validation within groups in the same well. The source of testing
data for this experiment was generated from the same source of training data.
52
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
The ratio of training and testing dataset was set to ± 70 % /30 %, respectively. The
section which selected for training and testing data was not randomly selected,
but the selection was based on the consideration if we apply this experiment
in the field. In the practice, we would have a training dataset from formations
that have been drilled and with known lithology. Meanwhile, the testing dataset
was taken from the formations that recently been drilled with unknown lithology.
According to this, the testing dataset would always be taken from a section in a
greater depth than the depth of the section for testing data. Hence, the GR data
from the upper depth region was set as training data and the lower depth region
was set as testing data.
2. Experiment 2, a validation with neighboring wells within Block 15. The testing
data would be picked from neighboring wells in Block 15 and tested according to
the hole size. One additional well from Block 15, Well 15/6-12, was used in the
validation process for group of 12 14 " in Well 15/6-11 S.
3. Experiment 3, a validation with wells from different block, which is Block 16.
The wells used in this experiment were Well 16/1-14, Well 16/2-7, and Well 16/2-
13 A. Similar to experiment 2, the testing data would be tested according to the
hole size.
53
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
54
Figure 5.7: Preview of the interface for validation process built by using MATLAB
5.2.1 Results
Experiment 1
1. Well 15/5-7 A
Table 5.2: Summary of data and results for validation within each hole section in Well 15/5-7 A
17 12 " 1039 - 2180 2283 2180 - 2656.5 954 0.144 0.856 13 12.58
8 12 " 2657 - 3800 2287 3800 - 4119 639 0.541 0.459 11.27 9.86
55
2. Well 15/6-11 S
Table 5.3: Summary of data and results for validation within each hole section in Well 15/6-11 S
17 12 " 690 - 1730 2081 1730 - 2181 903 0.158 0.842 12.62 15.84
12 14 " 2181.5 - 3320 2278 3320 - 3816.5 994 0.447 0.553 13.88 12.07
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
3. Well 15/6-9 S
Table 5.4: Summary of data and results for validation within each hole section in Well 15/6-9 S
17 12 " 753 - 2180 2855 2180 - 2785.5 1212 0.368 0.632 43.48 38.28
8 12 " 2786 - 3590 1609 3590 - 3942 705 0.684 0.352 23.69 25.39
Experiment 2
Table 5.5: Summary of data and results for validation of Well 15/6-11 S and Well 15/6-9 S , by using Well 15/5-7 A as training data
26" 194 - 1038.5 1690 Well 15/6-11 S 187 - 689.5 1006 1 0 2.19 0
Table 5.6: Summary of data and results for validation of Well 15/5-7 A and Well 15/6-9 S , by using Well 15/6-11 S as training data
26" 187 - 689.5 1006 Well 15/5-7 A 194 - 1038.5 1690 0.763 0.237 23.67 23.67
12 14 " 2181.5 - 3816.5 3271 Well 15/6-12 2754 - 3628.5 1750 0.01 0.99 2.17 1.14
57
Table 5.7: Summary of data and results for validation of Well 15/5-7 A and Well 15/6-11 S , by using Well 15/6-9 S as training data
Table 5.8: Summary of data and results for validation of wells in Block 16 by using Well 15/5-7 A as training data
Table 5.9: Summary of data and results for validation of wells in Block 16 by using Well 15/6-11 S as training data
Table 5.10: Summary of data and results for validation of wells in Block 16 by using Well 15/6-9 S as training data
5.2.2 Discussions
There are 3 main points discussed in this section, they are the results of misclassifi-
cation rate, the effect of prior probability, and the explanation of how the validation
method for lithology prediction can be applied in the field. Each point is discussed in
the separated section.
The increasing rates of misclassification in experiment 2 and 3 were due to the dif-
ference of GR data distribution between the model and the testing data. Several exam-
ples of experiment 1, 2, and 3 were taken from model of 17 12 " in Well 15/6-9 S and the
distribution between the model and testing dataset are presented in Fig. 5.8. A contrast
between the distribution of each dataset was improved from experiment 1 to experi-
ment 3. If we observed the peak of the probability density of each lithology, it could
be seen that the GR value from testing data in experiment 2 were shifted away from
the GR value of the model and created a gap of GR value. A greater gap of testing data
distribution was indicated in experiment 3.
Summing up the observations above, we believed that the increased of misclassifi-
cation rate was related to the uncertainty factors on GR data. And we were certain that
the further away the location of wells used as the testing data source, the uncertainty
was increased. We identified some of the possible source of uncertainty, they were
geological factors, borehole environment, and GR tool. The uncertainty from geologi-
cal factors includes the mineral content, lithology mixture, and depositional environ-
ments of the formations. However, this geological factor was believed to have a small
contribution to the classification in this study because the wells tested in experiment
2 and 3 located in the same sub-basin, Viking Graben, and GR values of one formation
are less likely to change in vertical direction.
62
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
(a) Experiment 1, with testing data from Well 15/6-9 S(2180 - 2785.5 m)
63
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
Figure 5.8: Training and testing data distributions from different experiments for
group of 17 21 ". The testing data was from Well 15/6-9 S
Borehole environment and GR tools were considered as the most contributing fac-
tors to this misclassification. We dismissed the effect of hole size because the models
were tested according to the hole size. However, factors such as drilling fluid, the tool
position, hole cavity, cement type, and the tool size could contribute as uncertainty
factors, notably with the lack of correction from the source and data limitation.
All of these uncertainties rise through the imperfect knowledge. However, the effect
from these uncertainties could be minimized if we integrate data from other source.
One of the example is the usage of data from geological description in addition to GR
data for lithology classification. This example was applied in this study and the ap-
plication was presented by adding prior probability into the classification rule. The
detailed discussion of the results from this application is shown in the next section.
Within this discussion point, we would discuss the result of validation from model 1
and 2, the way prior probability working in reducing misclassification rate, and condi-
tions that lead model 2 fails to reduce the misclassification rate. Referring to Table 5.11,
model 2 from experiment 2 and 3 produced less misclassification rate, with difference
around 3% and 1% respectively, compared to model 1. Meanwhile, model 2 from ex-
periment 1 produced error 0.5% higher compared to model 1.
Based on the number of cases, 4 out of 7 total cases (4/7) in experiment 1 showed
that model 2 yielded less misclassification rate compared to model 1. A similar event
also occurs in several cases from experiment 2 and 3, with the number of cases 7/13
and 11/18 respectively. This observation discovered that more than half of the cases in
each experiment proved that model 2 produced less misclassification rate.
64
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
The unique values of prior probability in model 2 indicate the true distribution of
lithologies of the training data. This was unmistakable because the prior probability
values were generated based on the number of observation of each lithology from the
geological description of training data. This was different from model 1 which assumed
that the lithologies have an equal distribution.
So, in what way that the true distribution affects the classification? By recalling and
modifying the equation from Eq. 2.37, we will get following equation
f 1 (x) p(1)
R1 : ≥1
f 2 (x) p(2)
(5.2)
f 1 (x) p(1)
R2 : <1
f 2 (x) p(2)
According to the equation above, the true distribution from the prior probability will
affect the value of probability density of the models for each lithology group. And as the
further effect, the data distribution of the model for each lithology would also change.
In order to get a better understanding of this explanation, we took an example of ex-
periment 1 from group of 17 21 " in Well 15/6-9 S.
In model 1, the values of prior probability of shale and not shale lithology were
equal, p(shale) = p(not-shale) = 0.5. By multiplying these values to the probability
density, the model would produce probability density plot shown in Fig. 5.9a. Refer-
ring to the plot and the classification rule, all GR values which fulfill the condition of
f 1 (x) p(1)
f 2 (x) p(2) ≥ 1 fall within the green area which indicates shale region, R(shale). Mean-
f (x) p(1)
while, all GR values which fulfill the condition of f 21 (x) p(2) < 1 fall within the blue area
which indicates not-shale region, R(not-shale).
In model 2, the values of prior probabilities were calculated and the results were
p(shale)=0.368 and p(not-shale)=0.632. By multiplying these values into the proba-
bility density of each lithology, we generated model 2 which shown in Fig. 5.9b. Be-
cause the value of p(shale) is smaller compared to the value of p(not-shale), the area
of R(shale) is also smaller than the area of R(not-shale). In addition, the distribution of
probability density of model 2, f(shale) and f(not-shale), resembled the distribution of
training data (Fig. 5.9c) much closer compared to model 1.
By observing the plots from both model 1 and 2, the red dashed line which is
formed at the intersection of f(shale) and f(not-shale) separated R(shale) and R(not-
shale) very clearly. According to this, we discovered that the x-value of this intersection
point acted as a discriminator between these two lithology groups. Meanwhile, the
y-value of this intersection point influenced the size of overlapped region.
In order to understand in what extent these values can reduce the misclassification
rate, we summarized the x- and y- values of the intersection points from the example
in Table 5.12. The table shows that model 2 approached the x-values of testing data
much closer than model 1. Meanwhile, the less y-value was given from model 2. From
this observation, we believed that the more identical the x-values between the model
and the training data, less misclassification rate would be yielded, especially by know-
ing that the x-value behave as discriminator between the lithology groups. Meanwhile,
the smaller the y-value, the smaller the overlapped region of the model thus the clas-
sification from the model would be more precise. Even though the difference of these
values between model 1 and 2 was rather small, the misclassification error reduced by
5%.
65
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
Figure 5.9: Plot of probability density for case in experiment 1: 17 21 " in Well 15/6-9 S
66
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
Table 5.12: The values of the intersection points of the models and the testing data
from experiment 1, group of 17 21 " in Well 15/6-9 S
Intersection points
Data
x-values (API) y-values
The classification by using model 2 did not always produce less misclassification
rate but instead, model 1 produced a better result. We sensed that the main reason of
this event was the difference of the shape of data distribution between the model and
the testing. This reason indeed is related to the reason causing high misclassification
rate from the previous discussion point. However, the impact of this reason for this
case was somewhat different. We observed that in some cases adding prior probability
increased the contrast of x-value between the model and the testing data which led
to misclassification rate increment. Two examples showing this event are shown in
Fig. 5.10 and Fig. 5.11.
The discussions of model validation above give insights of how GR data can be pro-
cessed for lithology classification. Within this discussion point, we also presented the
best way to apply the method in the practice.
The experiments results discovered that the classification by using training and
testing data which originated from the same source was the most optimum in min-
imizing the misclassification rate. According to this, we suggested that the training
data for this application will be generated from the drilled section which lithology type
of this section is acknowledged and confirmed by other interpretation methods.
However, this may come with limitation during drilling formation at the beginning
of a section because no data from drilled formation is available. Therefore, we recom-
mended the usage of data from the neighboring well as the training data during drilling
the top of a section. And in order to reduce the errors from uncertainty factors, it is re-
quired to select the neighboring well which fairly identic to the current drilled well.
Once the information from the drilled formation is considered adequate, the training
data can be switch and use data from the drilled formation.
Meanwhile, the testing data will be taken from the section that just been drilled.
The usage of prior probability was also recommended to improve the results and the
value of prior probability can be calculated from the prediction of geological descrip-
tion of the current well. Once the data is classified toward the training data, the clas-
sification results can be verified with the results from other interpretation method to
check any misclassification from the proposed method. In our vision, we sensed that
the application of this classification method in the field could improve the lithology
interpretation and optimize the drilling operation.
67
5.2. VALIDATION OF PROBABILITY DENSITY ON GR DATA BY USING KDE
(a) Model 1
(b) Model 2
Figure 5.10: The f(shale) of the models is more skewed to the left compared to the
f(shale) of the testing data. Multiplying the probability density with prior probabilities
in model 2 shifted the x-value of intersection point to right side (52.73 GR API), further
away from the x-value in testing data (29.78 GR API)
68
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
(a) Model 1
(b) Model 2
Figure 5.11: The f(not-shale) in training data has bimodal distribution and f(shale) has
right-skewed distribution. Meanwhile, f(not-shale) of testing data has right-skewed
distribution and f(shale) is less skewed than f(shale) in training data. Adding prioir
probability shifted the intersection point to 92.3 GR API, further away from the
intersection in the testing data (68.3 GR API)
69
5.3. PRELIMINARY STUDY OF BIVARIATE ANALYSIS
ures show that the overlapping distribution of lithologies indicated from one variable
could be distinguished by another variable. In other words, the GR and neutron vari-
ables complemented each other. This is because GR and neutron tool have different
principal of measurement, thus each tool has different sensitivity in the particular rock
types. The results also showed better separation of lithologies and we expected that the
misclassification rate could be reduced.
70
CHAPTER 5. UNIVARIATE KDE ANALYSIS ON GR DATA
Figure 5.12: The overlapped distribution of shale and sandstone lithology from GR
could be separated by neutron data. Shale is dominated with high neutron value,
while sandstone has small neutron value.
Figure 5.13: The overlapped region of chalk and carbonate lithologies from neutron
could be separated by using GR. Chalk has smaller GR value than to carbonate.
71
5.4. CONCLUDING REMARK
• The validation was applied to assess the GR data in lithology classification. From
the investigation, the quality of classification depends on the distribution of
training and testing data.
• The prior probability values which extracted from geological description were
effective in reducing the misclassification rate.
• The classification method from this study can be applied for lithology interpre-
tation in the practice. The application was considered useful to reduce time of
lithology interpretation during drilling operation.
• The preliminary study of bivariate analysis proved that by using two variables
of GR and neutron, the lithology classification was improved. The most signifi-
cant result was indicated from the separation of overlapped distribution of one
variable.
72
Chapter 6
Conclusions
In this study, a quantitative analysis of GR data for lithology classification and predic-
tion has been performed. Prior to the main analysis of this study, data exploration and
kernel application on GR data were carried out. The results from these processes were
exploited for the main analysis which was validation as a part of an assessment of vari-
able GR for lithology classification.
The data exploration was executed by visualization of statistical graphs: histogram
and boxplot, and analysis of statistical descriptive. At the beginning of the process, GR
data were grouped based on the lithology type. As the results were investigated, the
discovery showed that GR data contain sub-groups which originated from variable of
hole size. This was indicated from the high variance of GR within the lithology groups
and the contrast of GR values between different hole sizes during visualization GR in
log trace. We sensed that the sub-groups were emerged due to uncorrected GR data.
In order to support the discoveries, we tested the hypotheses from data exploration
by using Kruskal-Wallis H test. The results of p-value and mean rank value from these
tests agreed with the discoveries from data exploration. The results also proved that
GR data depended on lithology and hole size variables. The correction on GR data was
not possible due to inadequate informations from the data source, thus the GR data for
further analysis remained to be grouped based on the hole size.
The kernel density estimator was applied in this study to approximate the proba-
bility density of GR data which further would be employed for lithology classification.
The kernel estimator was chosen because of the capability of kernel estimator to return
a continuous probability density and approach the non-parametric distribution of GR
data. The type of kernel function applied in this study is Epanechnikov function. The
results of probability density from kernel application were preferred over histogram
due to the discreteness of histogram. The results of the probability plots also showed
that the lithology group within GR data was more convenient to be grouped into shale
and not-shale lithology. This was due to sandstone, chalk, and carbonate lithologies
had similar GR value range which was smaller than the shale lithology. In addition to
that, error from misclassifying of these 3 lithologies could be prevented.
The lithology classification and validation were performed by using 2 different
models in 3 different experiments. The results showed that the misclassification rate
was reduced significantly in experiment 1 which model and testing data were taken
from the same data source. Meanwhile, model 2 which added prior probability value
within the classification rule had less misclassification rate compared to model 1 which
neglected the prior probability value.
73
Summing up the observations of the main analysis, the main cause of lithology
misclassification was due to the difference of data distribution between the model and
the testing data. The most significant difference in data distribution was indicated in
experiments which testing data was taken from different wells and/or different block.
We sensed that the difference in data distribution was caused by uncertainty factors.
However, the effect of uncertainty factors could be reduced or even dismissed if the
data quality could be improved.
Considering that data quality was beyond the scope of this study, we suggested to
perform the experiment by using a corrected data logging. We also suggested to carry
out a similar study which lithology was classified by using the relative value of GR log,
thus the necessity of correction could be neglected and GR can be analyzed without
grouping the data based on the hole size.
The application of the method proposed in this study has been explained. The ap-
plication would give opportunities to predict and classify the lithology from GR read-
ing. Since the prediction was still limited and only processed the reading from the tool
location inside the borehole, a further work to improve the prediction beyond the tool
was recommended. In addition, the prediction could be improved by using the pos-
terior probability calculation from Bayes formula. The prediction will be more precise
because this formula calculates the conditional probability which are useful for deci-
sion making (Ott and Longnecker, 2010).
The results from bivariate analysis proved that the lithology classification could be
improved by using 2 variables: GR and neutron logging. We recommended to continue
the study of bivariate analysis through implementation of bivariate kernel estimation
and validation. It was also possible to combine data from mud logging and well logging
within the bivariate analysis and study the the most optimum variables combination.
In addition to that, discriminate analysis was also suggested so that the classification
could be enhanced.
74
Appendix A
Acronyms
GR Gamma Ray
75
Appendix B
Geological Description
76
APPENDIX B. GEOLOGICAL DESCRIPTION
77
Rødby The upper boundary of Rødby formation and
Shetland Group has characteristic of decreasing
gamma ray reading toward Rødby formation. The
dominated lithology in this formation is chalk. The
Rødby formation has depositional environment of
an open marine with reddish sediments and
oxygenated environment.
78
APPENDIX B. GEOLOGICAL DESCRIPTION
79
Appendix C
The ANOVA table showed the summary of hypothesis testing. The calculated variables
are described below.
Source indicates the source of the variation in data. There are 3 sources provided,
Groups, Error and Total. Groups indicates the data variation due to the factor of
interest, or variation in the populations being compared. Error means the vari-
ation within each groups being compared, while Total means the total variation
in population data.
df means the degrees of freedom in the source. The formulas to calculate the df shown
as
, where k is the number of groups in the source and N is the number of measure-
ments in the source.
SS means the sum of squares in the source. SSGr oups calculates the sum of squares
between treatment groups, which formula is shown in Equation 2.12. SS Er r or
indicates the sum of squares within groups which following
N X
X k ¡ ¢2
SS Er r or = yi j − M j (C.2)
i =1 j =1
80
APPENDIX C. ANOVA TABLE RESULTS OF HYPOTHESES TEST
Table C.4: ANOVA table of hypothesis testing #2 for shale lithology in Well 15/5-7 A
Table C.5: ANOVA table of hypothesis testing #2 for sandstone lithology in Well 15/5-7
A
Table C.6: ANOVA table of hypothesis testing #2 for shale lithology in Well 15/6-11 S
81
Table C.7: ANOVA table of hypothesis testing #2 for sandstone lithology in Well
15/6-11 S
Table C.8: ANOVA table of hypothesis testing #2 for shale lithology in Well 15/6-9 S
Table C.9: ANOVA table of hypothesis testing #2 for sandstone lithology in Well 15/6-9
S
82
Appendix D
The results of probability density of GR estimated by kernel estimator for two cate-
gories shale and non-shale lithology are provided in this section. The results were gen-
erated from GR reading in each hole section of Well 15/5-7 A, Well 15/6-11 S, and Well
15/6-9 S.
83
D.2. WELL 15/6-11 S
Figure D.1: Probability density results of each hole section in Well 15/5-7 A using
kernel estimator grouped into shale and non-shale lithology
84
APPENDIX D. RESULTS OF GR PROBABILITY DENSITY OF TWO CATEGORIES:
SHALE AND NON-SHALE
Figure D.2: Probability density results of each hole section in Well 15/6-11 S using
kernel estimator grouped into shale and non-shale lithology
85
D.3. WELL 15/6-9 S
Figure D.3: Probability density results of each hole section in Well 15/6-9 S using
kernel estimator grouped into shale and non-shale lithology
86
Appendix E
87
E.1. WELL 16/1-14
(a)
(b)
Figure E.1: Scatter plot and probability density plots of GR and neutron for Well
16/1-14 in hole section: (a) 12 14 " and (b) 8 12 "
88
APPENDIX E. RESULTS OF BIVARIATE ANALYSIS OF GR AND NEUTRON
(a)
(b)
89
E.2. WELL 16/2-7
(c)
Figure E.2: Scatter plot and probability density plots of GR and neutron for Well
16/2-7 in hole section: (a) 17 12 ", (b) 12 14 ", and (c) 8 12 "
90
APPENDIX E. RESULTS OF BIVARIATE ANALYSIS OF GR AND NEUTRON
(a)
(b)
Figure E.3: Scatter plot and probability density plots of GR and neutron for Well
16/2-13 A in hole section: (a) 12 14 " and (b) 8 12 "
91
Bibliography
Bourgoyne, A. T., Millheim, K. K., and Chenevert, M. E. (1985). Applied Drilling Engi-
neering. Society of Petroleum Engineers, Richardson.
Burke, J., Campbell, Jr., R., and Schmidt, A. (1969). The Litho Porosity Cross Plot: A
New Concept For Determining Porosity And Lithology From Logging Methods. In
SPWLA-1969-Y, SPWLA. Society of Petrophysicists and Well-Log Analysts.
Busch, J., Fortney, W., and Berry, L. (1987). Determination of Lithology From Well Logs
by Statistical Analysis. SPE-14301-PA.
Clavier, C., Hoyle, W., and Meunier, D. (1971). Quantitative Interpretation of Thermal
Neutron Decay Time Logs: Part I. Fundamentals and Techniques. SPE-2658-A-PA.
Clavier, C. and Rust, D. (1976). Mid Plot:a New Lithology Technique. SPWLA-1976-
vXVIIn6a2.
Delfiner, P., Peyret, O., and Serra, O. (1987). Automatic Determination of Lithology
From Well Logs. SPE-13290-PA.
Ellis, D. V. and Singer, J. M. (2010). Well logging for earth scientists. Springer, Dordrecht.
Frigge, M., Hoaglin, D. C., and Iglewicz, B. (1989). Some Implementations of the Box-
plot. The American Statistician, 43(1):50.
92
BIBLIOGRAPHY
Laosripaiboon, L., Saiwan, C., and Prurapark, R. (2015). Reservoir Characteristics In-
terpretation by Using Down-Hole Specific Energy With Down-Hole Torque and Drag.
In OTC-25890-MS, OTC. Offshore Technology Conference.
Mann, H. B. and Whitney, D. R. (1947). On a Test of Whether one of Two Random Vari-
ables is Stochastically Larger than the Other. The Annals of Mathematical Statistics,
18(1):50–60.
Ott, L. and Longnecker, M. (2010). An introduction to statistical methods and data anal-
ysis. Brooks/Cole Cengage Learning, Australia ; United States, 6th ed edition.
Poupon, A. and Gaymard, R. (1970). The Evaluation Of Clay Content From Logs. In
SPWLA-1970-G, SPWLA. Society of Petrophysicists and Well-Log Analysts.
Privitera, G. J. (2015). Statistics for the behavioral sciences. SAGE, Los Angeles, second
edition edition.
Schlumberger Wireline & Testing (1998). Log interpretation charts. Schlumberger Wire-
line & Testing, Sugar Land, Texas.
Scott, D. W., editor (1992). Multivariate Density Estimation. Wiley Series in Probability
and Statistics. John Wiley & Sons, Inc., Hoboken, NJ, USA.
Serra, O., Delfiner, P., and Levert, J. C. (1985). Lithology Determination From Well-Logs:
Case Studies. In SPWLA-1985-WW, SPWLA. Society of Petrophysicists and Well-Log
Analysts.
Silverman, B. (1981). Density estimation for univariate and bivariate data. In Interpret-
ing multivariate data, pages 37–53. Wiley: Chichester, v edition.
Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26.
CRC press.
93
BIBLIOGRAPHY
Stieber, S. (1970). Pulsed Neutron Capture Log Evaluation - Louisiana Gulf Coast. In
SPE-2961-MS, SPE. Society of Petroleum Engineers.
Wilson, R. (1955). Mud Analysis Logging And Its Use In Formation Evaluation. In SPE-
587-G, SPE. Society of Petroleum Engineers.
Ziaja, M. and Roegiers, J.-C. (1998). Lithology Diagnosis Based on the Measurements of
Drilling Forces and Moments at the Bit. In SPE-47799-MS, SPE. Society of Petroleum
Engineers.
94