1. Introduction
Images obtained with coherent illumination systems, such as Synthetic Aperture Radar (SAR), are contaminated by speckle. This noise-like interference phenomenon corrupts the image in a non-Gaussian and non-additive manner, making difficult its processing and visual interpretation.
Against this backdrop, statistical procedures are essential tools for processing SAR data. A suitable model to describe this sort of image is fundamental to obtain features that promote a good analysis. In this sense, the family of
distributions [
1] has been extensively used to model SAR data because of its analytical simplicity and ability to describe a wide variety of roughness targets.
The application of machine and deep learning techniques to the problem of classification, segmentation and detection of objects in SAR images became more popular in recent times. Palacio et al. [
2] used machine learning techniques in combination with filters to perform classification in PolSAR images. Baek and Jung [
3] carried out a comparison between three different machine learning techniques to classify single- and dual-pol SAR image showing that the deep neural network presented the best performance.
Different authors used methods based on transfer learning techniques to classify SAR images. These methods aim to solve the problem of having limited labeled area information to train deep convolutional neural networks (CNN). Kang and He [
4] applied this technique using a CNN trained on a CIFAR-10 dataset to extract a mid-level representation. They showed that this technique is adequate to solve the problem of the limited amount of labeled SAR data, by comparing the results obtained with a CNN without using this technique and combining a Support Vector Machine (SVM) with a Gabor filter or with gray level co-occurrence matrices. Lu and Li [
5] implemented this methodology using several popular pre-trained models and proposed a new method of data augmentation. They also made a comparison with some related works and showed that their proposed method outperformed the others. Huang et al. [
6] proposed to transfer the knowledge obtained from a large number of unlabeled SAR images by incorporating a reconstruction path with stacked convolutional autoencoders in the network architecture. Their proposal was competitive for the MTSAR dataset using all training samples, and had the best performance when the training dataset has a small size.
Transfer learning was also implemented by Rostami et al. [
7]. They proposed to transfer the knowledge from the electro-optical domain to SAR by learning a shared embedding space, and they showed that their approach is effective when applied to a ship classification problem. Huang et al. [
8] proposed another deep transfer learning method to solve the land cover classification problem with highly unbalanced classes, geographic diversity and noisy labels. They showed that the proposed model, which uses cross-entropy, can be generalized and can be applied to others SAR domains.
Several approaches have been developed in order to obtain expressive and tractable features from SAR data. In particular, entropy measures have been widely used for this purpose. Parameter estimation [
9], classification [
10], procedures for constructing confidence interval and contrast measures [
11,
12], edge detection [
13], and noise reduction filters [
14] are among their applications.
Sundry authors have tackled the segmentation and classification SAR images problem using information theory measures. Nobre et al. [
15] used Rényi’s entropy for monopolarized SAR image segmentation. Ferreira and Nascimento [
16] derived a closed-form expression for the Shannon entropy based on the
law for intensity data and proposed a new entropy-based segmentation method. Carvalho et al. [
10] employed stochastic distances to approach unsupervised classification applied to Polarimetric Synthetic Aperture Radar (PolSAR) images. Shannon entropy has been applied to analyzed SAR imagery in several approaches, from inference [
11] to classification [
16]. Therefore, its estimation deserves attention.
The parametric expression of the Shannon entropy for a system characterized by a continuous random variable is the following well-known expression:
where
f is the probability density function that characterizes the distribution of the real-valued random variable
Z. Several procedures can be applied to obtain an estimate of
given a random sample
.
The most direct family of estimators of
given
consists of obtaining estimators for
, the parameter that indexes the distribution of
Z, say
, and using them in (
1). This approach yields the families of maximum likelihood, moments, and robust estimators, to name a few. This is the “parametric approach”.
“Non-parametric” approaches do not use
as a proxy. Instead, they rely on the equivalent expression for the Shannon entropy given by
where
F is the cumulative distribution function that also characterizes the distribution of the random variable [
17]. Such alternative approaches compute estimates of
F in Equation (
2) from the observed sample. Vasicek [
17] replaced the distribution function
F by the empirical distribution function
and used a difference operator in place of the differential operator. van Es [
18] studied an entropy estimator based on differences between order statistics. Correa [
19] proposed a new entropy estimator determined from local linear regression. Al-Omari [
20] and Noughabi and Noughabi [
21] presented modified versions of the estimator introduced by Ebrahimi et al. [
22].
It is important to mention that these estimators have been studied in different contexts. Maurizi [
23] studied the works by Vasicek [
17] and van Es [
18] to estimate the entropy
when the random variable has support
. Noughabi and Park [
24] considered them to propose goodness of fit tests for the Laplace distribution. Suk-Bok et al. [
25] assessed the proposal by [
17] to estimate
for a double exponential function in the framework of multiple type-II censored sampling. More recently, Al-Labadi et al. [
26] considered these estimators to propose a new Bayesian non-parametric estimation to entropy. Additionally, Lopes and Machado [
27] considered Ref. [
22] as a reference in the review of other entropy estimators.
In this paper, we study the performance of parametric and non-parametric estimators of the entropy in the context of supervised and unsupervised classification. In the parametric case, we use the relationship between the and Fisher distributions to obtain an expression of the entropy. In the non-parametric case, we assess these estimators in terms of bias, mean square error, computational time, and accuracy.
2. Materials and Methods
2.1. The Model
The multiplicative model defines the return Z in a monopolarized SAR image as the product of two independent random variables: one corresponding to the backscatter X, and the other to the speckle noise Y. In this manner, represents the return in each pixel of the image.
The
distribution is an attractive model for
Z because of its flexibility to adequately model areas with all types of roughness [
28,
29]. For intensity SAR data, this family arises from considering the speckle noise
Y modeled as a
distributed random variable with unitary mean and the shape parameter
, the number of looks. We also assume that the backscatter
X obeys a reciprocal gamma law. Thus, the density function for intensity data is given by
where
and
. The
r-order moment is
provided
, and infinite otherwise.
Mejail et al. [
28] proved a relationship between the
distribution and the Fisher–Snedekor
F law, which states that the cumulative distribution function
for the return
Z is
for every
, where
is the cumulative distribution function of a Fisher–Snedekor random variable with
and
degrees of freedom. This connection is helpful to obtain a closed formula for the entropy.
2.2. Shannon Entropy
Shannon’s contribution to the creation of what is known as information theory is well known. Shannon [
30] proposed a new way of measuring the transmission of information through a channel, thinking of information as a statistical concept. The entropy of the
distribution can be obtained using (
4). Denote
as the entropy under the Fisher–Snedekor model; then the
entropy for intensity data
is
Using (
5), the expression of
is
where
and
B are the digamma and beta functions, respectively.
Figure 1 shows the theoretical entropy
as a function of
and
with
. It can be shown that for each fixed
value,
is an injective function. The same behavior repeats if we consider
as a constant.
2.3. Shannon Entropy Estimators
Several authors have proposed entropy estimators using (
2). Most of them are based on order statistics of the sample. Al-Omari [
20] presented an overview of these estimators and also proposed a new one. From a parametric point of view, it is natural to consider the maximum likelihood estimator (ML) of the entropy (
).
In what follows, we describe the entropy estimators studied in this paper.
2.3.1. Maximum Likelihood Entropy Estimator
Let
be an independent random sample of size
n from the
distribution. Assume that
L is known. The maximum likelihood estimator of
for
L is known and denoted as
, which consists of the values in the parametric space
, which maximize the reduced log-likelihood function:
Solving (
7) requires numerical maximization routines that, under certain circumstances, do not converge [
31]. We use the L-BFGS-B version of the Broyden–Fletcher–Goldfarb–Shannon (BFGS) method [
32] that allows box constraints. This algorithm belongs to the quasi-Newton methods family, not requiring the Hessian matrix but only the gradient. The optimal asymptotic properties of the ML estimator are well-known.
The ML entropy estimator [
33] is
This estimator inherits all of the good properties of ML estimators (consistency and asymptotic normality), but also their pitfalls: sensitivity to the initial value, lack of convergence due to flatness of (
7), and lack of robustness. Convergence problems, which are more prevalent with small samples and with data from textureless areas, were identified by Frery et al. [
31] and mitigated with a line-search algorithm. Refs. [
9,
34,
35] studied robust alternatives to (
7).
2.3.2. Non-Parametric Entropy Estimators
Assume that
is a random sample from the law characterized by the distribution function
whose order statistics are
. Vasicek [
17] proposed the following entropy estimator:
with
as a positive integer,
the spacing of order
m, or
m-spacing,
if
, and
if
. The author proved that this estimator is weakly consistent for
when
and
.
The only possible numerical problem with this estimator and its variants is having zero as the argument of the logarithm, a situation that can be easily checked and solved. Their computational complexity reduces to adding differences of order statistics. These estimators are robust by nature, since they do not depend on any particular model. Differently from the approaches discussed in Refs. [
9,
34,
35], achieving such a robustness does not impose a heavy computational burden.
Several authors introduced modifications to Vasicek’s estimator. In this work we consider the following entropy estimators variants, surveyed by Al-Omari [
20].
Correa [
19]:
where
.
Noughabi and Arghami [
36]:
where
and
if
and
for
.
Al-Omari [
37]:
where
in which
for
, and
for
.
Al-Omari alternative proposal [
20]:
where
in which
for
, and
for
.
Ebrahimi et al. [
22]:
where
van Es [
18] showed that, under general conditions, (
10) converges almost surely to
when
,
, and
. The author also proved the estimator’s asymptotic normality when
and
. Correa [
19], through a simulation study, showed that his estimator has a smaller mean squared error than Vasiciek’s proposal (
9).
Al-Omari’s estimators, cf. (
13) and (
14), converge in probability to
when
. Ebrahimi et al. [
22] presented an estimator adjusting Vasicek’s [
17] weight. Under the same conditions as Al-Omari [
37], the authors proved that
when
. The same applies to the Noughabi–Arghami estimator.
2.4. Estimator Tuning
The choice of the spacing parameter
m in this type of estimators is an important task that is still open. Wieczorkowski and Grzegorzewski [
38] proposed the following heuristic formula:
Our goal is to find a value of
m that performs well in a range of parameters
and sample sizes
n when estimating the entropy under the
model. In order to achieve this goal, we assess the performance of (
16) with a Monte Carlo study for each one of the entropy estimators presented in
Section 2.3.2 under the
model. We considered a parameter space comprised of:
Sample sizes , which represent different scenarios of squared windows of sides 3, 5, 7, 9 and 11;
Texture values
to depict areas with different levels of roughness and
(the
case was studied by Cassetti et al. [
39]).
Since is a scale parameter, we based the forthcoming analysis on the condition , which links texture and brightness by . With the aim to simplify the notation, we consider with where , , and . Thus, .
For each fixed
n and
j we draw 1000 independent samples
from
. We used
and calculated all estimators
from
Section 2.3.2. Therefore, we obtained a vector of estimates
from which we computed the sample mean
, the sample bias
, where
is the true entropy from (
6), and the sample mean squared error
. Then, we analyzed the performance of these estimators in terms of bias and MSE.
In order to improve the spacing (
16), we implemented another strategy to choose, for each sample size
n, the best value
m to be used for all textures
. In the following, we considered
as was indicated in (
9). We repeated the same methodology as before for each
m and for each
j, obtaining
. This vector is represented in the
jth column of
Table 1. We then calculated, in each row of the table, the average of the absolute value of bias (shown in the last column of
Table 1). The best
m value is
.
Table 1 shows the schema of the methodology employed, for fixed
n and an entropy estimator. Each table entry,
, represents the bias for
and
.
Section 3.1 presents the results of this approach. The spacing values we obtained are different from the heuristic formula (
16), and they lead to better estimates in terms of bias and mean squared error.
2.5. Classification
To study the performance of the selected entropy estimators in terms of SAR image classification, we divided the analysis into simulated and actual images. We used unsupervised and supervised techniques to choose the three estimators that led to the best values of classification quality. For the former, we applied a
k-means algorithm, which groups data into
k classes setting
k centroids and minimizing the variance within each group. This non-hierarchical clustering technique has been applied in many studies in SAR image processing, cf. the works by Niharika et al. [
40] and by Liu et al. [
41].
For the latter approach we implemented a support vector machine (SVM) algorithm, which is a supervised machine learning technique [
42] whose objective is to define, given a set of features, the best possible separation between classes by finding a hyperplane that maximizes the margin of separation between these classes. It is common to accept some misclassification to obtain a better overall performance; this is achieved through the penalizing parameter
c.
When data cannot be separated by a hyperplane, they are transformed to a higher-dimensional feature space through a suitable non-linear transformation called “kernel function”. Given , linear and radial kernels are respectively defined by and , for .
We randomly selected 1000 pixels in each of the four regions, far away enough from the boundaries, to find the best kernel and hyperparameters. This reference sample was divided into two sets: training and validation (80% of the sample), and testing (20%). We considered linear and radial types for the kernel, with the penalizing parameters and . With the training–validation set we made a 5-cross fold validation, and computed the mean and the standard deviation of the F1-scores. Recall that , where TPR is the True Positive Rate and PPV is the Positive Predictive Value.
This approach has been applied in different areas, such as sea oil spill monitoring [
43], pattern recognition [
44], and classification of polarimetric SAR data [
2], among other applications.
We used different measures of quality depending on the type of classification. In the unsupervised case, we used the Calinski–Harabasz (CH) [
45] and Davies–Bouldin [
46] (DB) indexes, while we present the Kappa coefficient for the supervised classification. We also show the accuracy of both algorithms. All of these measures should be interpreted as “bigger is better”, except for the DB index, for which “lower is better”.
3. Results and Discussion
3.1. Choice of the Spacing Parameter m for Non-Parametric Estimators
Figure 2 presents the bias and the MSE for the Wieczorkowski and Grzegorzewski [
38] criterion,
case, and for all of the estimators analyzed, except for the Al-Omari (
14) and Ebrahimi (
15) estimators. These two estimators presented large bias and, thus, were discarded for further analysis.
It can be seen that there is no single estimator that performs best for all values, but and present low bias and low MSE for all of the cases studied except for . The others estimators show bad behavior in terms of bias because of their slower convergence to zero for all of the cases studied.
Table 2 shows the best
m chosen according to the methodology used for
and
, for samples coming from
distribution.
Notice that, with few exceptions, the optimal spacing m is smaller than the empirical formula .
3.2. Performance of the Nonparametric Estimators for the Selected m Value
In order to study the behavior of our proposal for the selection of the
m value we performed a Monte Carlo simulation as described in
Section 2.4.
Figure 3 shows the results obtained for the estimators studied for the
m value chosen in terms of bias and MSE, for
case. We also plotted the
estimator. It can be observed that there is an improvement in entropy estimation in terms of bias and MSE with our methodology, compared to the (
16) heuristic formula for all of the estimators studied. All of them show a faster convergence of the bias to zero and are competitive with the performance of the
estimator in terms of bias and MSE, for sample sizes larger than 81.
As mentioned, the optimized spacing leads, in most cases, to the use of more samples than the (
16) criterion. This suggests that the latter is an optimistic view of the information content of each sample, at least when dealing with
deviates. In other words, theses observations are less informative for the estimation of the entropy. Because of this, a smaller spacing, i.e., larger samples, are required to achieve good estimation quality.
In the following, we present empirical results classifying a simulated image SAR.
3.3. Simulated Image
We generated two
images with observations coming from
distributions with
,
, and four different classes:
.
Figure 4a shows the image obtained with
, where the brightest area corresponds to
, i.e., extremely textured observations. As the brightness decreases, the texture changes from heterogeneous (
and
) to a homogeneous zone corresponding to the darkest area (
). As the performance measures were similar in both
cases, i.e., single and multi-look, we only show results from the latter.
We computed a map of estimated entropies (
) with each estimator by sweeping the image with sliding windows of sizes
, for
. These are the sample sizes studied in
Section 2.4. Then, we used
as a feature to classify by both the unsupervised and supervised techniques.
Figure 4b shows the result of classifying by the
k-means algorithm the
map of values obtained with
.
Figure 4c shows the accuracy as a function of the sample size. It can be observed that a
window presents the best accuracy. It can also be seen that the
estimator has the worst performance, whereas
,
and
show the best performance. These results are corroborated by the values shown in
Table 3, in which the best performances are shown in bold font.
Table 4 presents the CH and DB values for the best sample size (
,
). According to CH,
,
, and
have the best performance, whereas DB selected
,
, and
as the best.
We also provide, for the sake of comparison, quality measures obtained by .
Table 5 shows the selected kernels and hyper-parameters that maximize the F1 mean and minimize the F1 variance. The best models were trained using the whole reference sample and applied to classify the complete image. The accuracy and
coefficient were computed, and the results are shown in
Figure 5. The best accuracy values are shown in
Table 6, as well as the models that achieved them. It can be seen that the optimal value for
was obtained for a sliding window of size
. Sizes
and
presented similar (best) values. In this sense, with the purpose of providing a unified criterion, we chose the size of the sliding window as
to perform the analysis.
Table 7 and
Table 8 show the confusion matrices when the models are applied to the simulated images,
. It can be observed that if
,
,
, and
overcame
for extremely high, high, and middle textured areas, respectively. For
,
performed better than the other models except for regions with a very high level of texture in which
and
produced better results.
3.4. Actual Images
We assessed our proposal with two SAR images. First, we considered an image of the surroundings of Munich in Germany of the size , which was acquired in L-band, HV polarization, and complex single look format. Second, we used a subsample of pixels of a full PolSAR image of California’s San Francisco bay area, taken by the NASA/JPL AIRSAR L-band instrument in intensity format.
We applied the SVM algorithm to both actual images, replicating the procedure described in the study of simulated data, using the entropy estimator as a feature for classification of the three polarizations.
The Equivalent Number of Looks (ENL) using uncorrelated data is defined as
, the reciprocal of the sample coefficient of variation
, where
is the sample standard deviation and
is the sample mean [
47]. In order to find the ENL in each polarization band of the image of San Francisco, we manually selected samples from homogeneous areas in each band and calculated ENL as an average weighted by the sample size per band. Finally, the ENL is the average of the estimations in each polarization. We obtained
,
, and
as the ENL values in the HH, HV, and VV bands, respectively. Thus, we considered the ENL as equal to
for the whole image. We then used the same spacings,
m, for
and
.
Figure 6 and
Figure 7 show the training samples selected to perform the supervised classification in both images. In the fist case, we worked with three types of regions: urban (red), forest (dark green), and pasture (light green). In the other case, we selected five areas: water (blue), urban zone (red), vegetation (green), pasture (yellow), and beach (orange).
We studied linear and radial kernels; the last one produced better results, except for and when applied to the image of Munich. The combinations of hyper-parameters are the following:
for in Munich;
and for in Munich;
and for in Munich;
for in Munich;
and for in Munich;
and for in Munich;
and for in San Francisco;
and for in San Francisco;
and for in San Francisco;
and for in San Francisco;
and for in San Francisco;
and for in San Francisco.
We subsequently included the CV as a feature in the classification process. In this case, the best performance was achieved for the linear kernel with a cost of 10 applied to the image of San Francisco, except for and showing a best performance if a radial kernel is used with and , respectively, and with a radial kernel using and , respectively. On the other hand, the radial kernel produced the best results for the image of Munich using the following hyper-parameters:
and for ;
and for ;
and for ;
and for ;
and for ;
and for .
Table 9 and
Table 10 present the test accuracy and Kappa index. We also show the validation accuracy, which was computed using cross-validation with five folds; these values are similar to the test accuracy, showing that there is no evidence of overfitting. In addition, we show that including the CV coefficient as a feature in the classification problem improved the results.
If we only consider the entropy, showed the best performance in both single and multilook cases. However, if we add CV as a characteristic, then appears as the best classifier followed by and for the single-look case, and followed by and for the multilook case.
Figure 8 and
Figure 9 exhibit the classification of the whole images when our proposal is applied. It can be observed that in the case of the image of San Francisco the classifiers distinguished the beach and, with the addition of the CV, some roads surrounded by trees were better classified.
The processing time is an important feature when proposing a new estimator.
Table 11 shows the processing time, measured in minutes, needed to perform a map of estimated entropies moving through the image with sliding windows of size
for each one of the estimators applied to the Munich and San Francisco images. It can be seen that
had the shortest processing time, followed by
and
.
We conclude this section by comparing the results of classifying by using estimates of the entropy with those obtained with a classical approach.
Table 12 compares the results obtained using our best models against the technique that applies the improved Lee filter [
48] and then classifies using SVM.
Figure 10 shows the classification of the whole images applying the alternative method. It can be observed that our proposal offers advantages that prior methods cannot.
4. Conclusions
We assessed the performance of six non-parametric entropy estimators in conjunction with the ML estimator in terms of bias, MSE and image classification for single and multilook cases.
On the one hand, the advantage of using these non-parametric estimators is that they are very simple to implement, since they do not assume any model and do not need optimization algorithms. On the other hand, they depend on a space parameter
m. Although the literature recommends a heuristic value, we proposed a criterion for choosing the value of
m that presents the slightest bias in the entropy estimation for all of the textured values studied and all of the sample sizes analyzed. This criterion presents better performance than that proposed by Wieczorkowski and Grzegorzewski [
38].
With these values for m, we applied unsupervised (k-means) and supervised (SVM) classification algorithms to both simulated and actual data, and compared their performance with the entropy estimator. We showed evidence that presents the best performance in terms of accuracy and kappa index for both single and multilook cases, when it is applied to actual images. However, when we added the coefficient of variation as a feature used by the classifier, both measures improved and the best estimators changed. and performed the best for the single and multilook cases, respectively, showing an improvement of 1% for the former and of 3% for the latter. However, these two estimators require longer processing times than the others.
We completed the analysis by comparing our proposal with another technique that combines the improved Lee filter with an SVM classifier, showing that the entropy-based approach presents better accuracy indexes.
Hence, we strongly recommend to consider these non-parametric estimators because of the simplicity of their implementation and their good performance.