Rare Event Classification With Weighted Logistic Regression For Identifying Repeating Fast Radio Bursts
Rare Event Classification With Weighted Logistic Regression For Identifying Repeating Fast Radio Bursts
Rare Event Classification With Weighted Logistic Regression For Identifying Repeating Fast Radio Bursts
Rare Event Classification with Weighted Logistic Regression for Identifying Repeating Fast Radio
Bursts
Antonio Herrera-Martin ,1, 2 Radu V. Craiu ,2 Gwendolyn M. Eadie ,1, 2 David C. Stenning ,3
Derek Bingham ,3 Bryan M. Gaensler ,4, 5, 1 Ziggy Pleunis ,5, 6, 7 Paul Scholz ,8, 5 Ryan Mckinven ,9, 10
Bikash Kharel ,11 and Kiyoshi W. Masui 12, 13
1 David A. Dunlap Department of Astronomy & Astrophysics, University of Toronto, 50 St. George Street, Toronto, ON M5S 3H4, Canada
2 Department of Statistical Science, University of Toronto, Ontario Power Building, 700 University Avenue, 9th Floor, Toronto, ON,
arXiv:2410.17474v1 [astro-ph.HE] 22 Oct 2024
MA 02139, USA
13 Department of Physics, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
ABSTRACT
An important task in the study of fast radio bursts (FRBs) remains the automatic classification of
repeating and non-repeating sources based on their morphological properties. We propose a statistical
model that considers a modified logistic regression to classify FRB sources. The classical logistic regres-
sion model is modified to accommodate the small proportion of repeaters in the data, a feature that is
likely due to the sampling procedure and duration and is not a characteristic of the population of FRB
sources. The weighted logistic regression hinges on the choice of a tuning parameter that represents
the true proportion τ of repeating FRB sources in the entire population. The proposed method has
a sound statistical foundation, direct interpretability, and operates with only 5 parameters, enabling
quicker retraining with added data. Using the CHIME/FRB Collaboration sample of repeating and
non-repeating FRBs and numerical experiments, we achieve a classification accuracy for repeaters of
nearly 75% or higher when τ is set in the range of 50 to 60%. This implies a tentative high proportion
of repeaters, which is surprising, but is also in agreement with recent estimates of τ that are obtained
using other methods.
1. INTRODUCTION
Fast radio bursts (FRBs) represent an enigmatic phenomenon in astrophysics. FRBs are dispersed, isolated,
millisecond-long radio pulses that are similar in appearance to single pulses from Galactic pulsars. The arrival of
these radio pulses show a frequency-dependent delay (quantified by the dispersion measure or DM) due to the elec-
tromagnetic wave’s path through free electrons in the Universe. FRBs have the defining characteristic of a DM that
exceeds the maximum DM expected from our Galaxy, suggesting that they are very luminous and of extragalactic
origin (Lorimer et al. 2007; Yao et al. 2017).
The first FRB was identified in archival Parkes multibeam pulsar survey data by Lorimer et al. (2007), and was
suggestive of extragalactic origin. Since then, there has been rapid progress in the observation of these enigmatic
events and their use as probes of the intergalactic medium (Chatterjee 2021; CHIME/FRB Collaboration et al. 2021;
Bailes 2022). The discovery of FRBs has opened up new research avenues in astrophysics as they have the potential
to help us better understand the distribution of matter in the universe or the nature of dark matter (Lin & Sang 2021;
Zhao et al. 2023).
2
The biggest mystery about FRBs is their origin or progenitor object(s). The FRB enigma is made more mysterious
by the fact that some FRBs are observed to burst repeatedly (repeaters), while others have only been observed to burst
once (non-repeaters) (Spitler et al. 2016). Models that explain the progenitors of FRBs are thus typically assigned
to one of two broad categories. The first category considers non-cataclysmic explanations, while the second category
assumes FRBs are the result of a catastrophic event that destroys the astrophysical source. In the early years of FRB
discoveries, the lack of repeating FRBs supported catastrophic models. The assumption of two distinct subpopulations
of FRBs (repeaters and non-repeaters) is now supported by arguments made by Pleunis et al. (2021), which are based
on the morphological differences between the repeating and non-repeating FRBs. While the number of FRB repeater
sources continues to grow, the total published sample is currently only 531 .
The number of known FRB repeaters is small compared to the total FRB population (roughly 2.6% are repeaters,
CHIME/FRB Collaboration et al. 2023a), but observational selection effects strongly influence this number. The
CHIME telescope, the world’s leading detector of FRBs, relies on the rotation of the Earth to observe the whole
northern sky over the course of each day. Due to its location in Penticton, BC at a latitude of 49.32 degrees (Amiri
et al. 2018) CHIME observes some parts of the sky continuously, and observes other parts as little as 5 minutes per day
(Amiri et al. 2018). This means that some areas of the sky are monitored as little as 0.3% of the time each day. Thus,
many more FRB repeaters could exist, but have not been detected due to censoring. Recent work by James (2023)
and Yamasaki et al. (2023), which take into account the observational selection effects of CHIME, have suggested that
the true fraction of repeaters is closer to 50%. Surprisingly, we have reached a similar conclusion through an entirely
independent approach that we present here.
In this paper, we propose a principled and interpretable statistical model to predict whether new FRB bursts are
repeaters or non-repeaters. Our method uses the morphological characteristics of FRBs (e.g., bandwidth and peak
frequency) as inputs, and is based on weighted logistic regression for imbalanced data sets, the details of which
are described in Section 3. In short, our classification algorithm can be used to identify potential FRB repeater
candidates. Specifically, given a new FRB source for which only one burst has been recorded, the algorithm will
provide the probability that this FRB source is actually a repeater. Having this kind of predictive tool could be useful
for follow-up observations. For example, if a FRB source is given a high probability of being a repeater, but is in
an area of the sky observed only 5 minutes per day by CHIME, then one could allocate time at other telescopes to
observe it more regularly, or one could search through archival data from other telescopes to find previous bursts.
While developing our prediction method and algorithm, using techniques entirely independent from James (2023) and
Yamasaki et al. (2023), we arrive at a similar, albeit very tentative, conclusion about the true percentage of repeating
FRBs.
Creating classification and prediction algorithms for FRBs is notably difficult because of the unique structure of the
data. There are at least two major challenges to overcome:
• Challenge 1: The observed number of repeaters is significantly outweighed by the observed number of non-
repeaters. This imbalance in the data leads to biased inference when it is not taken into account. Some studies
do not account for the imbalance (e.g., Chen et al. 2021), while others have addressed the imbalance through
resampling techniques (Yang et al. 2023). However, the resampling approach does not account for the intrinsic
reduction of variability of features in the repeater subpopulation.
• Challenge 2: For training data, the labels for FRB repeaters are almost entirely certain, but the labels for FRB
non-repeaters are not. Any non-repeater FRB may actually be a repeater that we have not yet seen repeat.
This corresponds to a mislabelling problem in the training data for any classification algorithm — i.e., there may
be repeaters that are wrongly labeled as non-repeaters. Statistical approaches to mitigate this usually rely on
modeling the probability of an error in labeling (Nagelkerke & Fidler 2015; Hung et al. 2018) which, in the FRB
classification case, is impossible.
Previous methods to predict and/or classify repeating FRBs have used black-box machine-learning approaches (e.g.,
Chen et al. 2021; Luo et al. 2022; Yang et al. 2023; Zhu-Ge et al. 2022a, among others). However, these previous
methods inadequately handle some of the challenges posed by the particular characteristics of FRB data. For instance,
Chen et al. (2021), Luo et al. (2022) and Zhu-Ge et al. (2022b) consider sub-bursts and repeater bursts as independent
data points and they do not differentiate between them when creating training and test data. Assuming independence
makes it possible to split bursts from the same source and put them into the training and the test data. However,
this approach will artificially i) increase the similarity between the training and test sets, since FRBs from the same
source will have some dependency and, consequently, ii) enhance the model’s classification performance because it is
easier to identify a repeater source in the test data once one or more of its bursts have been used in the training data.
Consequently, the use of sub-bursts from the same source in test and training data exaggerates the accuracy of the
model’s predictions.
The method that we propose accounts for the imbalance between the number of repeaters and non-repeaters by
weighting differently the information contained in each observation. This approach relies on a tuning parameter that
represents the true proportion of repeater FRBs in the universe. While a precise value is elusive, our analysis suggests
that the model is robust to values of this tuning parameter between 50% and 60%. Our model is also able to identify
which of the non-repeaters to-date are most likely to repeat. This information can be used for strategic and efficient
monitoring of the sky.
Our paper is organized as follows. First, we introduce our data selection procedure, including which FRBs features are
used in our analysis (Section 2). Next, we present the proposed method of weighted logistic regression for imbalanced
data sets, describe training, validation, and test data sets, and introduce the tuning parameter τ (Section 3). We then
present our results (Section 4), and conclusions and directions for future work (Section 5).
2. DATA
We construct a statistical prediction model that will identify repeaters based on their morphological features. Thus,
we need a training set and a validation set with known labels (i.e., which FRBs are non-repeaters and which are
repeaters, albeit with the caveats described in the previous section) to determine the accuracy of our methodology.
Once our statistical model is trained and validated, then we can apply it to a separate test set of data.
• the intrinsic width of the FRB associated with the event in seconds, as modeled with the fitburst pipeline (i.e.,
without dispersion smearing and scatter-broadening, Fonseca et al. 2024),
The emitted bandwidth is obtained by taking the difference between the high and low frequencies for detection at
the full width at tenth maximum, which is the width of the FRB signal at the level corresponding to the difference
between frequencies at which the signal reaches 10% of its peak intensity. Previous work has only looked at emitting
bandwidth and not the peak frequency, but we consider the latter as a distinct feature. Detailed descriptions of the
complete set of parameters are presented by CHIME/FRB Collaboration et al. (2021). In the case of FRBs with
sub-bursts, we define the corresponding FRB-specific features as the mean value across sub-bursts for each feature.
3. METHODS
The proposed model is based on logistic regression (McCullagh & Nelder 1989; Peng et al. 2002), which is reviewed
in Section 3.1. In Section 3.2, we introduce from the statistics literature methods developed to account for imbalanced
data sets in logistic regression, which not only helps us address the low proportion of repeaters in the FRB sample,
but also allows us to introduce a tuning parameter for the fraction of repeaters in the entire population.
η = Xβ, (1)
where η is called the linear predictor, β is a vector of model parameters, and X is the covariate or design matrix. That
is, each row i in X is an FRB observation, with the first column representing the intercept in the regression model
and the subsequent columns representing the covariates or features. The number of observations, or rows, in X is n,
and the number of columns is m + 1.
Let {(yi , xi ) : 1 ≤ i ≤ n} denote the sample of size n containing the observed FRBs in which
1 if the sample i is from a repeater FRB source
yi = (2)
0 if the sample i is from an apparently non-repeating FRB source
and xi ∈ R(m+1)×1 is the column vector of observation i that contains m features. While the description is valid for
all m < n, for the FRB model we selected five features (m = 5): the boxcar width, peak frequency, intrinsic width
from fitburst, number of sub-bursts, and emitted bandwidth, as described in Section 2.2.
The logistic regression model assumes that each yi follows a Bernoulli distribution
where
e ηi
πi (β) = (4)
1 + e ηi
and
ηi = xTi β = β0 + x1i β1 + x2i β2 + x3i β3 + x4i β4 + x5i β5 , (5)
where xji is the j-th feature or covariate of observation i.
The inference for β is based on the likelihood function
n h
Y i
L(β|Data) = πi (β)yi (1 − πi (β))(1−yi ) (6)
i=1
where Data = {(yi , xi ) : 1 ≤ i ≤ n}. Estimates of β can be obtained through the maximum likelihood estimator,
β̂ = arg maxβ L(β|training data), and cross-validation can also be used to further test and validate the classifier model.
With β̂ in hand, the logistic regression model (or classifier) can then be used to classify any new FRB with observed
features are x∗ = (x∗1 , . . . , x∗5 ) as a repeater or non-repeater. That is, one uses β̂ and x∗ to obtain ηi∗ and πi∗ (β̂) using
equations (5) and (4). πi∗ (β̂) is the probability P (i-th FRB is a repeater) Although the model returns a probability
5
value, in practice a threshold is used to enable metrics for the classification performance. The most common threshold
is 0.5, and we use the criteria πi∗ (β̂) > 0.5 to predict that the i-th FRB is a repeater.
One drawback of logistic classifiers is that they do not perform well with imbalanced data sets, i.e. situations in
which the number of cases (Yi = 1) is vastly different from the number of controls (Yi = 0) (He & Garcia 2009; Luque
et al. 2019; Kim & Hwang 2022). In the next section, we describe potential solutions to alleviate this challenge.
wi = δr yi + δnr (1 − yi ) (7)
where
1−τ τ
δnr = , δr = , (8)
1 − ȳ ȳ
where r and nr stand for repeater and non-repeater respectively, and ȳ is the number of repeaters in the data divided
by the total number of observed sources. When y = 1 (a repeating FRB), w = δr = τȳ . In this way, δr could be
interpreted as the correction factor for the proportion of repeaters in the sample. Note that each weight wi depends
on the ratio τ . The latter is not known exactly, and so we first treat τ as a tuning parameter, trying various values of
6
τ between 0 and 1. Ultimately, though, we rely on a combination of the evidence provided by Yamasaki et al. (2023),
who state that τ ∈ (0.5, 1], and the results of our initial exploration of τ settle on the value τ = 0.55. This will be
discussed further in Section 4.1.
The resulting (weighted) logistic regression log-likelihood is then
N
X
log L(β|Data) = − wi log(1 + e(1−2yi )ηi ), (9)
i=0
Figure 1. Venn diagram describing our training/validation set and “test” set. The training/validation set is comprised of the
white, yellow, and grey regions, minus the region delimited by the purple outline. The latter is the “test” set of 14 FRBs and
the performance of our model on these is shown in Fig. 4.
The statistical model that we consider for classification depends on the parameter τ , which is not directly estimable
from the data, or from theory.
In this paper, we treat τ as an unknown independent quantity and create a sequence of its possible values. For
each value of τ , we train the proposed logistic regression classifier by splitting randomly 500 times the training and
cross-validation sets, as described above. Together, the individual classifiers trained by the random splits form an
ensemble of classifiers. For every value of τ , a different maximum likelihood estimator is obtained and a different
classification for a newly observed FRB source is generated.
As mentioned previously, the logistic regression model gives a probability πi of belonging to one of the classes (either
repeater or non-repeater), but this is not useful when measuring the performance of the model, as our response, yi , is
binary in nature. So it is necessary to apply a decision threshold — we use a standard value of 50%. The choice of
this threshold is independent of τ or the imbalance of the data, and it only serves as a decision boundary to obtain
a binary response. The standard value assumes that there is no preference for repeaters or non-repeaters, and allows
for better analysis of the performance of the logistic regression model.
When summarizing the performance of the model for each τ , we use the traditional metrics of accuracy, precision,
recall, and F1 -Score. These metrics are derived using empirical frequencies computed from the proposed model’s
performance on multiple splits of the training and cross-validation data. These metrics are defined as follows:
• Accuracy: the proportion of correct predictions among all predictions, i.e., the proportion of correctly classified
FRBs as repeaters and non-repeaters, among all classifications:
TP + TN
Accuracy = (10)
TP + TN + FP + FN
where T P , T N , F P , and F N are, respectively, the number of true positives, true negatives, false positives and
false negatives.
• Precision: proportion of positive predictions that are actually correct, i.e., the proportion of FRBs classified as
repeaters that are real repeaters:
TP
Precision = (11)
TP + FP
• Recall: proportion of true positives that have been correctly predicted, e.i., the proportion of real repeaters that
are correctly identified as repeaters:
TP
Recall = (12)
TP + FN
8
Figure 2. Example of a logistic classification model using the bandwidth of each FRB as a feature and the complete catalog
as training data. The purple and orange dots represent the catalog 1 samples, and the y-axis is the probability of belonging
to one of the binary classes, where 1.0 is a 100% or the repeater class, and 0.0 is 0% or the non-repeater class. As the data
have a binary response, the samples are assigned to 0% or 100% depending of the assigned label of repeater or non-repeater.
Left: The classical logistic model ignores the imbalance in the data, and produces the fit shown as a solid blue line. Based
on this model fit, and assuming a threshold probability of 50% to classify any future data as repeaters, only a few samples
would be correctly classified. Right: The weighted logistic regression uses the rare events approach and an assumed value of τ
(Section 3.2), resulting in the fit shown as solid lines (τ = 0.1 in orange, τ = 0.5 in green, and τ = 0.9 in pink).
• F1 -Score: a measure that combines Precision and Recall. The F1 -Score is the harmonic mean of the two, and
can be interpreted as an optimal number of true positives without introducing too many false negatives or false
positives. It can also be written as:
Precision × Recall
F1 -Score = 2 (13)
Precision + Recall
The F1 -Score is particularly useful in cases where the data set is imbalanced.
After training an ensemble of classifiers2 for every τ , we analyse the above performance metrics (Sec. 4.2) and
combine this information with literature estimates to settle on a single value of τ .
4. RESULTS
4.1. Example demonstrating weighted logistic regression for imbalanced data sets
In Figure 2, we compare classifications produced by traditional logistic regression that does not account for imbalance
in the data (left) to that produced by weighted logistic regression (right). In this example, and only for ease of
illustration, we use the bandwidth as a predictor and Catalog 1 as the training data. In the right-hand side of
Figure 2, we also show how the logistic curve changes as τ takes the values 0.1, 0.5 and 0.9. For τ = 0.1 the fitted
logistic curve (orange line) is very similar to the fit obtained without any correction. When τ = 0.5 the logistic fit
changes shape, and predicts higher probability Pr(Y = 1) (i.e., predicts a repeater) for FRBs with lower bandwidths.
When τ = 0.9 we can see that most of the FRBs will be predicted to be repeaters, which is not surprising since the
information provided by τ is that 90% of all FRB sources in the Universe are repeaters. Clearly, the choice of τ has a
strong effect on the model. When judging the accuracy of the model, one must consider the rate of mislabeling a non-
repeater, which relates with 1−Precision, or mislabeling a repeater, which relates to 1−Recall. In general, we expect
both type of errors to be non-zero and impossible to simultaneously optimize (see Craiu & Sun 2008, and references
therein). One must therefore consider the effect of choosing a value of τ on both types of errors simultaneously.
Figure 3. Performance metrics for the ensemble of classifiers, as a function of τ . Values on the vertical axis represent the
proportion for each metric (i.e., equations 10-13). Accuracy (the proportion of correctly classified FRBs as repeaters and non-
repeaters, among all classifications) is shown in the upper left, and is maximized for 0.4 ≤ τ ≤ 0.6. Precision (the proportion of
FRBs correctly classified as repeaters out of all repeater predictions) is shown in the upper right, and decreases with increasing
τ . Recall (the proportion correctly identified repeaters out of all real repeaters) is shown in the lower left, and increases with
increasing τ . F1 -Score (the average rate between Precision and Recall) is shown in the lower right, and appears to be maximized
and stable in the approximate range 0.4 ≤ τ ≤ 0.6, similar to the accuracy. Precision and Recall appear to be inversely related.
In Figure 3, we show the performance metrics and their estimated confidence intervals, as a function of τ , listed in
Section 3 for our ensemble of classifiers. The bold purple line represents the median, the blue-shaded regions represent
the 68% confidence intervals, and the orange lines are the 95% confidence intervals. The most important feature of
Figure 3 is that the range 0.4 ≤ τ ≤ 0.6 gives the most reliable model.
Recall from equation 13 that the F1 -Score provides an average of Precision and Recall and thus can be considered
an overall measure of performance. Similarly, Accuracy (equation 10) measures how often the model correctly predicts
the output, regardless of the true value. As illustrated in the upper left and lower right panels of Fig. 3, both Accuracy
and F1 -Score show an optimal range of approximately 0.4 ≤ τ ≤ 0.6.
For values lower than τ = 0.4, the Accuracy, Recall, and F1 -Score return a lower percentage of correctly classified
samples. Although the Precision (upper right panel, Figure 3) is high for low τ , this is merely a consequence of the
majority of the data being non-repeaters; the Precision (Equation 11) only considers the performance as measured
on positives (i.e., repeaters). Since the majority of the data consists of negatives (i.e., non-repeaters), the method is
more likely to misclassify a negative. We also note that Precision is insensitive to the number of repeaters correctly
classified. For example, we could achieve 100% Precision by having just one repeater correctly classified and zero
misclassified non-repeaters.
Recall (lower left panel of Figure 3) shows the ability of our model to correctly identify repeaters. Ideally we want
to achieve 100% Recall; based on Equation (12), achieving 100% means that every repeater is correctly identified as a
10
repeater. Our model indicates that a high value of τ is needed for such a result. Such a high value of τ would imply
a belief that almost all FRBs in the universe are repeaters.
For values higher than τ = 0.6, the Accuracy and F1 Score decline. Although it is possible to correctly classify all
repeaters with τ = 0.8, using τ = 0.6 still correctly classifies all repeaters in the catalog for some of the test splits.
While τ > 0.6 is possible, we are not comfortable with the implication of having such a large proportion of repeaters,
as we do not have other external evidence that justifies the decrease in Accuracy and Precision from the data.
Based on the overall performance metrics in Figure 3, it appears that our ensemble of classifiers performs best in the
range 0.4 ≤ τ ≤ 0.6. Both James (2023) and Yamasaki et al. (2023) propose that the fraction of repeaters should be
at least 50%, by modeling the dispersion measure (DM) with low repeating population or correcting the source count
evolution, respectively. Combining this lower limit of τ with our range, we are left with a proportion of repeaters
somewhere between 50% and 60%. The analysis proposed in this paper is quite robust to the choice of τ in this range.
For brevity, we present the results from a single value of τ = 0.55 for the remaining analysis. The results corresponding
to τ = 0.5 and 0.6, the lower and upper limits of the range, are very similar.
4.3. Testing our model with the gold and silver “test” data
We use the test data (i.e., the 14 FRB sources from the Gold and Silver samples described in Section 3.3) to assess
the performance of the ensemble of classifiers for τ = 0.55. The results are presented in Figure 4. The y-axis shows
the names provided by the Transient Name Server (TNS) for each FRB in the test set, and the x-axis shows the
classification probability. If the classification probability for an FRB is higher than 0.5, then we classify the FRB as a
repeater. Since each FRB is classified multiple times through the ensemble of classifiers, the results are summarized by
violin plots of the predicted probabilities. The vertical black lines are the medians of all the prediction probabilities. We
choose to present the median instead of the mean because the distribution of predicted probabilities is not symmetric
and the mean may be misleading. The distributions are colored gold and silver, to indicate which FRBs belong to the
Gold and Silver Samples, respectively.
Of the 14 samples in our test set, six (four Silver FRBs and two Gold FRBs) are unambiguously identified as non-
repeaters because the entire distribution is below the threshold. Two FRBs remain ambiguous (FRB20180909A and
FRB20190127B, both Silver FRBs) due to the distribution being broad and uninformative. We found that this result
is consistent across three values for τ ∈ {0.5, 0.55, 0.6}. Notably, five of the six samples identified as repeaters are from
the Gold sample.
The remaining FRBs in Figure 4 are predicted to be repeaters, with only the tail of the distributions crossing
the threshold to the non-repeater region. Note that one could find several ways to make use of the distribution of
classification probabilities. For instance, in the case of FRB20190609C, all the ensemble models correctly predict it to
be a repeater and the median of the classification probabilities value is significantly higher than 0.5, which strongly
recommends it for future monitoring. In the case of FRB20190127B (sixth from the top in Figure 4) the set of ensemble
models that predicts it as a repeater is barely outnumbered by the complement set. An observer who is interested
in finding new repeater signals might consider the ambiguity in the evidence and choose to continue monitoring this
FRB.
At first glance, Figure 4 shows that six out of 14 FRB sources in the test set were correctly labelled as repeaters. A
quick and naive interpretation of this result is that our model does no better than 50% chance at predicting repeaters.
However, six out of 14 repeaters is merely a result of the 50% threshold for classification. Had a different threshold
been chosen, then more or fewer FRBs would be classified as repeaters. Moreover, the classification is not random
between the Gold and Silver samples. Almost all of the Gold sample are identified as repeaters, and most of the Silver
sample are not, which implies that information from the covariates is informing our model. Furthermore, the training
data are imbalanced to non-repeaters, so if the classifiers give an FRB source in the test set a high probability of being
a repeater, then that FRB must be quite different from the non-repeater set. The Precision in Figure 3 gives another
measure of this idea — approximately 20% of non-repeaters are miss-classified at our selected τ .
In Figures 5, 6, and 7, we strengthen our case that our classifier is doing better than chance by showing the dynamic
spectra, also called “waterfall plots”, for each FRB in the test set. We present two plots for each FRB, at the left the
intensity data and at the right the intensity data with an overlay of the fitburst model for visibility purposes. FRBs
classified as repeaters are shown in Figure 5, non-repeaters in Figure 6, and ambiguous FRBs in Figure 7. In these
waterfall plots, we see the typical behaviour of FRB burst morphology for repeaters and non-repeaters; those with
broad widths are classified as repeaters and those with a single short burst are classified as non-repeaters.
11
Figure 4. Violin plots of the classification probability for the ensemble of classifiers at τ = 0.55 for the 14 unidentified repeater
sources in catalog 1. The non-repeater to repeater threshold of 50% is shown as a gray vertical line; everything to the right
of the gray line is classified as a repeater. The gold and silver colors correspond to the gold and silver samples, respectively.
Each of the violin plots indicates where the majority of the distribution lies, and the dark lines inside the boxes are the medians
of the distributions. The median values are enough to identify 6 out of 14 as repeaters across the different values of τ . The
FRB20180909A and FRB20190127B classification distributions are stretched so thinly across the threshold that they become
ambiguous to classify.
Each of the repeating sources of FRBs within the Gold and Silver samples have a contamination rate of chance
coincidence (Rcc ; CHIME/FRB Collaboration et al. 2023a). The latter is interpreted as a measure of uncertainty. A
higher uncertainty is represented by a higher Rcc value, and is linked with the probability of being observed by chance
in the same region as another source. The Gold sample includes sources with Rcc < 0.5 while the silver sample includes
sources within the range 0.5 ≤ Rcc < 5. This additional information provides a different perspective on our results.
From the Gold sample, our method is able to correctly identify 5 out of 7 sources as repeaters, yielding an accuracy on
confirmed repeaters higher than 70%, which is more in line with the results generated via cross-validation. The two
FRBs from the Gold sample that were miss-classified, FRB20180910A and FRB20190201A, can be considered atypical
cases or outliers in their morphological features. While FRB20180910A has the largest contamination rate from Gold
sample with Rcc = 0.43, FRB20190201A is similar to FRB20200120E, which shows unusual larger bandwidths and
narrower widths (Bhardwaj et al. 2021).
Figure 5. Dynamic spectra or waterfall plots for the six FRB burst from the test set, from CHIME/FRB Collaboration
et al. (2021), which are classified as repeaters by the ensemble classifier (i.e., FRB20190110C, FRB20190113A, FRB20190226B,
FRB20190308B, FRB20190430C, and FRB20190609C from Figure 4). We present two plots for each sample; at the left is the
intensity data and at the right is an overlay of the fitburst model to enhance the visual shape of the FRB. These bursts show
evidence of being a repeater: wider widths, downward drifting, or narrow emitting bandwidths. This is in contrast with bursts
from Figures 6 and 7, which do not show repeater characteristics.
FRBs for follow up. Alternatively, repeater candidates identified through our method may improve the efficiency of
follow-up searches for repeat bursts in archival data.
The method proposed here is based on the widely used statistical model for studying the dependence of binary
response variables on independent variables, known as logistic regression. Our logistic regression model is modified to
account for the marked imbalance in the data (i.e., many fewer repeaters than non-repeaters). The adjustment of the
method relies on a tuning parameter that represents the proportion of repeaters, τ , in the whole population of FRBs.
Given the cross disciplinary nature of the methodology, we have prepared takeaways according to the main interest
of the audience:
Astronomy Takeaways —
• One of the first takeaways is efficiency. Compared with a deep-learning approach or other machine learning
classification techniques, logistic regression requires a smaller volume of data.
• The performance of the model is promising for finding potential repeaters. For the Gold sample, the model
correctly identifies 70% of the test set FRBs, not previously used for training, as repeaters.
• While logistic regression is well-established in other fields, its application in astronomy, particularly for rare
events, is relatively novel. We have introduced this approach to the astronomical literature and hope it can
13
Figure 6. Dynamic spectra or waterfall plots for the six FRBs from the test classified as non-repeaters by our method (i.e.,
FRB20180910A, FRB20190107B, FRB20190201A, FRB20190210C, FRB20190303D, FRB20190328C from Figure 4). These
FRBs’ morphologies appear similar to non-repeaters.
Figure 7. Dynamic spectra or waterfall plots of the two FRBs in the test set (FRB20180909A and FRB20190127B) that are
ambiguously classified using our method.
be successfully applied to other astronomy datasets that have binary responses. Moreover, we emphasize the
introduction of a parameter that could help interpret the true population fraction in any study containing an
imbalanced data set. We encourage the application of our method to other datasets where a similar pattern is
present.
Statistical Takeaways —
14
• The FRB data present interesting challenges for the statistician. The uncertain labeling of one-off FRBs is
an open problem. One cannot be sure that a one-off source will never repeat in the future. This uncertainty
in labeling is an unusual issue in statistical applications of logistic regression. This implies that the model
we propose should not be interpreted as a classifier. Rather, an astronomer interested in repeating FRBs will
want to use our model to select and prioritize monitoring of one-off sources that are more likely to repeat. In
other applications, where both cases and controls are unambiguously labeled, one can use a model like ours for
classification.
• The imbalance of the sample can be accounted for, with caveats. The number of one-off FRBs vastly outnumbers
the number of repeating FRBs in this application. This issue of imbalanced data is not unusual in statistical
applications of logistic regression, and we have referenced works that tackle this problem. However, what is
unusual in the application of FRBs is that we do not know the true, underlying proportion of repeating FRBs in
the population. We have addressed this issue by allowing the proportion of repeaters in the population of FRBs,
τ , to take on various values, and by assessing performance metrics to settle on a range of probable values. We
found values of τ in the range 0.5 to 0.6 to perform well, and noted that the results do not change significantly
within this range. Thus, we presented the analysis when τ = 0.55. Our numerical experiments suggest that the
model is most accurate when τ is around 60%. However, we hesitate to conclude that this qualifies as strong
statistical evidence in favor of this being the true proportion, since the tuning parameter is not estimable from
the data.
• The dependence between observations produced by a repeating source stills needs to be taken into account.
A one-off source generates a single burst while a repeater generates one to several sub-bursts. In this paper,
the information from multiple bursts is summarized by the mean. This strategy does not incorporate all the
information contained in multiple bursts. Moreover, some repeating FRBs may have burst only twice, while
others may have burst tens or even hundreds of times. Thus, some repeaters have more data available, but this
is not considered in the training. In other words, the amount of information provided by each repeater is not
the same, a fact that the current model does not take into account. Moreover, sub-bursts cannot be treated as
independent observations since they have single origin. We are currently exploring the use of a mixed effects
weighted logistic regression model that will automatically integrate the information from all bursts that have the
same source. The results will be communicated in a future paper.
6. ACKNOWLEDGMENTS
This work was supported by a Collaborative Research Team grant to G.M.E. from the Canadian Statistical Sciences
Institute (CANSSI), which is supported by Natural Sciences and Engineering Research Council of Canada (NSERC).
We thank Amanda Cook for providing specifics about the CHIME/FRB collaboration, helpful comments, and minor
editing of the paper. The Dunlap Institute is funded through an endowment established by the David Dunlap family
and the University of Toronto. B.M.G. acknowledges the support of the NSERC through grant RGPIN-2022-03163,
and of the Canada Research Chairs program. Z.P. was a Dunlap Fellow and is supported by an NWO Veni fellowship
(VI.Veni.222.295). G.M.E also acknowledges the support of NSERC through Discovery Grant RGPIN-2020-04554.
K.W.M. holds the Adam J. Burgasser Chair in Astrophysics.
7. AUTHOR CONTRIBUTIONS
AHM performed all of the analysis and wrote all of the code for this project. AHM also made all figures and wrote
the first and subsequent drafts of the paper. RVC, GME, and DCS co-supervised AHM and helped write and edit
significant portions of the paper.
REFERENCES
Amiri, M., Bandura, K., Berger, P., et al. 2018, The Bailes, M. 2022, Science, 378, eabj3043,
Astrophysical Journal, 863, 48
Amiri, M., Bandura, K., Boskovic, A., et al. 2022, The doi: 10.1126/science.abj3043
Astrophysical Journal Supplement Series, 261, 29
15
Bhardwaj, M., Gaensler, B. M., Kaspi, V. M., et al. 2021, Kotsiantis, S. B., Zaharakis, I. D., & Pintelas, P. E. 2006,
The Astrophysical Journal Letters, 910, L18, Artificial Intelligence Review, 26, 159–190,
doi: 10.3847/2041-8213/abeaa6 doi: https://doi.org/10.1007/s10462-007-9052-3
Chatterjee, S. 2021, Astronomy & Geophysics, 62, 1.29, Lin, H.-N., & Sang, Y. 2021, Chinese Physics C, 45,
doi: 10.1093/astrogeo/atab043 125101, doi: 10.1088/1674-1137/ac2660
Chen, B. H., Hashimoto, T., Goto, T., et al. 2021, Monthly Lingam, M., & Loeb, A. 2017, ApJL, 837, L23,
Notices of the Royal Astronomical Society, 509, 1227, doi: 10.3847/2041-8213/aa633e
doi: 10.1093/mnras/stab2994 Lorimer, D. R., Bailes, M., McLaughlin, M. A., Narkevic,
CHIME/FRB Collaboration, Amiri, M., Andersen, B. C., D. J., & Crawford, F. 2007, Science, 318, 777,
et al. 2021, The Astrophysical Journal Supplement doi: 10.1126/science.1147532
Series, 257, 59, doi: 10.3847/1538-4365/ac33ab Luo, J.-W., Zhu-Ge, J.-M., & Zhang, B. 2022, Monthly
CHIME/FRB Collaboration, Andersen, B. C., Bandura, K., Notices of the Royal Astronomical Society, 518, 1629,
et al. 2023a, The Astrophysical Journal, 947, 83, doi: 10.1093/mnras/stac3206
doi: 10.3847/1538-4357/acc6c1 Luque, A., Carrasco, A., Martı́n, A., & de las Heras, A.
CHIME/FRB Collaboration, Amiri, M., Andersen, B. C., 2019, Pattern Recognition, 91, 216,
et al. 2023b, The Astrophysical Journal Supplement doi: https://doi.org/10.1016/j.patcog.2019.02.023
Series, 264, 53, doi: 10.3847/1538-4365/acb54c Maalouf, M., & Siddiqi, M. 2014, Knowledge-Based
Connor, L., & van Leeuwen, J. 2018, The Astronomical
Systems, 59, 142,
Journal, 156, 256, doi: 10.3847/1538-3881/aae649
doi: https://doi.org/10.1016/j.knosys.2014.01.012
Cordes, J. M., & Lazio, T. J. W. 2002, arXiv e-prints,
Maalouf, M., & Trafalis, T. B. 2011, Computational
astro, doi: 10.48550/arXiv.astro-ph/0207156
Statistics & Data Analysis, 55, 168,
Cormack, R. M. 1971, Journal of the Royal Statistical
doi: https://doi.org/10.1016/j.csda.2010.06.014
Society: Series A (General), 134, 321,
Manski, C. F., & Lerman, S. R. 1977, Econometrica, 45,
doi: https://doi.org/10.2307/2344237
1977. http://www.jstor.org/stable/1914121
Craiu, R. V., & Sun, L. 2008, Statistica Sinica, 861
McCullagh, P., & Nelder, J. 1989, Generalized linear
Cui, X.-H., Zhang, C.-M., Wang, S.-Q., et al. 2020,
models (Chapman and Hall)
Monthly Notices of the Royal Astronomical Society, 500,
McCulloch, C. E. 1997, Journal of the American Statistical
3275, doi: 10.1093/mnras/staa3351
Association, 92, 162.
Fonseca, E., Pleunis, Z., Breitman, D., et al. 2024, ApJS,
http://www.jstor.org/stable/2291460
271, 49, doi: 10.3847/1538-4365/ad27d6
Nagelkerke, N., & Fidler, V. 2015, PLoS ONE, 10, e0140718
Guo, H.-Y., & Wei, H. 2022, Journal of Cosmology and
Peng, C.-Y. J., Lee, K. L., & Ingersoll, G. M. 2002, The
Astroparticle Physics, 2022, 010,
Journal of Educational Research, 96, 3.
doi: 10.1088/1475-7516/2022/07/010
http://www.jstor.org/stable/27542407
He, H., & Garcia, E. A. 2009, IEEE Transactions on
Knowledge and Data Engineering, 21, 1263, Petroff, E., Hessels, J. W. T., & Lorimer, D. R. 2022,
doi: 10.1109/TKDE.2008.239 A&A Rv, 30, 2, doi: 10.1007/s00159-022-00139-w
Holland, P. W., & Welsch, R. E. 1977, Communications in Pleunis, Z., Good, D. C., Kaspi, V. M., et al. 2021, The
Statistics - Theory and Methods, 6, 813, Astrophysical Journal, 923, 1,
doi: 10.1080/03610927708827533 doi: 10.3847/1538-4357/ac33ac
Hung, H., Jou, Z.-Y., & Huang, S.-Y. 2018, Biometrics, 74, Sheather, S. J. 2009, Logistic Regression (New York, NY:
145 Springer New York), 263–303,
James, C. W. 2023, Modelling repetition in zDM: a single doi: 10.1007/978-0-387-09608-7 8
population of repeating fast radio bursts can explain Spitler, L. G., Scholz, P., Hessels, J. W. T., et al. 2016,
CHIME data. https://arxiv.org/abs/2306.17403 Nature, 531, 202, doi: 10.1038/nature17168
Kim, M., & Hwang, K.-B. 2022, PloS one, Tomz, M., King, G., & Zeng, L. 2003, Journal of Statistical
doi: 10.1371/journal.pone.0271260 Software, 8, 1–27, doi: 10.18637/jss.v008.i02
King, G., & Zeng, L. 2001a, International Organization, 55, van den Goorbergh, R., van Smeden, M., Timmerman, D.,
693–715, doi: 10.1162/00208180152507597 & Van Calster, B. 2022, Journal of the American Medical
—. 2001b, Political Analysis, 9, 137–163, Informatics Association, 29, 1525,
doi: 10.1093/oxfordjournals.pan.a004868 doi: 10.1093/jamia/ocac093
16
Wagstaff, K. L., Tang, B., Thompson, D. R., et al. 2016, Yao, J. M., Manchester, R. N., & Wang, N. 2017, ApJ, 835,
Publications of the Astronomical Society of the Pacific, 29, doi: 10.3847/1538-4357/835/1/29
128, 084503, doi: 10.1088/1538-3873/128/966/084503 Zhao, Z.-W., Wang, L.-F., Zhang, J.-G., Zhang, J.-F., &
Zhang, X. 2023, JCAP, 2023, 022,
Yamasaki, S., Goto, T., Ling, C.-T., & Hashimoto, T. 2023,
doi: 10.1088/1475-7516/2023/04/022
Monthly Notices of the Royal Astronomical Society, 527, Zhu-Ge, J.-M., Luo, J.-W., & Zhang, B. 2022a, Monthly
11158, doi: 10.1093/mnras/stad3844 Notices of the Royal Astronomical Society, 519, 1823,
Yang, X., Zhang, S.-B., Wang, J.-S., & Wu, X.-F. 2023, doi: 10.1093/mnras/stac3599
Monthly Notices of the Royal Astronomical Society, 522, —. 2022b, Monthly Notices of the Royal Astronomical
4342, doi: 10.1093/mnras/stad1304 Society, 519, 1823, doi: 10.1093/mnras/stac3599