Statistical Analysis and Data Visualization of Indonesia and Malaysia Sars Cov-2 Metadata
Statistical Analysis and Data Visualization of Indonesia and Malaysia Sars Cov-2 Metadata
Statistical Analysis and Data Visualization of Indonesia and Malaysia Sars Cov-2 Metadata
net/publication/352438045
CITATIONS READS
0 53
3 authors:
Bens Pardamean
Binus University
341 PUBLICATIONS 1,239 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bens Pardamean on 16 June 2021.
E-mail: digdo.sudigyo@binus.edu
1. Introduction
Indonesian SARS CoV-2 transmission showed shows the highest incidence in ASEAN. On 31 December
2020, the total positive cases of COVID-19 had accumulated 743,198 incidences, and the death rate was
22,138 dead. COVID-19 incidence with most positive cases was found 31.24% in age group range 31-
45. Meanwhile, based on gender, the positive incidences between men (50.02%) and women (49.98%)
are almost equal [1–3]. However, from the high number of COVID-19 cases, Indonesia only sequences
140 SARS CoV-2 genomes, both partial and complete genomes uploaded to the GISAID database during
2020 [4,5]. Several of the metadata information displayed on GISAID can use to understand the COVID-
19 characteristics in Indonesia.
The visualization of the GISAID database only summarizes the metadata information of the SARS
CoV-2 sequences in a limited way and cannot customize the detailed comparisons between countries.
Mutation variation, clade distribution, age, and gender demographics are potential factors to be studied.
During the year 2020, several research papers on these factors have been published, allowing for a more
in-depth investigation of the virus's characteristics and diversity [6–12]. However, in Indonesia, the
specific dynamics of SARS CoV-2 in the population of this country are very rarely published.
Meanwhile, Malaysia has published a paper regarding the distribution related to the SARS CoV-2 clade
based on metadata from the virus database [13]. Malaysia's population has an ethnic distribution with
high similarity with the Indonesian. The country's contiguous territory and related ethnic distribution are
intriguing research topics for comparing SARS CoV-2 characteristics in the two countries. Based on
metadata information, statistical analysis can determine the prevalence of SARS CoV-2 infection in each
region, allowing us to identify the origin of the virus's transmission and characteristics. Based on this
information, policymakers may decide whether to enact regional quarantine, lockdown, or international
flight restrictions to reduce the virus's transmission [14–16]. In addition, drug and vaccine synthesis can
also be carried out quickly following the dynamic of the specific variants and characteristics of the virus
in the Indonesian population [17–19]. Python-based statistical analysis allows for simultaneous
visualization and statistical tests with a high level of customization. Several variables from the GISAID
metadata can be described and visualized statistically by comparing between countries.
In the present study, we conducted a comparison of SARS CoV-2 between the populations of
Indonesia and Malaysia using Python to see if there were differences in virus characters statistically
from the GISAID metadata.
2. Literature Review
3. Research Methodology
SARS CoV-2 sequence data containing virus details (clades and mutations) and sample information (age
and sex of the patient) were collected from GISAID for two populations, Indonesia and Malaysia. The
117 Indonesian and 250 Malaysian sequence information were collected during 2020. Descriptive
statistics technique was performed in Python to analyze the differences of COVID-19 case
characteristics from these two populations, specifically looking at virus clade, virus mutation, and age
distribution stratified by gender. Series of bar plots and boxplots were created using the Matplotlib [36]
to visualize the comparison between Indonesia and Malaysia cases. Barplots were used to visualize the
comparison of virus mutation and clade frequencies. On the other hand, boxplots were generated to
depicts the differences in age distribution between populations for each gender. Subsequently, chi-square
and t-test were performed to quantify the significance of differences between COVID-19 cases in
Indonesia and Malaysia on all aforementioned variables using the Scipy library [37–39]. Overall, this
research method is summarized in Figure 1.
Figure 1. Method workflow of comparison characteristic Indonesian dan Malaysian SARS CoV-2
The mutation frequency between SARS CoV-2 Indonesia and Malaysia has dominated by D614G
and NSP12 P323L spike mutations. Significant differences were found in the two variant mutations
between Indonesia and Malaysia with a p-value of 0.007. The D614G spike mutation carried by the G
clade and the offspring (GH and GR) increases the infectivity of SARS CoV-2 [12,41,44]. This mutation
dominates the global incidence of COVID-19 by increasing the binding ability of viral S protein to
ACE2 in humans [7][25–29]. Our result indicates that the high number of the G614D spike mutation in
the Indonesian population is carried by the GH clade, while in the Malaysian population, this mutation
is affected by the high frequency of G clade. In addition, the NSP12 P323L mutation together with the
D614G spike mutation predominates in global COVID-19 cases, carried by clades G, GR, and GH
transmission of SARS CoV-2 [19,23,45,46]. The NSP12 P323L mutation alters the intramolecular
interaction and protein stability in the virus, allowing an increase in both structural and functional
variation of SARS CoV-2 [47]. Although these two mutations also dominate COVID-19 in Asian
populations, Asian SARS COV-2 shows a low association of spike D614G and NSP12 P323L based on
the haplotype percentage from the Linkage Disequilibrium (LD) calculation in the study of Pandit et al.
[41].
Figure 3. Comparison of top 2 SARS CoV-2 mutation between Indonesia and Malaysia
The majority of COVID-19 cases in both countries were between 20-60 years old, as shown in
Figure 4. However, the average of cases in Indonesia was older than in Malaysia. The high number of
COVID-19 cases in Malaysia at a young age was suspected infected virus from the gathering of religious
events at the start of the pandemic and asymptomatic patient [13,48,49]. Meanwhile, in the Indonesian
population, the COVID-19 cases are frequently infected in men aged > 40 years. There is probably a
higher percentage of male active smokers in the Indonesian population than in Malaysia. On active
smokers, there was an increase in ACE 2 expression in lung cells. Overexpression of ACE-2 leads SARS
CoV-2 to bind the spike protein with the host conveniently, resulting in infection [50,51]. But the
prevalence of COVID-19 with smoking is still low in patients. Therefore, that several factors indicate
the age difference between men in the population of Indonesia and Malaysia. These age distributions
reflected the same pattern in the early days of the global pandemics as reported in several studies [33,35].
Figure 4. Age distribution of COVID-19 cases in Indonesia and Malaysia stratified by gender
In the gender-stratified comparison, the difference of age distribution among the male group was
found to be significant with a p-value of 0.017. On the other hand, the difference in age distribution in
the female group was not significant with a p-value of 0.206. Additionally, the comparison of age
distribution between genders in Indonesia and Malaysia population were also statistically insignificant
with a p-value of 0.947 and 0.470 respectively.
5. Conclusion
The comparison of COVID-19 cases metadata between Indonesia and Malaysia, collected during the
year 2020, shows some differences in terms of the clade, mutation, and gender-stratified age distribution.
Indonesian SARS CoV-2 isolates were dominated by clade GH and L, while clade G and O were found
to be the highest frequency in Malaysia cases. A significant difference in SARS CoV-2 clades proportion
between Indonesian and Malaysian populations was found only in the two dominant Malaysian clades.
From the mutation point of view, a significant difference was found in the comparison of spike G614D
and NSP12 P323L mutations between the two countries. COVID-19 patients in Indonesia and Malaysia
were predominantly between the ages of 20 and 60 years old. The difference in age distribution was
found significant only in male groups from the two countries. Despite the limited data in this study, our
analysis still provides relevant findings to be validated with more data from other countries.
References
[1] Indonesian Task Force C-19 R A 2020 Data Sebaran, Gugus Tugas Percepatan Penanganan
COVID-19
[2] Dong E, Du H and Gardner L 2020 An interactive web-based dashboard to track COVID-19 in
real time Lancet Infect. Dis. 20 533–4
[3] Caraka R E, Lee Y, Kurniawan R, Herliansyah R, Toharudin T and Pardamean B 2020 Impact
of COVID-19 large scale restriction on environment and economy in Indonesia Glob. J.
Environ. Sci. Manag. 6 65–84
[4] Elbe S and Buckland-Merrett G 2017 Data, disease and diplomacy: GISAID’s innovative
contribution to global health Glob. Challenges 1 33–46
[5] Shu Y and McCauley J 2017 GISAID: Global initiative on sharing all influenza data--from
vision to reality Eurosurveillance 22 30494
[6] Brufsky A 2020 Distinct viral clades of SARS-CoV-2: implications for modeling of viral spread
J. Med. Virol. 92 1386–90
[7] Mercatelli D and Giorgi F M 2020 Geographic and Genomic Distribution of SARS-CoV-2
Mutations Front. Microbiol. 11 1800
[8] Rambaut A, Holmes E C, O’Toole Á, Hill V, McCrone J T, Ruis C, du Plessis L and Pybus O G
2020 A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic
epidemiology Nat. Microbiol. 5 1403–7
[9] Krueger D K, Kelly S M, Lewicki D N, Ruffolo R and Gallagher T M 2001 Variations in
disparate regions of the murine coronavirus spike protein impact the initiation of membrane
fusion J. Virol. 75 2792–802
[10] Geoghegan J L and Holmes E C 2018 The phylogenomics of evolving virus virulence Nat. Rev.
Genet. 19 756–69
[11] Eaaswarkhanth M, Al Madhoun A and Al-Mulla F 2020 Could the D614G substitution in the
SARS-CoV-2 spike (S) protein be associated with higher COVID-19 mortality? Int. J. Infect.
Dis. 96 459–60
[12] Sengupta A, Hassan S S and Choudhury P P 2021 Clade GR and clade GH isolates of SARS-
CoV-2 in Asia show highest amount of SNPs Infect. Genet. Evol. 89 104724
[13] Sim B L H, Chidambaram S K, Wong X C, Pathmanathan M D, Peariasamy K M, Hor C P,
Chua H J and Goh P P 2020 Clinical characteristics and risk factors for severe COVID-19
infections in Malaysia: a nationwide observational study Lancet Reg. Heal. Pacific 4 100055
[14] Kaban P A, Kurniawan R, Caraka R E, Pardamean B, Yuniarto B and others 2019 Biclustering
method to capture the spatial pattern and to identify the causes of social vulnerability in
Indonesia: a new recommendation for disaster mitigation policy Procedia Comput. Sci. 157
31–7
[15] Sumaryana A, Toharudin T, Eko Caraka R, Pontoh R S, Chen R C and Pardamean B 2020 Short
Communication: COVID-19 Pandemic and Attitude of Citizens in Bandung City Indonesia
(Case Study in Cibiru Subdistrict) Int. J. Criminol. Sociol. 9 241–6
[16] Caraka R E, Chen R C, Lee Y, Gio P U, Budiarto A and Pardamean B 2021 Latent Regression
and Ordination Risk of Infectious Disease and Climate Procedia Comput. Sci. 179 25–32
[17] Conti P, Ronconi G, Caraffa A L, Gallenga C E, Ross R, Frydas I and Kritas S K 2020 Induction
of pro-inflammatory cytokines (IL-1 and IL-6) and lung inflammation by Coronavirus-19
(COVI-19 or SARS-CoV-2): anti-inflammatory strategies J Biol Regul Homeost Agents 34 1
[18] Hoffmann M, Kleine-Weber H, Schroeder S, Krüger N, Herrler T, Erichsen S, Schiergens T S,
Herrler G, Wu N-H, Nitsche A and others 2020 SARS-CoV-2 cell entry depends on ACE2
and TMPRSS2 and is blocked by a clinically proven protease inhibitor Cell
[19] Mutlu O, Ugurel O M, Sariyer E, Ata O, Inci T G, Ugurel E, Kocer S and Turgut-Balik D 2020
Targeting SARS-CoV-2 Nsp12/Nsp8 interaction interface with approved and investigational
drugs: an in silico structure-based approach J. Biomol. Struct. Dyn. 1
[20] Bartolini B, Rueca M, Gruber C E M, Messina F, Giombini E, Ippolito G, Capobianchi M R and
Di Caro A 2020 The newly introduced SARS-CoV-2 variant A222V is rapidly spreading in
Lazio region, Italy medRxiv
[21] Tang X, Wu C, Li X, Song Y, Yao X, Wu X, Duan Y, Zhang H, Wang Y, Qian Z and others
2020 On the origin and continuing evolution of SARS-CoV-2 Natl. Sci. Rev. 7 1012–23
[22] KUMAR N P, Saini P and Kumar A 2020 Distribution of the Genetic clade “G” of SARS-CoV-
2--an insight into COVID-19 virulence and spread in India
[23] Lorenzo-Redondo R, Nam H H, Roberts S C, Simons L M, Jennings L J, Qi C, Achenbach C J,
Hauser A R, Ison M G, Hultquist J F and others 2020 A clade of SARS-CoV-2 viruses
associated with lower viral loads in patient upper airways EBioMedicine 62 103112
[24] Ko K, Nagashima S, E B, Ouoba S, Akita T, Sugiyama A, Ohisa M, Sakaguchi T, Tahara H,
Ohge H and others 2021 Molecular characterization and the mutation pattern of SARS-CoV-
2 during first and second wave outbreaks in Hiroshima, Japan PLoS One 16 e0246383
[25] Plante J A, Liu Y, Liu J, Xia H, Johnson B A, Lokugamage K G, Zhang X, Muruato A E, Zou J,
Fontes-Garfias C R, Mirchandani D, Scharton D, Bilello J P, Ku Z, An Z, Kalveram B,
Freiberg A N, Menachery V D, Xie X, Plante K S, Weaver S C and Shi P-Y Spike mutation
D614G alters SARS-CoV-2 fitness and neutralization susceptibility bioRxiv 19
[26] Mohammad A, Alshawaf E, Marafie S K, Abu-Farha M, Abubaker J and Al-Mulla F 2020
Higher binding affinity of Furin to SARS-CoV-2 spike (S) protein D614G could be
associated with higher SARS-CoV-2 infectivity Int. J. Infect. Dis.
[27] Jackson C B, Zhang L, Farzan M and Choe H 2020 Functional importance of the D614G
mutation in the SARS-CoV-2 spike protein Biochem. Biophys. Res. Commun.
[28] Yurkovetskiy L, Wang X, Pascal K E, Tomkins-Tinch C, Nyalile T P, Wang Y, Baum A, Diehl
W E, Dauphin A, Carbone C, Veinotte K, Egri S B, Schaffner S F, Lemieux J E, Munro J B,
Rafique A, Barve A, Sabeti P C, Kyratsous C A, Dudkina N V., Shen K and Luban J 2020
Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant Cell
183 739-751.e8
[29] Franco-Muñoz C, Álvarez-Díaz D A, Laiton-Donato K, Wiesner M, Escandón P, Usme-Ciro J
A, Franco-Sierra N D, Flórez-Sánchez A C, Gómez-Rangel S, Rodríguez-Calderon L D,
Barbosa-Ramirez J, Ospitia-Baez E, Walteros D M, Ospina-Martinez M L and Mercado-
Reyes M 2020 Substitutions in Spike and Nucleocapsid proteins of SARS-CoV-2 circulating
in South America Infect. Genet. Evol. 85 104557
[30] Mi J, Zhong W, Huang C, Zhang W, Tan L and Ding L 2020 Gender, age and comorbidities as
the main prognostic factors in patients with COVID-19 pneumonia Am. J. Transl. Res. 12
6537
[31] Agrawal H, Das N, Nathani S, Saha S, Saini S, Kakar S S and Roy P 2020 An assessment on
impact of COVID-19 infection in a gender specific manner Stem cell Rev. reports 1–19
[32] Vahidy F S, Pan A P, Ahnstedt H, Munshi Y, Choi H A, Tiruneh Y, Nasir K, Kash B A,
Andrieni J D and McCullough L D 2021 Sex differences in susceptibility, severity, and
outcomes of coronavirus disease 2019: Cross-sectional analysis from a diverse US
metropolitan area PLoS One 16 e0245556
[33] Monod M, Blenkinsop A, Xi X, Hebert D, Bershan S, Tietze S, Baguelin M, Bradley V C, Chen
Y, Coupland H and others 2021 Age groups that sustain resurging COVID-19 epidemics in
the United States Science (80-. ). 371 eabe8372
[34] Gyasi R M 2020 SARS-CoV-2 outbreaks, age and gender: Getting under the skin Int. J. Health
Plann. Manage. 35 1632–4
[35] Davies N G, Klepac P, Liu Y, Prem K, Jit M and Eggo R M 2020 Age-dependent effects in the
transmission and control of COVID-19 epidemics Nat. Med. 26 1205–11
[36] Hunter J D 2007 Matplotlib: A 2D graphics environment IEEE Ann. Hist. Comput. 9 90–5
[37] Virtanen P, Gommers R, Oliphant T E, Haberland M, Reddy T, Cournapeau D, Burovski E,
Peterson P, Weckesser W, Bright J and others 2020 SciPy 1.0: fundamental algorithms for
scientific computing in Python Nat. Methods 17 261–72
[38] Pardamean B 2017 Dasar Bioinformatika dengan R
[39] Pardamean B, Budiarto A and Caraka R 2018 Bioinformatika dengan R Tingkat Lanjut
[40] Caraka R E, Lee Y, Chen R C, Toharudin T, Gio P U, Kurniawan R and Pardamean B 2020
Cluster Around Latent Variable for Vulnerability Towards Natural Hazards, Non-Natural
Hazards, Social Hazards in West Papua IEEE Access 9 1972–86
[41] Pandit B, Bhattacharjee S and Bhattacharjee B 2021 Association of clade-G SARS-CoV-2
viruses and age with increased mortality rates across 57 countries and India Infect. Genet.
Evol. 90 104734
[42] Srivastava S, Banu S, Singh P, Sowpati D T and Mishra R K 2021 SARS-CoV-2 genomics: An
Indian perspective on sequencing viral variants J. Biosci. 46 1–14
[43] Yap P S X, Tan T S, Chan Y F, Tee K K, Kamarulzaman A and Teh C S J 2020 An Overview of
the Genetic Variations of the SARS-CoV-2 Genomes Isolated in Southeast Asian Countries
J. Microbiol. Biotechnol. 30 962–6
[44] Korber B, Fischer W M, Gnanakaran S, Yoon H, Theiler J and Al. E 2020 Tracking Changes in
SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Cell
182 812-827.e19
[45] Hartley P D, Tillett R L, AuCoin D P, Sevinsky J R, Xu Y, Gorzalski A, Pandori M, Buttery E,
Hansen H, Picker M A, Rossetto C C and Verma S C Genomic surveillance of Nevada
patients revealed prevalence of unique SARS-CoV-2 variants bearing mutations in the RdRp
gene medRxiv
[46] Kannan S R, Spratt A N, Quinn T P, Heng X, Lorson C L, Sönnerborg A, Byrareddy S N and
Singh K 2020 Infectivity of SARS-CoV-2: there Is Something More than D614G? J.
Neuroimmune Pharmacol. 1
[47] Chand G B, Banerjee A and Azad G K 2020 Identification of novel mutations in RNA-
dependent RNA polymerases of SARS-CoV-2 and their implications on its protein structure.
PeerJ 8 e9492
[48] He X, Lau E H Y, Wu P, Deng X, Wang J, Hao X, Lau Y C, Wong J Y, Guan Y, Tan X and
others 2020 Temporal dynamics in viral shedding and transmissibility of COVID-19 Nat.
Med. 26 672–5
[49] Quadri S A 2020 COVID-19 and religious congregations: Implications for spread of novel
pathogens Int. J. Infect. Dis. 96 219–21
[50] Polverino F 2020 Cigarette smoking and COVID-19: A complex interaction Am. J. Respir. Crit.
Care Med. 202 471–2
[51] Smith J C, Sausville E L, Girish V, Yuan M Lou, Vasudevan A, John K M and Sheltzer J M
2020 Cigarette smoke exposure and inflammatory signaling increase the expression of the
SARS-CoV-2 receptor ACE2 in the respiratory tract Dev. Cell 53 514–29