Knowledge Guided Data Centric AI in Healthcare 1685207849

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

1

Knowledge-Guided Data-Centric AI in Healthcare:


Progress, Shortcomings, and Future Directions
Edward Y. Chang
Stanford University
echang@cs.stanford.edu
Fellow of ACM & IEEE

Abstract
The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of
examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease
arXiv:2212.13591v2 [cs.AI] 30 Apr 2023

can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there
have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights
the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the
available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data:
data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of
knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in
large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the
effectiveness of knowledge-guided generative methods.

I. I NTRODUCTION
Applying artificial intelligence (AI) techniques to improve various aspects of healthcare, such as disease diagnosis and
treatment recommendations, has made significant progress over the past half-century, but there are still several challenges to be
overcome. This article presents the current state of progress in this field, identifies key challenges, and discusses promising
directions for future research.
The term “AI” regained popularity after AlphaGo’s success in 2015, but in the scientific domain, AI is simply a set of
machine learning or statistical learning algorithms. According to a survey paper by W. B. Schwartz [64], pattern recognition
techniques (such as Boolean algebra and naive Bayesian models) were first used in healthcare in the 1960s to learn the weights
of a set of decision factors for classification and disease prediction. A notable example was Stanford’s Dentral system [77],
which helped chemists identify unknown organic molecules. In the 1970s and 1980s, the machine learning community focused
on using AI to solve clinical problems. Pathophysiological knowledge and reasoning were incorporated into the development
of rule-based expert systems. The main approach was to match a patient’s symptoms with stored profiles of findings for
each disease. MYCIN [20] is a representative system for identifying bacteria causing severe infections and recommending
antibiotics. Another system, INTERNIST-I [21], used a multiple hypotheses strategy to consider multiple disease candidates
without prematurely converging on a single prediction. The rule-based approach to AI in healthcare was not successful due to a
lack of comprehensive data and insufficient accuracy. These limitations prevented the deployment of any rule-based systems in
the healthcare industry during this time period. Additionally, there was resistance from healthcare professionals to the adoption
of these systems due to both procedural and legal concerns.
From 1990s to 2010s, the focus of AI in healthcare shifted to developing tools and algorithms to improve performance.
These efforts were driven by two key insights: first, that AI systems in healthcare must be able to handle imperfect or missing
data, and second, that small or insufficient training data is often the norm in healthcare and must be directly addressed. To
address these challenges, researchers developed and improved upon techniques such as fuzzy set theory, Bayesian networks,
clustering, and support vector machines (SVMs). Fuzzy set theory, for example, can handle data uncertainty and missing data,
while Bayesian networks can consider correlations between parameters to reduce dependencies. SVMs, which use the kernel
trick, can significantly reduce the amount of required training data by classifying data instances in the similarity space. Both
SVMs and data clustering techniques, such as K-means and manifold learning, can reduce computation dimensionality and
improve prediction accuracy with less training data.
Since 2010, the data-centric approach, which involves using large volumes of data to learn data representations, has become
more widely accepted in the field of artificial intelligence [12, 37]. This approach involves using large amounts of data to learn
data representations, and is exemplified by algorithms such as deep learning [31, 32] and transformers [71]. This approach
differs from the model-centric approach, which relies on human-designed computer models or algorithms to extract features from
images or documents. Unlike the model-centric approach, the data-centric approach involves learning features/representations
from data, and more data can improve the quality of these representations. Additionally, the internal parameters of data-centric
algorithms like multi-layer perceptrons and convolutional neural networks(CNNs) [39] can be changed or learned based on
the structure found in large datasets. The features of an image in a data-centric pipeline like a CNN can be affected by other
2

images, while the features of an image in a model-centric pipeline like (SIFT) [43] are invariant. The greater the quantity and
variety of data available, the better the representations that can be learned by a data-centric pipeline. When a learning algorithm
has been trained on a sufficient number of instances of an object under various conditions, such as different postures and partial
occlusion, the features learned from this data will be more comprehensive.
Despite advancements in algorithms and data quality, AI is still not widely deployed in hospitals. The reasons can be
understood from three pieces of documents: the “indictment” of A. R., Feinstein [28] made in 1977, the experience of Google’s
negative deployment experience published in April 2020 [20], and the comments made by Andrew Ng at Stanford Healthcare’s
AI Future workshop in April 2021 [38]. Their comments can be summarized into three factors: 1) a lack of deep interdisciplinary
collaboration between computer scientists and clinicians, 2) a lack of robustness and interpretability in AI models, and 3)
insufficient training data for the data-centric approach to learn good representations. This article focuses on addressing the
training data issue. The current data augmentation techniques and transfer learning methods are not able to systematically
enhance diversity to cover all variants of a medical condition. This leads to AI models being unable to effectively handle
out-of-distribution instances. One well known problem is that AI models trained in one hospital often do not perform well
when used in a different hospital with different hardware and software configurations and clinical practices.
In the remainder of this article we first summarize our prior studies in training-data generation [10, 18, 66, 72] and aggregation
in Section II. In Section III, we propose a knowledge-guided generative model that can generate valid patterns not seen in the
training data. Finally, we enumerate plausible future research directions in Section IV.

II. T RAINING DATA G ENERATION AND AGGREGATION


The representation learning approach, which focuses on using data to train models, requires large and diverse training sets
that can account for all possible variations of a target concept, such as a illness, object or document. However, collecting such
data can be challenging, especially in healthcare. For example, annotating chest X-ray images for FDA certification requires
three certified specialists, who are often in high demand and well paid. The task of annotating a set of 180,000 chest X-ray
images, which includes the 108,938 NIH lung cancer dataset, took more than three years and was not completed by the end
of 2020. Moreover, it is not possible for specialists to guarantee that these images represent all possible variations of chest
diseases. To overcome these challenges, AI researchers have explored various methods to increase the volume and diversity of
training data, such as data augmentation, federated learning, transfer learning, and data synthesis at the using methods such
as GANs. In the rest of this section, we will examine each of these methods as well as their advantages and disadvantages
through analyzes and our previous empirical studies [17, 18, 66] .

A. Data Augmentation
It was observed by B. Li and E. Chang in 2002 in an image similarity study [40] that an image that has gone through
transformations such as scaling, cropping, down-sampling, color enhancement, lighting changes, etc. may appear to be perceptually
similar to the original image. However, a traditional similarity function such as the Minkowski family function (e.g., L1 and L2
functions) would quantify these transformed images to be dissimilar to the original. There are three approaches to address this
problem. The first approach is to devise a distance function to correctly quantify the distance between perceptually similar
images/objects. Our proposed Dynamic Partial Function (DPF) [40] is one effective method based on psychological principles.
The second approach is to employ a model that is insensitive to some image transformations. For instance, the graph neural
network (GNN) model [76] connects key features of an object into a graph, which is invariant to transformations such as
rotation and translation. However, GNN cannot model all transformations. One could consider employing GNN (instead of
using CNN) as the training algorithm to work with augmented data. Data augmentation is the third method, which generates all
possible variants of a semantic (such as a disease and an object) and add them to the training data.
AlexNet is a pioneer work that uses data augmentation extensively. According to [37], AlexNet employs two distinct forms
of data augmentation, both of which can produce the transformed images from the original images with very little computation
[37, 41]. The first scheme of data augmentation includes a random cropping function and horizontal reflection function. Data
augmentation can be applied to both the training and testing stages. For the training stage, AlexNet randomly extracted smaller
image patches (224 × 224) and their horizontal reflections from the original images (256 × 256). The AlexNet model was trained
on these extracted patches instead of the original images in the ImageNet dataset. The second scheme of data augmentation
alters the intensities of the RGB channels in training images by using principal component analysis (PCA). This scheme is
used to capture an important property of natural images: the invariance of object identity to changes in the intensity and color
of the illumination.
Data augmentation enjoys popularity because of its simplicity. However, its major shortcomings are that the generated
new training instances may not appear in the real world and that some unseen patterns of an object cannot be generated via
transformations. Suppose data augmentation increases the number of training instances by K folds. The increased computational
cost can be O(K 2 ).
3

B. Federated Learning
Recently, federated learning attracts interests from the healthcare domain [57] because of its advertised benefits of multi-source
data integration, privacy protection, and and regulation compliance.
Federated learning has shown limited success in rudimental computer vision operations such as organ localization and lesion
segmentation [55]. Unfortunately, from both the architecture and practical perspectives, federated learning faces tremendous
challenges for image-based disease diagnosis. We have proposed a better architecture [23], which is summarized in the end of
this subsection.

Fig. 1: Soteria Components and its Three-layer Block Chains: Main, Side and Digital Agreement.

Federal learning cannot overcome the following seven real-world multi-hospital integration hurdles:
• Hardware diversity. Heterogeneous hardware devices (e.g., MRI machines) of different brands and models are used by
different hospitals.
• Hardware parameters. Even with exactly the same device, hardware parameter settings may be different, which may result
in different image size, resolution, and quality.
• Different clinical SOPs.
• Different DL models. One hospital may have a legacy system using GNN and another using CNN, etc..
• Different DL architectures. Since AlexNet, there has been double-digit DL architecture developed such as VGG, GoogleNet,
ResNet, Inception, etc. It is virtually impossible to ask all participants to use the same architecture.
• Different hyper-parameter values. Even with the same architecture, the local optimal hyper-parameter values may be
different between sites.
• Different data format, quality, and annotation practice.
The fantasy to overcome all these seven issues requires draconian measures to force all participating hospitals (domestic and
aboard) to use the same equipment, same hardware parameters, same photo-taking procedures, same image resolutions, same
deep learning architectures, same hyper-parameters, same image labeling conventions, etc., and output the parameters of the
same latent layers for aggregation.
For text data, wherever the data are collected, a word is a word. But image analysis is very sensitive to the aforementioned
variations. Indeed, Andrew Ng remakred at Stanford Healthcare’s AI Future workshop in April 2021 [38] that, a medical image
model trained at Stanford would simply fail to perform at a hospital down the street due to various variants.
How about privacy preservation? Privacy regulations such as GDPR [1], CCPA [4], and HIPPA [3] require “provable” privacy
preservation. Federated learning is a close system, and its privacy practice cannot be made distributed, transparent, nor publicly
provable. Our proposed Soteria architecture, shown in Figure 1, uses a three layer blockchain-based ledger to support publicly
auditable privacy regulation compliance. Moreover, by placing digital contacts1 on the side chains, a data owner knows from
the upper-layer blockchains the status of his/her consent, the access time to the data, and the payment for each access, etc.. For
details, please consult the Soteria paper [23].
1 A digital contract is an agreement sign between the data owner and data consumer on the data-access consent with terms and conditions. Terms and
conditions can include access epoch and payment. A digital contract is converted to a piece of SQL-like code to fetch data.
4

Fig. 2: The Flowchart of Algorithm using OM Images as Example.

C. Transfer Representation Learning


The idea of transfer learning stems from the fact that human beings can recognize a new object with just a small number of
examples. This few-shot learning capability may come from our learned experience in the past, which already well-tuned the
parameters of our brain. It is also possible that a pre-trained model in our cerebellum was given to us through heredity and
prior learning [13, 15]. In machine learning, the common practice of transfer representation learning is to pre-train a CNN on a
very large dataset (called the source domain) and then to use the pre-trained CNN either as an initialization or a fixed feature
extractor for the task of interest (called the target domain) [2].
We use disease diagnosis as the target domain to illustrate the problems of and solutions to the challenges of small data
training. Specifically, we use otitis media (OM) and melanoma2 as two example diseases. The training data available to us are
1) 1, 195 OM images collected by seven otolaryngologists at Cathay General Hospital3 , Taiwan [65] and 2) 200 melanoma
images from PH2 dataset [45]. The source domain from which representations are transferred to our two target diseases is
ImageNet [19].
OM is any inflammation or infection of the middle ear, and treatment consumes significant medical resources each year
[52]. Several symptoms such as redness, bulging, and tympanic membrane perforation may suggest an OM condition. Color,
geometric, and texture descriptors may help in recognizing these symptoms. However, specifying these kinds of features involves
a hand-crafted process and therefore requires domain expertise. Often times, human heuristics obtained from domain experts
may not be able to capture the most discriminative characteristics, and hence the extracted features cannot achieve high detection
accuracy. Similarly, melanoma, a deadly skin cancer, is diagnosed based on the widely-used dermoscopic “ABCD” rule [69],
where A means asymmetry, B means border, C color, and D different structures. The precise identification of such visual cues
relies on experienced dermatologists to articulate. Unfortunately, there are many congruent patterns shared by melanoma and
nevus, with skin, hair, and wrinkles often preventing noise-free feature extraction.
Our transfer representation learning experiments consist of the following five steps:
1) Unsupervised codebook construction: We learned a codebook from ImageNet images, and this codebook construction is
“unsupervised” with respect to OM and melanoma.
2) Encode OM and melanoma images using the codebook: Each image was encoded into a weighted combination of the
pivots in the codebook. The weighting vector is the feature vector of the input image.
3) Supervised learning: Using the transfer-learned feature vectors, we then employed supervised learning to learn two classifiers
from the 1, 195 labeled OM instances or 200 labeled melanoma instances.
4) Feature fusion: We also combined some heuristic features of OM (published in [65]) and ABCD features of melanoma
with features learned via transfer learning.
5) Fine tuning: We further fine-tuned the weights of the CNN using labeled data to improve classification accuracy.
As we will show in the remainder of this section, step four does not yield benefit, whereas the other steps are effective
in improving diagnosis accuracy. In other words, these two disease examples demonstrate that features modeled by domain
experts or physicians (the model-centric approach) are ineffective. The data-centric approach of big data representation learning
combined with small data adaptation is convincingly promising.

C.1 Method Specifications


We started with unsupervised codebook construction. On the large ImageNet dataset, we learned the representation of these
images using AlexNet [37]. AlexNet contains eight neural network layers. The first five are convolutional and the remaining
three are fully-connected. Different hidden layers represent different levels of abstraction concepts. We utilized AlexNet in
Caffe [35] as our foundation to build our encoder to capture generic visual features.
For each image input, we obtained a feature vector using the codebook. The information of the image moves from the input
layer to the output layer through the inner hidden layers. Each layer is a weighted combination of the previous layer and stands
for a feature representation of the input image. Since the computation is hierarchical, higher layers intuitively represent higher
2 In our award-winning XPRIZE Tricorder [16, 58] device (code name DeepQ), we effectively diagnose twelve conditions, and OM and melanoma are two
of them.
3 The dataset was used under a strict IRB process. The dataset was deleted by April 2015 after our experiments had completed.
5

Fig. 3: Four classification flows (OM photos are from [74]).

concepts. For images, the neurons from lower levels describe rudimentary perceptual elements like edges and corners, whereas
the neurons from higher layers represent aspects of objects such as their parts and categories. To capture high-level abstractions,
we extracted transfer-learned features of OM and melanoma images from the fifth, sixth and seventh layers, denoted as pool5
(P5), fc6 and fc7 in Fig. 2 respectively.
Once we had transfer-learned feature vectors of the 1, 195 collected OM images and 200 melanoma images, we performed
supervised learning by training a support vector machine (SVM) classifier [11]. We chose SVMs to be our model since it is an
effective classifier widely used by prior works. Using the same SVM algorithm lets us perform comparisons with the other
schemes solely based on feature representation. As usual, we scaled features to the same range and found parameters through
cross validation. For fair comparisons with previous OM works, we selected the radial basis function (RBF) kernel.
To further improve classification accuracy, we experimented with two feature fusion schemes, which combine OM features
hand-crafted by human heuristics (or model-centric) in [65] and our melanoma heuristic features with features learned from
our codebook. In the first scheme, we combined transfer-learned and hand-crafted features to form fusion feature vectors. We
then deployed the supervised learning on the fused feature vectors to train an SVM classifier. In the second scheme, we used
the two-layer classifier fusion structure proposed in [65]. In brief, in the first layer we trained different classifiers based on
different feature sets separately. We then combined the outputs from the first layer to train the classifier in the second layer.
Fig. 3 summarizes our transfer representation learning approaches using OM images as an example. The top of the figure
depicts two feature-learning schemes: the transfer-learned scheme on the left-hand side and the hand-crafted scheme on the
right. The solid lines depict how OM or melanoma features are extracted via the transfer-learned codebook, whereas the dashed
lines represent the flow of hand-crafted feature extraction. The bottom half of the figure describes two fusion schemes. Whereas
the dashed lines illustrate the feature fusion by concatenating two feature sets, the dotted lines show the second fusion scheme
at the classifier level. At the bottom of the figure, the four classification flows yield their respective OM-prediction decisions. In
order from left to right in the figure are ’transfer-learned features only’, ’feature-level fusion’, ’classifier-level fusion’, and
’hand-crafted features only’.

C.2 Empirical Study and Discussion


Two sets of experiments were conducted in our prior work [17] to validate our idea. In this subsection, we first report
OM classification performance by using our proposed transfer representation learning approach, followed by our melanoma
classification performance. Then, we elaborate the correlations between images of ImageNet classes and images of disease
classes by using a visualization tool to explain why transfer representation learning works.
For fine-tuning experiments, we performed a 10-fold cross-validation for OM and a 5-fold cross-validation for melanoma to
train and test our models, so the test data are separated from the training dataset. We applied data augmentation, including
random flip, mirroring, and translation, to all the images.
6

TABLE I: OM classification experimental results. (The best shown in bold.)

Method Accuracy(std) Sensitivity Specificity F_1

1 Heuristic w/ seg 80.11%(18.8) 83.33% 75.66% 0.822

2 Heuristic w/o seg 76.19%(17.8) 79.38% 71.74% 0.790

3 Transfer w/ seg (pool5) 87.86%(3.62) 89.72% 86.26% 0.890

4 Transfer w/o seg (pool5) 88.37%(3.41) 89.16% 87.08% 0.894

5 Transfer w/ seg (fc6) 87.58%(3.45) 89.33% 85.04% 0.887

6 Transfer w/o seg (fc6) 88.50%(3.45) 89.63% 86.90% 0.895

7 Transfer w/ seg (fc7) 85.60%(3.45) 87.50% 82.70% 0.869

8 Transfer w/o seg (fc7) 86.90%(3.45) 88.50% 84.90% 0.879

9 Feature fusion 89.22%(1.94) 90.08% 87.81% 0.900

10 Classifier fusion 89.87%(4.43) 89.54% 90.20% 0.898

11 Fine-tune 90.96%(0.65) 91.32% 90.20% 0.917

For the setting of training hyperparameters and network architectures, we used mini-batch gradient descent with a batch size
of 64 examples, learning rate of 0.001, momentum of 0.9 and weight decay of 0.0005. To fine-tune the AlexNet model, we
replaced the fc6, fc7 and fc8 layers with three new layers initialized by using a Gaussian distribution with a mean of 0 and a std
of 0.01. During the training process, the learning rates of those new layers were ten times greater than that of the other layers.
Results of Transfer Representation Learning for OM
Our 1, 195 OM image dataset encompasses almost all OM diagnostic categories: normal; AOM: hyperemic stage, suppurative
stage, ear drum perforation, subacute resolution stage, bullous myringitis, barotrauma; OME: with effusion, resolution stage
(retracted); COM: simple perforation, active infection. Table I compares OM classification results for different feature
representations. All experiments were conducted using 10-fold SVM classification. The measures of results reflect the
discrimination capability of the features.
The first two rows in Table I show the results of human-heuristic methods (hand-crafted), followed by our proposed
transfer-learned approach. The eardrum segmentation, denoted as ‘seg’, identifies the eardrum by removing OM-irrelevant
information such as ear canal and earwax from the OM images [65]. The best accuracy achieved by using human-heuristic
methods is around 80%. With segmentation (the first row), the accuracy improves 3% over that without segmentation (the
second row).
Rows three to eight show results of applying transfer representation learning. All results outperform the results shown in
rows one and two, suggesting that the features learned from transfer learning are superior to that of human-crafted ones.
Interestingly, segmentation does not help improve accuracy for learning representation via transfer learning. This indicates
that the transfer-learned feature set is not only more discriminative but also more robust. Among three transfer-learning layer
choices (layer five (pool5), layer six (fc6) and layer seven (fc7)), fc6 yields slightly better prediction accuracy for OM. We
believe that fc6 provides features that are more general or fundamental to transfer to a novel domain than pool5 and fc7 do.
We also directly used the 1, 195 OM images to train a new AlexNet model.The resulting classification accuracy was only
71.8%, much lower than applying transfer representation learning. This result confirms our hypothesis that even though CNN is
a good model, with merely 1, 195 OM images (without the ImageNet images to facilitate feature learning), it cannot learn
discriminative features.
Two fusion methods, combining both hand-crafted and transfer learning features, achieved a slightly higher OM-prediction
F1-score (0.9 over 0.895) than using transfer-learned features only. This statistically insignificant improvement suggests that
hand-crafted features do not provide much help.
Finally, we used OM data to fine-tune the AlexNet model, which achieves the best accuracy (see line 11 in red). For
fine-tuning, we replaced the original fc6, fc7 and fc8 layers with the new ones and used OM data to train the whole network
without freezing any parameters. In this way, the leaned features can be refined and are thus more aligned to the targeted task.
This result attests that the ability to adapt representations to data is a critical characteristic that makes deep learning superior to
the other learning algorithms.
Results of Transfer Representation Learning for Melanoma
We performed experiments on the PH2 dataset whose dermoscopic images were obtained at the Dermatology Service of
Hospital Pedro Hispano (Matosinhos, Portugal) under the same conditions through the Tuebinger Mole Analyzer system using
a magnification of 20x. The assessment of each label was performed by an expert dermatologist.
7

TABLE II: Melanoma classification experimental results. (The best shown in bold.)

Method Accuracy(std) Sens Spec F_1

1 ABCD rule w/ auto seg 84.38%(13.02) 85.63% 83.13% 0.8512

2 ABCD w/ manual seg 89.06%(9.87) 90.63% 87.50% 0.9052

3 Transfer w/o seg (pool5) 89.06%(10.23) 92.50% 85.63% 0.9082

4 Transfer w/o seg (fc6) 85.31%(11.43) 83.13% 87.50% 0.8686

5 Transfer w/o seg (fc7) 79.83%(14.27) 84.38% 74.38% 0.8379

6 Feature fusion 90.00%(9.68) 92.50% 87.50% 0.9157

7 Fine-tune 92.81%(4.69) 95.00% 90.63% 0.9300

Table II compares melanoma classification results for different feature representations. In Table II, all the experiments except
for the last two were conducted by using 5-fold SVM classification. The last experiment involved fine-tuning, which was
implemented and evaluated by using Caffe. We also performed data augmentation to balance the PH2 dataset (160 normal
images and 40 melanoma images) .
Unlike OM, we found the low-level features to be more effective in classifying melanoma. Among three transfer-learning
layer choices, pool5 yields a more robust prediction accuracy than the other layers do for melanoma. The deeper the layer is,
the worse the accuracy becomes. We believe that pool5 provides low-level features that are suitable for delineating texture
patterns that depict characteristics of melanoma.
Rows three and seven show that the accuracy of transferred features is as good as that of the ABCD rule method with expert
segmentation. These results reflect that deep transferred features are robust to noise such as hair or artifacts.
We used melanoma data to fine-tune the AlexNet model and obtained the best accuracy 92.81% since all network parameters
are refined to fit the target task by employing back propagation. We also compared our result with the cutting-edge method,
which reported 98% sensitivity and 90% specificity on PH2 [7]. Their method requires preprocessing such as manual lesion
segmentation to obtain “clean” data. In contrast, we utilized raw images without conducting any heuristic-based preprocessing.
Thus, deep transfer learning can identify features in an unsupervised way to achieve as good classification accuracy as those
features identified by domain experts.
Qualitative Evaluation - Visualization
In order to investigate what kinds of features are transferred or borrowed from the ImageNet dataset, we utilized a visualization
tool to perform qualitative evaluation. Specifically, we used an attribute selection method, SVMAttributeEval [27] with Ranker
search, to identify the most important features for recognizing OM and melanoma. Second, we mapped these important features
back to their respective codebook and used the visualization tool from Yosinski et al. [79] to find the top ImageNet classes
causing the high value of these features. By observing the common visual appearances shared by the images of the disease
classes and the retrieved top ImageNet classes, we were able to infer the transferred features.
Fig. 4 demonstrates the qualitative analyses of four different cases: the Normal eardrum, acute Otitis Media (AOM), Chronic
Otitis Media (COM) and Otitis Media with Effusion (OME), which we will now proceed to explain in turn. First, the normal
eardrum, nematode and ticks are all similarly almost gray with a certain degree of transparency, features that are hard to capture
with only hand-crafted methods. Second, AOM, purple-red cloth and red wine have red colors as an obvious common attribute.
Third, COM and seashells are both commonly identified by a calcified eardrum. Fourth, OME, oranges, and coffee all seem to
share similar colors. Here, transfer learning works to detect OM in an analogous fashion to how explicit similes are used in
language to clarify meaning. The purpose of a simile is to provide information about one object by comparing it to something
with which one is more familiar. For instance, if a doctor says that OM displays redness and certain textures, a patient may not
be able to comprehend the doctor’s description exactly. However, if the doctor explains that OM presents with an appearance
similar to that of a seashell, red wine, orange, or coffee colors, the patient is conceivably able to envision the appearance of
OM at a much more precise level. At level fc6, transfer representation learning works like finding similes that can help explain
OM using the representations learned in the source domain (ImageNet).
The classification of melanoma contrasts sharply with the classification of OM. We can exploit distinct visual features to
classify different OM classes. However, melanoma and benign nevi share very similar textures, as melanoma evolves from
benign nevi. Moreover, melanoma often has atypical textures and presents in various colors.
In the case of detecting melanoma versus benign nevus, effective representations of the diseases from higher-level visual
characteristics cannot be found from the source domain. Instead, the most effective representations are only transferable at a
lower-level of the CNN. We believe that if the source domain can add substantial images of texture-rich objects, the effect of
explicit similes may be utilized at a higher level of the CNN. For detailed analysis, readers can consult our work published in
2019 [17].
8

Fig. 4: The visualization of helpful features from different classes corresponding to different OM symptoms (from left to right:
Normal eardrum, AOM, COM, OME).

C.3 Observations on Transfer Learning


The transfer learned features achieve accuracy 90.96% (91.32% in sensitivity and 90.20% in specificity) for OM and 92.81%
(95.0% in sensitivity and 90.63% in specificity) for melanoma, achieving an improvement in disease-detection accuracy over
the feature extraction instructed by domain experts. Moreover, our algorithms do not require manual data cleaning beforehand,
and the preliminary diagnosis of OM and melanoma can be derived without aid from doctors. Therefore, automatic disease
diagnosis systems, which hold the potential to help populations lacking in access to medical resources, are developmentally
possible.
In summary, Our experiments on transfer learning provided three important insights on representation learning.
1) Low-level representations can be shared. Low-level perceptual features such as edges, corners, colors, and textures can be
borrowed from some source domains where training data are abundant. After all, low-level representations are similar
despite different high-level semantics.
2) Middle-level representations can be correlated. Analogous to explicit similes used in language, an object in the target
domain can be “represented” or “explained” by some source domain features. In our OM visualization, we observed that a
positive OM may display appearances similar to that of a seashell, red wine, oranges, or coffee colors — features learned
and transferred from the ImageNet source domain.
3) Representations can adapt to a target domain. Even though, in the small data training situations, the amount of data is
insufficient to learn effective representations by itself, given representations learned from some big-data source domains,
the small data of the target domain can be used to align (e.g., re-weight) the representations learned from the source
domains to adapt to the target domain. However, predicting transferability from a source domain to a target domain is an
open problem. Quantifying and predicting transferability remains to be a research problem to tackle.

D. Generative Adversarial Networks (GANs)


Generative adversarial networks (GANs) [26] are a special type of neural network model where two networks are trained
simultaneously. Figure 5 depicts that the generator (denoted as G) focuses on producing fake images and the discriminator
(denoted as D) centers on discriminating fake from real. The goal is for the generator to produce fake images that can fool the
discriminator to believe they are real. If an attempt fails, GANs use backpropagation to adjust network parameters. GANs have
been used for transforming images to different styles [34], or changing facial expression of a person [42]. GANs have also
been used to generate more training data.
9

Fig. 5: The Vanilla GANS by [26]; figure credit: Hunter Heidenreich [29].

D.1 Method Specifications


Since the introduction of the initial GAN model [26], there have been several variants depending on how the input, output,
and error functions are modeled. GANs can be primarily divided into four representative categories based on the input and
output, and their applications are as follows:
• Conditional GAN (CGAN) [46]: CGAN adds to GAN an additional input, y, on which the models can be conditioned.
Input y can be of any type, e.g., class labels. Conditioning can be achieved by feeding y to both the generator G(z|y) and
the discriminator D(x|y), where x is a training instance and z is random noise in latent space. The benefit of conditioning
on class labels is that it allows the generator to generate images of a particular class. (Application: text to image.)
• Pixel-to-Pixel GAN (Pix2Pix) [34]: Pix2Pix GAN is similar to CGAN. However, conditions are placed upon an image
instead of a label. The effect of such conditioning is that it allows the generator to map images of one style to another,
e.g. mapping a photo to the painting style of an artist or mapping a sketch to a colored image. (Application: image to
image translation, supervised.)
• Progressive-Growing GAN (PGGAN) [36]: PGGAN grows both the generator and discriminator progressively; starting
from low resolution, it adds new layers that model increasingly fine details as training progresses. PGGAN can generate
high-resolution images through progressive refinement. (Application: high-resolution image generation.)
• Cycle GAN [81]: Pix2Pix GAN requires paired training data to train. Cycle GAN is an unsupervised approach for learning
to translate an image from a source domain X to a target domain Y without training examples. The goal is to learn two
mappings from X to Y (i.e., G) and from Y to X (i.e., F) such that the distributions G(X) is indistinguishable from the
distribution Y , and the distributions F(Y ) is indistinguishable from the distribution X, respectively. Cycle GAN introduces
a cycle consistency loss to approximate F(G(X)) to X and also G(F(Y )) to Y. (Application: image to image translation,
unsupervised.)
Though GANs have demonstrated interesting results, there are both micro and macro research issues that need to be addressed.
The micro issues are related to the formulation of the model’s loss function to achieve good generalization. But this
generalization goal has been cast into doubt by the empirical study of [6], which concludes that training of GANs may not
result in good generalization properties. The GAN loss formulation is regarded as a saddle point optimization problem and
training of the GAN is often accomplished by gradient-based methods [26]. G and D are trained alternatively so that they
evolve together. However, there is no guarantee of balance between the training of G and D with the KL divergence. As a
consequence, one network may inevitably be more powerful than the other, which in most cases, is D. When D becomes too
strong in comparison to G, the generated samples become too easy to differentiate from real ones. Another well known issue is
that the two distributions are in high probability located in disjoint lower dimensional manifolds without overlaps. The work of
WGAN [5] addresses this issue by introducing the Wasserstein distance. However, WGAN still suffers from unstable training,
slow convergence after weight clipping (when the clipping window is too large), and vanishing gradients (when the clipping
window is too small).
The macro issue of GANs is: can GANs help generate large-volume and diversified training data to improve validation and
testing accuracy? As stated in the introductory section, deep learning depends on the scale of training data to succeed, but most
applications do not have ample training data.
Specifically in medical imaging, GANs have been mainly used in five areas: image reconstruction, synthesis, segmentation,
registration, and classification, with hundreds of papers published since 2016 [78]. A recent report [61] summarizes the state
of applied AI in the field of radiology and conveys that promising results have been demonstrated, but the key challenge of
data curation in collection, annotation, and management remains. The work of [22] uses GANs to generate additional samples
10

Fig. 6: Pre-Trained on Generated Images.

for liver lesion classification and claims that both the sensitivity and specificity are improved. However, the total number of
labeled images is merely 182, which is too small a dataset to draw any convincing conclusions. The work [63] applies a similar
idea to thoracic disease classification and achieves better performance. The work uses human experts to remove noisy data,
but fails to report how many noisy instances were removed and how much of the accuracy improvement was attributed to
human intervention. The paper also claims that additional data contributes in making training data of all classes balanced to
mitigate the imbalanced training data issue. Had the work demonstrated that generating additional data using GANs helps
despite imbalanced distribution, the improved result would have been more convincing.
Combining 3D model simulation with GANs seems to be another plausible alternative to reaching the same goal of increasing
training instances. The work of [68] presents a framework that can generate a large amount of labeled data by combining a 3D
model with GANs. Another work [67] combines a 3D simulator (with labels) with unsupervised learning to learn a GAN model
that can improve the realism of the simulating labeled data. However, this combining scheme does not work for some tasks. For
example, our AR platform Aristo [80] experimented with these methods and did not yield any accuracy improvements in its
gesture recognition task. Moreover, most medical conditions have lacked exact 3D models so far, which makes the combining
scheme difficult to apply.

D.2 Empirical Study and Discussion


This section presents our experiments performed in our prior work [17] in generating training data using GANs to improve
the accuracy of supervised learning. Section II-C shows that adding images unrelated to OM can improve classification accuracy
due to representation transfer in the lower layers of the model and representation analogy in the middle layers of the model.
This leads us to the following questions: Can GANs produce useful labeled data to improve classification accuracy? If so, which
CNN layers can GANs strengthen to achieve the goal and how do GANs achieve this classification accuracy improvement? Our
experiments were designed to answer these questions.
Experiment Setup
We used the NIH Chest X-ray 14 [73] dataset to conduct our experiments. This dataset consists of 112, 120 labeled chest
X-ray images, from over 30, 000 unique patients corresponding to 14 common thoracic disease types, including atelectasis,
cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural
thickening, and hernia. The dataset is divided into training, validation, and testing sets, containing 78, 468, 11, 219. and 22, 433
images, respectively4 . Our experiments were designed to examine and compare four training methods:
1) Random initialization: Model parameters were randomly initialized.
2) Pre-trained by using ImageNet: Similar to what we did with transfer learning in Section II-C, the network was pre-trained
by using ImageNet.
3) Pre-trained with additional data generated by unsupervised-GAN: The method is shown in Figure 6. First, the GAN
generated the same number of fake images as we had real images. Second, the CNN classifier was trained to differentiate
between real and fake images. Third, the weights were used to initialize the subsequent classification task.
4) Trained with additional data generated by supervised-GAN: By adding the generated images, the size of the dataset was
expanded to 2x and 5x (the size of original dataset is x). In order to show whether GAN can produce labeled data to
directly improve classification accuracy instead of indirectly, we changed the configuration of GAN in Method 3 so that it
could generate labeled images.
4 We followed the dataset splits in https://github.com/zoogzog/chexnet/tree/master/dataset
11

TABLE III: Results of Four GANG Methods.

Scale of Dataset

5% 10% 20% 50% 100%


0.708 0.757 0.780 0.807 0.829
Method 1
(0.020) (0.003) (0.004) (0.002) (0.000)
0.756 0.790 0.807 0.832 0.843
Method 2
(0.006) (0.002) (0.005) (0.001) (0.000)
0.726 0.765 0.789 0.817 0.828
Method 3
(0.002) (0.004) (0.001) (0.002) (0.000)
0.713 0.724 0.768 0.809 0.824
Method 4 (2x)
(0.003) (0.004) (0.004) (0.001) (0.000)
0.693 0.727 0.774 0.798 0.813
Method 4 (5x)
(0.005) (0.002) (0.005) (0.005) (0.000)

To establish a yardstick for these four methods, we first measured the “golden” results that supervised learning can attain using
100% training and validation data. We then dialed back the size of the training and validation data to be 50%, 20%, 10%, and
then 5%. We used each of the four methods to either increase training data or pre-train the network. We used PGGAN5 as our
GAN model to generate images with 1024 × 1024 pixel resolution. For our CNN classifier, we employed DenseNet121 [33],
and used AUROC6 as our evaluation metric. Intuitively, our conjectures before seeing the results were as follows:
• Method 1 will perform the worst, since it does not receive any help to improve model parameters.
• Method 4 will perform the best, since it produces more training instances for each target class.
• Method 3 will outperform 2 as the training data generated, though unlabeled, is more relevant to the target disease images
than ImageNet is.
Experiment Results
Table III presents our experimental results. We report the AUROC of detecting 14 thoracic disease types using each of the
four different training methods. These results are inconsistent with our conjectures:
• Method 2, which is equivalent to transfer learning, performs the best. No methods using GANs were able to outperform
this method.
• Method 4 performs the worst. In Method 4, additional GAN-generated labeled images were used to perform training. We
believe that the labeled images generated using GANs were too noisy. Therefore, when the generated images are increased
(5x vs. 2x), the prediction accuracy is not always increased and sometimes even worse. This suggests that GANs do not
produce helpful training instances and may in fact be counter-productive.
• Method 3 does not outperform method 2, even though ImageNet data used by method 2 is entirely irrelevant to images of
thoracic conditions. We believe that the additional images generated by GANs used for initializing network parameters are
less useful because of their low volume and variety (diversity). After all, adding more low-quality similar images to an
unlabeled pool cannot help the model learn novel features. Note that a recent keynote of I. Goodfellow [25] points out that
GANs can successfully generate more unlabeled data (not labeled data) to improve MNIST classification accuracy. Table III
reflects the same conclusion that method 3 outperforms method 1, which uses randomly-initialized weights. However,
using GANs to generate unlabeled data may not be more productive than using ImageNet to pre-train the network.
Figure 7 samples real and GAN-generated images. The first column presents real images, the second column GAN-generated
unsupervised, and the third GAN-generated supervised. The GAN-generated images may successfully fool our colleagues with
no medical knowledge. However, as reported in [63], the GAN-generated labeled chest X-ray images must be screened by a
team of radiologists to remove erroneous data (with respect to diagnosis knowledge). Without domain knowledge, incorrectly
labeled images may be introduced by GANs into the training pool, which would degrade classification accuracy.
In summary, the study of [44] shows that pre-training with datasets that are multiple orders of magnitude larger than ImageNet
can achieve higher performance than pre-training with only ImageNet on several image classification and object detection tasks.
This result further attests that volume and variety of data, even if unlabeled, helps improve accuracy. GANs may indeed achieve
volume, but certainly cannot achieve variety.
To explain why using ImageNet can achieve better pre-training performance than that achieved when using GAN-generated
images, we perform layer visualizations using the technique introduced in [53]. Figure 8 plots the output layer of the first
dense-block of DenseNet. Row one shows five filters of untrained randomly initialized weights. Row three shows five filters with
more distinct features learned from the ImageNet pre-trained model. The unsupervised-GAN method (row two) produces filters of
5 We used a publicly available implementation of PGGAN via https://github.com/tkarras/progressive_growing_of_gans. This implementation has an auxiliary
classifier [51] and hence can generate images conditionally (for Method 4) or unconditionally (for Method 3).
6 We used a publicly available implementation of ChexNet [59] from https://github.com/zoogzog/chexnet, which contains a DenseNet121 classifier, and used
its evaluation metric. The metric is deriving by first summing up all AUROCs from each of the 14 classes and then dividing the summation by 14.
12

Real GAN unsupervised GAN supervised

Fig. 7: Real vs. GAN-Generated Images.

similar quality to that of row one. Qualitatively, unsupervised-GAN learns similar features akin to how the random-initialization
method does, and does not yield more promising classification accuracy.

III. F USING K NOWLEDGE WITH GAN S


The desired outcome of GANs after training is that samples formed by xg approximate the real data distribution pr(x).
However, if the real data distribution is under-represented by the training data, the generated samples cannot explore beyond the
training data. For instance, if the otitis media (OM) training data shown in Section II-C consists of only one type of OM, say
AOM, GANs cannot generate the other two types of OM, COM and OME. As another example, if a set of training data consists
of a large number of red roses, and the aim of GANs is to generate entire categories of different colored roses, there would be
no knowledge or hint for G or D to respectively achieve and tolerate diversity in color. In other word, the discriminator D
would reject any roses that are not red and G would not be encouraged to expand beyond generating red colored roses. The
nature of GANs treats exploration beyond the paradigm of the seen or known to be erroneous.
If we would like GANs to generate diversified samples to improve supervised learning, the new approach must address two
issues:
• Guiding the generator to explore diversity productively.
• Allowing the discriminator to tolerate diversity reasonably.
The adverbs productively reasonably, convey exploration (beyond exploitation) with guidance (via rules) and guardrails (via
rewards, positive and negative). In the case of playing games, rules and rewards are clear. In the case of generating roses
beyond red colors or generating types of flowers beyond roses, guidance and guardrails are difficult to articulate. Supposing
computer vision techniques can precisely segment petals of roses in an image, what colors can the generator G use to replace
red petals? For example, black roses do not exist, so this color would be deemed unreasonable and unproductive for generating
realistic rose images. Exploration beyond training distribution should be permitted, but at the same time guided by knowledge.
How can knowledge be incorporated into training GANs? We enumerate two schemes.
1) Incorporating a human in the loop: Placing a human in the loop instead of letting function D make the decision can ensure
D is properly adjusted, due to human input. The work of [63] discussed in Section II-D implements a GAN to generate
labeled chest X-ray images and then asks a team of radiologists to remove mislabeled images. We believe that merely
removing “bad” images without productively generating new images with novel disease patterns may provide only limited
help.
13

Fig. 8: CNN layer visualization of the first denseblock of DenseNet121. The top row is random weight, the second row is
pre-trained by unsupervised-GAN method, and the third row is pre-trained by ImageNet.

2) Encoding knowledge into GANs: We can convey to GANs about the information to be modeled via the knowledge
layers/structures and/or via the knowledge graph/dictionary using natural language processing [10, 17]. We elaborate this
scheme in the remainder of this section.

A. Knowledge Acquisition Sources and Mechanisms


Considering the structure of information may improve the effectiveness of GANs. For instance, differentiating two types of
strokes, ischemic and hemorrhagic, in order to provide proper treatment is critical for patient recovery. Ischemic stroke, which
accounts for 87 percent of all stroke cases, occurs as a result of an obstruction within a blood vessel supplying blood to the
brain. Hemorrhagic stroke occurs when a weakened blood vessel ruptures inside or on the surface of the brain. Two types of
weakened blood vessels usually cause hemorrhagic stroke: aneurysms and arteriovenous malformations (AVMs).
Without the above knowledge, GANs could generate data that flips the appearance of ischemic versus hemorrhagic strokes,
which would blur the critical ability to differentiate between the two. Additionally, without knowledge of brain anatomy, GANs
could generate obstructions and ruptures in clearly erroneous brain locations where no blood vessels are present. With the
knowledge that the symptoms largely occur within and on blood vessels, multi-layer GANs may be able to impose anatomical
constrains through layering information.
Let us use the rose example to explain two sources/mechanisms for knowledge acquisition. Note that for each specific
domain, e.g., nature and medicine, the knowledge sources can be vastly different, but we hope that the acquisition mechanisms
could be similar.
Wikipedia
The possible colors of roses can be obtained from the following Wikipedia text via natural language processing (NLP)
parsing: “Rose flowers have always been available in a number of colours and shades; they are also available in a number of
colour mixes in one flower. Breeders have been able to widen this range through all the options available with the range of
pigments in the species. This gives us yellow, orange, pink, red, white and many combinations of these colours. However, they
lack the blue pigment that would give a true purple or blue colour and until the 21st century all true blue flowers were created
using some form of dye. Now, however, genetic modification is introducing the blue pigment.”
Once possible colors and their combinations have been extracted using NLP, we can enhance the idea of text-adaptive GANs
[49] to generate roses of these colors.
Large Pre-trained Models
Recent launches of ChatGPT [54] (on November 30th , 2022) and DALL-E [60] by OpenAI demonstrate a pipeline that one
can generate images through prompting a large pre-trained language model. For instance, GPT3 [9] has 175-billion parameters
14

Real
image
Random noise
Fake
Real/Fake
image
Seen category

Semantic
Predicted
embedding
embedding

Random noise
Semantic
Fake Predicted
embedding
image embedding
Unseen category

Fig. 9: The schematic diagram of KG-GAN for unseen flower category generation. There are two generators G1 and G2 , a
discriminator D, and an embedding regression network E as the constraint function f . We share all the weights between G1
and G2 . By doing so, our method here can be treated as training a single generator with a category-dependent loss that seen
and unseen categories correspond to optimizing two losses (LSNGAN and Lse ) and a single loss (Lse ), respectively, where Lse
is the semantic embedding loss.

and WuDao [75, 70] has 1.75 trillions, 10 times of the GPT3’s. Though to-date, the iterative prompting and dialogue methods
for acquiring information are still primitive, users can already use DALL-E followed by prompting ChatGPT to produce
impressive results. In the end of this section, we present some examples in Figure 13, and discuss our recent work in modeling
consciousness [13, 15] to make knowledge acquisition to support more effective and personalizable .

B. Method Specifications
This section presents our proposed KG-GAN that incorporates domain knowledge into the GAN framework. We consider a
set of training data under-represented at the category level, i.e., all training samples belong to the set of seen categories, denoted
as Y1 (e.g., red category of roses), while another set of unseen categories, denoted as Y2 (e.g., any other color categories), has
no training samples. Our goal is to learn categorical image generation for both Y1 and Y2 . To generate new data in Y1 , KG-GAN
applies an existing GAN-based method to train a category-conditioned generator G1 by minimizing GAN loss LGAN over G1 .
To generate unseen categories Y2 , KG-GAN trains another generator G2 from the domain knowledge, which is expressed by a
constraint function f that explicitly measures whether an image has the desired characteristics of a particular category.
KG-GAN consists of two parts: (1) constructing the domain knowledge for the task at hand, and (2) training two generators
G1 and G2 that condition on available and unavailable categories, respectively. KG-GAN shares the parameters between
G1 and G2 to couple them together and to transfer knowledge learned from G1 to G2 . Based on the constraint function
f , KG-GAN adds a knowledge loss, denoted as LK , to train G2 . The general objective function of KG-GAN is written as
minG1 ,G2 LGAN (G1 ) + λ LK (G2 ).
Given a flower dataset in which some categories are unseen, our aim is using KG-GAN to generate unseen categories in
addition to the seen categories. Figure 9 shows an overview of KG-GAN for unseen flower category generation. Our generators
take a random noise z and a category variable y as inputs and generate an output image x0 . In particular, G1 : (z, y1 ) 7→ x10 and
G2 : (z, y2 ) 7→ x20 , where y1 and y2 belong to the set of seen and unseen categories, respectively.
We leverage the domain knowledge that each category is characterized by a semantic embedding representation, which
describes the semantic relationships among categories. In other words, we assume that each category is associated with a
semantic embedding vector v. For example, we can acquire such feature representation from the textual descriptions of each
category. (Figure 10 shows example textual descriptions for four flowers.) We use semantic embedding in two places: one is for
modifying the GAN architecture, and the other is for defining the constraint function. (Using the Oxford flowers dataset, we
show how semantic embedding is done in Section III-C.)
KG-GAN is developed upon SN-GAN [47, 48]. SN-GAN uses a projection-based discriminator D and adopts spectral
normalization for discriminator regularization. The objective functions for training G1 and D use a hinge version of adversarial
loss. The category variable y1 in SN-GAN is a one-hot vector indicating which target category. KG-GAN replaces the one-hot
vector by the semantic embedding vector v1 . By doing so, we directly encode the similarity relationships between categories
into the GAN training.
The loss functions of the modified SN-GAN are defined as
LSNGAN
G
(G1 ) = −Ez,v1 [D(G1 (z, v1 ), v1 )], and
(1)
LSNGAN
D
(D) = Ex,v1 [max(0, 1 − D(x, v1 ))] + Ez,v1 [max(0, 1 + D(G1 (z, v1 ), v1 ))].
15

• This flower has thick, very • This flower has a large white
pointed petals in bright hues of petal and has a small yellow
yellow and indigo. colored circle in the middle.
• A bird shaped flower with purple • The white flower has petals
and orange pointy flowers that are soft, smooth and
Bearded Iris stemming from it's ovule. Orange Dahlia fused together and has bunch
of white stamens in the center.

• The petals on this flower are red • This flower has five large wide
with a red stamen. pink petals with vertical
• The flower has a few broad red grooves and round tips.
petals that connect at the base, • This flower has five pink petals
and a long pistil with tiny yellow which are vertically striated
Snapdragon stamen on the end. Stemless Gentia and slightly heart-shaped.
Fig. 10: Oxford flowers dataset. Example images and their textual descriptions.

Fig. 11: Unseen flower category generation. Qualitative comparison between real images and the generated images from
KG-GAN. Left: Real images. Middle: Successful examples of KG-GAN. Right: Unsuccessful examples of KG-GAN. The top
two and the bottom two rows are Orange Dahlia and Stemless Gentian, respectively.

Semantic Embedding Loss. We define the constraint function f as predicting the semantic embedding vector of the
underlying category of an image. To achieve that, we implement f by training an embedding regression network E from the
training data. Once trained, we fix its parameters and add it to the training of G1 and G2 . In particular, we propose a semantic
embedding loss Lse as the role of knowledge loss in KG-GAN. This loss requires the predicted embedding of fake images to
be close to the semantic embedding of target categories. Lse is written as

Lse (Gi ) = Ez,vi ||E(Gi (z, vi )) − vi ||2 , where i ∈ {1, 2}. (2)

Total Loss. The total loss is a weighted combination of LSNGAN and Lse . The loss functions for training D and for training
G1 and G2 are respectively defined as
L D = LSNGAN
D
(D), and
(3)
L G = LSNGAN
G
(G1 ) + λse (Lse (G1 ) + Lse (G2 )).
16

C. Empirical Study and Discussion


We use the Oxford flowers dataset [50], which contains 8,189 flower images from 102 categories (e.g., bluebell, daffodil,
iris, and tulip). Each image is annotated with 10 textual descriptions. Figure 10 shows two representative descriptions for
four flowers. Following [62], we randomly split the images into 82 seen and 20 unseen categories. To extract the semantic
embedding vector of each category, we first extract sentence features from each textual description using the fastText library [8],
which takes a sentence as input and outputs a 300-dimensional real-valued vector in range [0, 1]. Then we average over the
features within each category to obtain the per-category feature vector as the semantic embedding. We resize the images to
64 × 64 as the image size in our experiments. For the SN-GAN part of the model, we use its default hyper-parameters and
training configurations. In particular, we train for 200k iterations. For the knowledge part, we use λse = 0.1 in our experiments.
Comparing Methods. We compare with SN-GAN trained on the full Oxford flowers dataset, which potentially represents a
performance upper-bound of our method. Besides, we additionally evaluate two ablations of KG-GAN: (1) One-hot KG-GAN:
y is a one-hot vector that represents the target category. (2) KG-GAN w/o Lse : our method without Lse .
Results. To evaluate the quality of the generated images, we compute the FID scores [30] in a per-category manner as
in [47]. Then, we average over the FID scores of the set of the seen and the unseen categories, respectively. Table IV shows
the seen and the unseen FID scores. We can see from the table that in terms of the category condition, semantic embedding
gives better FID scores than one-hot representation. Our full method achieves the best FID scores. In Figure 11, we show
example results of two representative unseen categories.

TABLE IV: Per-category FID scores of SN-GAN and KG-GANs.

Method Training data Condition Lse Seen FID Unseen FID


SN-GAN Y1 ∪Y2 One-hot 0.6922 0.6201

One-hot KG-GAN Y1 One-hot 0.7077 0.6286
KG-GAN w/o Lse Y1 Embedding √ 0.1412 0.1408
KG-GAN Y1 Embedding 0.1385 0.1386

D. Observations on KG-GANs
From Table IV we make two observations. First, KG-GAN (conditioned on semantic embedding) performs better than
One-hot KG-GAN. This is because One-hot KG-GAN learns domain knowledge only from the knowledge constraint while
KG-GAN additionally learns the similarity relationships between categories through the semantic embedding as the condition
variable. Second, when KG-GAN conditions on semantic embedding, KG-GAN with out Lse still works. This is because
KG-GAN learns how to interpolate among seen categories to generate unseen categories. For example, if an unseen category is
close to a seen category in the semantic embedding space, then their images will be similar.
As we can see from Figure 11, our model faithfully generates flowers with the right color, but does not perform as well in
shapes and structures. The reasons are twofold. First, colors can be more consistently articulated on a flower. Even if some
descriptors annotate a flower as red and while others annotate it as pink, we can obtain a relatively consistent color depiction
over, say, ten descriptions. Shapes and structures do not enjoy as confined a vocabulary set as colors do. In addition, the flowers
in the same category may have various shapes and structures due to aging and camera angles. Since each image has 10 textual
descriptions and each category has an average number of 80 images, the semantic embedding vector of each category is obtained
from taking an average over about 800 fastText feature vectors. This averaging operation preserves the color information quite
well while blurring the other aspects.

E. Knowledge Acquisition
A better semantic embedding representation that encodes richer textual information about a flower category can be performed
by prompting a large pre-trained model. The Oxford dataset, on which we conducted our experiments, is a tiny knowledge
based, compared to GPT3. Using GPT3 as our knowledge base, we first prompted it for knowledge about roses, and then used
the acquired knowledge to prompt DALLE to generate images. Figure 12 shows the prompt to query ChatGPT about the colors
and textures of roses. Once after the color and texture information were obtained, we issued two separate prompts to DALLE
to produce “roses with red, orange, and white colors”, and “roses of orange, white, and pink colors with velvety petals in
ruffled appearance”. The first row of Figure 13 shows three images of roses with the specified colors. The second row of
the figure shows three rose images with the specified texture specifications. The ChatGPT and DALLE pipeline can reliably
generate a variety of realistic rose images based on the knowledge acquired from the pre-trained model. When compared with
the flowers generated from a much smaller knowledge-base learned from the Oxford dataset presented in Figure 11, acquiring
specifications from a much larger pre-trained model via ChatGPT clearly generates much higher quality rose images.
17

Fig. 12: An Example “Rose” Query to ChatGPT.

Fig. 13: Rose Images Generated by Prompting ChatGPT and then CALLE. The photos of the first line were generated by the
prompt “generate roses with red, orange, and white colors”. The second line were generated by the prompt “generate some
roses of orange, white, and pink colors with velvety petals in ruffled appearance”.

IV. C ONCLUDING R EMARKS


Deep learning has achieved great success thanks to three key factors: the large volume of training data, the ability of advanced
models to learn representations from this data, and the scale of computation. In this article, we emphasized the importance of a
data-centric approach to learning effective data representations, especially in cases where the available training data is limited.
It is widely accepted in the research community that the more data available, both in terms of volume and variety, the better
the performance of representation learning and classification. However, collecting high-quality annotated data in the healthcare
domain can be difficult. To address this issue, we discussed four methods for generating and aggregating training data: data
augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). While these approaches can be
effective, we may still face the challenge of lacking diversity or coverage to represent the full population distribution. To address
this problem, we presented knowledge-guided GANs (KG-GANs) as a potential solution. While the idea of KG-GANs was
well received by reviewers at top AI conferences, one critique was that knowledge integration must be tailored to the specific
18

task at hand. We believe that recent large pre-trained language models have the potential to serve as a general, task-agnostic
knowledge base to support knowledge acquisition. In this article, we demonstrated that the ChatGPT-DALLE pipeline can
first acquire precise descriptions of a concept and then use this new “insight” to generate realistic images beyond the original,
restrictive distribution represented by a small training dataset.
It is currently believed within the artificial intelligence (AI) community that a pre-trained model, trained on all available
documents in the world, can serve as a highly accurate and reliable source of knowledge for various tasks. This pre-trained
model can then be fine-tuned with a small amount of additional data for a specific task, or prompted step-by-step (e.g., [24, 14])
to achieve state-of-the-art performance on that task. These techniques allow for the efficient and effective use of large, pre-trained
language models in a variety of applications.
There are several promising directions for future research in the field of artificial intelligence, based on the observations and
discussions presented throughout this article. These include:
• Improving the interpretability of deep learning models, so that their decision-making processes are more transparent and
easier to understand. This is an important consideration for fields such as healthcare, where the consequences of incorrect
predictions can be severe.
• Combining domain knowledge with existing training data to generate more diverse and representative training data, in
order to better cover a wide range of semantic concepts. This could be particularly useful in fields where annotated data is
scarce or difficult to collect.
• Utilizing large pre-trained models as a knowledge base for robust knowledge acquisition is an important direction for
future research. These models, which have been trained on vast amounts of data, can serve as a valuable resource for
acquiring knowledge that is generalizable across a wide range of tasks. By leveraging and improving these models, we can
increase the reliability and efficiency of knowledge acquisition, which can then be used to guide the generation of more
diverse and representative training data for deep learning models. This, in turn, can improve the performance of these
models in various applications.
• Developing effective techniques for prompting and guiding the acquisition of knowledge, such as using a chain-of-thought
or dialogue approach. These methods can help to ensure that the knowledge acquired is precise and relevant to the task at
hand.
In my opinion, each of these research directions has the potential to significantly advance the field of artificial intelligence
and improve the usefulness and reliability of deep learning models in healthcare and a variety of other applications. By focusing
on improving the interpretability of deep learning models, incorporating domain knowledge into training data, leveraging large
pre-trained models as a knowledge base, and developing effective knowledge prompting techniques, we can make significant
progress in enhancing the performance and trustworthiness of these models in healthcare and other fields.

ACKNOWLEDGMENT
This article presents the work performed by the DeepQ team between 2014 and 2017 for the Tricorder XPRIZE award
[16, 56], as well as our research on consciousness modeling at Stanford University since 2020. The relevant papers published
by our team [10, 12, 17, 18, 66] have been cited throughout the article. We would like to acknowledge the following colleagues
for their contributions, listed in alphabetical order: Che-Han Chang, Fu-Chieh Chang, Chun-Nan Chou, Chuen-Kai Shie, and
Kai-Fu Tang.

R EFERENCES
[1] General Data Protection Regulation (GDPR). Retrieved February 4, 2020, from https://gdpr-info.eu, 2016.
[2] CS231n convolutional neural network for visual recognition: transfer learning. http://cs231n.github.io/transfer-learning/,
2017.
[3] Health Information Privacy Act, HIPAA. Retrieved February 4, 2020, from https://www.hhs.gov/hipaa/for-professionals/
index.html, 2017.
[4] AB-713 California Consumer Privacy Act (CCPA). Retrieved February 5, 2020, from https://leginfo.legislature.ca.gov,
January 2020. California Legislature 2019-2020 Regular Session.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[6] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial
nets (gans). In International Conference on Machine Learning, pages 224–232, 2017.
[7] Catarina Barata, M Emre Celebi, and Jorge S Marques. Melanoma detection algorithm based on feature fusion. In
Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE, pages
2653–2656. IEEE, 2015.
[8] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information.
Transactions of the Association for Computational Linguistics, 2017.
[9] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
19

Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL
https://arxiv.org/abs/2005.14165.
[10] Che-Han Chang, Chun-Hsien Yu, Szu-Ying Chen, and Edward Y. Chang. KG-GAN: Knowledge-guided generative
adversarial networks, 2019.
[11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent
Systems and Technology (TIST), 2(3):27, 2011.
[12] Edward Y Chang. Perceptual feature extraction (chapter 2). In Foundations of large-scale multimedia information
management and retrieval: Mathematics of perception, chapter 2, pages 13–35. Springer, 2011.
[13] Edward Y. Chang. Towards artificial general intelligence via consciousness modeling (invited talk). In IEEE Infrastructure
Conference, September 2022. URL https://drive.google.com/file/d/1NPuKPB4gSeJeT1fmfY5eus_Rw3abwd5m/view?usp=
sharing.
[14] Edward Y. Chang. Prompting large language models with the socratic method. IEEE 13th Annual Computing and
Communication Workshop and Conference (CCWC), March 2023. URL https://arxiv.org/abs/2303.08769.
[15] Edward Y. Chang. Cocomo: Computational consciousness modeling for generative and ethical ai. arXiv preprint
arXiv:2304.02438, 2023.
[16] Edward Y. Chang, Meng-Hsi Wu, Kai-Fu Tang Tang, Hao-Cheng Kao, and Chun-Nan Chou. Artificial intelligence
in xprize deepq tricorder. In Proceedings of the 2nd International Workshop on Multimedia for Personal Health and
Health Care, MMHealth ’17, page 11–18, New York, NY, USA, 2017. Association for Computing Machinery. ISBN
9781450355049. doi: 10.1145/3132635.3132637. URL https://doi.org/10.1145/3132635.3132637.
[17] Fu-Chieh Chang, Jocelyn J. Chang, Chun-Nan Chou, and Edward Y. Chang. Toward fusing domain knowledge with
generative adversarial networks to improve supervised learning for medical diagnoses. In 2019 IEEE Conference on
Multimedia Information Processing and Retrieval (MIPR), pages 77–84, 2019. doi: 10.1109/MIPR.2019.00022.
[18] Chun-Nan Chou, Chuen-Kai Shie, Fu-Chieh Chang, Jocelyn Chang, and Edward Y. Chang. Representation learning on
large and small data, chapter 1 of Big Data Analytics for Large-Scale Multimedia Search. pages 3–30, 07 2017.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
[20] Will Douglas. Google’s medical AI was super accurate in a lab. Real life was a different story. MIT Technology Review,
April 2020.
[21] Shortliffe EH. Computer-based medical consultations: MYCIN. Elsevier, New York, 1976.
[22] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. Gan-based synthetic
medical image augmentation for increased cnn performance in liver lesion classification. arXiv preprint arXiv:1803.01229,
2018.
[23] Wei-Kang Fu, Yi-Shan Lin, Giovanni Campagna, Chun-Ting Liu, De-Yi Tsai, Chung-Huan Mei, Edward Y. Chang,
Shih-Wei Liao, and Monica S. Lam. Soteria: A provably compliant user right manager using a novel two-layer blockchain
technology. In 2020 IEEE Infrastructure Conference, pages 1–10, 2020. doi: 10.1109/IEEECONF47748.2020.9377624.
[24] Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient, 2021.
[25] Ian Goodfellow. Adversarial machine learning (keynote). In AAAI Conference on Artificial Intelligence, 2019.
[26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680,
2014.
[27] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using
support vector machines. Machine learning, 46(1-3):389–422, 2002.
[28] D.J. Hand and P.D.S.D.J. Hand. Artificial Intelligence and Psychiatry. The Scientific Basis of Psychiatry. Cambridge
University Press, 1985. ISBN 9780521258715. URL https://books.google.com/books?id=8PQ8AAAAIAAJ.
[29] Hunter Heidenreich. What is a generative adversarial network? http://hunterheidenreich.com/blog/what-is-a-gan/.
[30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two
time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
[31] Geoffrey E Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10):428–434, 2007.
[32] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation,
18(7):1527–1554, 2006.
[33] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708. IEEE, 2017.
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial
networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5967–5976. IEEE, 2017.
[35] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and
Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International
20

Conference on Multimedia, pages 675–678. ACM, 2014.


[36] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability,
and variation. In International Conference on Learning Representations, 2018.
[37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.
[38] Curtis Langlotz. Healthcare’s AI Future: A Conversation with Fei-Fei Li and Andrew Ng. Stanford HAI Workshop, April
2021. URL https://youtu.be/Gbnep6RJinQ.
[39] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied
to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. doi: 10.1162/neco.1989.1.4.541.
[40] Baitao Li, E. Chang, and Ching-Tung Wu. Dpf - a perceptual distance function for image retrieval. In Proceedings.
International Conference on Image Processing, volume 2, pages II–II, 2002. doi: 10.1109/ICIP.2002.1040021.
[41] Beitao Li, Edward Chang, and Yi Wu. Discovery of a perceptual distance function for measuring image similarity.
Multimedia systems, 8(6):512–522, 2003.
[42] Y. Lin, P. Wu, C. Chang, E. Chang, and S. Liao. Relgan: Multi-domain image-to-image translation via relative attributes.
In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5913–5921, Los Alamitos, CA, USA,
nov 2019. IEEE Computer Society. doi: 10.1109/ICCV.2019.00601. URL https://doi.ieeecomputersociety.org/10.1109/
ICCV.2019.00601.
[43] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision, 60(2):91–110, 2004.
ISSN 0920-5691. doi: http://dx.doi.org/10.1023/B:VISI.0000029664.99615.94.
[44] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and
Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018.
[45] Teresa Mendonça, Pedro M Ferreira, Jorge S Marques, André RS Marcal, and Jorge Rozeira. Ph 2-a dermoscopic image
database for research and benchmarking. In Engineering in Medicine and Biology Society (EMBC), 2013 35th Annual
International Conference of the IEEE, pages 5437–5440. IEEE, 2013.
[46] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[47] Takeru Miyato and Masanori Koyama. cGANs with projection discriminator. In ICLR, 2018.
[48] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial
networks. In ICLR, 2018.
[49] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: manipulating images
with natural language. In Advances in Neural Information Processing Systems, pages 42–51, 2018.
[50] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICCVGI,
2008.
[51] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In
International Conference on Machine Learning, pages 2642–2651, 2017.
[52] American Academy of Pediatrics Subcommittee on Management of Acute Otitis Media et al. Diagnosis and management
of acute otitis media. Pediatrics, 113(5):1451, 2004.
[53] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2(11):e7, 2017.
[54] Openai. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/, 2022.
[55] Vishwa S Parekh, Shuhao Lai, Vladimir Braverman, Jeff Leal, Steven Rowe, Jay J Pillai, and Michael A Jacobs.
Cross-domain federated learning in medical imaging, December 2021.
[56] Yu-Shao Peng, Kai-Fu Tang, Hsuan-Tien Lin, and Edward Chang. REFUEL: Exploring sparse features in deep reinforcement
learning for fast disease diagnosis. In Advances in Neural Information Processing Systems, pages 7333–7342, 2018.
[57] Prayitno, C.-R. Shyu, K.T. Putra, H.-C. Chen, Y.-Y. Tsai, K.S.M.T. Hossain, W. Jiang, and Z.-Y Shae. A systematic
review of federated learning in the healthcare area: From the perspective of data properties and applications. Applied
Sciences, (11), November 2021. doi: 10.3390/app112311191.
[58] Qualcomm. Xprize Tricorder Winning Teams. http://tricorder.xprize.org/teams, 2017.
[59] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis
Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.
arXiv preprint arXiv:1711.05225, 2017.
[60] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation
with clip latents, 2022. URL https://arxiv.org/abs/2204.06125.
[61] Erik Ranschaert. Artificial intelligence in radiology: hype or hope? Journal of the Belgian Society of Radiology, 102(S1):
20, 2018.
[62] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations of fine-grained visual
descriptions. In CVPR, 2016.
[63] Hojjat Salehinejad, Shahrokh Valaee, Tim Dowdell, Errol Colak, and Joseph Barfett. Generalization of deep neural networks
for chest pathology classification in x-rays using generative adversarial networks. In IEEE International Conference on
Acoustics, Speech and Signal Processing, pages 990–994. IEEE, 2018.
21

[64] William B. Schwartz, Ramesh S. Patil, and Peter Szolovits. Artificial intelligence in medicine, where do we stand? New
England Journal of Medicine, 316(11):685–88, March 1987.
[65] Chuen-Kai Shie, Hao-Ting Chang, Fu-Cheng Fan, Chung-Jung Chen, Te-Yung Fang, and Pa-Chun Wang. A hybrid
feature-based segmentation and classification system for the computer aided self-diagnosis of otitis media. In Engineering
in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, pages 4655–4658.
IEEE, 2014.
[66] Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, and Edward Y. Chang. Transfer representation
learning for medical image analysis. IEEE EMBC, pages 711–714, 2015.
[67] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from
simulated and unsupervised images through adversarial training. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2107–2116. IEEE, 2017.
[68] Leon Sixt, Benjamin Wild, and Tim Landgraf. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI,
5:66, 2018.
[69] W Stolz, A Riemann, AB Cognetta, L Pillet, W Abmayr, D Holzel, P Bilek, F Nachbar, and M Landthaler. Abcd rule of
dermatoscopy: a new practical method for early recognition of malignant melanoma. In European Journal of Dermatology,
pages 521–527, 1994.
[70] Jie Tang. Wudao — pre-train the world. https://keg.cs.tsinghua.edu.cn/jietang/publications/wudao-3.0-en.pdf, May 2022.
[71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[72] Stefanos Vrochidis, Benoit Huet, Edward Y. Chang, and Ioannis Kompatsiaris. Big data analytics for large-scale multimedia
search. Wiley, June 2019, ISBN: 978-1119376972. 2019.
[73] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax
diseases. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3462–3471. IEEE, 2017.
[74] Wikipedia. Otitis media. https://en.wikipedia.org/wiki/Otitis_media, 2017.
[75] Wikipedia. Wu dao, a multimodal artificial intelligence pre-trained model. https://en.wikipedia.org/wiki/Wu_Dao, 2021.
[76] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey
on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021. doi:
10.1109/TNNLS.2020.2978386.
[77] Vincent C. Yen and Robert Boissoneau. Artificial intelligence and expert systems: Implications for health care delivery.
New England Journal of Medicine, 66(5):16–19, 1988. doi: 10.1080/00185868.1988.10543623.
[78] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review. arXiv preprint
arXiv:1809.07294, 2018.
[79] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep
visualization. arXiv preprint arXiv:1506.06579, 2015.
[80] Zhongyang Zheng, Bo Wang, Yakun Wang, Shuang Yang, Zhongqian Dong, Tianyang Yi, Cyrus Choi, Emily J Chang, and
Edward Y Chang. Aristo: An augmented reality platform for immersion and interactivity. In ACM Multimedia Conference,
pages 690–698. ACM, 2017.
[81] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In IEEE International Conference on Computer Vision, pages 2242–2251. IEEE, 2017.

You might also like