Knowledge Guided Data Centric AI in Healthcare 1685207849
Knowledge Guided Data Centric AI in Healthcare 1685207849
Knowledge Guided Data Centric AI in Healthcare 1685207849
Abstract
The success of deep learning is largely due to the availability of large amounts of training data that cover a wide range of
examples of a particular concept or meaning. In the field of medicine, having a diverse set of training data on a particular disease
arXiv:2212.13591v2 [cs.AI] 30 Apr 2023
can lead to the development of a model that is able to accurately predict the disease. However, despite the potential benefits, there
have not been significant advances in image-based diagnosis due to a lack of high-quality annotated data. This article highlights
the importance of using a data-centric approach to improve the quality of data representations, particularly in cases where the
available data is limited. To address this "small-data" issue, we discuss four methods for generating and aggregating training data:
data augmentation, transfer learning, federated learning, and GANs (generative adversarial networks). We also propose the use of
knowledge-guided GANs to incorporate domain knowledge in the training data generation process. With the recent progress in
large pre-trained language models, we believe it is possible to acquire high-quality knowledge that can be used to improve the
effectiveness of knowledge-guided generative methods.
I. I NTRODUCTION
Applying artificial intelligence (AI) techniques to improve various aspects of healthcare, such as disease diagnosis and
treatment recommendations, has made significant progress over the past half-century, but there are still several challenges to be
overcome. This article presents the current state of progress in this field, identifies key challenges, and discusses promising
directions for future research.
The term “AI” regained popularity after AlphaGo’s success in 2015, but in the scientific domain, AI is simply a set of
machine learning or statistical learning algorithms. According to a survey paper by W. B. Schwartz [64], pattern recognition
techniques (such as Boolean algebra and naive Bayesian models) were first used in healthcare in the 1960s to learn the weights
of a set of decision factors for classification and disease prediction. A notable example was Stanford’s Dentral system [77],
which helped chemists identify unknown organic molecules. In the 1970s and 1980s, the machine learning community focused
on using AI to solve clinical problems. Pathophysiological knowledge and reasoning were incorporated into the development
of rule-based expert systems. The main approach was to match a patient’s symptoms with stored profiles of findings for
each disease. MYCIN [20] is a representative system for identifying bacteria causing severe infections and recommending
antibiotics. Another system, INTERNIST-I [21], used a multiple hypotheses strategy to consider multiple disease candidates
without prematurely converging on a single prediction. The rule-based approach to AI in healthcare was not successful due to a
lack of comprehensive data and insufficient accuracy. These limitations prevented the deployment of any rule-based systems in
the healthcare industry during this time period. Additionally, there was resistance from healthcare professionals to the adoption
of these systems due to both procedural and legal concerns.
From 1990s to 2010s, the focus of AI in healthcare shifted to developing tools and algorithms to improve performance.
These efforts were driven by two key insights: first, that AI systems in healthcare must be able to handle imperfect or missing
data, and second, that small or insufficient training data is often the norm in healthcare and must be directly addressed. To
address these challenges, researchers developed and improved upon techniques such as fuzzy set theory, Bayesian networks,
clustering, and support vector machines (SVMs). Fuzzy set theory, for example, can handle data uncertainty and missing data,
while Bayesian networks can consider correlations between parameters to reduce dependencies. SVMs, which use the kernel
trick, can significantly reduce the amount of required training data by classifying data instances in the similarity space. Both
SVMs and data clustering techniques, such as K-means and manifold learning, can reduce computation dimensionality and
improve prediction accuracy with less training data.
Since 2010, the data-centric approach, which involves using large volumes of data to learn data representations, has become
more widely accepted in the field of artificial intelligence [12, 37]. This approach involves using large amounts of data to learn
data representations, and is exemplified by algorithms such as deep learning [31, 32] and transformers [71]. This approach
differs from the model-centric approach, which relies on human-designed computer models or algorithms to extract features from
images or documents. Unlike the model-centric approach, the data-centric approach involves learning features/representations
from data, and more data can improve the quality of these representations. Additionally, the internal parameters of data-centric
algorithms like multi-layer perceptrons and convolutional neural networks(CNNs) [39] can be changed or learned based on
the structure found in large datasets. The features of an image in a data-centric pipeline like a CNN can be affected by other
2
images, while the features of an image in a model-centric pipeline like (SIFT) [43] are invariant. The greater the quantity and
variety of data available, the better the representations that can be learned by a data-centric pipeline. When a learning algorithm
has been trained on a sufficient number of instances of an object under various conditions, such as different postures and partial
occlusion, the features learned from this data will be more comprehensive.
Despite advancements in algorithms and data quality, AI is still not widely deployed in hospitals. The reasons can be
understood from three pieces of documents: the “indictment” of A. R., Feinstein [28] made in 1977, the experience of Google’s
negative deployment experience published in April 2020 [20], and the comments made by Andrew Ng at Stanford Healthcare’s
AI Future workshop in April 2021 [38]. Their comments can be summarized into three factors: 1) a lack of deep interdisciplinary
collaboration between computer scientists and clinicians, 2) a lack of robustness and interpretability in AI models, and 3)
insufficient training data for the data-centric approach to learn good representations. This article focuses on addressing the
training data issue. The current data augmentation techniques and transfer learning methods are not able to systematically
enhance diversity to cover all variants of a medical condition. This leads to AI models being unable to effectively handle
out-of-distribution instances. One well known problem is that AI models trained in one hospital often do not perform well
when used in a different hospital with different hardware and software configurations and clinical practices.
In the remainder of this article we first summarize our prior studies in training-data generation [10, 18, 66, 72] and aggregation
in Section II. In Section III, we propose a knowledge-guided generative model that can generate valid patterns not seen in the
training data. Finally, we enumerate plausible future research directions in Section IV.
A. Data Augmentation
It was observed by B. Li and E. Chang in 2002 in an image similarity study [40] that an image that has gone through
transformations such as scaling, cropping, down-sampling, color enhancement, lighting changes, etc. may appear to be perceptually
similar to the original image. However, a traditional similarity function such as the Minkowski family function (e.g., L1 and L2
functions) would quantify these transformed images to be dissimilar to the original. There are three approaches to address this
problem. The first approach is to devise a distance function to correctly quantify the distance between perceptually similar
images/objects. Our proposed Dynamic Partial Function (DPF) [40] is one effective method based on psychological principles.
The second approach is to employ a model that is insensitive to some image transformations. For instance, the graph neural
network (GNN) model [76] connects key features of an object into a graph, which is invariant to transformations such as
rotation and translation. However, GNN cannot model all transformations. One could consider employing GNN (instead of
using CNN) as the training algorithm to work with augmented data. Data augmentation is the third method, which generates all
possible variants of a semantic (such as a disease and an object) and add them to the training data.
AlexNet is a pioneer work that uses data augmentation extensively. According to [37], AlexNet employs two distinct forms
of data augmentation, both of which can produce the transformed images from the original images with very little computation
[37, 41]. The first scheme of data augmentation includes a random cropping function and horizontal reflection function. Data
augmentation can be applied to both the training and testing stages. For the training stage, AlexNet randomly extracted smaller
image patches (224 × 224) and their horizontal reflections from the original images (256 × 256). The AlexNet model was trained
on these extracted patches instead of the original images in the ImageNet dataset. The second scheme of data augmentation
alters the intensities of the RGB channels in training images by using principal component analysis (PCA). This scheme is
used to capture an important property of natural images: the invariance of object identity to changes in the intensity and color
of the illumination.
Data augmentation enjoys popularity because of its simplicity. However, its major shortcomings are that the generated
new training instances may not appear in the real world and that some unseen patterns of an object cannot be generated via
transformations. Suppose data augmentation increases the number of training instances by K folds. The increased computational
cost can be O(K 2 ).
3
B. Federated Learning
Recently, federated learning attracts interests from the healthcare domain [57] because of its advertised benefits of multi-source
data integration, privacy protection, and and regulation compliance.
Federated learning has shown limited success in rudimental computer vision operations such as organ localization and lesion
segmentation [55]. Unfortunately, from both the architecture and practical perspectives, federated learning faces tremendous
challenges for image-based disease diagnosis. We have proposed a better architecture [23], which is summarized in the end of
this subsection.
Fig. 1: Soteria Components and its Three-layer Block Chains: Main, Side and Digital Agreement.
Federal learning cannot overcome the following seven real-world multi-hospital integration hurdles:
• Hardware diversity. Heterogeneous hardware devices (e.g., MRI machines) of different brands and models are used by
different hospitals.
• Hardware parameters. Even with exactly the same device, hardware parameter settings may be different, which may result
in different image size, resolution, and quality.
• Different clinical SOPs.
• Different DL models. One hospital may have a legacy system using GNN and another using CNN, etc..
• Different DL architectures. Since AlexNet, there has been double-digit DL architecture developed such as VGG, GoogleNet,
ResNet, Inception, etc. It is virtually impossible to ask all participants to use the same architecture.
• Different hyper-parameter values. Even with the same architecture, the local optimal hyper-parameter values may be
different between sites.
• Different data format, quality, and annotation practice.
The fantasy to overcome all these seven issues requires draconian measures to force all participating hospitals (domestic and
aboard) to use the same equipment, same hardware parameters, same photo-taking procedures, same image resolutions, same
deep learning architectures, same hyper-parameters, same image labeling conventions, etc., and output the parameters of the
same latent layers for aggregation.
For text data, wherever the data are collected, a word is a word. But image analysis is very sensitive to the aforementioned
variations. Indeed, Andrew Ng remakred at Stanford Healthcare’s AI Future workshop in April 2021 [38] that, a medical image
model trained at Stanford would simply fail to perform at a hospital down the street due to various variants.
How about privacy preservation? Privacy regulations such as GDPR [1], CCPA [4], and HIPPA [3] require “provable” privacy
preservation. Federated learning is a close system, and its privacy practice cannot be made distributed, transparent, nor publicly
provable. Our proposed Soteria architecture, shown in Figure 1, uses a three layer blockchain-based ledger to support publicly
auditable privacy regulation compliance. Moreover, by placing digital contacts1 on the side chains, a data owner knows from
the upper-layer blockchains the status of his/her consent, the access time to the data, and the payment for each access, etc.. For
details, please consult the Soteria paper [23].
1 A digital contract is an agreement sign between the data owner and data consumer on the data-access consent with terms and conditions. Terms and
conditions can include access epoch and payment. A digital contract is converted to a piece of SQL-like code to fetch data.
4
concepts. For images, the neurons from lower levels describe rudimentary perceptual elements like edges and corners, whereas
the neurons from higher layers represent aspects of objects such as their parts and categories. To capture high-level abstractions,
we extracted transfer-learned features of OM and melanoma images from the fifth, sixth and seventh layers, denoted as pool5
(P5), fc6 and fc7 in Fig. 2 respectively.
Once we had transfer-learned feature vectors of the 1, 195 collected OM images and 200 melanoma images, we performed
supervised learning by training a support vector machine (SVM) classifier [11]. We chose SVMs to be our model since it is an
effective classifier widely used by prior works. Using the same SVM algorithm lets us perform comparisons with the other
schemes solely based on feature representation. As usual, we scaled features to the same range and found parameters through
cross validation. For fair comparisons with previous OM works, we selected the radial basis function (RBF) kernel.
To further improve classification accuracy, we experimented with two feature fusion schemes, which combine OM features
hand-crafted by human heuristics (or model-centric) in [65] and our melanoma heuristic features with features learned from
our codebook. In the first scheme, we combined transfer-learned and hand-crafted features to form fusion feature vectors. We
then deployed the supervised learning on the fused feature vectors to train an SVM classifier. In the second scheme, we used
the two-layer classifier fusion structure proposed in [65]. In brief, in the first layer we trained different classifiers based on
different feature sets separately. We then combined the outputs from the first layer to train the classifier in the second layer.
Fig. 3 summarizes our transfer representation learning approaches using OM images as an example. The top of the figure
depicts two feature-learning schemes: the transfer-learned scheme on the left-hand side and the hand-crafted scheme on the
right. The solid lines depict how OM or melanoma features are extracted via the transfer-learned codebook, whereas the dashed
lines represent the flow of hand-crafted feature extraction. The bottom half of the figure describes two fusion schemes. Whereas
the dashed lines illustrate the feature fusion by concatenating two feature sets, the dotted lines show the second fusion scheme
at the classifier level. At the bottom of the figure, the four classification flows yield their respective OM-prediction decisions. In
order from left to right in the figure are ’transfer-learned features only’, ’feature-level fusion’, ’classifier-level fusion’, and
’hand-crafted features only’.
For the setting of training hyperparameters and network architectures, we used mini-batch gradient descent with a batch size
of 64 examples, learning rate of 0.001, momentum of 0.9 and weight decay of 0.0005. To fine-tune the AlexNet model, we
replaced the fc6, fc7 and fc8 layers with three new layers initialized by using a Gaussian distribution with a mean of 0 and a std
of 0.01. During the training process, the learning rates of those new layers were ten times greater than that of the other layers.
Results of Transfer Representation Learning for OM
Our 1, 195 OM image dataset encompasses almost all OM diagnostic categories: normal; AOM: hyperemic stage, suppurative
stage, ear drum perforation, subacute resolution stage, bullous myringitis, barotrauma; OME: with effusion, resolution stage
(retracted); COM: simple perforation, active infection. Table I compares OM classification results for different feature
representations. All experiments were conducted using 10-fold SVM classification. The measures of results reflect the
discrimination capability of the features.
The first two rows in Table I show the results of human-heuristic methods (hand-crafted), followed by our proposed
transfer-learned approach. The eardrum segmentation, denoted as ‘seg’, identifies the eardrum by removing OM-irrelevant
information such as ear canal and earwax from the OM images [65]. The best accuracy achieved by using human-heuristic
methods is around 80%. With segmentation (the first row), the accuracy improves 3% over that without segmentation (the
second row).
Rows three to eight show results of applying transfer representation learning. All results outperform the results shown in
rows one and two, suggesting that the features learned from transfer learning are superior to that of human-crafted ones.
Interestingly, segmentation does not help improve accuracy for learning representation via transfer learning. This indicates
that the transfer-learned feature set is not only more discriminative but also more robust. Among three transfer-learning layer
choices (layer five (pool5), layer six (fc6) and layer seven (fc7)), fc6 yields slightly better prediction accuracy for OM. We
believe that fc6 provides features that are more general or fundamental to transfer to a novel domain than pool5 and fc7 do.
We also directly used the 1, 195 OM images to train a new AlexNet model.The resulting classification accuracy was only
71.8%, much lower than applying transfer representation learning. This result confirms our hypothesis that even though CNN is
a good model, with merely 1, 195 OM images (without the ImageNet images to facilitate feature learning), it cannot learn
discriminative features.
Two fusion methods, combining both hand-crafted and transfer learning features, achieved a slightly higher OM-prediction
F1-score (0.9 over 0.895) than using transfer-learned features only. This statistically insignificant improvement suggests that
hand-crafted features do not provide much help.
Finally, we used OM data to fine-tune the AlexNet model, which achieves the best accuracy (see line 11 in red). For
fine-tuning, we replaced the original fc6, fc7 and fc8 layers with the new ones and used OM data to train the whole network
without freezing any parameters. In this way, the leaned features can be refined and are thus more aligned to the targeted task.
This result attests that the ability to adapt representations to data is a critical characteristic that makes deep learning superior to
the other learning algorithms.
Results of Transfer Representation Learning for Melanoma
We performed experiments on the PH2 dataset whose dermoscopic images were obtained at the Dermatology Service of
Hospital Pedro Hispano (Matosinhos, Portugal) under the same conditions through the Tuebinger Mole Analyzer system using
a magnification of 20x. The assessment of each label was performed by an expert dermatologist.
7
TABLE II: Melanoma classification experimental results. (The best shown in bold.)
Table II compares melanoma classification results for different feature representations. In Table II, all the experiments except
for the last two were conducted by using 5-fold SVM classification. The last experiment involved fine-tuning, which was
implemented and evaluated by using Caffe. We also performed data augmentation to balance the PH2 dataset (160 normal
images and 40 melanoma images) .
Unlike OM, we found the low-level features to be more effective in classifying melanoma. Among three transfer-learning
layer choices, pool5 yields a more robust prediction accuracy than the other layers do for melanoma. The deeper the layer is,
the worse the accuracy becomes. We believe that pool5 provides low-level features that are suitable for delineating texture
patterns that depict characteristics of melanoma.
Rows three and seven show that the accuracy of transferred features is as good as that of the ABCD rule method with expert
segmentation. These results reflect that deep transferred features are robust to noise such as hair or artifacts.
We used melanoma data to fine-tune the AlexNet model and obtained the best accuracy 92.81% since all network parameters
are refined to fit the target task by employing back propagation. We also compared our result with the cutting-edge method,
which reported 98% sensitivity and 90% specificity on PH2 [7]. Their method requires preprocessing such as manual lesion
segmentation to obtain “clean” data. In contrast, we utilized raw images without conducting any heuristic-based preprocessing.
Thus, deep transfer learning can identify features in an unsupervised way to achieve as good classification accuracy as those
features identified by domain experts.
Qualitative Evaluation - Visualization
In order to investigate what kinds of features are transferred or borrowed from the ImageNet dataset, we utilized a visualization
tool to perform qualitative evaluation. Specifically, we used an attribute selection method, SVMAttributeEval [27] with Ranker
search, to identify the most important features for recognizing OM and melanoma. Second, we mapped these important features
back to their respective codebook and used the visualization tool from Yosinski et al. [79] to find the top ImageNet classes
causing the high value of these features. By observing the common visual appearances shared by the images of the disease
classes and the retrieved top ImageNet classes, we were able to infer the transferred features.
Fig. 4 demonstrates the qualitative analyses of four different cases: the Normal eardrum, acute Otitis Media (AOM), Chronic
Otitis Media (COM) and Otitis Media with Effusion (OME), which we will now proceed to explain in turn. First, the normal
eardrum, nematode and ticks are all similarly almost gray with a certain degree of transparency, features that are hard to capture
with only hand-crafted methods. Second, AOM, purple-red cloth and red wine have red colors as an obvious common attribute.
Third, COM and seashells are both commonly identified by a calcified eardrum. Fourth, OME, oranges, and coffee all seem to
share similar colors. Here, transfer learning works to detect OM in an analogous fashion to how explicit similes are used in
language to clarify meaning. The purpose of a simile is to provide information about one object by comparing it to something
with which one is more familiar. For instance, if a doctor says that OM displays redness and certain textures, a patient may not
be able to comprehend the doctor’s description exactly. However, if the doctor explains that OM presents with an appearance
similar to that of a seashell, red wine, orange, or coffee colors, the patient is conceivably able to envision the appearance of
OM at a much more precise level. At level fc6, transfer representation learning works like finding similes that can help explain
OM using the representations learned in the source domain (ImageNet).
The classification of melanoma contrasts sharply with the classification of OM. We can exploit distinct visual features to
classify different OM classes. However, melanoma and benign nevi share very similar textures, as melanoma evolves from
benign nevi. Moreover, melanoma often has atypical textures and presents in various colors.
In the case of detecting melanoma versus benign nevus, effective representations of the diseases from higher-level visual
characteristics cannot be found from the source domain. Instead, the most effective representations are only transferable at a
lower-level of the CNN. We believe that if the source domain can add substantial images of texture-rich objects, the effect of
explicit similes may be utilized at a higher level of the CNN. For detailed analysis, readers can consult our work published in
2019 [17].
8
Fig. 4: The visualization of helpful features from different classes corresponding to different OM symptoms (from left to right:
Normal eardrum, AOM, COM, OME).
Fig. 5: The Vanilla GANS by [26]; figure credit: Hunter Heidenreich [29].
for liver lesion classification and claims that both the sensitivity and specificity are improved. However, the total number of
labeled images is merely 182, which is too small a dataset to draw any convincing conclusions. The work [63] applies a similar
idea to thoracic disease classification and achieves better performance. The work uses human experts to remove noisy data,
but fails to report how many noisy instances were removed and how much of the accuracy improvement was attributed to
human intervention. The paper also claims that additional data contributes in making training data of all classes balanced to
mitigate the imbalanced training data issue. Had the work demonstrated that generating additional data using GANs helps
despite imbalanced distribution, the improved result would have been more convincing.
Combining 3D model simulation with GANs seems to be another plausible alternative to reaching the same goal of increasing
training instances. The work of [68] presents a framework that can generate a large amount of labeled data by combining a 3D
model with GANs. Another work [67] combines a 3D simulator (with labels) with unsupervised learning to learn a GAN model
that can improve the realism of the simulating labeled data. However, this combining scheme does not work for some tasks. For
example, our AR platform Aristo [80] experimented with these methods and did not yield any accuracy improvements in its
gesture recognition task. Moreover, most medical conditions have lacked exact 3D models so far, which makes the combining
scheme difficult to apply.
Scale of Dataset
To establish a yardstick for these four methods, we first measured the “golden” results that supervised learning can attain using
100% training and validation data. We then dialed back the size of the training and validation data to be 50%, 20%, 10%, and
then 5%. We used each of the four methods to either increase training data or pre-train the network. We used PGGAN5 as our
GAN model to generate images with 1024 × 1024 pixel resolution. For our CNN classifier, we employed DenseNet121 [33],
and used AUROC6 as our evaluation metric. Intuitively, our conjectures before seeing the results were as follows:
• Method 1 will perform the worst, since it does not receive any help to improve model parameters.
• Method 4 will perform the best, since it produces more training instances for each target class.
• Method 3 will outperform 2 as the training data generated, though unlabeled, is more relevant to the target disease images
than ImageNet is.
Experiment Results
Table III presents our experimental results. We report the AUROC of detecting 14 thoracic disease types using each of the
four different training methods. These results are inconsistent with our conjectures:
• Method 2, which is equivalent to transfer learning, performs the best. No methods using GANs were able to outperform
this method.
• Method 4 performs the worst. In Method 4, additional GAN-generated labeled images were used to perform training. We
believe that the labeled images generated using GANs were too noisy. Therefore, when the generated images are increased
(5x vs. 2x), the prediction accuracy is not always increased and sometimes even worse. This suggests that GANs do not
produce helpful training instances and may in fact be counter-productive.
• Method 3 does not outperform method 2, even though ImageNet data used by method 2 is entirely irrelevant to images of
thoracic conditions. We believe that the additional images generated by GANs used for initializing network parameters are
less useful because of their low volume and variety (diversity). After all, adding more low-quality similar images to an
unlabeled pool cannot help the model learn novel features. Note that a recent keynote of I. Goodfellow [25] points out that
GANs can successfully generate more unlabeled data (not labeled data) to improve MNIST classification accuracy. Table III
reflects the same conclusion that method 3 outperforms method 1, which uses randomly-initialized weights. However,
using GANs to generate unlabeled data may not be more productive than using ImageNet to pre-train the network.
Figure 7 samples real and GAN-generated images. The first column presents real images, the second column GAN-generated
unsupervised, and the third GAN-generated supervised. The GAN-generated images may successfully fool our colleagues with
no medical knowledge. However, as reported in [63], the GAN-generated labeled chest X-ray images must be screened by a
team of radiologists to remove erroneous data (with respect to diagnosis knowledge). Without domain knowledge, incorrectly
labeled images may be introduced by GANs into the training pool, which would degrade classification accuracy.
In summary, the study of [44] shows that pre-training with datasets that are multiple orders of magnitude larger than ImageNet
can achieve higher performance than pre-training with only ImageNet on several image classification and object detection tasks.
This result further attests that volume and variety of data, even if unlabeled, helps improve accuracy. GANs may indeed achieve
volume, but certainly cannot achieve variety.
To explain why using ImageNet can achieve better pre-training performance than that achieved when using GAN-generated
images, we perform layer visualizations using the technique introduced in [53]. Figure 8 plots the output layer of the first
dense-block of DenseNet. Row one shows five filters of untrained randomly initialized weights. Row three shows five filters with
more distinct features learned from the ImageNet pre-trained model. The unsupervised-GAN method (row two) produces filters of
5 We used a publicly available implementation of PGGAN via https://github.com/tkarras/progressive_growing_of_gans. This implementation has an auxiliary
classifier [51] and hence can generate images conditionally (for Method 4) or unconditionally (for Method 3).
6 We used a publicly available implementation of ChexNet [59] from https://github.com/zoogzog/chexnet, which contains a DenseNet121 classifier, and used
its evaluation metric. The metric is deriving by first summing up all AUROCs from each of the 14 classes and then dividing the summation by 14.
12
similar quality to that of row one. Qualitatively, unsupervised-GAN learns similar features akin to how the random-initialization
method does, and does not yield more promising classification accuracy.
Fig. 8: CNN layer visualization of the first denseblock of DenseNet121. The top row is random weight, the second row is
pre-trained by unsupervised-GAN method, and the third row is pre-trained by ImageNet.
2) Encoding knowledge into GANs: We can convey to GANs about the information to be modeled via the knowledge
layers/structures and/or via the knowledge graph/dictionary using natural language processing [10, 17]. We elaborate this
scheme in the remainder of this section.
Real
image
Random noise
Fake
Real/Fake
image
Seen category
Semantic
Predicted
embedding
embedding
Random noise
Semantic
Fake Predicted
embedding
image embedding
Unseen category
Fig. 9: The schematic diagram of KG-GAN for unseen flower category generation. There are two generators G1 and G2 , a
discriminator D, and an embedding regression network E as the constraint function f . We share all the weights between G1
and G2 . By doing so, our method here can be treated as training a single generator with a category-dependent loss that seen
and unseen categories correspond to optimizing two losses (LSNGAN and Lse ) and a single loss (Lse ), respectively, where Lse
is the semantic embedding loss.
and WuDao [75, 70] has 1.75 trillions, 10 times of the GPT3’s. Though to-date, the iterative prompting and dialogue methods
for acquiring information are still primitive, users can already use DALL-E followed by prompting ChatGPT to produce
impressive results. In the end of this section, we present some examples in Figure 13, and discuss our recent work in modeling
consciousness [13, 15] to make knowledge acquisition to support more effective and personalizable .
B. Method Specifications
This section presents our proposed KG-GAN that incorporates domain knowledge into the GAN framework. We consider a
set of training data under-represented at the category level, i.e., all training samples belong to the set of seen categories, denoted
as Y1 (e.g., red category of roses), while another set of unseen categories, denoted as Y2 (e.g., any other color categories), has
no training samples. Our goal is to learn categorical image generation for both Y1 and Y2 . To generate new data in Y1 , KG-GAN
applies an existing GAN-based method to train a category-conditioned generator G1 by minimizing GAN loss LGAN over G1 .
To generate unseen categories Y2 , KG-GAN trains another generator G2 from the domain knowledge, which is expressed by a
constraint function f that explicitly measures whether an image has the desired characteristics of a particular category.
KG-GAN consists of two parts: (1) constructing the domain knowledge for the task at hand, and (2) training two generators
G1 and G2 that condition on available and unavailable categories, respectively. KG-GAN shares the parameters between
G1 and G2 to couple them together and to transfer knowledge learned from G1 to G2 . Based on the constraint function
f , KG-GAN adds a knowledge loss, denoted as LK , to train G2 . The general objective function of KG-GAN is written as
minG1 ,G2 LGAN (G1 ) + λ LK (G2 ).
Given a flower dataset in which some categories are unseen, our aim is using KG-GAN to generate unseen categories in
addition to the seen categories. Figure 9 shows an overview of KG-GAN for unseen flower category generation. Our generators
take a random noise z and a category variable y as inputs and generate an output image x0 . In particular, G1 : (z, y1 ) 7→ x10 and
G2 : (z, y2 ) 7→ x20 , where y1 and y2 belong to the set of seen and unseen categories, respectively.
We leverage the domain knowledge that each category is characterized by a semantic embedding representation, which
describes the semantic relationships among categories. In other words, we assume that each category is associated with a
semantic embedding vector v. For example, we can acquire such feature representation from the textual descriptions of each
category. (Figure 10 shows example textual descriptions for four flowers.) We use semantic embedding in two places: one is for
modifying the GAN architecture, and the other is for defining the constraint function. (Using the Oxford flowers dataset, we
show how semantic embedding is done in Section III-C.)
KG-GAN is developed upon SN-GAN [47, 48]. SN-GAN uses a projection-based discriminator D and adopts spectral
normalization for discriminator regularization. The objective functions for training G1 and D use a hinge version of adversarial
loss. The category variable y1 in SN-GAN is a one-hot vector indicating which target category. KG-GAN replaces the one-hot
vector by the semantic embedding vector v1 . By doing so, we directly encode the similarity relationships between categories
into the GAN training.
The loss functions of the modified SN-GAN are defined as
LSNGAN
G
(G1 ) = −Ez,v1 [D(G1 (z, v1 ), v1 )], and
(1)
LSNGAN
D
(D) = Ex,v1 [max(0, 1 − D(x, v1 ))] + Ez,v1 [max(0, 1 + D(G1 (z, v1 ), v1 ))].
15
• This flower has thick, very • This flower has a large white
pointed petals in bright hues of petal and has a small yellow
yellow and indigo. colored circle in the middle.
• A bird shaped flower with purple • The white flower has petals
and orange pointy flowers that are soft, smooth and
Bearded Iris stemming from it's ovule. Orange Dahlia fused together and has bunch
of white stamens in the center.
• The petals on this flower are red • This flower has five large wide
with a red stamen. pink petals with vertical
• The flower has a few broad red grooves and round tips.
petals that connect at the base, • This flower has five pink petals
and a long pistil with tiny yellow which are vertically striated
Snapdragon stamen on the end. Stemless Gentia and slightly heart-shaped.
Fig. 10: Oxford flowers dataset. Example images and their textual descriptions.
Fig. 11: Unseen flower category generation. Qualitative comparison between real images and the generated images from
KG-GAN. Left: Real images. Middle: Successful examples of KG-GAN. Right: Unsuccessful examples of KG-GAN. The top
two and the bottom two rows are Orange Dahlia and Stemless Gentian, respectively.
Semantic Embedding Loss. We define the constraint function f as predicting the semantic embedding vector of the
underlying category of an image. To achieve that, we implement f by training an embedding regression network E from the
training data. Once trained, we fix its parameters and add it to the training of G1 and G2 . In particular, we propose a semantic
embedding loss Lse as the role of knowledge loss in KG-GAN. This loss requires the predicted embedding of fake images to
be close to the semantic embedding of target categories. Lse is written as
Lse (Gi ) = Ez,vi ||E(Gi (z, vi )) − vi ||2 , where i ∈ {1, 2}. (2)
Total Loss. The total loss is a weighted combination of LSNGAN and Lse . The loss functions for training D and for training
G1 and G2 are respectively defined as
L D = LSNGAN
D
(D), and
(3)
L G = LSNGAN
G
(G1 ) + λse (Lse (G1 ) + Lse (G2 )).
16
D. Observations on KG-GANs
From Table IV we make two observations. First, KG-GAN (conditioned on semantic embedding) performs better than
One-hot KG-GAN. This is because One-hot KG-GAN learns domain knowledge only from the knowledge constraint while
KG-GAN additionally learns the similarity relationships between categories through the semantic embedding as the condition
variable. Second, when KG-GAN conditions on semantic embedding, KG-GAN with out Lse still works. This is because
KG-GAN learns how to interpolate among seen categories to generate unseen categories. For example, if an unseen category is
close to a seen category in the semantic embedding space, then their images will be similar.
As we can see from Figure 11, our model faithfully generates flowers with the right color, but does not perform as well in
shapes and structures. The reasons are twofold. First, colors can be more consistently articulated on a flower. Even if some
descriptors annotate a flower as red and while others annotate it as pink, we can obtain a relatively consistent color depiction
over, say, ten descriptions. Shapes and structures do not enjoy as confined a vocabulary set as colors do. In addition, the flowers
in the same category may have various shapes and structures due to aging and camera angles. Since each image has 10 textual
descriptions and each category has an average number of 80 images, the semantic embedding vector of each category is obtained
from taking an average over about 800 fastText feature vectors. This averaging operation preserves the color information quite
well while blurring the other aspects.
E. Knowledge Acquisition
A better semantic embedding representation that encodes richer textual information about a flower category can be performed
by prompting a large pre-trained model. The Oxford dataset, on which we conducted our experiments, is a tiny knowledge
based, compared to GPT3. Using GPT3 as our knowledge base, we first prompted it for knowledge about roses, and then used
the acquired knowledge to prompt DALLE to generate images. Figure 12 shows the prompt to query ChatGPT about the colors
and textures of roses. Once after the color and texture information were obtained, we issued two separate prompts to DALLE
to produce “roses with red, orange, and white colors”, and “roses of orange, white, and pink colors with velvety petals in
ruffled appearance”. The first row of Figure 13 shows three images of roses with the specified colors. The second row of
the figure shows three rose images with the specified texture specifications. The ChatGPT and DALLE pipeline can reliably
generate a variety of realistic rose images based on the knowledge acquired from the pre-trained model. When compared with
the flowers generated from a much smaller knowledge-base learned from the Oxford dataset presented in Figure 11, acquiring
specifications from a much larger pre-trained model via ChatGPT clearly generates much higher quality rose images.
17
Fig. 13: Rose Images Generated by Prompting ChatGPT and then CALLE. The photos of the first line were generated by the
prompt “generate roses with red, orange, and white colors”. The second line were generated by the prompt “generate some
roses of orange, white, and pink colors with velvety petals in ruffled appearance”.
task at hand. We believe that recent large pre-trained language models have the potential to serve as a general, task-agnostic
knowledge base to support knowledge acquisition. In this article, we demonstrated that the ChatGPT-DALLE pipeline can
first acquire precise descriptions of a concept and then use this new “insight” to generate realistic images beyond the original,
restrictive distribution represented by a small training dataset.
It is currently believed within the artificial intelligence (AI) community that a pre-trained model, trained on all available
documents in the world, can serve as a highly accurate and reliable source of knowledge for various tasks. This pre-trained
model can then be fine-tuned with a small amount of additional data for a specific task, or prompted step-by-step (e.g., [24, 14])
to achieve state-of-the-art performance on that task. These techniques allow for the efficient and effective use of large, pre-trained
language models in a variety of applications.
There are several promising directions for future research in the field of artificial intelligence, based on the observations and
discussions presented throughout this article. These include:
• Improving the interpretability of deep learning models, so that their decision-making processes are more transparent and
easier to understand. This is an important consideration for fields such as healthcare, where the consequences of incorrect
predictions can be severe.
• Combining domain knowledge with existing training data to generate more diverse and representative training data, in
order to better cover a wide range of semantic concepts. This could be particularly useful in fields where annotated data is
scarce or difficult to collect.
• Utilizing large pre-trained models as a knowledge base for robust knowledge acquisition is an important direction for
future research. These models, which have been trained on vast amounts of data, can serve as a valuable resource for
acquiring knowledge that is generalizable across a wide range of tasks. By leveraging and improving these models, we can
increase the reliability and efficiency of knowledge acquisition, which can then be used to guide the generation of more
diverse and representative training data for deep learning models. This, in turn, can improve the performance of these
models in various applications.
• Developing effective techniques for prompting and guiding the acquisition of knowledge, such as using a chain-of-thought
or dialogue approach. These methods can help to ensure that the knowledge acquired is precise and relevant to the task at
hand.
In my opinion, each of these research directions has the potential to significantly advance the field of artificial intelligence
and improve the usefulness and reliability of deep learning models in healthcare and a variety of other applications. By focusing
on improving the interpretability of deep learning models, incorporating domain knowledge into training data, leveraging large
pre-trained models as a knowledge base, and developing effective knowledge prompting techniques, we can make significant
progress in enhancing the performance and trustworthiness of these models in healthcare and other fields.
ACKNOWLEDGMENT
This article presents the work performed by the DeepQ team between 2014 and 2017 for the Tricorder XPRIZE award
[16, 56], as well as our research on consciousness modeling at Stanford University since 2020. The relevant papers published
by our team [10, 12, 17, 18, 66] have been cited throughout the article. We would like to acknowledge the following colleagues
for their contributions, listed in alphabetical order: Che-Han Chang, Fu-Chieh Chang, Chun-Nan Chou, Chuen-Kai Shie, and
Kai-Fu Tang.
R EFERENCES
[1] General Data Protection Regulation (GDPR). Retrieved February 4, 2020, from https://gdpr-info.eu, 2016.
[2] CS231n convolutional neural network for visual recognition: transfer learning. http://cs231n.github.io/transfer-learning/,
2017.
[3] Health Information Privacy Act, HIPAA. Retrieved February 4, 2020, from https://www.hhs.gov/hipaa/for-professionals/
index.html, 2017.
[4] AB-713 California Consumer Privacy Act (CCPA). Retrieved February 5, 2020, from https://leginfo.legislature.ca.gov,
January 2020. California Legislature 2019-2020 Regular Session.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
[6] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial
nets (gans). In International Conference on Machine Learning, pages 224–232, 2017.
[7] Catarina Barata, M Emre Celebi, and Jorge S Marques. Melanoma detection algorithm based on feature fusion. In
Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE, pages
2653–2656. IEEE, 2015.
[8] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information.
Transactions of the Association for Computational Linguistics, 2017.
[9] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
19
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL
https://arxiv.org/abs/2005.14165.
[10] Che-Han Chang, Chun-Hsien Yu, Szu-Ying Chen, and Edward Y. Chang. KG-GAN: Knowledge-guided generative
adversarial networks, 2019.
[11] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent
Systems and Technology (TIST), 2(3):27, 2011.
[12] Edward Y Chang. Perceptual feature extraction (chapter 2). In Foundations of large-scale multimedia information
management and retrieval: Mathematics of perception, chapter 2, pages 13–35. Springer, 2011.
[13] Edward Y. Chang. Towards artificial general intelligence via consciousness modeling (invited talk). In IEEE Infrastructure
Conference, September 2022. URL https://drive.google.com/file/d/1NPuKPB4gSeJeT1fmfY5eus_Rw3abwd5m/view?usp=
sharing.
[14] Edward Y. Chang. Prompting large language models with the socratic method. IEEE 13th Annual Computing and
Communication Workshop and Conference (CCWC), March 2023. URL https://arxiv.org/abs/2303.08769.
[15] Edward Y. Chang. Cocomo: Computational consciousness modeling for generative and ethical ai. arXiv preprint
arXiv:2304.02438, 2023.
[16] Edward Y. Chang, Meng-Hsi Wu, Kai-Fu Tang Tang, Hao-Cheng Kao, and Chun-Nan Chou. Artificial intelligence
in xprize deepq tricorder. In Proceedings of the 2nd International Workshop on Multimedia for Personal Health and
Health Care, MMHealth ’17, page 11–18, New York, NY, USA, 2017. Association for Computing Machinery. ISBN
9781450355049. doi: 10.1145/3132635.3132637. URL https://doi.org/10.1145/3132635.3132637.
[17] Fu-Chieh Chang, Jocelyn J. Chang, Chun-Nan Chou, and Edward Y. Chang. Toward fusing domain knowledge with
generative adversarial networks to improve supervised learning for medical diagnoses. In 2019 IEEE Conference on
Multimedia Information Processing and Retrieval (MIPR), pages 77–84, 2019. doi: 10.1109/MIPR.2019.00022.
[18] Chun-Nan Chou, Chuen-Kai Shie, Fu-Chieh Chang, Jocelyn Chang, and Edward Y. Chang. Representation learning on
large and small data, chapter 1 of Big Data Analytics for Large-Scale Multimedia Search. pages 3–30, 07 2017.
[19] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
[20] Will Douglas. Google’s medical AI was super accurate in a lab. Real life was a different story. MIT Technology Review,
April 2020.
[21] Shortliffe EH. Computer-based medical consultations: MYCIN. Elsevier, New York, 1976.
[22] Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. Gan-based synthetic
medical image augmentation for increased cnn performance in liver lesion classification. arXiv preprint arXiv:1803.01229,
2018.
[23] Wei-Kang Fu, Yi-Shan Lin, Giovanni Campagna, Chun-Ting Liu, De-Yi Tsai, Chung-Huan Mei, Edward Y. Chang,
Shih-Wei Liao, and Monica S. Lam. Soteria: A provably compliant user right manager using a novel two-layer blockchain
technology. In 2020 IEEE Infrastructure Conference, pages 1–10, 2020. doi: 10.1109/IEEECONF47748.2020.9377624.
[24] Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient, 2021.
[25] Ian Goodfellow. Adversarial machine learning (keynote). In AAAI Conference on Artificial Intelligence, 2019.
[26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680,
2014.
[27] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using
support vector machines. Machine learning, 46(1-3):389–422, 2002.
[28] D.J. Hand and P.D.S.D.J. Hand. Artificial Intelligence and Psychiatry. The Scientific Basis of Psychiatry. Cambridge
University Press, 1985. ISBN 9780521258715. URL https://books.google.com/books?id=8PQ8AAAAIAAJ.
[29] Hunter Heidenreich. What is a generative adversarial network? http://hunterheidenreich.com/blog/what-is-a-gan/.
[30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two
time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
[31] Geoffrey E Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10):428–434, 2007.
[32] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation,
18(7):1527–1554, 2006.
[33] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708. IEEE, 2017.
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial
networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5967–5976. IEEE, 2017.
[35] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and
Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International
20
[64] William B. Schwartz, Ramesh S. Patil, and Peter Szolovits. Artificial intelligence in medicine, where do we stand? New
England Journal of Medicine, 316(11):685–88, March 1987.
[65] Chuen-Kai Shie, Hao-Ting Chang, Fu-Cheng Fan, Chung-Jung Chen, Te-Yung Fang, and Pa-Chun Wang. A hybrid
feature-based segmentation and classification system for the computer aided self-diagnosis of otitis media. In Engineering
in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE, pages 4655–4658.
IEEE, 2014.
[66] Chuen-Kai Shie, Chung-Hisang Chuang, Chun-Nan Chou, Meng-Hsi Wu, and Edward Y. Chang. Transfer representation
learning for medical image analysis. IEEE EMBC, pages 711–714, 2015.
[67] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from
simulated and unsupervised images through adversarial training. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2107–2116. IEEE, 2017.
[68] Leon Sixt, Benjamin Wild, and Tim Landgraf. Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI,
5:66, 2018.
[69] W Stolz, A Riemann, AB Cognetta, L Pillet, W Abmayr, D Holzel, P Bilek, F Nachbar, and M Landthaler. Abcd rule of
dermatoscopy: a new practical method for early recognition of malignant melanoma. In European Journal of Dermatology,
pages 521–527, 1994.
[70] Jie Tang. Wudao — pre-train the world. https://keg.cs.tsinghua.edu.cn/jietang/publications/wudao-3.0-en.pdf, May 2022.
[71] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[72] Stefanos Vrochidis, Benoit Huet, Edward Y. Chang, and Ioannis Kompatsiaris. Big data analytics for large-scale multimedia
search. Wiley, June 2019, ISBN: 978-1119376972. 2019.
[73] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8:
Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax
diseases. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3462–3471. IEEE, 2017.
[74] Wikipedia. Otitis media. https://en.wikipedia.org/wiki/Otitis_media, 2017.
[75] Wikipedia. Wu dao, a multimodal artificial intelligence pre-trained model. https://en.wikipedia.org/wiki/Wu_Dao, 2021.
[76] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey
on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2021. doi:
10.1109/TNNLS.2020.2978386.
[77] Vincent C. Yen and Robert Boissoneau. Artificial intelligence and expert systems: Implications for health care delivery.
New England Journal of Medicine, 66(5):16–19, 1988. doi: 10.1080/00185868.1988.10543623.
[78] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review. arXiv preprint
arXiv:1809.07294, 2018.
[79] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep
visualization. arXiv preprint arXiv:1506.06579, 2015.
[80] Zhongyang Zheng, Bo Wang, Yakun Wang, Shuang Yang, Zhongqian Dong, Tianyang Yi, Cyrus Choi, Emily J Chang, and
Edward Y Chang. Aristo: An augmented reality platform for immersion and interactivity. In ACM Multimedia Conference,
pages 690–698. ACM, 2017.
[81] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent
adversarial networks. In IEEE International Conference on Computer Vision, pages 2242–2251. IEEE, 2017.