Problems with CTGAN/ GAN based synthesizers #461

dsrishti · 2022-05-03T12:04:44Z

dsrishti
May 3, 2022

Hi Team,

We are currently testing out GAN based synthesizers like PATECTGAN and DPCTGAN (from snsynth) on several datasets for industrial usage. There seems to be a problem with convergence of the while loop in CTGAN. When we try to use higher epsilon values (order of 100s) for GAN synthesizers, we were not thrown any result nor an error. We found that there is a problem with category_eps_percent parameter in both the synthesizers.

The earlier versions of PATECTGAN/DPCTGAN were working fine for us. We also arrived at a conclusion when epsilon was increased for a dataset, accuracy increased and privacy decreased. Now, with the new models, that pattern seems to have disappeared for some reason. Kindly guide us on whether some of these problems could be addressed.

Regards,
Dileep

Answered by joshua-oss

Jun 28, 2022

Hi Dileep,

I think I have isolated the issue to a privacy fix we made to the category sampling. Previously, we were using the exact counts of each category to sample, and we switched to use the noisy counts. There is an issue using the binary search to get the noise scale where it sometimes gets in an infinite loop. This issue has been fixed in the main opendp branch, but I haven't had a chance to test that branch against the GANs.

We have also noticed that the GAN-based synthesizers are performing less consistently now, with more variance between training runs. It's not clear if this is related to that privacy fix or some other regression. We want to diagnose and fix that issue at the sa…

View full answer

joshua-oss · 2022-05-06T03:34:47Z

joshua-oss
May 6, 2022
Maintainer

Hi @dsrishti, can you provide some more details that would help us reproduce the behavior you are seeing? Is the behavior occurring in version 0.2.5 but not in earlier versions? There was a bug fix in 0.2.5 to handle privacy leak in very low epsilon settings, but it doesn't seem like that would affect your use case.

To revert to older versions you can try pip uninstall smartnoise-synth followed by pip install smartnoise-synth==0.2.4 or pip install smartnoise-synth==0.2.3.

Can you clarify what the call is and what the behavior is? Is it simply that you initialize CTGAN with a high epsilon, and the call to fit hangs infinitely? Or is there some other message? What leads you to infer there is an issue with category_epsilon_percent? Also, can you confirm that A) your data are categorical, and B) roughly how many categorical columns exist in the source data?

You are also correct that in general, lower epsilon should be less utility with higher privacy. If that pattern isn't holding for you, there may be a bug. Is this a separate issue from the issue with large epsilon? IOW, is it the case that large epsilon has unexpected behavior in one way, such as hanging infinitely, and smaller epsilon (below what range?) work fine, but accuracy doesn't seem to be impacted by epsilon?

2 replies

dsrishti May 12, 2022
Author

Hello Joshua. Thanks for responding and apologies for a late reply. I am not sure of the version that we have used earlier but it was prior to making some changes in Conditional Generator.

Is it simply that you initialize CTGAN with a high epsilon, and the call to fit hangs infinitely?
It is the call to fit that hangs infinitely. We are using it with PytorchDPSynthesizer (as per the documentation). But the same works for MWEM.

Or is there some other message?
No, there is no other message in regards to this

What leads you to infer there is an issue with category_epsilon_percent?
We used a brute method to at least make the synthesiser run. So, we looked at the source code and started iterating if any of them gets executed.
class PATECTGAN(CTGANSynthesizer):
def init(
self,
embedding_dim=128,
generator_dim=(256, 256),
discriminator_dim=(256, 256),
generator_lr=2e-4,
generator_decay=1e-6,
discriminator_lr=2e-4,
discriminator_decay=1e-6,
batch_size=500,
discriminator_steps=1,
log_frequency=False,
verbose=False,
epochs=300,
pac=1,
cuda=True,
epsilon=1,
binary=False,
regularization=None,
loss="cross_entropy",
teacher_iters=5,
student_iters=5,
sample_per_teacher=1000,
delta=None,
noise_multiplier=1e-3,
preprocessor_eps=1,
moments_order=100,
category_epsilon_pct=0.1,
)
We went one-by-one while keeping others constant and the default value of 0.1 (for category_epsilon_pct) was not working for higher epsilons and when we increased this, we saw that the code was working.

Also, can you confirm that A) your data are categorical, and B) roughly how many categorical columns exist in the source data?
Yes, we converted every column into categorical and these synthesisers work for us at lower epsilons. Approximately, on an average, we feed in 15 categorical columns

You are also correct that in general, lower epsilon should be less utility with higher privacy. If that pattern isn't holding for you, there may be a bug. Is this a separate issue from the issue with large epsilon?
Yes. The pattern was holding for us previously when we used our own metrics, but now we are using sdvmetrics to assess privacy gain and accuracy. At times, this was holding for some of the datasets but it was not holding for some as well! I expect this pattern to be data-agnostic. We started assessing synthetic datasets, from eps=0.2 to 1000 to see how the pattern holds and it was behaving weirdly for GANs and for some datasets.

is it the case that large epsilon has unexpected behavior in one way, such as hanging infinitely, and smaller epsilon (below what range?)work fine, but accuracy doesn't seem to be impacted by epsilon?
True, it was hanging infinitely for higher epsilons. We are not even able to run the codes with epsilon = 5.0. Accuracy seems to be increasing for most of the datasets across epsilons. This is particularly visible in MWEM. When it comes to GANs, they are not showing monotonous behaviour in most of the cases.

Just wanted to let you know thatI have also started working on the new version and it seems that it is still failing at higher epsilons. Will let you know if there is something we realise!!

Thank you for your support, Joshua. Have a great day!

Thanks,
Dileep

dsrishti May 20, 2022
Author

Hi Joshua,

Please let me know if you need more clarification on this!

joshua-oss · 2022-06-28T03:43:27Z

joshua-oss
Jun 28, 2022
Maintainer

Hi Dileep,

I think I have isolated the issue to a privacy fix we made to the category sampling. Previously, we were using the exact counts of each category to sample, and we switched to use the noisy counts. There is an issue using the binary search to get the noise scale where it sometimes gets in an infinite loop. This issue has been fixed in the main opendp branch, but I haven't had a chance to test that branch against the GANs.

We have also noticed that the GAN-based synthesizers are performing less consistently now, with more variance between training runs. It's not clear if this is related to that privacy fix or some other regression. We want to diagnose and fix that issue at the same time as we are cleaning up that data input/output (#467) and calling pattern (#465). We believe that bringing back continuous value support for the GANs will address some of the problem cases, and potentially being smarter about how we use epsilon to compute the noise counts for the category sampler.

2 replies

dsrishti Jun 28, 2022
Author

Thanks for letting us know, Joshua. Kindly, let me know when/if this issue gets fixed.

joshua-oss Oct 13, 2022
Maintainer

Following up on this; it should be fixed now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with CTGAN/ GAN based synthesizers #461

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problems with CTGAN/ GAN based synthesizers #461

dsrishti May 3, 2022

Replies: 2 comments · 4 replies

joshua-oss May 6, 2022 Maintainer

dsrishti May 12, 2022 Author

dsrishti May 20, 2022 Author

joshua-oss Jun 28, 2022 Maintainer

dsrishti Jun 28, 2022 Author

joshua-oss Oct 13, 2022 Maintainer

dsrishti
May 3, 2022

Replies: 2 comments 4 replies

joshua-oss
May 6, 2022
Maintainer

dsrishti May 12, 2022
Author

dsrishti May 20, 2022
Author

joshua-oss
Jun 28, 2022
Maintainer

dsrishti Jun 28, 2022
Author

joshua-oss Oct 13, 2022
Maintainer