Problems with CTGAN/ GAN based synthesizers #461
-
Hi Team, We are currently testing out GAN based synthesizers like PATECTGAN and DPCTGAN (from snsynth) on several datasets for industrial usage. There seems to be a problem with convergence of the while loop in CTGAN. When we try to use higher epsilon values (order of 100s) for GAN synthesizers, we were not thrown any result nor an error. We found that there is a problem with category_eps_percent parameter in both the synthesizers. The earlier versions of PATECTGAN/DPCTGAN were working fine for us. We also arrived at a conclusion when epsilon was increased for a dataset, accuracy increased and privacy decreased. Now, with the new models, that pattern seems to have disappeared for some reason. Kindly guide us on whether some of these problems could be addressed. Regards, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hi @dsrishti, can you provide some more details that would help us reproduce the behavior you are seeing? Is the behavior occurring in version 0.2.5 but not in earlier versions? There was a bug fix in 0.2.5 to handle privacy leak in very low epsilon settings, but it doesn't seem like that would affect your use case. To revert to older versions you can try Can you clarify what the call is and what the behavior is? Is it simply that you initialize CTGAN with a high epsilon, and the call to You are also correct that in general, lower epsilon should be less utility with higher privacy. If that pattern isn't holding for you, there may be a bug. Is this a separate issue from the issue with large epsilon? IOW, is it the case that large epsilon has unexpected behavior in one way, such as hanging infinitely, and smaller epsilon (below what range?) work fine, but accuracy doesn't seem to be impacted by epsilon? |
Beta Was this translation helpful? Give feedback.
-
Hi Dileep, I think I have isolated the issue to a privacy fix we made to the category sampling. Previously, we were using the exact counts of each category to sample, and we switched to use the noisy counts. There is an issue using the binary search to get the noise scale where it sometimes gets in an infinite loop. This issue has been fixed in the main opendp branch, but I haven't had a chance to test that branch against the GANs. We have also noticed that the GAN-based synthesizers are performing less consistently now, with more variance between training runs. It's not clear if this is related to that privacy fix or some other regression. We want to diagnose and fix that issue at the same time as we are cleaning up that data input/output (#467) and calling pattern (#465). We believe that bringing back continuous value support for the GANs will address some of the problem cases, and potentially being smarter about how we use epsilon to compute the noise counts for the category sampler. |
Beta Was this translation helpful? Give feedback.
Hi Dileep,
I think I have isolated the issue to a privacy fix we made to the category sampling. Previously, we were using the exact counts of each category to sample, and we switched to use the noisy counts. There is an issue using the binary search to get the noise scale where it sometimes gets in an infinite loop. This issue has been fixed in the main opendp branch, but I haven't had a chance to test that branch against the GANs.
We have also noticed that the GAN-based synthesizers are performing less consistently now, with more variance between training runs. It's not clear if this is related to that privacy fix or some other regression. We want to diagnose and fix that issue at the sa…