Data Privacy With Apache Spark
In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance.
We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, and watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.
Some of the abovementioned techniques are barely an inconvenience to implement, but difficult to support in the long run. We’ll show on which occasions Databricks Delta can help to make your datasets privacy-ready.