Open Source: Is It the Holy Grail or a Can of Worms?

Photo by Virginia Johnson
 on Unsplash
Photo by Virginia Johnson on Unsplash

Do you ever wonder if you should include a third-party library in your code or not? Sometimes it’s worth it, but mostly it’s not. Here’s a quick way to tell: If the library is doing something you don’t comprehend, or if it’s doing something you could do yourself with little effort, then don’t use it. The only exception to this rule is if the library is doing something that would be very difficult or time-consuming to do yourself. In that case, it might be worth using the library even if you don’t fully understand it.

Read more about Open Source: Is It the Holy Grail or a Can of Worms?

How Golang Generics Empower Concise APIs

Tired Gopher
 (of Quasilyte) is extracting table into memory
Tired Gopher (of Quasilyte) is extracting table into memory

You’ve likely heard and read dozens of stories about generics in Go about ordinary slices and maps but haven’t yet thought about a fun way to apply this feature. Let’s implement the peer of pandas.read_html , which maps HTML tables into slices of structs! If it’s achievable even with Rust , why shouldn’t it be with Go?! This essay will show you a thrilling mix of reflection and generics to reach concise external APIs for your libraries.

Read more about How Golang Generics Empower Concise APIs

Reverse-Engineering a Search Language

This goes about a couple architectural and organizational approaches on achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated departamental or team collaborative environments, and security enforcement.

Data Mesh With Terraform and Databricks

This goes about a couple architectural and organizational approaches on achieving Distributed Data Mesh, which is essentially a combination of mindset, fully automated infrastructure, continuous integration for data pipelines, dedicated departamental or team collaborative environments, and security enforcement.

Growing a Terraform Provider to Millions of Downloads

This will be a story of building and growing a Databricks Terraform Provider over the course of two years, as well as tactics, techniques and procedures that allowed it to achieve millions of installations. This talk will be useful for every Terraform Provider maintainer, as well as those who are planning to write one.

Data Quality With or Without Apache Spark and Its Ecosystem

Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure certain data quality, especially when continuous imports happen. Organizations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ, and Great Expectations. In this presentation, we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, and features like data profiling and anomaly detection.

Read more about Data Quality with or without Apache Spark and its ecosystem

Using Terraform Enable Distributed Data Mesh

In this session we’ll learn about Databricks (Labs) Terraform integration and how it can automate literally every aspect required for a production-grade platform: data security, permissions, continuous deployment and so on. We’ll learn how Scribd offers their internal customers flexibility without acting as gatekeepers. Just about anything they might need in Databricks is a pull request away.

Data Privacy With Apache Spark

In this talk, we’ll compare different data privacy techniques & protection of personally identifiable information and their effects on statistical usefulness, re-identification risks, data schema, format preservation, read & write performance. We’ll cover different offense and defense techniques. You’ll learn what k-anonymity and quasi-identifier are. Think of discovering the world of suppression, perturbation, obfuscation, encryption, tokenization, and watermarking with elementary code examples, in case no third-party products cannot be used. We’ll see what approaches might be adopted to minimize the risks of data exfiltration.

Read more about Data Privacy with Apache Spark

Data Lineage in Context of Interactive Analysis

This presentation focuses on tracking data lineage for interactive data exploration through notebooks. A set of techniques is shown to demonstrate how to audit data journey from code entered in notebook down to levels of execution planning, DataFrames, RDDs and Hadoop’s file formats back to visualizations displayed to data analyst. Custom tool has been created using Java Instrumentation API, that allowed us to add extra security to certain parts of Spark driver’s JVM runtime environment.

Read more about Data Lineage in Context of Interactive Analysis