Strategies for Data Quality With Apache Spark

Data Quality Landscape
Data Quality Landscape

In fact, many data teams are guilty of overlooking critical questions like “Are we actually monitoring the data?” after deploying multiple pipelines to production. They might celebrate the success of the first pipeline and feel confident about deploying more. Still, they need to consider the health and robustness of their ETL pipeline for long-term production use. This lack of foresight can lead to significant problems down the line and undermine trust in the data sets produced by the pipeline. In the previous post we’ve scratched the surface of how one can check data quality with Apache Spark . But the real complexity lies in the greater data quality landscape, which involves people and processes, not just the Spark clusters.

Read more about Strategies for Data Quality with Apache Spark

Introduction to Data Quality With Apache Spark

High-Quality Spark
High-Quality Spark

What really happens in the data engineering world is that the data team deploys the first pipeline through production, and everyone is happy. They deploy the second, third fifth, and tenth pipelines to production. But then, they started thinking, Hmm, are we actually monitoring the data? Is our ETL pipeline healthy and robust enough for production use for other teams to trust the data sets that are produced by this pipeline? “Data quality, requires a certain level of sophistication within the enterprise to even understand that it’s a problem.” - and this quote was from Colleen Graham in Performance Management Driving BI Spending article from 2006, but it pertains even to nowadays.

Read more about Introduction to Data Quality with Apache Spark

Fingerprinting Process Trees on Linux With Rust

This is how you could imagine fingerprinting process trees
This is how you could imagine fingerprinting process trees

Name fingerprinting is a cybersecurity forensics technique used to identify and track processes running on a computer system by using the process name or other identifiable information. This information could include the process’s file name, file path, command line arguments, and other identifying indicators of compromise .

Read more about Fingerprinting Process Trees on Linux with Rust

Exposing Azure Storage on Domain Apex With Let's Encrypt SSL

Simplified Azure CDN Let’s Encrypt flow with Terraform
.
Simplified Azure CDN Let’s Encrypt flow with Terraform .

Hello, reader; in this article, I will explain how to expose an Azure Storage Account through a top-level domain with the Let’s Encrypt SSL certificate you can get for free, almost all via Terraform .

Read more about Exposing Azure Storage on Domain Apex with Let's Encrypt SSL

What Happened if Unit-Tests Unlock Self-Healing in Go?

Gopher with a wrench fixing a test.
Gopher with a wrench fixing a test.

Driving unit test coverage is essential but very dull. We need to make it as fun as possible. And for the “shippable” OSS products, it’s vital. It differs from the SaaS world, where you roll out an emergency release for all users. Once a user downloads something and runs it in their environment — it’s done. You cannot effortlessly swap the binary artifact. And if it’s broken — it’s your fault. The best way to prevent this is decent unit-testing coverage. This time we’ll cover something boring and automatable — API calls to a predefined service.

Read more about What Happened If Unit-Tests Unlock Self-Healing in Go?

GitHub Dependabot in Action

Experience with Dependabot a repository. Screenshot processed in GIMP
Experience with Dependabot a repository. Screenshot processed in GIMP

I’ve used this awesome tool on 20 open-source projects over the last two years. Here’s my opinion.

Read more about GitHub Dependabot in Action

OSS Year 2022 in Review: Projects Launched

It’s probably clear when I took the actual vacation.
It’s probably clear when I took the actual vacation.

Okay, it’s this time of the year, and everyone is checking their GitHub stats. I’ll join the pack on my OSS summary for the year 2022. Here’s a short recap with my own thoughts about the four projects I’ve been driving.

Read more about OSS Year 2022 in Review: Projects Launched

Tech Book Reviews: Go

Paper stack
Paper stack

Nothing like a paper book is falling onto your face, reminding you that you’re falling asleep and need to turn off the bedlight. I’ve read some books about GoLang and would like to share some of my opinions. The list goes in the reverse chronological order of me reading them. Some are good, some are great, and some are just there. Let’s Go.

Read more about Tech Book Reviews: Go

Open Source: Is It the Holy Grail or a Can of Worms?

Photo by Virginia Johnson
 on Unsplash
Photo by Virginia Johnson on Unsplash

Do you ever wonder if you should include a third-party library in your code or not? Sometimes it’s worth it, but mostly it’s not. Here’s a quick way to tell: If the library is doing something you don’t comprehend, or if it’s doing something you could do yourself with little effort, then don’t use it. The only exception to this rule is if the library is doing something that would be very difficult or time-consuming to do yourself. In that case, it might be worth using the library even if you don’t fully understand it.

Read more about Open Source: Is It the Holy Grail or a Can of Worms?

How Golang Generics Empower Concise APIs

Tired Gopher
 (of Quasilyte) is extracting table into memory
Tired Gopher (of Quasilyte) is extracting table into memory

You’ve likely heard and read dozens of stories about generics in Go about ordinary slices and maps but haven’t yet thought about a fun way to apply this feature. Let’s implement the peer of pandas.read_html , which maps HTML tables into slices of structs! If it’s achievable even with Rust , why shouldn’t it be with Go?! This essay will show you a thrilling mix of reflection and generics to reach concise external APIs for your libraries.

Read more about How Golang Generics Empower Concise APIs