Photo by Markus Winkler on Unsplash

This month was quite active with news and release in the Data space, with two big conferences going on — Amazon’s re:Invent and Neurips, as well as the official release of Airflow 2.0 and an introduction to the principles and architecture of the Data Mesh…

SQL, Databases, and ETL

SQLite received a new release(3.34.0) providing increased support for recursive queries, and an increased query planner.

Amazon open-sourced Babelfish to provide a SQL Server/T-SQL compatibility layer for Postgres. Postgres also received a docker image in which IVM (incremental view maintenance) is implemented.

Cockroach DB explained why they are compatible with Postgres, as well as how…

ON DATA ENGINEERING

Photo by Robin Pierre on Unsplash

The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data flows brings a paradigm shift and an added layer of complexity compared to traditional integration and processing methods (i.e., batch).

There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be taken into account when considering embarking on a real-time journey.

Use cases for leveraging Real-time Data

Photo by Viacheslav Bublyk on Unsplash

There are different estimates for Salaries available online. Glassdoor provides a general overview of salaries for Data Scientists, StackOverflow provides a specific calculator, and several recruitment agencies provide salary estimates for different positions.

For instance, Harnham produces an annual salary guide for many data professions across the UK, several European countries, and the US. This is an exercise shared by Orange Quarter, and Wyatt Partners. Storm2 provides an estimate of tech salaries for fintech companies. While BigCloud specifically focused on a report of Data Science salaries across Europe. …

Photo by Markus Winkler on Unsplash

With everyone trying to get everything out of the door before the Holiday season, November was a busy month for the data world, Airflow 2.0 moved to beta status, new tooling was released by Google to help with Machine Learning in the space of NLP and managing ML model bias, and Apple released some benchmark of the performance of their new M1 chip for ML workloads.

SQL and ETL

SQL got some attention this month; Google released an upgrade to their managed Postgres instance to the latest version, Postgres 13. Databricks released SQL Analytics providing a familiar SQL interface for querying delta lake…

ON DATA ENGINEERING

Photo by abi ismail on Unsplash

Postgres as a database is a very versatile database, with a high degree of extensibility. It can be extended through extensions, UDFs, UDAF, UDT. There are quite a few features not currently available within the native implementation. Not all extensibility options are supported in PaaS (platform as a service) implementations, AWS for instance, doesn’t support PL/Python as part of AWS Relational Databases (RDS).

Some companies such as Uber have explained why they have been migrating their operating data stores (ODS ) off of Postgres, but for Data Engineers, different functionality for a database used as a data warehouse than one…

ON DATA ENGINEERING

Photo by Campaign Creators on Unsplash

SQL is one of the key tools used by data engineers to model business logic, extract key performance metrics, and create reusable data structures. There are, however, different types of SQL to consider for data engineers: Basic, Advanced Modelling, Efficient, Big Data, and Programmatic. The path to learning SQL involves progressively learning these different types.

Basic SQL

What is “Basic SQL”

Learning “Basic SQL” is all about learning the key operations in SQL to manipulate the data such as aggregations, grain, and joins, for example.

Where to learn it

Basic SQL can be learned from websites such as W3C or looking for a more practical approach to learning from websites…

On Data Engineering

Five tips to help you navigate your early career as a data engineer

Photo by Sam Dan Truong on Unsplash

Data Engineering is an interdisciplinary profession requiring a mix of technical and business knowledge to have the most impact. Starting a career in data engineering, it is not always clear what is necessary to be successful. Some people believe that there is a need to learn particular technologies (e.g., Big data); others believe it is a high degree of software engineering expertise; others believe it is focusing on business.

There are five main tips I would give to data engineers starting their career:

  1. Learn fast
  2. Don’t succumb to the technology hype
  3. Data Engineering is not all about coding or technology…

On Data Engineering

Understanding how to leverage one of the most useful window functions in SQL’s toolbox

ROW_NUMBER is one of the most valuable and versatile functions in SQL. It can be leveraged for different use cases, from ranking items, identifying data quality gaps, doing some minimization, handling preference queries, or helping with sessionization etc.

Photo by R Mo on Unsplash

The ROW_NUMBER function isn’t, however, a traditional function. It is a window function. Window functions are an advanced kind of function, with specific properties. This article aims to go over how window functions, and more specifically, how the ROW_NUMBERfunction work, and to go over some of the use cases for the ROW_NUMBER function.

Anatomy of a Window function

To understand how a window function work, it is…

Data Science Interviews

Photo by Billy Huynh on Unsplash

KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced data scientists do not have a good grasp of this algorithm. This makes KMeans an excellent topic for interviews, to get a good grasp of the understanding of one of the most foundational machine learning algorithm.

There are a lot of questions that can touched-on when discussing the topic:

  1. Description of the Algorithm
  2. Big O Complexity & Optimization
  3. Application of the algorithm
  4. Comparison with other clustering algorithms
  5. Advantages / Disadvantage of using K-Means

Description of Algorithm

Describing the…

ON DATA ENGINEERING

How to handle the difficulty in leveraging and tracking time

Photo by Thomas Bormans on Unsplash

There is a lot of focus on engagement analysis to track customer time spent on different pieces of content. Time Spend is usually a metric that proves to have quite a lot of data quality issues.

The importance of Time Spent

Time spent is an essential metric in engagement study. It provides a general measure of engagement with the content on your website or the popularity of your app. It allows us to blend across the different types of material, such as pictures, text content, or even videos.

It helps to provide a metric that informs on the potential the share of time that you…

Julien Kervizic

Living at the interstice of business, data and technology | Solution Architect & Head of Data | Heineken, Facebook and Amazon | linkedin: https://bit.ly/2XbDffo

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store