ON DATA ENGINEERING

Financial methods, applications, and modeling for Data Engineers

Photo by Markus Spiske on Unsplash

Being a role meant to support the decision-making process, Data Engineers need to understand certain Financial concepts and know-how to best leverage them in their data models.

Some concepts are particularly important for Data Engineers activities, namely amortization, and allocation. While other concepts of controlling such as entitlement values, mix shift, and variance decomposition can also be helpful to understand how some of their data consumers might be leveraging the data.

Amortization & Depreciation

What is amortization?

Amortization represents the process of gradually writing off the initial cost of an asset over its useful life — when we distinguish between amortization and depreciation…


ON Data Engineering

Data Engineering application of data classes

Photo by Nam Hoang on Unsplash

Data classes are a relatively new introduction to Python, first released in Python 3.7 which provides an abstraction layer leveraging type annotations to define container objects for data. Compared to a normal Python class, data classes make do of some of the syntactic sugar for instantiation, and there are a number of areas where data class can add value to data engineering.

Understanding Data Classes

The data class library introduces a lightweight way to define objects, providing getters and setters for the different fields define within it.

from dataclasses import dataclass@dataclass
class CustomerDataClass:

As shown above, it relies on a decorator pattern…


ON DATA ENGINEERING

Reflections on one year of using DBT for modeling a data warehouse

Photo by Artur Shamsutdinov on Unsplash

DBT is a tool that aims at facilitating the work of analysts and data engineering in transforming data and modeling within a data warehouse. It provides a command-line as well as a documentation and RPC server.

After more than a year working with DBT, I thought it would be good to reflect on what it offers, what it is currently lacking, and what features might be desirable to have incorporated in the tool.

Jinja capabilities

Jinja is a python templating engine, used in data tools such as Airflow, Superset, or infrastructure as code tools such as Ansible.

DBT leverages Jinja, at the…


Photo by Michael Dziedzic on Unsplash

Data Modeling seems to have become a lost art amongst data engineers. What was once the primal part of the job of a data engineer seems to have been relegated to a secondary rank.

Shaping the data by developing an understanding of the underlying data and the business process going along with it doesn’t seem nearly as important these days as the ability to move data around.

In a large number of organizations, the role of a data engineer has been transformed from a data shaper to a data mover.

Data Engineering’s role shift

Data Engineering has been a fast-evolving role, at the same…


Photo by Markus Winkler on Unsplash

This month was quite active with news and release in the Data space, with two big conferences going on — Amazon’s re:Invent and Neurips, as well as the official release of Airflow 2.0 and an introduction to the principles and architecture of the Data Mesh…

SQL, Databases, and ETL

SQLite received a new release(3.34.0) providing increased support for recursive queries, and an increased query planner.

Amazon open-sourced Babelfish to provide a SQL Server/T-SQL compatibility layer for Postgres. Postgres also received a docker image in which IVM (incremental view maintenance) is implemented.

Cockroach DB explained why they are compatible with Postgres, as well as how…


ON DATA ENGINEERING

Photo by Robin Pierre on Unsplash

The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data flows brings a paradigm shift and an added layer of complexity compared to traditional integration and processing methods (i.e., batch).

There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be taken into account when considering embarking on a real-time journey.

Use cases for leveraging Real-time Data


Photo by Viacheslav Bublyk on Unsplash

There are different estimates for Salaries available online. Glassdoor provides a general overview of salaries for Data Scientists, StackOverflow provides a specific calculator, and several recruitment agencies provide salary estimates for different positions.

For instance, Harnham produces an annual salary guide for many data professions across the UK, several European countries, and the US. This is an exercise shared by Orange Quarter, and Wyatt Partners. Storm2 provides an estimate of tech salaries for fintech companies. While BigCloud specifically focused on a report of Data Science salaries across Europe. …


Photo by Markus Winkler on Unsplash

With everyone trying to get everything out of the door before the Holiday season, November was a busy month for the data world, Airflow 2.0 moved to beta status, new tooling was released by Google to help with Machine Learning in the space of NLP and managing ML model bias, and Apple released some benchmark of the performance of their new M1 chip for ML workloads.

SQL and ETL

SQL got some attention this month; Google released an upgrade to their managed Postgres instance to the latest version, Postgres 13. Databricks released SQL Analytics providing a familiar SQL interface for querying delta lake…


ON DATA ENGINEERING

Photo by abi ismail on Unsplash

Postgres as a database is a very versatile database, with a high degree of extensibility. It can be extended through extensions, UDFs, UDAF, UDT. There are quite a few features not currently available within the native implementation. Not all extensibility options are supported in PaaS (platform as a service) implementations, AWS for instance, doesn’t support PL/Python as part of AWS Relational Databases (RDS).

Some companies such as Uber have explained why they have been migrating their operating data stores (ODS ) off of Postgres, but for Data Engineers, different functionality for a database used as a data warehouse than one…


ON DATA ENGINEERING

Photo by Campaign Creators on Unsplash

SQL is one of the key tools used by data engineers to model business logic, extract key performance metrics, and create reusable data structures. There are, however, different types of SQL to consider for data engineers: Basic, Advanced Modelling, Efficient, Big Data, and Programmatic. The path to learning SQL involves progressively learning these different types.

Basic SQL

Learning “Basic SQL” is all about learning the key operations in SQL to manipulate the data such as aggregations, grain, and joins, for example.

Basic SQL can be learned from websites such as W3C or looking for a more practical approach to learning from websites…

Living at the interstice of business, data and technology | Solution Architect & Head of Data | Heineken, Facebook and Amazon | linkedin: https://bit.ly/2XbDffo

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store