This month was quite active with news and release in the Data space, with two big conferences going on — Amazon’s re:Invent and Neurips, as well as the official release of Airflow 2.0 and an introduction to the principles and architecture of the Data Mesh…
SQLite received a new release(3.34.0) providing increased support for recursive queries, and an increased query planner.
Amazon open-sourced Babelfish to provide a SQL Server/T-SQL compatibility layer for Postgres. Postgres also received a docker image in which IVM (incremental view maintenance) is implemented.
The shift towards real-time data flow has a major impact on the way applications are designed and on the work of data engineers. Dealing with real-time data flows brings a paradigm shift and an added layer of complexity compared to traditional integration and processing methods (i.e., batch).
There are real benefits to leveraging real-time data, but it requires specialized considerations in setting up the ingestion, processing, storing, and serving of that data. It brings about specific operational needs and a change in the way data engineers work. These should be taken into account when considering embarking on a real-time journey.
There are different estimates for Salaries available online. Glassdoor provides a general overview of salaries for Data Scientists, StackOverflow provides a specific calculator, and several recruitment agencies provide salary estimates for different positions.
For instance, Harnham produces an annual salary guide for many data professions across the UK, several European countries, and the US. This is an exercise shared by Orange Quarter, and Wyatt Partners. Storm2 provides an estimate of tech salaries for fintech companies. While BigCloud specifically focused on a report of Data Science salaries across Europe. …
With everyone trying to get everything out of the door before the Holiday season, November was a busy month for the data world, Airflow 2.0 moved to beta status, new tooling was released by Google to help with Machine Learning in the space of NLP and managing ML model bias, and Apple released some benchmark of the performance of their new M1 chip for ML workloads.
SQL got some attention this month; Google released an upgrade to their managed Postgres instance to the latest version, Postgres 13. Databricks released SQL Analytics providing a familiar SQL interface for querying delta lake…
Postgres as a database is a very versatile database, with a high degree of extensibility. It can be extended through extensions, UDFs, UDAF, UDT. There are quite a few features not currently available within the native implementation. Not all extensibility options are supported in PaaS (platform as a service) implementations, AWS for instance, doesn’t support PL/Python as part of AWS Relational Databases (RDS).
Some companies such as Uber have explained why they have been migrating their operating data stores (ODS ) off of Postgres, but for Data Engineers, different functionality for a database used as a data warehouse than one…
SQL is one of the key tools used by data engineers to model business logic, extract key performance metrics, and create reusable data structures. There are, however, different types of SQL to consider for data engineers: Basic, Advanced Modelling, Efficient, Big Data, and Programmatic. The path to learning SQL involves progressively learning these different types.
Learning “Basic SQL” is all about learning the key operations in SQL to manipulate the data such as aggregations, grain, and joins, for example.
Basic SQL can be learned from websites such as W3C or looking for a more practical approach to learning from websites…
Data Engineering is an interdisciplinary profession requiring a mix of technical and business knowledge to have the most impact. Starting a career in data engineering, it is not always clear what is necessary to be successful. Some people believe that there is a need to learn particular technologies (e.g., Big data); others believe it is a high degree of software engineering expertise; others believe it is focusing on business.
There are five main tips I would give to data engineers starting their career:
ROW_NUMBER is one of the most valuable and versatile functions in SQL. It can be leveraged for different use cases, from ranking items, identifying data quality gaps, doing some minimization, handling preference queries, or helping with sessionization etc.
ROW_NUMBER function isn’t, however, a traditional function. It is a window function. Window functions are an advanced kind of function, with specific properties. This article aims to go over how window functions, and more specifically, how the
ROW_NUMBERfunction work, and to go over some of the use cases for the
To understand how a window function work, it is…
KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced data scientists do not have a good grasp of this algorithm. This makes KMeans an excellent topic for interviews, to get a good grasp of the understanding of one of the most foundational machine learning algorithm.
There are a lot of questions that can touched-on when discussing the topic:
There is a lot of focus on engagement analysis to track customer time spent on different pieces of content. Time Spend is usually a metric that proves to have quite a lot of data quality issues.
Time spent is an essential metric in engagement study. It provides a general measure of engagement with the content on your website or the popularity of your app. It allows us to blend across the different types of material, such as pictures, text content, or even videos.
It helps to provide a metric that informs on the potential the share of time that you…