The year 2021 will foresee a tremendous increase in the use of artificial intelligence, machine learning and other domains of data science. Walking the trail into the future, let’s take a look at the latest developments in data engineering in the year 2020 and what’s in store for 2021 and thereafter. The recent data engineering trends can be primarily divided into the following 3 categories -
- Data Infrastructure
- Data Architecture
- Data Management
Before we go into the depths of each of these categories, let us also spend a moment to know the top 3 predictions for the year 2021:
- Metadata management including data lineage, data quality and data discovery tools will together blend into a mainstream data management platform.
- Data Mesh principles will observe a significant adoption in driving this unified data management platform.
- Lakehouse systems including Iceberg, Rudi, Data lake will be significant in shaping the data engineering architecture.
Data Infrastructure
Managed Data Infrastructure and Serverless Computing - The year 2020 saw the cloud platforms continuing adoption of the open source data infrastructure solutions. This adoption is growing from AWS’s EMR, Azure HDInsight, Google Cloud Data proc to the latest AWS managed Airflow. Even though opinions may differ on cloud platforms packaging the opensource, the Cloud managed infrastructure certainly has many pros for the consumers to rapidly adopt complex infrastructure and put their attention to solving business problems.
2021 and beyond -
there will be a significant increase in a very interesting trend in data engineering i.e. the serverless architecture. It will also open an exciting space to observe how managed data infrastructure and the rise in serverless computing will come together.
Cloud Datawarehouse - Since 2010, tightly coupled computing and storage has been the go-to approach to run large scale data processing engines. It was in 2019 when the industry finally acknowledged the cloud data warehouse system, after declaring that data processing is the old way of thinking.
With Snowflake’s successful IPO in 2020, cloud data warehouse systems emerged as an assurance for the future. The Amazon S3 update is as an important step in adopting object storage for the Cloud Datawarehouse system, with a powerful read after write consistency.
2021 and beyond -
the Cloud Datawarehouse system will continue its dominance and increased adoption. It will also be of interest to see how these cloud warehouse systems are tightly integrating with the data management systems.
Cost Optimisation - The Cloud Datawarehouse as well as the managed data infrastructure systems has put pressure on optimising the cost of operating the data warehouse systems for example, Netflix. At the same time, the GPU accelerated workload provides a strategic business advantage examples of which can be Pinterest and NVIDIA.
2021 and beyond -
the unpredictability of the object storage engines, storage costs, handling and the need for specialised hardware will be a norm.
Data Architecture
Lakehouse - The support for ACID transactions, data versioning, auditing, indexing, caching as well as query optimisation are the vital characteristics to build large scale data systems. 2020 saw emerging lakehouse frameworks such as DataBricks Delta Lake, Apache Hudi and Apache Iceberg. Lakehouse is a new generation of open platforms that unify Data Warehousing and advanced analytics.
2021 and beyond -
the Lakehouse systems will continue to mature and play a big role in shaping the data engineering architecture. Let us see how will Lakehouse complement or compete with the likes of Redshift and Snowflake.
Lambda vs. Kappa vs. Lambda-less - The real time and batch computing management to provide one integrated dataset view remains the primary challenge in data processing. Pinterest writes on some of the challenges of Lambda architecture as well as its migration journey to the Kappa architecture. Linkedin also took an interesting approach to the Lambda-less model.
2021 and beyond -
there will be no real-time vs. batch, but it will be all about the window that we process.
Streaming SQL Engines & OLAP Engines - Real time computing and insights are crucial for many businesses. Event sourcing is a well-established design pattern that brings up this question - Can we merge compute business metrics and streams or feed it all into the OLAP databases and query it?
2021 and beyond -
streaming SQL engines are the way for predefined analytics. OLAP engines are good for interactive analytics where analytical queries are unknown while building the datasets.
Data Management
Data Quality and Metadata Management - The data quality is critical for developing a data pipeline, and the ML model is as efficient as the quality of data. Both Microsoft and Airbnb write on how data quality effort improved its organisations decision-making process. One of the most remarkable trends of 2020 is the emergence of tools and infrastructure in data engineering for managing metadata at scale.
2021 and beyond -
isolated systems like data pipeline, data quality, data discovery and data lineage that will emerge as one unified data management platform.
Data Mesh - Data Mesh in 2020 emerged as a de-facto principles for scale data management as an organisation grows. We have see a number of companies moving beyond a monolithic data lake and adopting a distributed Data Mesh.
2021 and beyond -
an accelerated adoption of data mesh principles will further push the vision of a single integrated data management system.
DBT & Workflow Orchestration - The fundamental pattern behind the success of DBT is that the industry embraces SQL as the best data abstraction mechanism for most of the data engineering workload. The success of DBT is also primarily driven by the corresponding successes of cloud data warehouse systems and the data lake 3.0 systems.
2021 and beyond -
we will also see the likes of Databricks alongside of AWS launching its version of DBT and adopting to it. With this, it is interesting to note that the general purpose data orchestration engines like Airflow, Dagster and prefect come already integrated with the DBT.