The year 2021 will foresee a tremendous increase in the use of artificial intelligence, machine learning and other domains of data science. Walking the trail into the future, let’s take a look at the latest developments in data engineering in the year 2020 and what’s in store for 2021 and thereafter. The recent data engineering trends can be primarily divided into the following 3 categories -
- Data Infrastructure
- Data Architecture
- Data Management
Before we go into the depths of each of these categories, let us also spend a moment to know the top 3 predictions for the year 2021:
- Metadata management including data lineage, data quality and data discovery tools will together blend into a mainstream data management platform.
- Data Mesh principles will observe a significant adoption in driving this unified data management platform.
- Lakehouse systems including Iceberg, Rudi, Data lake will be significant in shaping the data engineering architecture.
Data Infrastructure
Managed Data Infrastructure and Serverless Computing - The year 2020 saw the cloud platforms continuing adoption of the open source data infrastructure solutions. This adoption is growing from AWS’s EMR, Azure HDInsight, Google Cloud Data proc to the latest AWS managed Airflow. Even though opinions may differ on cloud platforms packaging the opensource, the Cloud managed infrastructure certainly has many pros for the consumers to rapidly adopt complex infrastructure and put their attention to solving business problems.
2021 and beyond -
Cloud Datawarehouse - Since 2010, tightly coupled computing and storage has been the go-to approach to run large scale data processing engines. It was in 2019 when the industry finally acknowledged the cloud data warehouse system, after declaring that data processing is the old way of thinking.
With Snowflake’s successful IPO in 2020, cloud data warehouse systems emerged as an assurance for the future. The Amazon S3 update is as an important step in adopting object storage for the Cloud Datawarehouse system, with a powerful read after write consistency.
2021 and beyond -
Cost Optimisation - The Cloud Datawarehouse as well as the managed data infrastructure systems has put pressure on optimising the cost of operating the data warehouse systems for example, Netflix. At the same time, the GPU accelerated workload provides a strategic business advantage examples of which can be Pinterest and NVIDIA.
2021 and beyond -
Data Architecture
Lakehouse - The support for ACID transactions, data versioning, auditing, indexing, caching as well as query optimisation are the vital characteristics to build large scale data systems. 2020 saw emerging lakehouse frameworks such as DataBricks Delta Lake, Apache Hudi and Apache Iceberg. Lakehouse is a new generation of open platforms that unify Data Warehousing and advanced analytics.
2021 and beyond -
Lambda vs. Kappa vs. Lambda-less - The real time and batch computing management to provide one integrated dataset view remains the primary challenge in data processing. Pinterest writes on some of the challenges of Lambda architecture as well as its migration journey to the Kappa architecture. Linkedin also took an interesting approach to the Lambda-less model.
2021 and beyond -
Streaming SQL Engines & OLAP Engines - Real time computing and insights are crucial for many businesses. Event sourcing is a well-established design pattern that brings up this question - Can we merge compute business metrics and streams or feed it all into the OLAP databases and query it?
2021 and beyond -
Data Management
Data Quality and Metadata Management - The data quality is critical for developing a data pipeline, and the ML model is as efficient as the quality of data. Both Microsoft and Airbnb write on how data quality effort improved its organisations decision-making process. One of the most remarkable trends of 2020 is the emergence of tools and infrastructure in data engineering for managing metadata at scale.
2021 and beyond -
Data Mesh - Data Mesh in 2020 emerged as a de-facto principles for scale data management as an organisation grows. We have see a number of companies moving beyond a monolithic data lake and adopting a distributed Data Mesh.
2021 and beyond -
DBT & Workflow Orchestration - The fundamental pattern behind the success of DBT is that the industry embraces SQL as the best data abstraction mechanism for most of the data engineering workload. The success of DBT is also primarily driven by the corresponding successes of cloud data warehouse systems and the data lake 3.0 systems.
2021 and beyond -