GoDaddy, the domain registration and management giant, is now migrating their majority of their infrastructure and data warehouse to AWS. GoDaddy group decided to go with AWS due to its deep experience in delivering a highly reliable global infrastructure, as well as an unmatched track record of technology innovation, to support their rapidly expanding business.
GoDaddy is the world's largest web host by market share, with over 62 million registered domains. This global growth has led to extremely large amount of data being generated across various projects and teams. Multiple teams of GoDaddy wanted to streamline their process of cleaning data before loading it into Amazon S3. This called for a generic configuration based framework to reduce the development efforts, which can be used across the different teams of GoDaddy.
Why Amazon Web Services
Running a cleaning process on more than 40 tables with huge amounts of data requires a scalable and cost-effective infrastructure, which is why Virtue Tech recommended GoDaddy to go with AWS as the capacity provided by AWS is a perfect fit. It gives the broadest and deepest portfolio of purpose-built analytics services optimized for unique analytics use cases. These services are all designed to be the best in class, which means we never have to compromise on performance, scale, or cost when using them. Spark on Amazon EMR runs 3x faster than standard Apache Spark 3.0 and we can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions.
Running Critical Applications on AWS
GoDaddy provisions Amazon Elastic Map Reduce (Amazon EMR) cluster to run generic configuaration based Pyspark framework, which has automated the entire process of loading the clean data to AWS Glue. This framework is developed in such a way that user is given option of handle incremental as well historical data for both partitioned and non-partitioned tables. In addition to that, it supports txt, csv, json and parquet input file formats. It also handles slowly changing dimensions (SCD), depending upon the usecase. Column mapping feature of this framework helps the user to map the raw column name from source to new column names in AWS Glue tables, which enables all the stakeholders from GoDaddy to run their SQL like queries and analyze the same on AWS glue and Amazon Athena. Not only is the process of querying now simpler, but the queries themselves also take much less time to complete, since the queries now runs on Amazon Athena, and thus reports takes no time to generate, which earlier used to take lots of time.
GoDaddy Clean Layer System Configuration Diagram
GoDaddy’s decision to move to a new AWS-based architecture to streamline the process of loading clean data across multiple teams helped the company save on time and money. With an on-premise solution, developing the framework for each team, used to take more than two weeks of time; however, after migrating to AWS and streamlining the process, the process takes hardly a day or two, as only system configuration file needs to be created. Running framework on Amazon EMR not only made the process 3x times faster but also saved on more than 50-80% of total costs.