The Challenge
Advertising is important for every aspect of a business, as it lets the business gain more customers, thereby increasing business turnaround. It can be achieved using various media like television, social media, websites etc.
GoDaddy being an online business, used to collect Data from these sources and store into Apache hive, which was then loaded into GCP Big Query for the transformations and Amazon S3, Amazon Redshift for generating reports.
Too many hops in this data pipeline was creating dependencies on other platforms like Hadoop and GCP, and also this approach lacked the ability to manage the upscaling of the application. Thus, Virtue Tech suggested a way to decommission the GCP, Hadoop and manage AWS for both transforming and analytical purposes.
Why Amazon Web Services
Lack of capability to use columnar storage for storing metric data in GCP was increasing the storage space and the costs compared to using S3 for storage by and EMR for processing the data. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters. We also saved 50-80% on the cost of the instances by selecting Amazon EC2 Spot for transient workloads and Reserved Instances for long-running workloads. Three R5.4 nodes used within EMR cluster costs 3 x $0.25/hour equates to less than a dollar for both tasks during a batch run
Running Critical Applications on AWS
Data from multiple sources such as Google Analytics, Salesforce etc. is pulled into Amazon Simple Storgae Service (S3) in CSV format through APIs. Amazon EMR runs a generic configuaration based Pyspark framework, which has automated the entire process of transforming, cleaning and loading the data into S3 ADS Layer. Amazon Redshift table is created on top of this S3 ADS Layer. Tableau is connected to this redshift table through Redshift connector, which lets the analysts analyze the data and generate useful business insights
The Benefits
Using AWS, GoDaddy now has access to the scale it requires to deliver a reliable and insightful service to its customers. With its earlier setup comprising of GCP and on-premise, the availability of its system ran to 98 percent, but on its AWS cloud infrastructure, this has risen to 99.965 percent. Running the pipelines on AWS infrastructure has also resulted in saving time as well as money. While, the previous system took almost more than 50 minutes to run the end-to-end flow, however on AWS it hardly takes 20 minutes to complete. Company is also saving almost 3000 USD per month since previous infrastructure included extra storage cost of google cloud storage and on-premise hadoop cluster.