Dataroma Case Study Dataroma Case Study
  • Services
    • Cloud Services | Cloud Solutions
      • Cloud Transformation Strategy | Cloud Migration Strategy
      • Cloud Migration Services | Cloud Data Migration
      • Cloud Managed Services | Cloud Data Management Services
    • DATA ENGINEERING & ANALYTICS
      • Data Lakes
      • Data Engineering
      • BI Analytics
    • Artificial Intelligence
    • PROFESSIONAL SERVICES
  • Industries
    • Telecommunications
    • Healthcare & Life Sciences
    • Financial Services
    • Media
    • Retail
    • Startup
    • Manufacturing
  • AWS
    • AWS Automation
    • AWS Migration
    • AWS Development
    • AWS Case Studies
  • Insights
    • Case Studies
    • Blogs
  • About Us
    • About VirtueTech
    • Leadership Team
  • Careers
  • Let’s Connect
  • Services
    • Cloud Services | Cloud Solutions
      • Cloud Transformation Strategy | Cloud Migration Strategy
      • Cloud Migration Services | Cloud Data Migration
      • Cloud Managed Services | Cloud Data Management Services
    • DATA ENGINEERING & ANALYTICS
      • Data Lakes
      • Data Engineering
      • BI Analytics
    • Artificial Intelligence
    • PROFESSIONAL SERVICES
  • Industries
    • Telecommunications
    • Healthcare & Life Sciences
    • Financial Services
    • Media
    • Retail
    • Startup
    • Manufacturing
  • AWS
    • AWS Automation
    • AWS Migration
    • AWS Development
    • AWS Case Studies
  • Insights
    • Case Studies
    • Blogs
  • About Us
    • About VirtueTech
    • Leadership Team
  • Careers
  • Let’s Connect
  •  

AWS Automation

Category: AWS Automation

Dataroma Case Study

The Challenge

GoDaddy is world’s largest domain name registrar company, with more than 16 million domain name registrations and 13 million customers. GoDaddy collects all the marketing related information for its customers to help them along in their customer’s lifecycle.

This marketing data is highly messy as it consists of data from Google Adwords, yahoo, bing etc. and thus it was a huge challenge for the GoDaddy team to get insights from this data.

Why Amazon Web Services

GoDaddy stores information on Amazon Simple Storage Service (Amazon S3), and processes data in parallel with Amazon Elastic MapReduce (Amazon EMR). EMR decouples compute and storage, giving us the ability to scale each independently and take advantage of the tiered storage of Amazon S3. With EMR, we can provision one, hundreds, or thousands of compute instances or containers to process data at any scale. The number of instances can be increased or decreased automatically using Auto Scaling (which manages cluster sizes based on utilization) and we only pay for what we use.

Running Critical Applications on AWS

Salesforce data is extracted in Amazon Simple Storgae Service (S3) in CSV format through API. Amazon EMR runs a Pyspark framework, which has automated the entire process of transformation, cleanup and loading data into S3 ADS Layer. Amazon Redshift table is created on top of this S3 ADS Layer. Tableau is connected to this redshift table via the Redshift connector, which allows analysts to analyze data and generate useful business insights.

The Benefits

The primary benefit to GoDaddy of moving to AWS is that enough resources are available to provide the services to customers of all sizes and onboard those customers in days. To generate deep insights with a more effective reporting process, GoDaddy turned to Virtue Tech to help deliver greater value for its client base by shifting time away from manual data wrangling and towards better analysis and insights.

Read More
PyDeque Case Study

GoDaddy, the domain registration and management giant, is now migrating their majority of their infrastructure and data warehouse to AWS. GoDaddy group decided to go with AWS due to its deep experience in delivering a highly reliable global infrastructure, as well as an unmatched track record of technology innovation, to support their rapidly expanding business.

The Challenge

GoDaddy, with its web hosting and world’s largest domain name registration business manages over 57 million domains worldwide. It ingests on a daily basis  over 20 terabytes of new, uncompressed data everything,  from website traffic and usage metrics to server management and ecommerce statistics.  All that data is used to configure products and provide client services to its 14.7 million customers which ranges from major corporations to small businesses. This data comes from many known and unknown sources with highly varied formats and disparate meanings and uses. There are conflicts and inconsistent or contradictory phenomena among data from different sources.  In the case of small data volume, the data can be checked by a manual search or programming, even by ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform). However, these methods are useless when processing such huge data volume. This is a great challenge to the existing techniques of data processing quality.

Why Amazon Web Services

Building a data lake in the cloud eliminates the costs and hassle of managing the necessary infrastructure required in an on-premises data center. That’s why VirtueTech recommended GoDaddy to go with AWS Cloud, as it offers even more benefits by virtue of their broad portfolio of services that offer options for building a data lake as well as maintaining the quality of data. That includes Amazon Simple Storage Service (Amazon S3) for storing data in any format, securely, and at massive scale. Deequ, which is used internally at amazon, can be used to define and verify data quality constraints, and be informed about the changes in data distribution.

Running Critical Applications on AWS

GoDaddy is evaluating Data Quality using PyDeequ, an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon) to make sure that the data used ultimately is relevant and trustworthy. PyDeequ uses Spark to read from sources such as Amazon Simple Storage Service (Amazon S3), and computes data quality metrics such as completeness, maximum, or correlation through an optimized set of aggregation queries. In addition to that, it also checks for constraints that are to be verified. It generates a data quality report, which contains the result of the constraint verification. PyDeequ can also analyze the whole dataset or only its part, and suggest you the validation constraints from there. In case of data quality failure, the downstream process is stopped and an email is sent to the producer team using Amazon Simple Email Service (SES). This report consists of an attachment notifying the team, what has failed, and why it failed.

GoDaddy PyDeque System Configuration Diagram

The Benefits

Improved data quality has led to better decision-making across the organization. Users can easily discover high-quality data in optimized formats, and teams are reporting reduced latency for their analytics results. Incomplete or inconsistent data takes significant amounts of time fixing that data to make it useable. This takes time away from other activities and means it takes longer to implement the insights the data uncovered. Quality data is also helping to keep your company’s various departments on the same page so that they can work together more effectively. This also led to increased profitability of the company, as business insights are now much more reliable and efficient.

Read More
RTF – Case Study

GoDaddy, the domain registration and management giant, is now migrating their majority of their infrastructure and data warehouse to AWS. GoDaddy group decided to go with AWS due to its deep experience in delivering a highly reliable global infrastructure, as well as an unmatched track record of technology innovation, to support their rapidly expanding business. 

The Challenge

GoDaddy is committed to continuous innovation, technology, and platform improvements to create a great experience for its customers. Real Time Finance (RTF) team of GoDaddy, wanted to streamline their process of generating reports, which used to happen earlier using json files. While the reports did give this company specific things they needed, but it was taking a good amount of time to generate the reports and also was not much analyst friendly. Thus, the company decided to shift to more cloud-native solution for generating reports.

Why Amazon Web Services

AWS provides a superior global footprint and set of cloud capabilities, which is why GoDaddy selected them to meet their needs today, and into the future. It also enables GoDaddy to accelerate the delivery of its products and services, and easily deploy them globally in minutes, to its customers worldwide.  In addition to that, AWS will enable GoDaddy to leverage emerging technologies like machine learning, quickly test ideas, and deliver new tools and solutions to their customers with greater frequency.

Running Critical Applications on AWS

GoDaddy provisions Amazon Elastic Map Reduce (Amazon EMR) cluster to run Pyspark framework, which has automated the entire RTF process. Column mapping feature of this framework helps the user to map the raw column name from source to new column names in AWS Glue tables. By adding this component to the architecture, company not only preserved their reporting system, but also enabled more user-friendly report generation. This framework is developed in such a way that an incremental data is loaded in the final glue table on a daily basis and thus it gives only the latest records from the system, enabling all the stakeholders from GoDaddy to run their SQL like queries and analyze the same on AWS glue and Amazon Athena. Not only is the process of querying now simpler, but the queries themselves also take much less time to complete, since the queries runs only on delta data, and thus reports takes no time to generate, which earlier used to take lots of time. 

GoDaddy RTF System Configuration Diagram 

The Benefits

VirtueTech advised GoDaddy to switch to a new AWS-based architecture to generate reports for its RTF team. With an on premise solution, the process used to take a large amount of time to generate reports; however, after migrating to AWS, reports not only takes less time but also saves on costs. Analysts also wanted to query their data from SQL like tool, so it will be easy for them to generate reports. Therefore, as a solution, they used Glue/Athena combination for faster and quicker analysis. There are now significantly more opportunities to make data-driven decisions within the organization.

Read More

Latest Blogs

  • BLOCKCHAIN ANALYTICS & ITS POTENTIAL USE-CASES
  • Amazon Redshift and its high-performance ingredients
  • DataOps: Future of Businesses in Data World

We are a team of highly skilled professionals with 20+ years of experience, who are lock and step with the industry 4.0 journey and evolution.
Email : contact.us@virtuetechinc.com
  |     |  

Services
  • CLOUD SERVICES
  • DATA SERVICES
  • INTERNET OF THINGS
  • AI | ML
  • PROFESSIONAL
Industries
  • TELECOMMUNICATIONS
  • HEALTHCARE & LIFE SCIENCE
  • FINANCIAL SERVICES
  • MEDIA | RETAIL | STARTUP
  • MANUFACTURING
About Us
  • ABOUT VIRTUETECH
  • CAREER
  • CONTACT US
  • CASE STUDIES
  • BLOGS
2020 © copyrights VIRTUETECH | PRIVACY POLICY | DISCLAIMER