Building a Batch Data Pipeline with Athena and MySQL | by 💡Mike Shakhomirov | Oct, 2023


An End-To-End Tutorial for Beginners

💡Mike Shakhomirov
Towards Data Science
Photo by Redd F on Unsplash

In this story I will speak about one of the most popular ways to run data transformation tasks — batch data processing. This data pipeline design pattern becomes incredibly useful when we need to process data in chunks making it very efficient for ETL jobs that require scheduling. I will demonstrate how it can be achieved by building a data transformation pipeline using MySQL and Athena. We will use infrastructure as code to deploy it in the cloud.

Imagine that you have just joined a company as a Data Engineer. Their data stack is modern, event-driven, cost-effective, flexible, and can scale easily to meet the growing data resources you have. External data sources and data pipelines in your data platform are managed by the data engineering team using a flexible environment setup with CI/CD GitHub integration.

As a data engineer you need to create a business intelligence dashboard that displays the geography of company revenue streams as shown below. Raw payment data is stored in the server database (MySQL). You want to build a batch pipeline that extracts data from that database daily, then use AWS S3 to store data files and Athena to process it.

Revenue dashboard. Image by author.

Batch data pipeline

A data pipeline can be considered as a sequence of data processing steps. Due to logical data flow connections between these stages, each stage generates an output that serves as an input for the following stage.

There is a data pipeline whenever there is data processing between points A and B.

Data pipelines might be different due it their conceptual and logical nature. I previously wrote about it here [1]:



Source link

This post originally appeared on TechToday.

Leave a Reply

Your email address will not be published. Required fields are marked *