Aws redshift spectrum architecture

9/17/2023

The solution should meet the data consistency requirements without building a complicated data synchronization process.

The purpose of the solution is to divide the workloads into separate Amazon Redshift clusters so that we can use Amazon Redshift to pause and resume clusters for periodic workloads to reduce the Amazon Redshift running costs, because clusters can still access a single copy of data that is required for workloads. Therefore, if we duplicate data into two Amazon Redshift clusters or only create a data share from the BI cluster to the reporting cluster, the customer will have to develop a data synchronization process to keep the data consistent between all Amazon Redshift clusters, and this process could be very complicated and unmaintainable.Īfter more analysis to gain an in-depth understanding of the customer’s workloads, the AWS team found that we could put tables into four groups, and proposed a multi-cluster, two-way data sharing solution. Some tables need to be read by ETL workloads and written by BI workloads, and some tables are the opposite. The challenge of dividing the Amazon Redshift cluster into multiple clusters is data consistency. The obstacle is that many tables in the data warehouse are required to be read and written by multiple workloads, and only the producer of a data share can update the shared data. By reducing the total number of nodes, we hoped to reduce the cost of Amazon Redshift.Īfter a series of conversations, the AWS team found that one of the reasons that the customer keeps all workloads on the 12-node Amazon Redshift cluster is to manage the performance of queries from their BI platform, especially while running ETL workloads, which have a big impact on the performance of all workloads on the Amazon Redshift cluster. Therefore, the analytics team wants to explore solutions to optimize their Amazon Redshift cluster.īecause CPU utilization spikes appear while the ETL tasks are running, the AWS team’s first thought was to separate workloads and relevant data into multiple Amazon Redshift clusters with different cluster sizes. However, they have noticed that performance is reduced while running ETL tasks, and the duration of ETL tasks is long. Weekly ETL – This job runs in the early morning every Sunday.It’s the second-most resource-heavy workload. Each job normally takes between 1.5–3 hours. Daily ETL – This job runs twice a day during business hours, because the operation team needs to get daily reports before the end of the day.Hourly ETL – This extract, transform, and load (ETL) job runs in the first few minutes of each hour.Queries from the BI platform – Various queries run mainly during business hours.The company runs four major analytics workloads on a single Amazon Redshift cluster, because some data is used by all workloads: The data has increased by hundreds of gigabytes daily in recent months, and employees from departments continuously run queries against the Amazon Redshift cluster on their BI platform during business hours. They mainly use Amazon Redshift to store and process user behavioral data for BI purposes. In this use case, our customer is heavily using Amazon Redshift as their data warehouse for their analytics workloads, and they have been enjoying the possibility and convenience that Amazon Redshift brought to their business. However, data sharing in Amazon Redshift has a few limitations. It also provides fine-grained access controls that you can tailor for different users and businesses that all need access to the data. You can also share the most up-to-date and consistent information as it’s updated in Amazon Redshift Serverless. You can share data at many levels, including schemas, tables, views, and user-defined functions. You can securely share live data with Amazon Redshift clusters in the same or different AWS accounts, and across regions. Data sharing enables instant, granular, and fast data access across Redshift clusters without the need to copy or move it. Data sharing – Amazon Redshift data sharing offers you to extend the ease of use, performance, and cost benefits of Amazon Redshift in a single cluster to multi-cluster deployments while being able to share data.RA3 nodes also support the ability to pause and resume, which allows you to easily suspend on-demand billing while the cluster is not being used. They bring a few very important features, one of which is data sharing. RA3 nodes – Amazon Redshift RA3 nodes are backed by a new managed storage model that gives you the power to separately optimize your compute power and your storage.

0 Comments

Aws redshift spectrum architecture

Leave a Reply.

Author

Archives

Categories