When to use Amazon RDS or Redshift
AWS offers many services to store, manage, and analyze data in a scalable and reliable way. Two well-known examples are RDS and Redshift. Pick the wrong one for an application, and performance and budgets can suffer.
Amazon Relational Database Service (RDS) manages database servers in the cloud. Amazon Redshift supports data warehouse and data lake approaches, allowing it to access and analyze large amounts of data. Although they have similarities, these two AWS database services solve different problems. They are based on different design patterns, and they scale and access data in different ways.
What are the main similarities?
RDS and Redshift use SQL to access data. They both manage compute and storage infrastructure to deliver scalability, availability, and security features. For software, data is grouped in databases and logical schemas, stored in tables, and organized in rows and columns. Amazon Redshift is based on PostgreSQL, a widely used database engine. Both equip app owners with a programmatic way to store and access data and make it available to end users.
What are the main differences?
RDS and Redshift differ in key areas including underlying design, provisioning and scaling approach, and data access. Cost is also a distinction.
Structure. The most important difference between RDS and Redshift is their data processing design. As a relational database service, RDS follows an online transaction processing (OLTP) design. OLTP programs follow a transactional process, which means that data is protected from concurrent changes and corruption from failed processes. Redshift follows an online analytical processing (OLAP) approach. OLAP cleans and organizes data from data warehouses into structured data cubes to prepare them for queries. This difference means that they solve two different problems.
RDS is intended for use online in real-time transactions that require an immediate response. Redshift is suitable for jobs with longer, heavier data analysis that can be run asynchronously.
Provisioning and scalability. Generally speaking, provisioning an RDS database involves several steps:
- launch a database instance;
- select its instance family, such as T3, M5 or R5;
- select its size, such as large, xlarge, 4xlarge or others;
- choose the storage type, General Purpose SSD or Provisioned IOPS; and
- storage capacity adjustment.
In the case of RDS Aurora, a MySQL and PostgreSQL compatible database in AWS RDS, the customer must launch a database in an RDS cluster. A cluster consists of a single master node and an optional number of read replicas, as well as multi-AZ or regional backup alternatives. RDS stores all source data in a single node. The only way to scale storage is to increase disk capacity in the RDS instance. Adding more read replicas can unload the primary node, but the primary node remains the single point where source data is stored and managed.
To provision a Redshift cluster, the customer selects the following:
- node type, such as RA3, DS2, or DC2;
- the size of the node, such as large, xlarge, 8xlarge, or another size; and
- number of nodes.
In the case of RA3 nodes, Redshift users can specify the amount of storage per node. Otherwise, it is preset based on node size. Application owners can increase cluster storage capacity by adding nodes or updating managed storage settings. Data is distributed evenly across nodes, providing scalability for application owners. Compute and storage are distributed across multiple nodes. Thus, the scalability of a Redshift cluster is much higher than that of an RDS deployment.
Access to data. RDS databases are not designed to access data stored outside of their local storage system and predefined format. Redshift can access data stored locally in the cluster or in external data sources.
For data stored externally, Redshift supports multiple formats, such as ORC, Parquet, JSON, and CSV. Redshift is designed and optimized to store and access much larger datasets than RDS. This can go up to 128TB per node, potentially reaching petabytes of data in a cluster. Compare that with RDS, which hits 100 Gibibytes at 64 TB for most database engines. Redshift is only limited by external data storage limitations. In the case of AWS S3 cloud storage, the limits are virtually non-existent.
Cost. Regarding the cost, comparing the two solutions is not so simple.
RDS pricing depends on database engine, instance size, and storage. The most expensive RDS instance (Microsoft SQL Server Enterprise db.x1.32xlarge) can run close to $45,000 per month. But there are production-ready options that customers can deploy for $1,000 or less, such as an RDS MySQL m5.4xlarge.
The most expensive Redshift node (ra3.16xlarge) can cost around $9,400 per month. This would allow a multi-node cluster to potentially reach $50,000 to $100,000 per month. Cost management strategies, such as resizing nodes and clusters, can reduce the costs of both products.
Although the cost of a Redshift cluster is higher than its equivalent in RDS, Redshift enables significantly more compute and storage capacity than an RDS deployment.
Real-world use
Concretely, a relational database managed by RDS is suitable for use in online queries. Consider it for requests from a web or mobile application, such as user login, product catalog search, user details, etc.
A Redshift cluster is ideal for generating reports from large amounts of data, such as site traffic reports, user activity reports, log analysis, market or business reports, billing for large platforms and analysis of public datasets. The Redshift user can automatically trigger report generation expecting the processing to take some time. Once the report is complete, Redshift’s output can be exported in a format accessible to other application components, such as business intelligence visualization tools, internal reports, and online applications.
Make the right choice
Although Redshift is based on PostgreSQL, many features and design patterns of this engine are not supported by Redshift. The AWS documentation describes them in detail.
Redshift supports both a data warehouse and a data lake approach. Redshift allows local storage in cluster nodes, which makes this model consistent with a data warehouse approach, where data is processed and stored internally according to a predefined structure. However, Redshift also allows for a data lake model, where it accesses data stored externally, such as in S3 or even RDS databases. Redshift interacts with a data catalog, which can be based on the Amazon Athena interactive query service, the AWS Glue serverless data integration service, or the EMR Hive data warehouse and analytics package to access these data sources. external data. Redshift’s federated query feature allows it to connect to RDS-managed databases using Aurora MySQL, MySQL, and PostgreSQL engines.
Both RDS and Redshift help solve complex problems for application owners, especially when it comes to infrastructure provisioning and data storage management. These are different services that solve different problems. When applied to the right use case, both platforms are powerful and should be considered part of the toolset for deploying modern applications in the cloud.