I passed both the AWS Data Analytics Specialty and GCP Data Engineer Professional certifications recently, and have held both AWS/GCP Professional Architect certs as well. Choosing between cloud providers is an important task, with their data capabilities often being a critical factor, so I wanted to jot down my thoughts. Things also move so fast I thought it would be fun to look back at this two years from now, as I expect this to be outdated really soon.
Choice or focus?
AWS tends to provide more comprehensive options, while GCP provides more streamlined ones. Some of this is by virtue of AWS being the first mover, while others seem to represent a deliberate focus by GCP.
Take messaging solutions for example. I have this mental “decision tree” when architecting messaging on AWS:
There are also cost and scalability considerations not covered in the above, e.g. Kinesis Data Streams has 2MB/s throughput per shard, SNS FIFO topics does 10MB/s; standard SNS much higher because it’s unordered— all important to know for the exam.
On GCP, how one weighs messaging is very different, with no weighing of minutiae like throughput and whether there are multiple consumers:
In 2019 I was developing a greenfield product that had to be built on GCP, and I found Cloud Pub/Sub fairly limited, sorely missing key features like:
- Ordering (!)
- Dead-letter queues
- Message filtering
- Retry policies
Fast-forward 2 years later and Cloud Pub/Sub now has all the above — I would call it a streamlined, not a limited, choice now.
A similar difference is seen in the managed database services that are available. My thought process for AWS looks like this:
For GCP, the options are again more limited:
AWS has more choice, but only one truly “AWS-native” database: DynamoDB. Their other managed database offerings are based on (or compatible with) popular open-source software:
- Aurora/RDS (SQL databases like MySQL, Postgres, MSSQL, and Oracle)
- OpenSearch/ElasticSearch (document search)
- Neptune (graph search)
- DocumentDB (MongoDB compatibility/document store)
- Keyspaces (Cassandra compatibility/wide-column)
For Google Cloud, it’s the other way around, with three out of their four database offerings being “GCP-native”: BigTable (wide-column), Cloud Spanner (globally-available SQL) and Firestore (dev-friendly document database).
Unlike for messaging though, the options available in AWS give it a strong lead.
Why database offerings matter
Team Topologies is a default for modern teams, with the overarching principle of “You built it, you run it” and a focus on stream-aligned teams (aka product teams) that deliver faster.
However, the capability of stream-aligned teams matters. If we go back 15 years in time when everything was on-premise, the separation of responsibilities would look like this, with:
- Frontend and backend developers only developing APIs and applications
- System and database administrators in charge of VMs and databases
This division is because of the complexity of administering VMs and databases — a typical developer may know how to use the database to make queries and conceptually understand the importance of read-replication or scaling for performance, but is not able to administer the intricacies of setting up binary logs, configuring secure master-slave communication, monitoring replication drift, etc.
Putting database and VM administration under a “platform team” leads to communication bottlenecks, e.g. every column and config change requiring approval. I had considerable anxiety running an ElasticSearch cluster on Google Cloud VMs with a small startup team, as the relatively junior team had to learn both how to use ElasticSearch as well has how to administer its master nodes and shards, taking away valuable development time to establish runbooks and practice scaling the cluster up and down.
Managed databases save considerable dev-hours from internal database workings, making it possible to abstract details like read-replication, cluster configuration, and instance count. This makes “You build it, you run it” much easier to achieve, and frees platform teams up for higher-value work, like administering a Kubernetes cluster or an SSO platform.
“I don’t like that we get locked-in with managed cloud databases”, a colleague said to me once. To that I would say that avoiding lock-in is rarely a business goal; time to value usually is. The alternatives are:
- Building up a 24/7 operations team with expertise in the databases an organization will use, and self-manage the cluster. This can be preferable if the organization has an overarching use-case that is solved with a few database solutions, however this is a choice against Sacrificial Architecture, the notion that we should never be too attached to a individual service as that stifles innovation.
- Going with a vendor platform like Confluent, Elastic, or Neo4j — which brings in more expertise, but you are still “locked in” to a non-cloud vendor instead.
For the record, I dislike that “DevOps” has (d)evolved to mean “Ops team, just with more work” in many companies (take note, recruiters!) — it originally meant a culture of marrying process and (cloud) technology to achieve new levels of productivity.
I especially despise people who crow “everyone needs to know the internals!” — it doesn’t reflect the realities of delivery. It is good that teams are empowered today to launch managed database services into production today, ensuring a reasonable level of reliability, and delve into the internals as needed — all while sitting on the same support plan and security model of the cloud provider.
Why are DynamoDB and Firebase/Firestore popular?
AWS DynamoDB and Google Cloud’s Firebase/Firestore are the only cloud provider-native databases to appear on the StackOverflow’s 2021 survey of the most popular database technologies. (“Firebase” is mentioned in the survey, but technically Firebase is GCP’s BaaS solution aimed at mobile developers, while Firestore is its data layer). It’s no accident that both share similar traits of being NoSQL and more importantly, are both serverless, dynamically scaling databases.
Beginners into the world of NoSQL usually think that having unstructured data is the goal: “A-ha! Now I don’t have to predefine column types!”. But with the advent of JSON columns in popular SQL databases like Postgres and MySQL (with indexing, even), you eventually realize the real problem being solved is scalability. A datastore that needs to always be consistent can’t scale as readily as one where consistency is optional (most NoSQL databases today provide the option of strong consistency).
This seminal talk by Rick Houlihan introduces important principles behind NoSQL in its first half and accelerates into a mind-blowing showcase of how to model a DynamoDB database:
In short, a NoSQL database is actually less flexible than a SQL one, because it relies on partitioning/sharding its data set across storage for performance. This means traditional joins are not possible (or are done in the application layer), and adhoc queries are difficult. Some NoSQL databases don’t allow scans (queries without indexes) at all.
In a separate talk, Houlihan introduces the PIE theorem as an alternative to the older CAP theorem. The CAP theorem says that you can only choose two of Consistency, Availability, and Partition Tolerance:
The CAP theorem is useful to know, but doesn’t always come up in everyday decisions as it represents internal workings of a database.
The PIE theorem is much more relevant in the choices we make when selecting a datastore, asking us to select two of Pattern flexibility, Infinite scale, and Efficiency.
Pattern flexibility + Efficiency is typically the domain of SQL databases, with some NoSQL databases like Elasticsearch offering complex search capability, but no JOINs. Data warehouses like BigQuery and Redshift couple pattern flexibility with infinite scale instead, but your analytics queries are never going to be in the millisecond range.
The default approach for most is therefore to pair a P+I database with a P+E one, covering the triangle. Or in other words, pairing an OLAP with an OLTP database. This works (and can work for a long time) until either of two things happen that bring the need for an I+E datastore to the fore:
- A small amount of data needs to be accessed very frequently
- Data needs to be accessed with unpredictable frequency, and auto- and dynamic scaling is desirable without worrying about how and when to scale out instances.
The popularity of DynamoDB and Firestore is therefore are not surprising.
For Firestore, the absence of pattern flexibility does mean that you will be caught out if you don’t understand it mainly stores documents and collections of documents. And while it scales dynamically, it has strict limits and quotas which may surprise, including the notorious one-write-per-second limit on individual documents (don’t use Firestore as a counter unless you read this):
I don’t like Firestore for this reason, and it tends to be more vague around limitations while DynamoDB is more precise and deterministic about its capacity. Firestore is also explicitly aimed at startups, oddly limiting every GCP account to one Firestore instance, and being integrated into the wider Firebase BaaS suite.
DynamoDB meanwhile is widely used internally at AWS, and has welcome cloud-native features like change streams, TTL, global replication etc. It also supports both provisioned capacity with some autoscaling, or totally on-demand capacity (which costs about 6x more).
However, it takes “pattern inflexibility” + “efficiency” to an extreme, with its own unique design patterns that rely on the application to stitch queries together, covered in the video mentioned above by Rick Houlihan.
I’ve mentioned the popularity of Firestore and DynamoDB are because they fulfil the need for “infinite scaling”, so it’s worth a special callout for Aurora Serverless, which is a good lesson on the importance of reading documentation and knowing what “serverless” really means: you don’t administer servers, but don’t take “serverless =infinite scaling” as a given — Aurora Serverless does not dynamically and automatically scale if there are long running transactions, meaning it may not scale when you need it to the most. There was an ACG course I took which called it “Aurora Infrequent Access”, which is much more apt. It is meant only for non-production workloads and gives up important features like replicas and backtracking.
BigTable and Cloud Spanner
BigTable and Cloud Spanner are the other two cloud-provider native databases in the GCP ecosystem, and were both originally for internal Google use. Unlike Kubernetes however, there hasn’t been as much public traction around these platforms.
BigTable is a scalable wide-column solution that until recently recommended a minimum of 3 nodes for a production load. Even after, a minimal 2-node setup (1 for production, 1 for dev) will cost USD1k a month, which is not trivial.
Spanner takes a complex problem: SQL-compatible consistency across regions with NoSQL-like performance, and came up with a brilliant approach to solving it by promoting external consistency and ensuring all Google servers are in sync with “TrueTime”. External consistency is worth reading about on its own (a good SO thread here).
TrueTime enables applications to generate monotonically increasing timestamps: an application can compute a timestamp T that is guaranteed to be greater than any timestamp T’ if T’ finished being generated before T started being generated. This guarantee holds across all servers and all timestamps
The SQL-compatible part of Cloud Spanner is not a throwaway by the way; the language is SQL but some features like auto-/monotonically- incrementing columns are not allowed in Spanner, just like any other NoSQL database (TLDR: Don’t expect to lift and shift an existing SQL application to Spanner without rework)
Until even more recently, Cloud Spanner had probably the highest cost barrier to entry among all cloud databases, with a single multi-region node costing more than USD2k/mo; so much so many teams probably decided to solve the problem of multi-region consistency some other way. GCP has now reworked Spanner to come in “Processing Units”, with 1,000 PU=1 node and the minimum allocation being 100 PUs, or 1/10th the previous minimum.
AWS has an edge in databases; but GCP isn’t stagnant
AWS’ approach to making managed or compatible versions of popular, existing databases gives it the edge in database choice, while the GCP strategy of making available its own internal database engines hasn’t taken off as quickly. There are more organizations absolutely needing, say, managed Elasticsearch and infinite, reliable scaling like DynamoDB over a single globally consistent database. My mind was equally blown when watching the video I linked above about DynamoDB design patterns as well as reading about Cloud Spanner’s approach to external consistency, but I have not had the chance to implement Spanner in any production use case.
Don’t count GCP out though. In the next article, I’ll cover why GCP has gained so much ground in data warehouse technology and how their developer experience is better in many ways. And with how rapidly things move, I expect everything here to be outdated really soon!