Comparing Data Solutions on AWS and GCP in 2021 (2022?), part 2
I have been putting off writing this. Mostly because I wrote part 1 shortly after passing the AWS/GCP Data Engineer/Data Analytics certifications, where I noted that “I expect this to be outdated really soon”. Sure enough, AWS:reinvent 2021 rolled around, and brought a host of new things to the fore.
The most anticipated of these is the public preview of Redshift Serverless (a great writeup here), which takes aim at GCP BigQuery (and Azure)’s biggest advantage. Reading the current documentation for Redshift brings up maintenance tasks like vacuuming tables —a concept that seems foreign today (when did you last have to defragment your personal laptop?
The BigQuery Juggernaut
I mentioned in part 1 that dev teams today rarely hire for skillsets in administrating databases. While to the layman a “data engineer” sounds like someone to have this skillset, this is even less true. The value of DataOps to an organization is in providing timely, quality data, not in administering its infrastructure — which is why BigQuery, which made waves since launch by defaulting to a serverless, pay-per-query model(it does offer the ability to provision capacity as well), is GCP’s flagship product.
The market share of BigQuery is relatively large compared to the rest of GCP, and it is a common pattern for companies to run operations on AWS and “the AI and data stuff” in GCP. Encouraging a multi-cloud strategy is a cornerstone of GCP’s strategy and BigQuery is a big part of this. Meanwhile, you will not find any reference to “Google” in AWS other than as a social identity provider.
In data analysis, less is more
The AWS Data Analytics Specialty exam tests you in multiple different ways you can perform data analytics (and this is before you think about the decisions about data processing, in purple below).
After passing it and preparing for the GCP Data Engineer practice exam (after finishing AWS Data Analytics Specialty), I laughed out loud getting the very first answer wrong, with an adamant “you should not use Cloud Storage for this scenario, it is cumbersome and doesn’t add value” feedback in a question about files that were in JSON format:
It’s very clear from the GCP syllabus that there is a preferred tool you should use do analytics: BigQuery (with some references to Dataproc, GCP’s Hadoop equivalent).
GCP Data Studio is also a free, adequate BI solution that complements BigQuery very well, which makes AWS Quicksight a puzzling offering (I don’t personally know anyone who uses it). Quicksight is unique among AWS offerings to price by the number of authors and readers, as well as cache storage and anomaly metrics.
In summary, BigQuery’s default pay-as-you-go pricing and Data Studio being free are a wombo-combo that have gained it a lot of traction — which is why I agree with Forrester’s evaluation putting BigQuery’s strategy at the head of the pack.
Serverless solutions will become a default in data analytics
The data warehouses aside, both AWS and GCP all now offer serverless data processing options, with both providing serverless replacements for the popular Apache Spark (AWS Glue/GCP Serverless Spark).
GCP also provides serverless options for Apache Beam and Apache Airflow (technically, Airflow is more “fully managed” than “serverless”), while AWS supports serverless Apache Flink in Kinesis Data Analytics, as well as an SQL flavour in KDA that is really simple to use.
So, AWS or GCP?
I personally prefer AWS in most cases for its suite of database products — with GCP going through the partnership route with ElasticSearch (I won’t take sides on the underlying drama), and DynamoDB is a personal favourite. However BigQuery is the clear winner here for data warehousing.
I will say, though, that GCP’s developer experience is the best among all cloud providers, with a (free!) Cloud Shell built into console.cloud.google.com, and clear documentation with multi-language tabbed examples and inline API demos.
AWS’ documentation is a lot poorer. Looking up how to use the batch function in DynamoDB went like this:
- Google it and realize the official documentation refers to the old NodeJS API
- Go to the “latest” documentation (URL: https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB.html) and see a message saying this is actually also for the older version
- Click on the link to the new version
- CTRL+F and find nothing that says “batch”
- Notice a tiny magnifying glass on the page, and use that to search “batch”
- Arrive, finally, on the documentation which is devoid of any examples and requires more clicking to get the definitions of BatchGetCommandInput and __HttpHandlerOptions.
In summary
Many may disagree with me, but being “Cloud agnostic” is an approach that is harder and harder to justify as the baseline of cloud provider data solutions improve — a cloud-provider-native or multi-cloud approach should be the default for most teams.
Only organizations whose data team(s) are a profit center, or where data analysis is a key differentiator will be able to sustain the talent pool necessary to run data processing on self-managed servers — otherwise it’s likely the engineers maintaining their data workloads will leave for the organizations that fit the criteria. The same has been true for databases for a long time. As I mentioned in part 1 though, this is something to celebrate, as the productivity possible in data engineering has never been higher.