Cloud Data Warehouse

Lakshman Raja
5 min readJul 9, 2021

--

Cloud Data Warehouses (CDW) were designed from the ground up to handle the high-velocity, large-scale data sets that are the hallmark of today’s data landscape. Cloud Data Warehouses are different from the previous generation
of on-premise technologies:

Inexpensive Storage: You can store all your data for a fraction of the cost. You no longer have to choose what to put into the data warehouse

Elastic and On-Demand Compute: Forget long-running batch jobs forever

Support for SQL and Analytics: Cloud technology can not only support SQL based technologies, but also powerful Spark clusters for analytics

No Management and Upgrade Cycles: Cloud-based service eliminates the need to worry about technology refreshes and upgrade cycles

Let us explore top 5 cloud offerings in this article.

AWS Redshift

The first widely adopted cloud data warehouse for many years. Data warehousing was only available as an on-premise solution. Then in November 2012 Amazon Web Services (AWS) launched Redshift. Although not the first cloud data warehouse, it was the first to gain market share through adoption. Redshift’s SQL dialect is based on PostgreSQL, which is well understood by analysts worldwide, and uses an architecture familiar to many on-premises data warehouses users.

Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It can start with as little as a few gigabytes of data and scale to petabytes. This empowers you to acquire new insights from your business and customer data. The first step to creating a Redshift data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After provision the cluster, upload data set and then perform data analysis queries. Regardless of the size of your data set, Amazon Redshift delivers fast query performance using familiar SQL-based tools and business intelligence applications.

AWS Redshift architecture

Azure Synapse

Azure Synapse Analytics is a newer analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data using either serverless on-demand or provisioned resources. Azure Synapse offers a unified experience to ingest, prepare, manage, and serve data for your business intelligence (BI) and machine learning (ML) needs.

At the heart of Azure Synapse is a cloud-native, distributed SQL processing engine. It’s built on the foundation of SQL Server to drive your most demanding enterprise data warehousing workloads. Similar to other cloud MPP solutions, Azure SQL Data Warehouse (SQL DW) separates storage and compute, billing for each separately. Azure Synapse saves relational tables data with columnar storage and abstracts physical machines by representing compute power in the form of data warehouse units (DWUs). This allows users to easily and seamlessly scale compute resources at will.

Synapse Analytics aims to unify a range of analytics workloads, such as data warehouses, data lakes, and ML, in a singular user interface (UI). The combination of an SQL Engine, Apache Spark with Azure Data Lake Storage (ADLS), and Azure Data Factory gives users the option to control both data warehouse/data lakes and data preparation for ML tasks. Azure Synapse allows for both vertical and horizontal scaling of the data warehouse. Vertically by changing the service tier or placing the database in an elastic pool. Horizontally by adding more data warehouse units.

Azure Synapse architecture

Google BigQuery

BigQuery is a fully managed, serverless data warehouse that automatically scales to match storage and computing power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data at incredible speeds. BigQuery also supports geospatial data analysis using familiar SQL with BigQuery GIS. In addition, easy to build and operationalize ML models on large-scale structured or semi-structured data using simple SQL with BigQuery ML. It supports real-time interactive dashboarding with BigQuery BI Engine.

The BigQuery architecture is composed of several components.

  • Borg is the compute
  • Colossus is the distributed storage
  • Jupiter is the network
  • Dremel is the execution engine

Google doesn’t expect us to manage infrastructure which is why BigQuery hides many of the underlying hardware, database, nodes, and configuration details. Its elasticity automatically works out of the box. And getting started is simply a matter of creating an account with Google Cloud Platform (GCP), loading a table, and running a query. Google takes care of the rest.

GCP BigQuery architecture

Snowflake

Snowflake is a fully managed MPP cloud data warehouse that runs on AWS, GCP, and Azure. Snowflake users can spin up as many as virtual warehouses, which helps to parallelize and isolate the performance of individual queries. Snowflake enables very high concurrency by separating storage and compute to ensure that many warehouses can simultaneously access the same data source. Snowflake’s data warehouse can be accessible through a web browser, the command line, an analytics platform, or via Snowflake’s ODBC, JDBC, or other supported drivers. The platform supports ACID-compliant relational processing and has native support for document store formats such as JSON, Avro, ORC (Optimized Row Columnar), Parquet, and XML.

Snowflake architecture

Teradata Vantage

Teradata Vantage is flagship analytic platform offering, which evolved from our industry-leading Teradata Database. Until references in content are updated to reflect this change, the term Teradata Database is synonymous with Teradata Vantage. Teradata Vantage is available in AWS, Azure and GCP platforms marketplace and their own Teradata cloud offering as well.

Advanced SQL Engine is a core capability of Teradata Vantage, based on our best-in-class Teradata Database. Advanced SQL refers to the ability to run advanced analytic functions beyond that of standard SQL.

Teradata Vantage delivers the best analytic functions and engines, preferred tools and languages, and support of multiple data types, which are supported by an industry-leading database.

  • A flexible, adaptive architecture that minimizes unnecessary movement of data and aligns to needs of users
  • Connects to customer ecosystems and runs in hybrid multi-cloud environments
  • Incorporates additional tools, languages and engines
Teradata Vantage architecture

--

--

Lakshman Raja
Lakshman Raja

Written by Lakshman Raja

Seasoned Data Management Professional

No responses yet