The Full Stack Data Dev

Google Pipe Syntax: Cleaner SQL

Introduction

This blog is intended to highlight the benefits of Google’s Pipe syntax SQL and convince users for adoption. If you don’t use BigQuery (or another platform that supports it, like Apache Spark), then consider this blog as some mental exercise on why pipes (functional programming) is a great addition to SQL. So that the comparison with Standard SQL is mostly fair, I made a deliberate attempt to avoid contrived examples that overcomplicate standard SQL. In fact, most examples are based on production code written by past me and/or my colleagues.

If you’re not familiar with Pipe syntax, you may consider starting from a later section to see some simple code side-by-side with Google standard SQL. Here are some more comprehensive resources if you find the examples here insufficient:

If I had to summarise the benefits of Pipe syntax SQL in a few bullet points, it would be:

Disclaimer

There are some known limitations for Pipe Syntax at time of writing, the big one for me being:

You can’t use a named window in pipe syntax.

Named windows are extremely convenient for analytics processing, and I often resort to using standard SQL in a new CTE specifically for window functions.

Finally, even if you’re convinced Pipe syntax is our lord and saviour, you don’t necessarily have to convince your company to rewrite the entire codebase in Pipe syntax. Adopting a new syntax/language is not a trivial decision for a team and requires upskilling for the team (though IMHO pipe syntax is easier to learn than standard SQL).

Game-changer improvements

Easier to transform arrays of structs

Consider a table with an array of structs as follows.

Rowpatient_idhistory.weighthistory.height
11701.75
711.75
22641.65

 

SQL definition of table
-- 1 row = 1 patient with an array of historical weight/height measurements
WITH existing_logic AS (
 SELECT
   1 AS patient_id,
   [
     STRUCT(70 AS weight, 1.75 AS height), -- weight:kg, height:m
     STRUCT(71 AS weight, 1.75 AS height)
   ] AS history
 UNION ALL
 SELECT
   2 AS patient_id,
   [
     STRUCT(64 AS weight, 1.65 AS height)
   ] AS history
)

SELECT * FROM existing_logic

Suppose we are now tasked to augment the history array by adding (BMI = weight / height^2) for each historical record. E.g. The target state is as follows.

Rowpatient_idhistory.weighthistory.heighthistory.bmi
11701.7522.86
711.7523.18
22641.6523.51

There are many ways to achieve this in standard SQL, but we can broadly categorise them into 2 approaches:

  1. Manipulate the arrays in place using a subquery.
  2. Unnest the arrays in top-level CTEs, perform the transform, and re-aggregate.

Let’s look at some example implementations using standard SQL.

Standard SQL (Approach 1)

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi AS (
  SELECT
    * REPLACE (
      ARRAY((                        -- This is the start of the subquery
        SELECT AS STRUCT
          *,
          weight/POWER(height, 2) AS BMI,
        FROM UNNEST(history)
      )) AS history                  -- This is the end of the subquery
    )
  FROM existing_logic
),
more_logic AS (
  -- Further transformations in model
)

Standard SQL (Approach 2)

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi_unnest AS (
  SELECT
    existing_logic.* REPLACE(h AS history),
    OFFSET
  FROM existing_logic, UNNEST(history) AS h WITH OFFSET
),
bmi_compute AS (
  SELECT
    * REPLACE(
      (SELECT AS STRUCT
        history.*,
        history.weight / POWER(history.height, 2) AS BMI
      ) AS history
    )
  FROM bmi_unnest
),
bmi_reagg AS (
  SELECT
    * EXCEPT(OFFSET, history),
    ARRAY_AGG(history ORDER BY OFFSET) AS history
  FROM bmi_compute
  GROUP BY ALL
),
more_logic AS (
  -- Further transformations in model
)

Despite the first approach appearing shorter, it is typically considered more difficult to read since the reader needs to identify the start and end of subqueries among existing CTEs. This difficulty is greatly exacerbated by any logic that is more involved than a simple example computation and requires CTEs for sequential transformations.

The second approach is easier to read since all the CTEs are top-level and the reader doesn’t need to identify the start and end of nested subqueries. However, this improved readability comes at a cost:

Consider now the Pipe Syntax variant.

Pipe Syntax

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi AS (
  FROM existing_logic
  |> SET
      history = ARRAY((
          FROM UNNEST(history)
          |> EXTEND weight/POWER(height, 2) AS BMI
          |> SELECT AS STRUCT *
      ))
),
more_logic AS (
  -- Further transformations in model
)

Let’s see how it addresses some of the concerns from the standard SQL variants:

More powerful generic tests in dbt

See this later article.

Small conveniences

Easier to identify references to previous CTEs

Consider the following example where cte3 references cte1. In Standard SQL, the declaration cte3 AS can be many lines apart from the reference FROM cte1 if the SELECT list is long. In Pipe syntax, the reference to cte1 is always within a few lines of cte3 AS.

Standard SQL

WITH
cte1 AS (
  SELECT
    ...
  FROM dataset.table
),

cte2 AS (
  SELECT
    ...
  FROM dataset.table2
),

cte3 AS (
  SELECT
    ...
  FROM cte1 -- this can be screens apart from `cte3 AS (`
)

SELECT
  ...
FROM cte3
JOIN cte2 USING(...)

Pipe Syntax

WITH
cte1 AS (
  FROM dataset.table
  |> SELECT
      ...
),

cte2 AS (
  FROM dataset.table2
  |> SELECT
      ...
),

cte3 AS (
  FROM cte1 -- this appears immediatley below `cte3 AS (`
  |> SELECT
      ...
)

FROM cte3
|> JOIN cte2 USING(...)
|> SELECT
    ...

Easier to identify changes in table cardinality (row-count)

Consider the following example where the table undergoes several transformations, each of which affect the final row count. In Standard SQL, the cardinality of the table changes 4 times due to the clauses UNNEST, WHERE, GROUP BY, and HAVING.

Standard SQL

SELECT
  dim_1,
  dim_2,
  COUNTIF(some_criteria) AS metric_1,
  ARRAY_AGG(some_expression) AS metric_2,
FROM dataset.table, UNNEST(array) AS a
WHERE some_filter
GROUP BY dim_1, dim_2
HAVING metric_1 > 0

Pipe Syntax

FROM dataset.table, UNNEST(arr) AS a
|> WHERE some_filter
|> AGGREGATE
    COUNTIF(some_criteria) AS metric_1,
    ARRAY_AGG(some_expression) AS metric_2,
  GROUP BY
    dim_1,
    dim_2,
|> WHERE metric_1 > 0

Some pain points in Standard SQL and how Pipe syntax helps:

Easier multi-level aggregation

In Standard SQL, each level of aggregation needs a new CTE (or god forbid a nested subquery). In Pipe syntax, each additional level of aggregation is just another chain of |> AGGREGATE … GROUP BY … clause.

The appearance of AGGREGATE also gives immediate notice to the reader that the following logic is focused on aggregation. In Standard SQL, one would have to:

  1. Read through a large list of dimensions until an aggregation function appears without an OVER clause, or
  2. Scroll to the end of the CTE and check for a GROUP BY clause.

Standard SQL

WITH
date_OnD AS (
  SELECT
    Date,
    OCanonical,
    DCanonical,
    
    COUNT(1) AS metric1,
    SUM(value) AS metric2,
  FROM dataset.table
  GROUP BY Date, OCanonical, DCanonical
)

SELECT
  Date,
  
  SUM(metric1) AS metric1,
  SUM(metric2) AS metric2,
  
  ARRAY_AGG(STRUCT(
    OCanonical,
    DCanonical,
    metric1,
    metric2
  ) LIMIT 10) AS Examples
FROM date_OnD
GROUP BY Date

Pipe Syntax

FROM dataset.table
|> AGGREGATE
    COUNT(1) AS metric1,
    SUM(value) AS metric2,
  GROUP BY
    Date, OCanonical, DCanonical
|> AGGREGATE
    SUM(metric1) AS metric1,
    SUM(metric2) AS metric2,
    
    ARRAY_AGG(STRUCT(
      OCanonical,
      DCanonical,
      metric1,
      metric2
    ) LIMIT 10) AS Examples
  GROUP BY
    Date

Easier EXCEPT and REPLACE

The equivalent of SELECT * EXCEPT () and SELECT * REPLACE () are now |> DROP and |> SET, respectively. Some niceties:

Standard SQL

SELECT
  * EXCEPT (
    drop_1,
    drop_2
  ) REPLACE(
    UDF1(col_1) AS col_1,
    STRUCT(
      DATE(col_2.Timestamp) AS Date,
      col_2.Uid AS Aqid
    ) AS col_2
  )
FROM dataset.table

Pipe Syntax

FROM dataset.table
|> DROP
    drop_1,
    drop_2,
|> SET
    col_1 = UDF1(col_1),
    col_2 = STRUCT(
        DATE(col_2.Timestamp) AS Date,
        col_2.Uid AS Aqid
      ),

Easier to manage sequential changes in column definitions

Following on the discussion of REPLACE and SET above, there’s a common pattern where data needs to be processed one column at a time. The brevity of the code speaks for itself.

Standard SQL

WITH
raw AS (
  SELECT
    col_1,
    col_2,
    col_3,
  FROM dataset.table
),

new_col1 AS (
  SELECT
    * REPLACE (
      UDF1(col_1) AS col_1
    )
  FROM raw
),

new_col2 AS (
  SELECT
    * REPLACE (
      UDF2(col_1, col_2) AS col_2 -- depends on new col_1
    )
  FROM new_col2
),

new_col3 AS (
  SELECT
    * REPLACE (
      UDF3(col_2, col_3) AS col_3 -- depends on new col_2
    )
  FROM new_col2
)

SELECT * FROM new_col3

Pipe Syntax

FROM dataset.table
|> SET
    col_1 = UDF1(col_1)
|> SET
    col_2 = UDF2(col_1, col_2) -- depends on new col_1
|> SET
    col_3 = UDF3(col_2, col_3) -- depends on new col_2

Cleaner Aggregation

In Standard SQL, dimensions and metrics can be in arbitrary order in the SELECT statement and the column names need to be manually duplicated in the GROUP BY statement if not using GROUP BY 1, 2 ... or GROUP BY ALL.

When developing code, this becomes time consuming when:

In Pipe syntax, the separation of metrics and dimensions is demarcated by the GROUP BY keywords. Renaming of grouping columns becomes trivial.

Standard SQL

WITH
group_by_column_example AS (
  SELECT
    dim_1,
    dim2 AS dim_2,

    COUNT(1) AS metric,
  FROM dataset.table
  GROUP BY
    dim_1,
    dim_2
),

group_by_all_example AS (
  SELECT
    dim1,
    
    COUNT(1) AS metric1,
    metric2, -- Forgot to aggregate this metric
  FROM dataset.table
  GROUP BY ALL
)

Pipe Syntax

WITH
group_by_column_example AS (
  FROM datset.table
  |> AGGREGATE
      COUNT(1) AS metric,
    GROUP BY
      dim1,
      dim2 AS dim_2, -- Column alias + trailing comma. Awesome!
),

group_by_all_example AS (
  FROM dataset.table
  |> AGGREGATE
      COUNT(1) AS metric1,
      metric2, -- BQ Engine will remind you to aggregate this
    GROUP BY
      dim1
)

Closing thoughts

I think Google Pipe syntax greatly improves the ergonomics of reading and writing SQL. Good developer ergonomics = faster development and/or more robust systems. I believe there’s a lot of potential for other SQl dialects to adopt pipe syntax, and I look forward to the day where it becomes ubiquitous.

Data quality testing in dbt