Introduction

This blog is intended to highlight the benefits of Google’s Pipe syntax SQL and convince users for adoption. If you don’t use BigQuery (or another platform that supports it, like Apache Spark), then consider this blog as some mental exercise on why pipes (functional programming) is a great addition to SQL. So that the comparison with Standard SQL is mostly fair, I made a deliberate attempt to avoid contrived examples that overcomplicate standard SQL. In fact, most examples are based on production code written by past me and/or my colleagues.

If you’re not familiar with Pipe syntax, you may consider starting from a later section to see some simple code side-by-side with Google standard SQL. Here are some more comprehensive resources if you find the examples here insufficient:

Official documentation for Pipe syntax.
Google’s conference paper introducing Pipe syntax.

If I had to summarise the benefits of Pipe syntax SQL in a few bullet points, it would be:

Improved guarantees on table cardinality.
Improved integration with dbt and templating, which I explore in another article.
Improved readability from:
- Fewer CTEs, each of which are more deliberately named.
- Concise multi-level aggregation.
- Concise sequential transforms.
Improved debugging experience in BigQuery console. I don’t discuss this in depth, but the tl;dr is that commenting out pipes is more convenient than commenting out CTEs.

Disclaimer

There are some known limitations for Pipe Syntax at time of writing, the big one for me being:

You can’t use a named window in pipe syntax.

Named windows are extremely convenient for analytics processing, and I often resort to using standard SQL in a new CTE specifically for window functions.

Finally, even if you’re convinced Pipe syntax is our lord and saviour, you don’t necessarily have to convince your company to rewrite the entire codebase in Pipe syntax. Adopting a new syntax/language is not a trivial decision for a team and requires upskilling for the team (though IMHO pipe syntax is easier to learn than standard SQL).

Game-changer improvements

Easier to transform arrays of structs

Consider a table with an array of structs as follows.

Row	patient_id	history.weight	history.height
1	1	70	1.75
		71	1.75
2	2	64	1.65

SQL definition of table

-- 1 row = 1 patient with an array of historical weight/height measurements
WITH existing_logic AS (
 SELECT
   1 AS patient_id,
   [
     STRUCT(70 AS weight, 1.75 AS height), -- weight:kg, height:m
     STRUCT(71 AS weight, 1.75 AS height)
   ] AS history
 UNION ALL
 SELECT
   2 AS patient_id,
   [
     STRUCT(64 AS weight, 1.65 AS height)
   ] AS history
)

SELECT * FROM existing_logic

Suppose we are now tasked to augment the history array by adding (BMI = weight / height^2) for each historical record. E.g. The target state is as follows.

Row	patient_id	history.weight	history.height	history.bmi
1	1	70	1.75	22.86
		71	1.75	23.18
2	2	64	1.65	23.51

There are many ways to achieve this in standard SQL, but we can broadly categorise them into 2 approaches:

Manipulate the arrays in place using a subquery.
Unnest the arrays in top-level CTEs, perform the transform, and re-aggregate.

Let’s look at some example implementations using standard SQL.

Standard SQL (Approach 1)

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi AS (
  SELECT
    * REPLACE (
      ARRAY((                        -- This is the start of the subquery
        SELECT AS STRUCT
          *,
          weight/POWER(height, 2) AS BMI,
        FROM UNNEST(history)
      )) AS history                  -- This is the end of the subquery
    )
  FROM existing_logic
),
more_logic AS (
  -- Further transformations in model
)

Standard SQL (Approach 2)

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi_unnest AS (
  SELECT
    existing_logic.* REPLACE(h AS history),
    OFFSET
  FROM existing_logic, UNNEST(history) AS h WITH OFFSET
),
bmi_compute AS (
  SELECT
    * REPLACE(
      (SELECT AS STRUCT
        history.*,
        history.weight / POWER(history.height, 2) AS BMI
      ) AS history
    )
  FROM bmi_unnest
),
bmi_reagg AS (
  SELECT
    * EXCEPT(OFFSET, history),
    ARRAY_AGG(history ORDER BY OFFSET) AS history
  FROM bmi_compute
  GROUP BY ALL
),
more_logic AS (
  -- Further transformations in model
)

Despite the first approach appearing shorter, it is typically considered more difficult to read since the reader needs to identify the start and end of subqueries among existing CTEs. This difficulty is greatly exacerbated by any logic that is more involved than a simple example computation and requires CTEs for sequential transformations.

The second approach is easier to read since all the CTEs are top-level and the reader doesn’t need to identify the start and end of nested subqueries. However, this improved readability comes at a cost:

Unnesting the array & re-aggregating destroys guarantees on the table cardinality. This makes the table more fragile to future code changes. Testing the table can help, but it’s difficult to provide any guarantees on table cardinality assuming it’s possible to test for this in the first place. The re-aggregation can be done either by a SELECT * GROUP BY ALL for brevity (but difficult to debug), or by tediously listing all the grouping column.
We now have two extra CTEs that don’t really have any business logic, leading to namespace pollution of CTEs.

Consider now the Pipe Syntax variant.

Pipe Syntax

WITH existing_logic AS (
  -- Existing transformations in model
),
bmi AS (
  FROM existing_logic
  |> SET
      history = ARRAY((
          FROM UNNEST(history)
          |> EXTEND weight/POWER(height, 2) AS BMI
          |> SELECT AS STRUCT *
      ))
),
more_logic AS (
  -- Further transformations in model
)

Let’s see how it addresses some of the concerns from the standard SQL variants:

Subqueries: Pipe syntax introduces one nested subquery per array. Furthermore, complex transformations can just be chained before the final |> SELECT AS STRUCT * instead of using nested CTEs.
Table cardinality: The cardinality of the top-level table is untouched. In fact, any changes in cardinality can be identified via the first word after the pipe, e.g. WHERE, AGGREGATE, [LEFT|RIGHT|FULL] JOIN.
Namespace pollution: A single CTE can be used to encapsulate as many or as few transformations as deemed appropriate.

More powerful generic tests in dbt

See this later article.

Small conveniences

Easier to identify references to previous CTEs

Consider the following example where cte3 references cte1. In Standard SQL, the declaration cte3 AS can be many lines apart from the reference FROM cte1 if the SELECT list is long. In Pipe syntax, the reference to cte1 is always within a few lines of cte3 AS.

Standard SQL

WITH
cte1 AS (
  SELECT
    ...
  FROM dataset.table
),

cte2 AS (
  SELECT
    ...
  FROM dataset.table2
),

cte3 AS (
  SELECT
    ...
  FROM cte1 -- this can be screens apart from `cte3 AS (`
)

SELECT
  ...
FROM cte3
JOIN cte2 USING(...)

Pipe Syntax

WITH
cte1 AS (
  FROM dataset.table
  |> SELECT
      ...
),

cte2 AS (
  FROM dataset.table2
  |> SELECT
      ...
),

cte3 AS (
  FROM cte1 -- this appears immediatley below `cte3 AS (`
  |> SELECT
      ...
)

FROM cte3
|> JOIN cte2 USING(...)
|> SELECT
    ...

Easier to identify changes in table cardinality (row-count)

Consider the following example where the table undergoes several transformations, each of which affect the final row count. In Standard SQL, the cardinality of the table changes 4 times due to the clauses UNNEST, WHERE, GROUP BY, and HAVING.

Standard SQL

SELECT
  dim_1,
  dim_2,
  COUNTIF(some_criteria) AS metric_1,
  ARRAY_AGG(some_expression) AS metric_2,
FROM dataset.table, UNNEST(array) AS a
WHERE some_filter
GROUP BY dim_1, dim_2
HAVING metric_1 > 0

Pipe Syntax

FROM dataset.table, UNNEST(arr) AS a
|> WHERE some_filter
|> AGGREGATE
    COUNTIF(some_criteria) AS metric_1,
    ARRAY_AGG(some_expression) AS metric_2,
  GROUP BY
    dim_1,
    dim_2,
|> WHERE metric_1 > 0

Some pain points in Standard SQL and how Pipe syntax helps:

If the SELECT list is very long, the dev has to scroll down quite far from the beginning of the CTE to find out how the cardinality has changed. In Pipe syntax, the dev immediately expects a change in cardinality upon coming across one of UNNEST, |> WHERE and |> AGGREGATE. Note, one may opt for the style of using |> CROSS JOIN UNNEST to denote an explicit change in cardinality.
Additional mental headspace is needed to determine the order in which the table cardinality changes due to each of the 4 clauses. This is exacerbated further if the query has additional clauses like QUALIFY and PIVOT. In Pipe syntax, the order in which cardinality changes is explicit with each chained |> WHERE or |> AGGREGATE. The changes in cardinality are thus sequential and can be more easily reasoned about.

Easier multi-level aggregation

In Standard SQL, each level of aggregation needs a new CTE (or god forbid a nested subquery). In Pipe syntax, each additional level of aggregation is just another chain of |> AGGREGATE … GROUP BY … clause.

The appearance of AGGREGATE also gives immediate notice to the reader that the following logic is focused on aggregation. In Standard SQL, one would have to:

Read through a large list of dimensions until an aggregation function appears without an OVER clause, or
Scroll to the end of the CTE and check for a GROUP BY clause.

Standard SQL

WITH
date_OnD AS (
  SELECT
    Date,
    OCanonical,
    DCanonical,
    
    COUNT(1) AS metric1,
    SUM(value) AS metric2,
  FROM dataset.table
  GROUP BY Date, OCanonical, DCanonical
)

SELECT
  Date,
  
  SUM(metric1) AS metric1,
  SUM(metric2) AS metric2,
  
  ARRAY_AGG(STRUCT(
    OCanonical,
    DCanonical,
    metric1,
    metric2
  ) LIMIT 10) AS Examples
FROM date_OnD
GROUP BY Date

Pipe Syntax

FROM dataset.table
|> AGGREGATE
    COUNT(1) AS metric1,
    SUM(value) AS metric2,
  GROUP BY
    Date, OCanonical, DCanonical
|> AGGREGATE
    SUM(metric1) AS metric1,
    SUM(metric2) AS metric2,
    
    ARRAY_AGG(STRUCT(
      OCanonical,
      DCanonical,
      metric1,
      metric2
    ) LIMIT 10) AS Examples
  GROUP BY
    Date

Easier EXCEPT and REPLACE

The equivalent of SELECT * EXCEPT () and SELECT * REPLACE () are now |> DROP and |> SET, respectively. Some niceties:

Trailing commas makes it easy to toggle comments on these lines when developing code,
The syntax for SET makes it very obvious which column is being replaced. This is in contrast to the REPLACE syntax, which requires the dev to tediouscly scan for the corresponding AS keyword for each column.

Standard SQL

SELECT
  * EXCEPT (
    drop_1,
    drop_2
  ) REPLACE(
    UDF1(col_1) AS col_1,
    STRUCT(
      DATE(col_2.Timestamp) AS Date,
      col_2.Uid AS Aqid
    ) AS col_2
  )
FROM dataset.table

Pipe Syntax

FROM dataset.table
|> DROP
    drop_1,
    drop_2,
|> SET
    col_1 = UDF1(col_1),
    col_2 = STRUCT(
        DATE(col_2.Timestamp) AS Date,
        col_2.Uid AS Aqid
      ),

Easier to manage sequential changes in column definitions

Following on the discussion of REPLACE and SET above, there’s a common pattern where data needs to be processed one column at a time. The brevity of the code speaks for itself.

Standard SQL

WITH
raw AS (
  SELECT
    col_1,
    col_2,
    col_3,
  FROM dataset.table
),

new_col1 AS (
  SELECT
    * REPLACE (
      UDF1(col_1) AS col_1
    )
  FROM raw
),

new_col2 AS (
  SELECT
    * REPLACE (
      UDF2(col_1, col_2) AS col_2 -- depends on new col_1
    )
  FROM new_col2
),

new_col3 AS (
  SELECT
    * REPLACE (
      UDF3(col_2, col_3) AS col_3 -- depends on new col_2
    )
  FROM new_col2
)

SELECT * FROM new_col3

Pipe Syntax

FROM dataset.table
|> SET
    col_1 = UDF1(col_1)
|> SET
    col_2 = UDF2(col_1, col_2) -- depends on new col_1
|> SET
    col_3 = UDF3(col_2, col_3) -- depends on new col_2

Cleaner Aggregation

In Standard SQL, dimensions and metrics can be in arbitrary order in the SELECT statement and the column names need to be manually duplicated in the GROUP BY statement if not using GROUP BY 1, 2 ... or GROUP BY ALL.

When developing code, this becomes time consuming when:

a column name needs to be renamed in both the SELECT and GROUP BY clauses,
a column needs to be added/removed from GROUP BY when a dimension becomes a metric or vice-versa,
debugging a column that should have been aggregated but wasn’t due to using GROUP BY ALL.

In Pipe syntax, the separation of metrics and dimensions is demarcated by the GROUP BY keywords. Renaming of grouping columns becomes trivial.

Note 1: Pipe syntax allows a trailing comma for the final column in GROUP BY which makes it easy to reorder grouping columns.
Note 2: Pipe syntax doesn’t allow GROUP BY ALL. This may sound like an inconvenience as more lines of code might be needed, but there’s no actual change since the code is just being moved from SELECT to GROUP BY. There’s also the added benefit of the code being more explicit about what’s being grouped.

Standard SQL

WITH
group_by_column_example AS (
  SELECT
    dim_1,
    dim2 AS dim_2,

    COUNT(1) AS metric,
  FROM dataset.table
  GROUP BY
    dim_1,
    dim_2
),

group_by_all_example AS (
  SELECT
    dim1,
    
    COUNT(1) AS metric1,
    metric2, -- Forgot to aggregate this metric
  FROM dataset.table
  GROUP BY ALL
)

Pipe Syntax

WITH
group_by_column_example AS (
  FROM datset.table
  |> AGGREGATE
      COUNT(1) AS metric,
    GROUP BY
      dim1,
      dim2 AS dim_2, -- Column alias + trailing comma. Awesome!
),

group_by_all_example AS (
  FROM dataset.table
  |> AGGREGATE
      COUNT(1) AS metric1,
      metric2, -- BQ Engine will remind you to aggregate this
    GROUP BY
      dim1
)

Closing thoughts

I think Google Pipe syntax greatly improves the ergonomics of reading and writing SQL. Good developer ergonomics = faster development and/or more robust systems. I believe there’s a lot of potential for other SQl dialects to adopt pipe syntax, and I look forward to the day where it becomes ubiquitous.

The Full Stack Data Dev

Google Pipe Syntax: Cleaner SQL

Introduction

Disclaimer

Game-changer improvements

Easier to transform arrays of structs

Standard SQL (Approach 1)

Standard SQL (Approach 2)

Pipe Syntax

More powerful generic tests in dbt

Small conveniences

Easier to identify references to previous CTEs

Standard SQL

Pipe Syntax

Easier to identify changes in table cardinality (row-count)

Standard SQL

Pipe Syntax

Easier multi-level aggregation

Standard SQL

Pipe Syntax

Easier EXCEPT and REPLACE

Standard SQL

Pipe Syntax

Easier to manage sequential changes in column definitions

Standard SQL

Pipe Syntax

Cleaner Aggregation

Standard SQL

Pipe Syntax

Closing thoughts