Introduction
This blog is intended to highlight the benefits of Google’s Pipe syntax SQL and convince users for adoption. If you don’t use BigQuery (or another platform that supports it, like Apache Spark), then consider this blog as some mental exercise on why pipes (functional programming) is a great addition to SQL. So that the comparison with Standard SQL is mostly fair, I made a deliberate attempt to avoid contrived examples that overcomplicate standard SQL. In fact, most examples are based on production code written by past me and/or my colleagues.
If you’re not familiar with Pipe syntax, you may consider starting from a later section to see some simple code side-by-side with Google standard SQL. Here are some more comprehensive resources if you find the examples here insufficient:
- Official documentation for Pipe syntax.
- Google’s conference paper introducing Pipe syntax.
If I had to summarise the benefits of Pipe syntax SQL in a few bullet points, it would be:
- Improved guarantees on table cardinality.
- Improved integration with dbt and templating, which I explore in another article.
- Improved readability from:
- Fewer CTEs, each of which are more deliberately named.
- Concise multi-level aggregation.
- Concise sequential transforms.
- Improved debugging experience in BigQuery console. I don’t discuss this in depth, but the tl;dr is that commenting out pipes is more convenient than commenting out CTEs.
Disclaimer
There are some known limitations for Pipe Syntax at time of writing, the big one for me being:
You can’t use a named window in pipe syntax.
Named windows are extremely convenient for analytics processing, and I often resort to using standard SQL in a new CTE specifically for window functions.
Finally, even if you’re convinced Pipe syntax is our lord and saviour, you don’t necessarily have to convince your company to rewrite the entire codebase in Pipe syntax. Adopting a new syntax/language is not a trivial decision for a team and requires upskilling for the team (though IMHO pipe syntax is easier to learn than standard SQL).
Game-changer improvements
Easier to transform arrays of structs
Consider a table with an array of structs as follows.
| Row | patient_id | history.weight | history.height |
|---|---|---|---|
| 1 | 1 | 70 | 1.75 |
| 71 | 1.75 | ||
| 2 | 2 | 64 | 1.65 |
SQL definition of table
-- 1 row = 1 patient with an array of historical weight/height measurements
WITH existing_logic AS (
SELECT
1 AS patient_id,
[
STRUCT(70 AS weight, 1.75 AS height), -- weight:kg, height:m
STRUCT(71 AS weight, 1.75 AS height)
] AS history
UNION ALL
SELECT
2 AS patient_id,
[
STRUCT(64 AS weight, 1.65 AS height)
] AS history
)
SELECT * FROM existing_logic
Suppose we are now tasked to augment the history array by adding (BMI = weight / height^2) for each historical record. E.g. The target state is as follows.
| Row | patient_id | history.weight | history.height | history.bmi |
|---|---|---|---|---|
| 1 | 1 | 70 | 1.75 | 22.86 |
| 71 | 1.75 | 23.18 | ||
| 2 | 2 | 64 | 1.65 | 23.51 |
There are many ways to achieve this in standard SQL, but we can broadly categorise them into 2 approaches:
- Manipulate the arrays in place using a subquery.
- Unnest the arrays in top-level CTEs, perform the transform, and re-aggregate.
Let’s look at some example implementations using standard SQL.
Standard SQL (Approach 1)
WITH existing_logic AS (
-- Existing transformations in model
),
bmi AS (
SELECT
* REPLACE (
ARRAY(( -- This is the start of the subquery
SELECT AS STRUCT
*,
weight/POWER(height, 2) AS BMI,
FROM UNNEST(history)
)) AS history -- This is the end of the subquery
)
FROM existing_logic
),
more_logic AS (
-- Further transformations in model
)
Standard SQL (Approach 2)
WITH existing_logic AS (
-- Existing transformations in model
),
bmi_unnest AS (
SELECT
existing_logic.* REPLACE(h AS history),
OFFSET
FROM existing_logic, UNNEST(history) AS h WITH OFFSET
),
bmi_compute AS (
SELECT
* REPLACE(
(SELECT AS STRUCT
history.*,
history.weight / POWER(history.height, 2) AS BMI
) AS history
)
FROM bmi_unnest
),
bmi_reagg AS (
SELECT
* EXCEPT(OFFSET, history),
ARRAY_AGG(history ORDER BY OFFSET) AS history
FROM bmi_compute
GROUP BY ALL
),
more_logic AS (
-- Further transformations in model
)
Despite the first approach appearing shorter, it is typically considered more difficult to read since the reader needs to identify the start and end of subqueries among existing CTEs. This difficulty is greatly exacerbated by any logic that is more involved than a simple example computation and requires CTEs for sequential transformations.
The second approach is easier to read since all the CTEs are top-level and the reader doesn’t need to identify the start and end of nested subqueries. However, this improved readability comes at a cost:
- Unnesting the array & re-aggregating destroys guarantees on the table cardinality. This makes the table more fragile to future code changes. Testing the table can help, but it’s difficult to provide any guarantees on table cardinality assuming it’s possible to test for this in the first place. The re-aggregation can be done either by a
SELECT * GROUP BY ALLfor brevity (but difficult to debug), or by tediously listing all the grouping column. - We now have two extra CTEs that don’t really have any business logic, leading to namespace pollution of CTEs.
Consider now the Pipe Syntax variant.
Pipe Syntax
WITH existing_logic AS (
-- Existing transformations in model
),
bmi AS (
FROM existing_logic
|> SET
history = ARRAY((
FROM UNNEST(history)
|> EXTEND weight/POWER(height, 2) AS BMI
|> SELECT AS STRUCT *
))
),
more_logic AS (
-- Further transformations in model
)
Let’s see how it addresses some of the concerns from the standard SQL variants:
- Subqueries: Pipe syntax introduces one nested subquery per array. Furthermore, complex transformations can just be chained before the final
|> SELECT AS STRUCT *instead of using nested CTEs. - Table cardinality: The cardinality of the top-level table is untouched. In fact, any changes in cardinality can be identified via the first word after the pipe, e.g.
WHERE,AGGREGATE,[LEFT|RIGHT|FULL] JOIN. - Namespace pollution: A single CTE can be used to encapsulate as many or as few transformations as deemed appropriate.
More powerful generic tests in dbt
See this later article.
Small conveniences
Easier to identify references to previous CTEs
Consider the following example where cte3 references cte1. In Standard SQL, the declaration cte3 AS can be many lines apart from the reference FROM cte1 if the SELECT list is long. In Pipe syntax, the reference to cte1 is always within a few lines of cte3 AS.
Standard SQL
WITH
cte1 AS (
SELECT
...
FROM dataset.table
),
cte2 AS (
SELECT
...
FROM dataset.table2
),
cte3 AS (
SELECT
...
FROM cte1 -- this can be screens apart from `cte3 AS (`
)
SELECT
...
FROM cte3
JOIN cte2 USING(...)
Pipe Syntax
WITH
cte1 AS (
FROM dataset.table
|> SELECT
...
),
cte2 AS (
FROM dataset.table2
|> SELECT
...
),
cte3 AS (
FROM cte1 -- this appears immediatley below `cte3 AS (`
|> SELECT
...
)
FROM cte3
|> JOIN cte2 USING(...)
|> SELECT
...
Easier to identify changes in table cardinality (row-count)
Consider the following example where the table undergoes several transformations, each of which affect the final row count. In Standard SQL, the cardinality of the table changes 4 times due to the clauses UNNEST, WHERE, GROUP BY, and HAVING.
Standard SQL
SELECT
dim_1,
dim_2,
COUNTIF(some_criteria) AS metric_1,
ARRAY_AGG(some_expression) AS metric_2,
FROM dataset.table, UNNEST(array) AS a
WHERE some_filter
GROUP BY dim_1, dim_2
HAVING metric_1 > 0
Pipe Syntax
FROM dataset.table, UNNEST(arr) AS a
|> WHERE some_filter
|> AGGREGATE
COUNTIF(some_criteria) AS metric_1,
ARRAY_AGG(some_expression) AS metric_2,
GROUP BY
dim_1,
dim_2,
|> WHERE metric_1 > 0
Some pain points in Standard SQL and how Pipe syntax helps:
If the
SELECTlist is very long, the dev has to scroll down quite far from the beginning of the CTE to find out how the cardinality has changed. In Pipe syntax, the dev immediately expects a change in cardinality upon coming across one ofUNNEST,|> WHEREand|> AGGREGATE. Note, one may opt for the style of using|> CROSS JOIN UNNESTto denote an explicit change in cardinality.Additional mental headspace is needed to determine the order in which the table cardinality changes due to each of the 4 clauses. This is exacerbated further if the query has additional clauses like
QUALIFYandPIVOT. In Pipe syntax, the order in which cardinality changes is explicit with each chained|> WHEREor|> AGGREGATE. The changes in cardinality are thus sequential and can be more easily reasoned about.
Easier multi-level aggregation
In Standard SQL, each level of aggregation needs a new CTE (or god forbid a nested subquery). In Pipe syntax, each additional level of aggregation is just another chain of |> AGGREGATE … GROUP BY … clause.
The appearance of AGGREGATE also gives immediate notice to the reader that the following logic is focused on aggregation. In Standard SQL, one would have to:
- Read through a large list of dimensions until an aggregation function appears without an
OVERclause, or - Scroll to the end of the CTE and check for a
GROUP BYclause.
Standard SQL
WITH
date_OnD AS (
SELECT
Date,
OCanonical,
DCanonical,
COUNT(1) AS metric1,
SUM(value) AS metric2,
FROM dataset.table
GROUP BY Date, OCanonical, DCanonical
)
SELECT
Date,
SUM(metric1) AS metric1,
SUM(metric2) AS metric2,
ARRAY_AGG(STRUCT(
OCanonical,
DCanonical,
metric1,
metric2
) LIMIT 10) AS Examples
FROM date_OnD
GROUP BY Date
Pipe Syntax
FROM dataset.table
|> AGGREGATE
COUNT(1) AS metric1,
SUM(value) AS metric2,
GROUP BY
Date, OCanonical, DCanonical
|> AGGREGATE
SUM(metric1) AS metric1,
SUM(metric2) AS metric2,
ARRAY_AGG(STRUCT(
OCanonical,
DCanonical,
metric1,
metric2
) LIMIT 10) AS Examples
GROUP BY
Date
Easier EXCEPT and REPLACE
The equivalent of SELECT * EXCEPT () and SELECT * REPLACE () are now |> DROP and |> SET, respectively. Some niceties:
- Trailing commas makes it easy to toggle comments on these lines when developing code,
- The syntax for
SETmakes it very obvious which column is being replaced. This is in contrast to theREPLACEsyntax, which requires the dev to tediouscly scan for the correspondingASkeyword for each column.
Standard SQL
SELECT
* EXCEPT (
drop_1,
drop_2
) REPLACE(
UDF1(col_1) AS col_1,
STRUCT(
DATE(col_2.Timestamp) AS Date,
col_2.Uid AS Aqid
) AS col_2
)
FROM dataset.table
Pipe Syntax
FROM dataset.table
|> DROP
drop_1,
drop_2,
|> SET
col_1 = UDF1(col_1),
col_2 = STRUCT(
DATE(col_2.Timestamp) AS Date,
col_2.Uid AS Aqid
),
Easier to manage sequential changes in column definitions
Following on the discussion of REPLACE and SET above, there’s a common pattern where data needs to be processed one column at a time. The brevity of the code speaks for itself.
Standard SQL
WITH
raw AS (
SELECT
col_1,
col_2,
col_3,
FROM dataset.table
),
new_col1 AS (
SELECT
* REPLACE (
UDF1(col_1) AS col_1
)
FROM raw
),
new_col2 AS (
SELECT
* REPLACE (
UDF2(col_1, col_2) AS col_2 -- depends on new col_1
)
FROM new_col2
),
new_col3 AS (
SELECT
* REPLACE (
UDF3(col_2, col_3) AS col_3 -- depends on new col_2
)
FROM new_col2
)
SELECT * FROM new_col3
Pipe Syntax
FROM dataset.table
|> SET
col_1 = UDF1(col_1)
|> SET
col_2 = UDF2(col_1, col_2) -- depends on new col_1
|> SET
col_3 = UDF3(col_2, col_3) -- depends on new col_2
Cleaner Aggregation
In Standard SQL, dimensions and metrics can be in arbitrary order in the SELECT statement and the column names need to be manually duplicated in the GROUP BY statement if not using GROUP BY 1, 2 ... or GROUP BY ALL.
When developing code, this becomes time consuming when:
- a column name needs to be renamed in both the
SELECTandGROUPBY clauses, - a column needs to be added/removed from
GROUP BYwhen a dimension becomes a metric or vice-versa, - debugging a column that should have been aggregated but wasn’t due to using
GROUP BY ALL.
In Pipe syntax, the separation of metrics and dimensions is demarcated by the GROUP BY keywords. Renaming of grouping columns becomes trivial.
- Note 1: Pipe syntax allows a trailing comma for the final column in
GROUP BYwhich makes it easy to reorder grouping columns. - Note 2: Pipe syntax doesn’t allow
GROUP BY ALL. This may sound like an inconvenience as more lines of code might be needed, but there’s no actual change since the code is just being moved fromSELECTtoGROUP BY. There’s also the added benefit of the code being more explicit about what’s being grouped.
Standard SQL
WITH
group_by_column_example AS (
SELECT
dim_1,
dim2 AS dim_2,
COUNT(1) AS metric,
FROM dataset.table
GROUP BY
dim_1,
dim_2
),
group_by_all_example AS (
SELECT
dim1,
COUNT(1) AS metric1,
metric2, -- Forgot to aggregate this metric
FROM dataset.table
GROUP BY ALL
)
Pipe Syntax
WITH
group_by_column_example AS (
FROM datset.table
|> AGGREGATE
COUNT(1) AS metric,
GROUP BY
dim1,
dim2 AS dim_2, -- Column alias + trailing comma. Awesome!
),
group_by_all_example AS (
FROM dataset.table
|> AGGREGATE
COUNT(1) AS metric1,
metric2, -- BQ Engine will remind you to aggregate this
GROUP BY
dim1
)
Closing thoughts
I think Google Pipe syntax greatly improves the ergonomics of reading and writing SQL. Good developer ergonomics = faster development and/or more robust systems. I believe there’s a lot of potential for other SQl dialects to adopt pipe syntax, and I look forward to the day where it becomes ubiquitous.