How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations

One of my trend calculation queries in the Opinion Analysis project started causing trouble recently. It would run for 270 seconds and break with an error “Query exceeded resource limits for tier 1. Tier 8 or higher required.” Changing the billing tier to the recommended Tier 8 did help, but the query still took 380 seconds to complete.
After some research I found out that the self-joins I use in part of the query were the culprit. Take a look at the join condition at the bottom of this query (in the CalcStatCombiTopics temp table):

INSERT INTO opinions.stattopic (...)
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
CalcStatSentiments AS (
 SELECT p.SnapshotDateId, t.Tag, ... s.SentimentHash,...
 FROM opinions.document d, p
 INNER JOIN opinions.sentiment s ON s.DocumentHash = d.DocumentHash, UNNEST(s.Tags) AS t
 INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
 INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
 AND wrRepost.CollectionItemId = d.CollectionItemId
 WHERE
 d.PublicationDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
CalcStatTopics AS (
…
),
CalcStatCombiTopics AS (
 SELECT 
 css1.SnapshotDateId, CONCAT(css1.Tag,’ & ‘,css2.Tag) AS Topic, [css1.Tag,css2.Tag] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
 ...
 FROM
 CalcStatSentiments css1, CalcStatSentiments css2
 WHERE
 css1.SentimentHash = css2.SentimentHash AND
 css1.Tag < css2.Tag
 GROUP BY css1.SnapshotDateId, css1.Tag, css2.Tag
),
…

Basically, I have a table “opinions.sentiment” with an identity column SentimentHash, a repeated field Tags, and a bunch of other columns. The Tags column contains an array of textual tags that I extract from text using the Opinion Analysis Cloud Dataflow IndexerPipeline. In the pre-BigQuery days I would have used a separate table to store the tags and link it to the main Sentiment table via the identity column SentimentHash. However, in BigQuery this is much more easily done with repeated fields.

When I calculate trends, I build frequency statistics for combinations of tags (e.g. how many news articles were about both “climate change” and “G-20”). To do so, I flatten the Tags field and create a temporary table CalcStatSentiments that contains individual records for each tag, with fields like SentimentHash as well as the actual tag. I then do a self-join of the CalcStatSentiments table to build what I call “topics” (in the CalcStatCombiTopics temp table).

It turns out that self-joins are bad for (your health) the performance of your queries, as illustrated in this blog post. It suggests replacing the self-joins with Windowing. I considered doing that, but I actually needed to end up with permutations of tags instead of aggregate statistics, for which Windowing would work great, so I came up with another technique.

Announcing Sergei’s Self-Join Elimination Technique ©

Instead of flattening my Tags field only once (in the CalcStatSentiments part of my query), I now flatten it the first time, carry the Tags array in the result set, and then flatten it the second time to emulate the CROSS JOIN operation.

Here is how it looks in the new query:

WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
SentimentTags AS (
  SELECT p.SnapshotDateId, s.SentimentHash, t.Tag, t.GoodAsTopic, s.Tags AS Tags
  FROM p, opinions.sentiment s, UNNEST(s.Tags) AS t
  WHERE
    s.DocumentDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
SentimentTagCombos AS (
  SELECT st.SnapshotDateId, st.SentimentHash, st.Tag AS Tag1, stt.Tag AS Tag2 
  FROM SentimentTags st, UNNEST(st.Tags) stt
  WHERE st.Tag < stt.Tag
),
...

The inequality filter `WHERE st.Tag < stt.Tag` ensures that I do not get duplicates in my tag combos. It works the same way as the inequality filter in the original version of my query.

WHERE
 css1.SentimentHash = css2.SentimentHash AND
 css1.Tag < css2.Tag

Once I calculated my Tag1 & Tag2 combinations, I join my result set via the SentimentHash record ID to my main dataset and conclude all the calculations.

The result: instead of a query that takes 380 seconds to complete in billing tier 8, my modified query runs in 12 seconds in billing tier 1.

Conclusion:

Using self-joins = BAD.
Replacing them with dual UNNEST() = Priceless! (or something like that)

Here are the queries if you want to compare the syntax.

New, performance-optimized query:

INSERT INTO opinions.stattopic (SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes )
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
SentimentTags AS (
  SELECT p.SnapshotDateId, s.SentimentHash, t.Tag, t.GoodAsTopic, s.Tags AS Tags
  FROM p, opinions.sentiment s, UNNEST(s.Tags) AS t
  WHERE
    s.DocumentDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
SentimentTagCombos AS (
  SELECT st.SnapshotDateId, st.SentimentHash, st.Tag AS Tag1, stt.Tag AS Tag2 
  FROM SentimentTags st, UNNEST(st.Tags) stt
  WHERE st.Tag < stt.Tag
),
CalcStatSentiments AS (
  SELECT st.SnapshotDateId, st.Tag, st.GoodAsTopic, d.DocumentHash AS DocumentHash, s.SentimentHash,
    wrOrig.WebResourceHash AS OrigWebResourceHash, wrOrig.Domain AS OrigDomain, wrRepost.WebResourceHash AS RepostWebResourceHash,
    s.DominantValence AS Valence, d.PublicationTime AS PublicationTime
  FROM SentimentTags st
    INNER JOIN opinions.sentiment s ON s.SentimentHash = st.SentimentHash AND s.DocumentDateId = st.SnapshotDateId
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash AND d.PublicationDateId = st.SnapshotDateId
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
),
CalcStatTopics AS (
  SELECT
    c.SnapshotDateId, c.Tag AS Topic, [c.Tag] AS Tags, c.GoodAsTopic, 1 AS TagCount,
    COUNT(distinct OrigDomain) as cntOrigPublishers,
    COUNT(distinct RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when c.Valence=1 then c.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when c.Valence=2 then c.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when c.Valence=3 then c.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when c.Valence=5 then c.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT c.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT c.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT c.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM CalcStatSentiments c
  GROUP BY c.SnapshotDateId, c.Tag, c.GoodAsTopic
),
CalcStatCombiTopics AS (
  SELECT 
    stc.SnapshotDateId, CONCAT(stc.Tag1,' & ',stc.Tag2) AS Topic, [stc.Tag1,stc.Tag2] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
    COUNT(distinct wrOrig.Domain) as cntOrigPublishers,
    COUNT(distinct wrRepost.WebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when s.DominantValence=1 then s.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when s.DominantValence=2 then s.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when s.DominantValence=3 then s.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when s.DominantValence=5 then s.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT s.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT wrOrig.WebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT wrRepost.WebResourceHash) AS RepostWebResourceHashes
  FROM SentimentTagCombos stc
    INNER JOIN opinions.sentiment s ON s.SentimentHash = stc.SentimentHash AND s.DocumentDateId = stc.SnapshotDateId
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash AND d.PublicationDateId = stc.SnapshotDateId
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
  GROUP BY stc.SnapshotDateId, stc.Tag1, stc.Tag2
  -- HAVING cntPublisherDomains > 1
),
CalcStatAllTopics AS (
  SELECT * FROM CalcStatTopics
  WHERE GoodAsTopic = true AND cntRepostWRs > 1
  UNION ALL
  SELECT * FROM CalcStatCombiTopics
  WHERE GoodAsTopic = true AND cntRepostWRs > 1
)
SELECT SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes
FROM CalcStatAllTopics

Original Query:

INSERT INTO opinions.stattopic (SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes )
WITH 
p AS (
 SELECT 20170630 AS SnapshotDateId
),
CalcStatSentiments AS (
  SELECT p.SnapshotDateId, t.Tag, t.GoodAsTopic, d.DocumentHash AS DocumentHash, s.SentimentHash,
    wrOrig.WebResourceHash AS OrigWebResourceHash, wrOrig.Domain AS OrigDomain, wrRepost.WebResourceHash AS RepostWebResourceHash,
    s.DominantValence AS Valence, d.PublicationTime AS PublicationTime
  FROM opinions.document d, p
    INNER JOIN opinions.sentiment s ON s.DocumentHash = d.DocumentHash, UNNEST(s.Tags) AS t
    -- Need to use Sentiment tags, so that sentiments relate to topics
    INNER JOIN opinions.webresource wrOrig ON wrOrig.DocumentHash = d.DocumentHash
    INNER JOIN opinions.webresource wrRepost ON wrRepost.DocumentCollectionId = d.DocumentCollectionId
      AND wrRepost.CollectionItemId = d.CollectionItemId
  WHERE
    d.PublicationDateId = p.SnapshotDateId AND s.SentimentTotalScore > 0
),
CalcStatTopics AS (
  SELECT
    c.SnapshotDateId, c.Tag AS Topic, [c.Tag] AS Tags, c.GoodAsTopic, 1 AS TagCount,
    COUNT(distinct OrigDomain) as cntOrigPublishers,
    COUNT(distinct RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when c.Valence=1 then c.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when c.Valence=2 then c.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when c.Valence=3 then c.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when c.Valence=5 then c.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT c.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT c.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT c.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM CalcStatSentiments c
  GROUP BY c.SnapshotDateId, c.Tag, c.GoodAsTopic
),
CalcStatCombiTopics AS (
  SELECT 
    css1.SnapshotDateId, CONCAT(css1.Tag,' & ',css2.Tag) AS Topic, [css1.Tag,css2.Tag] AS Tags, true AS GoodAsTopic, 2 AS TagCount,
    COUNT(distinct css1.OrigDomain) as cntOrigPublishers,
    COUNT(distinct css1.RepostWebResourceHash) as cntRepostWRs,
    COUNT(distinct (case when css1.Valence=1 then css1.SentimentHash else null end)) as cntPositives,
    COUNT(distinct (case when css1.Valence=2 then css1.SentimentHash else null end)) as cntNegatives,
    COUNT(distinct (case when css1.Valence=3 then css1.SentimentHash else null end)) as cntAmbiguous,
    COUNT(distinct (case when css1.Valence=5 then css1.SentimentHash else null end)) as cntGeneral,
    ARRAY_AGG(DISTINCT css1.SentimentHash) AS SentimentHashes,
    ARRAY_AGG(DISTINCT css1.OrigWebResourceHash) AS OrigWebResourceHashes,
    ARRAY_AGG(DISTINCT css1.RepostWebResourceHash) AS RepostWebResourceHashes
  FROM
    CalcStatSentiments css1, CalcStatSentiments css2
  WHERE
    css1.SentimentHash = css2.SentimentHash AND
    css1.Tag < css2.Tag
  GROUP BY css1.SnapshotDateId, css1.Tag, css2.Tag
  -- HAVING cntPublisherDomains > 1
),
CalcStatAllTopics AS (
  SELECT * FROM CalcStatTopics
  WHERE GoodAsTopic = true AND cntRepostWRs > 1
  UNION ALL
  SELECT * FROM CalcStatCombiTopics
  WHERE GoodAsTopic = true AND cntRepostWRs > 1
)
SELECT SnapshotDateId, Topic, Tags, TagCount, cntOrigPublishers, cntRepostWRs,
  cntPositives, cntNegatives, cntAmbiguous, cntGeneral, SentimentHashes, OrigWebResourceHashes, RepostWebResourceHashes
FROM CalcStatAllTopics

How to speed up your BigQuery query 31x by replacing a self-join with two UNNEST() operations

Leave a Reply Cancel reply