Islands and Gaps

Islands?

This past week I had the pleasure of seeing the ole Islands and Gaps SQL problem out in the wild! It’s pretty common actually, but this time I found myself having to do a little bit of backfilling in addition to the forward filling tasks I have usually been charged with. Let’s do a quick refresher before jumping into the meat of the problem today.

This type of problem can surface in almost any number of ways. The tell is that you are trying to identify if there is a gap in a sequence of some sort, and fill in that gap to create a more whole data set to work with. The “gap” is the real issue here….not so much what column the gap is occuring in.

In this instance I was working with timeseries data. The data was reporting on occupancy levels in a room in 5 minute increments (e.g. 01:00 UTC, 01:05 UTC, …). Unfortunately the data had many of these “gaps”. This time this was actually by design. The particular sensor that was tracking occupancy levels actually did not report data if there were no changes during the 5 minute interval. So, for example, if 5 people were in a room starting at 2:00 UTC, and they just stayed in the room for 10 minutes straight before 1 of them decided to leave the room the data would report 5 people occupied the room at the 2:00 UTC, and then the next row in the data would show there were 4 people in the room at 2:10 UTC…..There would be no row for 2:05 UTC. This is one of those “gaps”! Fortunately for me, knowing that senor only reported changes meant that I could backfill and forward fill any of these gaps based off the gap’s surrounding row’s data.

Filling in the gaps

Now, moving from that previous example to actual data. See below for the raw, unfilled state of our data set.


Note the missing time_partition at the start and end of the hour.

Alright. Let’s get down to it. 👷

First thing I want to do is go ahead and fill in the missing time partitions just so we can get a complete picture of the hour. The first CTE below just generates some sample data for demo purposes. The second, full_time_series, makes use of Big Query’s GENERATE_TIMESTAMP_ARRAY function to……well…like the name states….generate a timestamp array. The distinct in full_time_series is a little superfluous for since we only have a single id in this example, but it is going to be a requirement if we wanted to apply this query to data that had multiple ids.

with test_data AS (
      SELECT 1 id, TIMESTAMP('2023-01-01 00:15:00') time_partition, 7 AS occupancy_max, 3 AS occupancy_min, 5 AS occupancy_first, 7 AS occupancy_lASt, 3 AS occupancy_prev UNION ALL
      SELECT 1 id, TIMESTAMP('2023-01-01 00:20:00') time_partition, 4 AS occupancy_max, 4 AS occupancy_min, 4 AS occupancy_first, 4 AS occupancy_lASt, 7 AS occupancy_prev UNION ALL
      SELECT 1 id, TIMESTAMP('2023-01-01 00:25:00') time_partition, 6 AS occupancy_max, 1 AS occupancy_min, 4 AS occupancy_first, 6 AS occupancy_lASt, 4 AS occupancy_prev UNION ALL
      SELECT 1 id, TIMESTAMP('2023-01-01 00:30:00') time_partition, 6 AS occupancy_max, 2 AS occupancy_min, 6 AS occupancy_first, 2 AS occupancy_lASt, 6 AS occupancy_prev UNION ALL
      SELECT 1 id, TIMESTAMP('2023-01-01 00:35:00') time_partition, 5 AS occupancy_max, 2 AS occupancy_min, 2 AS occupancy_first, 5 AS occupancy_lASt, 2 AS occupancy_prev
), full_time_series AS (
  select
      distinct id,
      time_gen
  FROM test_data t
  FULL OUTER JOIN(SELECT time_gen FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2023-01-01 00:00:00', '2023-01-01 00:59:00', INTERVAL 5 MINUTE)) time_gen)
    ON 1 = 1
)
SELECT * FROM full_time_series;


Voila

Next, let’s join that full hour’s five minute interval timeseries to our gap data.

 _grps AS (
  SELECT
    f.id,
    f.time_gen,
    t.occupancy_max,
    t.occupancy_min,
    t.occupancy_first,
    t.occupancy_last,
    t.occupancy_prev,
    count(t.occupancy_prev) OVER (PARTITION BY f.id ORDER BY f.time_gen) forward_fill_grp,
    count(t.occupancy_prev) OVER (PARTITION BY f.id ORDER BY f.time_gen DESC) backfill_grp
  FROM full_time_series f
  LEFT JOIN test_data t ON f.id = t.id AND f.time_gen = t.time_partition
)
select * FROM _grps ORDER BY id, time_gen


Here we can clearly see the null values for the gap rows we filled in with our select from full_time_series.

Before moving on, we need to examine the two count() window functions. They are going to be very useful in our final query. The first, forward_fill_grp, is going to help us identify where our “real” data ends and our filled-gap data starts. For example, above we can see that row 8 has a forward_fill_grp of 5. That 5 value also applies to the filled in timeseries data we joined to the test_data CTE to help fill’er up. The second window function, backfill_grp does the exact same thing but in reverse!

Using these two window functions I can now backfill all the filled in rows with the first non-NULL occupancy_prev value we have, and forward fill using the last non-NULL occupancy_last value too!

filled AS (
  SELECT
    id,
    time_gen,
    CASE
      WHEN occupancy_max IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
      WHEN occupancy_max IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
      ELSE occupancy_max
    END as occupancy_max,
    CASE
      WHEN occupancy_min IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
      WHEN occupancy_min IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
      ELSE occupancy_min
    END as occupancy_min,
    CASE
      WHEN occupancy_first IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
      WHEN occupancy_first IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
      ELSE occupancy_first
    END as occupancy_first,
    CASE
      WHEN occupancy_last IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
      WHEN occupancy_last IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
      ELSE occupancy_last
    END as occupancy_last,
    CASE
      WHEN occupancy_prev IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
      WHEN occupancy_prev IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
      ELSE occupancy_prev
    END as occupancy_prev
  FROM _grps
)
SELECT * FROM filled ORDER BY id, time_gen


A little color coding for y’all.

I could have also accomplished this same effect by using Big Query’s IF() function, but, in my personal opinion, things start to look a little muddled if we go down that route. But, hey. You do you.

CASE
    WHEN occupancy_max IS NULL AND forward_fill_grp = 0 THEN FIRST_VALUE(occupancy_prev) OVER (PARTITION BY id, backfill_grp ORDER BY time_gen DESC)
    WHEN occupancy_max IS NULL AND forward_fill_grp > 0 THEN FIRST_VALUE(occupancy_last) OVER (PARTITION BY id, forward_fill_grp ORDER BY time_gen ASC)
    ELSE occupancy_max
END as occupancy_max

-- VS

IF(occupancy_max is NULL, IF(forward_fill_grp = 0, first_value(occupancy_prev) over(partition by id, backfill_grp order by time_gen desc), first_value(occupancy_last) over(partition by id, forward_fill_grp order by time_gen)), occupancy_max) as occupancy_max

#TODO data engineering

Islands and Gaps

Islands?

Filling in the gaps

Extra Resources