Post

Solve gaps and islands problem in AWS Redshift

A gaps and islands problem refers to a situation where there is a sequence of rows in a table that should appear at some regular intervals (daily measures of temperature, weekly summary of sales, etc.) but some entries are missing. Although such an issue appears most often in the case of data ordered by a date or a timestamp, in general, it can be any sequence ordered by any other type of column which could be treated as a sequence of consecutive values (integers, alphabet letters and so on). Islands refer to chunks of sorted consecutive records and gaps are simply missing values between the islands. This might sound a little bit enigmatic but in reality, is very intuitive. If we have a dataset of some measures taken in March 2020 ordered by day and there are missing measures for the 10th and 24th of March (for whatever reason), the gaps and islands are as follows:

Island: 1st - 9th March

Gap: 10th March

Island: 11th - 23th March

Gap: 24th March

Island: 25th - 31st March

Real-life examples of such situations might include the temporary unavailability of systems acting as a source of data, any kinds of errors in the stage of dataset creation, or simply raw data structure being not rich enough to be directly used in an analysis or a dashboard.

The challenge in solving this problem is identifying the islands of data that are separated by the gaps of missing values, and then performing some action on those islands, such as counting, summing, or averaging the values within each island.

Business problem to solve

Let’s imagine there’s an application. Everyone can use it as long as they create a user account. Some functionalities are available for free and some are only for paid premium accounts. There is also a 1-week free trial. Depending on the usage of the app we can divide users into 3 categories: beginner, contributor, and expert. We gather all possible logs of users’ behaviors and attributes in a database. Every time a new user account is created, the log records a unique ID of this account and all user attributes. However, in the case of a change in any of the attributes, the log sends only the ID and the new attribute. Let’s look at sample data and see how the log entries look.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CREATE TABLE user_sample
(
    log_date        date,
    user_account_id integer,
    event_type     varchar(32),
    account_type    varchar(16),
    user_level      varchar(16)
);

INSERT INTO user_sample VALUES ('2020-02-15', 111, 'user account created', 'free (basic)', 'beginner');
INSERT INTO user_sample VALUES ('2020-02-18', 111, 'user account updated', 'trial', NULL);
INSERT INTO user_sample VALUES ('2020-02-25', 111, 'user account updated', NULL, 'contributor');
INSERT INTO user_sample VALUES ('2020-03-01', 111, 'user account updated', 'premium', NULL);
INSERT INTO user_sample VALUES ('2020-03-02', 111, 'user account updated', NULL, 'expert');
INSERT INTO user_sample VALUES ('2020-03-06', 111, 'user account deleted', NULL, NULL);
INSERT INTO user_sample VALUES ('2020-02-19', 222, 'user account created', 'free (basic)', 'beginner');
INSERT INTO user_sample VALUES ('2020-03-01', 333, 'user account created', 'free (basic)', 'beginner');
INSERT INTO user_sample VALUES ('2020-03-04', 333, 'user account deleted', NULL, NULL);

SELECT *
FROM user_sample
ORDER BY user_account_id, log_date;

Output:

log_dateuser_account_idevent_typeaccount_typeuser_level
2020-02-15111user account createdfree (basic)beginner
2020-02-18111user account updatedtrialNULL
2020-02-25111user account updatedNULLcontributor
2020-03-01111user account updatedpremiumNULL
2020-03-02111user account updatedNULLexpert
2020-03-06111user account deletedNULLNULL
2020-02-19222user account createdfree (basic)beginner
2020-03-01333user account createdfree (basic)beginner
2020-03-04333user account deletedNULLNULL

Note that different events (see column: event_type) come with different attributes (e.g. user account created events always come with the user_level attribute, while user account deleted always have null account_type and user_level). This is how events typically look in production environments. They are put in separate tables that can later be joined. Here all the data was inserted in just one table as a starting point for building a user dimension.

In simple terms, in this example, we want to build a user history that shows all possible attributes (user_account_id, account_type , user_level). In data warehousing, this is called a slowly changing dimension with a daily grain.

There are numerous other types of SCDs and other ways of tackling irregularly changing data. I recommend the Wikipedia article on SCD as a reference.

Step 1: propagate user attributes using window functions and frames

Let’s run the following query to start with:

1
2
3
4
5
6
7
8
9
10
SELECT
    log_date
  , user_account_id
  , event_type
  , LAST_VALUE(account_type IGNORE NULLS)
    OVER (PARTITION BY user_account_id ORDER BY log_date ROWS UNBOUNDED PRECEDING) AS account_type
  , LAST_VALUE(user_level IGNORE NULLS)
    OVER (PARTITION BY user_account_id ORDER BY log_date ROWS UNBOUNDED PRECEDING) AS user_level
FROM user_sample
ORDER BY user_account_id, log_date;

Output:

log_dateuser_account_idevent_typeaccount_typeuser_level
2020-02-15111user account createdfree (basic)beginner
2020-02-18111user account updatedtrialbeginner
2020-02-25111user account updatedtrialcontributor
2020-03-01111user account updatedpremiumcontributor
2020-03-02111user account updatedpremiumexpert
2020-03-06111user account deletedpremiumexpert
2020-02-19222user account createdfree (basic)beginner
2020-03-01333user account createdfree (basic)beginner
2020-03-04333user account deletedfree (basic)beginner

The query retrieves data from the user_sample table. It selects specific columns from the table (log_date, user_account_id, event_type) and also adds additional columns with calculated values.

The additional columns are generated using the LAST_VALUE function in combination with the OVER clause. This function is used to retrieve the last non-null value in the specified column, over a defined partition and ordering.

The partition is defined as PARTITION BY user_account_id which means that the function will only consider values in the specified column that have the same user_account_id. The ordering is defined as ORDER BY log_date which means that the values will be considered in the order of log_date.

The ROWS UNBOUNDED PRECEDING frame clause means that the function will consider all the rows in the partition that come before the current row. The final result will be a table with the specified columns and two additional columns for the account_type and user_level, which will contain the last non-null value of the account_type and user_level column over the defined partition and ordering.

The query is also ordering the final result set by user_account_id and log_date.

I recommend the AWS Window Function Syntax Documentation and Window Functions in Redshift Doc if you want to understand frame clauses better.

Step 2: implement date ranges for every user state

With the above query, I eliminated the nulls. Now, if I query the table for any record, the result consists of all possible attributes of a user at a particular moment in time. However, there is still some information missing from this table. What if I want to check what was the status of all users of the app as of 2020-03-03? There is no row with such a log_date.

To easily query the user report by a specific date, we need to add time ranges to the table.

log_date acts as a start date and there is one more column added named valid_to that acts as an end date and is equal to one day prior to the next log_date. The most recent row is the current status of an active user and typically it gets assigned a surrogate end date equal to 9999-12-31. The most recent row of a deleted user has equal valid_from and valid_to dates.

Let’s generate such data and write the result to a new table named users_final_report:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE users_final_report AS (
    WITH users_no_gaps AS (
        SELECT
            log_date
          , user_account_id
          , event_type
          , last_value(account_type IGNORE NULLS)
            OVER (PARTITION BY user_account_id ORDER BY log_date ROWS UNBOUNDED PRECEDING) AS account_type
          , last_value(user_level IGNORE NULLS)
            OVER (PARTITION BY user_account_id ORDER BY log_date ROWS UNBOUNDED PRECEDING) AS user_level
        FROM user_sample)
        
    SELECT
        log_date AS valid_from
      , CASE
            WHEN event_type != 'user account deleted' 
            THEN NVL(LEAD(log_date, 1) OVER (PARTITION BY user_account_id ORDER BY log_date) - 1, '9999-12-31')
            ELSE valid_from 
            END AS valid_to 
      , user_account_id
      , event_type
      , account_type
      , user_level
    FROM users_no_gaps
    ORDER BY user_account_id, valid_from)
;

Now that the final result is ready, I can go back to the question I asked before. I want to see the state of all users as of 2020-03-03.

The query is pretty straightforward:

1
2
3
SELECT *
FROM users_final_report
WHERE '2020-03-03' BETWEEN valid_from AND valid_to;

Output:

valid_fromvalid_touser_account_idevent_typeaccount_typeuser_level
2020-02-199999-12-31222user account createdfree (basic)beginner
2020-03-012020-03-03333user account createdfree (basic)beginner
2020-03-022020-03-05111user account updatedpremiumexpert
This post is licensed under CC BY 4.0 by the author.