general aws tutorial
Don't get lost in the vastness of AWS. In this series we'll cut through the noise and learn how to use AWS through real world examples.
Unlike regular aggregate functions which collapse multipe rows into a single output, window functions maintain the individual row structure making it possible to compute running totals, rankings, moving averages, and other analytics without losing information. In this lesson we'll cover the three categories of window functions and look at examples of how they're used.
Window functions operate within a 'window', or a specified subset of rows, which can be defined based on partitioning and ordering criteria. The PARTITION BY clause divides the data into groups, similar to how GROUP BY works in aggregate functions, but without collapsing the data. The ORDER BY clause can be used to sort the rows within each partition, allowing you to calculate values such as ranks or cumulative sums.
The generalized strucutre of a window function is as follows:
SELECT
FUNCTION(column_name) OVER (PARTITION BY partition_column/s ORDER BY order_column/s) AS alias_name
FROM
table_name;
Note: The use of ORDER BY within a window function is optional and depends on the desired outcome. We'll look at this soon.
These functions perform aggregate calculations such as a sums, averages, or counts, over sets of rows while still returning each of the original rows. Lets take a look at an example where we calculate the cumulative sale price of each fruit over time.
SELECT
fruit_id,
order_date,
order_price_aud,
SUM(order_price_aud) OVER (PARTITION BY fruit_id ORDER BY order_date) as fruit_cumul_order_price
FROM
orders
ORDER BY
fruit_id,
order_date;
You might be wondering why we use ORDER BY twice. When using window functions, the ORDER BY clause within the window function is applied only during the calculation of the window function, however the rows in the query result remain in their original order.
What if we wanted to calculate the total sale price for each fruit, the same way we would when using GROUP BY, but without collapsing the data? To do this we simply use a window function without the ORDER BY clause.
SELECT
fruit_id,
order_date,
order_price_aud,
SUM(order_price_aud) OVER (PARTITION BY fruit_id) as fruit_order_price
FROM
orders
ORDER BY
fruit_id,
order_date;
The window function no longer calculates the cumulative values, rather, the same total sum will be repeated for every row corresponding to a specific fruit_id, regardless of the order_date.
While our example uses SUM as the aggregate function, the usual suspects can be used such as COUNT, MEAN, and MEDIAN etc. For a full list, refer to your DBMS's Documentation. A list of aggregate functions in snowflake can be found here.
We can also create a partition based on multiple columns. Lets find the order price running totals, partitioned by the customer and the order month.
SELECT
customer_id,
fruit_id,
MONTH(order_date) as order_month,
order_date,
order_price_aud,
SUM(order_price_aud) OVER (PARTITION BY customer_id, MONTH(order_date) ORDER BY order_date) as customer_monthly_running_total
FROM
orders
ORDER BY
customer_id,
order_date;
Ranking window functions assign a rank or position to each row within a partition of a dataset, based on the order specified. The four most common ranking window functions are:
The outputs of ROW_NUMBER(), RANK(), and DENSE_RANK() are rather similar with one key difference between each. We'll cover them all below.
ROW_NUMBER()
ROW_NUMBER() assigns a unique sequential integer to rows within a partition, starting at 1. Each row within a partition recieves a unique rank with no ties.
SELECT
fruit_id,
order_date,
order_price_aud,
ROW_NUMBER() OVER (PARTITION BY fruit_id ORDER BY order_price_aud) as order_price_row_num_rank
FROM orders
ORDER BY fruit_id, order_price_row_num_rank;
The output will rank each of the orders based on the order price, partitioned by each fruit. In the above example, ROW_NUMBER() is calculated based off order_price_aud. If two rows within the same partition have the same order price, the DBMS will use its default behavior to break ties, which usually involves the order of the rows as they appear in the result set (which itself is often based on the order they are physically stored or retrieved by the query).
Looking at the orders table, we can see a few fruits with duplicate order prices. Lets filter out a few and take a closer look.
SELECT
fruit_id,
order_date,
order_price_aud,
ROW_NUMBER() OVER (PARTITION BY fruit_id ORDER BY order_price_aud) as order_price_row_num_rank
FROM
orders
WHERE fruit_id in (20, 10, 13)
ORDER BY
fruit_id,
order_price_row_num_rank;
Notice how rows within the same partition which share the same order price recieve different rank values from ROW_NUMBER(). Depending on the use case, this may or may not be desirable.
Our table is ordered by date, and so earlier orders sharing the same order price as other rows within the same partition recieve lower ranks. If we wanted to change this, we'd need to include a new column to the ORDER BY clause within the window function. Lets change it so that more recent orders recieve a lower rank value.
SELECT
fruit_id,
order_date,
order_price_aud,
ROW_NUMBER() OVER (PARTITION BY fruit_id ORDER BY order_price_aud, order_date DESC) as order_price_row_num_rank
FROM
orders
ORDER BY
fruit_id,
order_price_row_num_rank;
Later orders with duplicate order prices are now ranked lower compared to before.
RANK()
An alternative option to ROW_NUMBER() is the RANK() function. RANK() assigns rank values to each row within a partition, with gaps in the rank values in the event of a tie. Lets take a look at how it works, filtering for fruit_id = 10 as it provides a good example.
SELECT
fruit_id,
order_date,
order_price_aud,
RANK() OVER (PARTITION BY fruit_id ORDER BY order_price_aud) as order_price_rank
FROM
orders
WHERE fruit_id = 10
order by
fruit_id,
order_price_rank;
Rows sharing the same order price are ranked the same, resulting in two rows with order_price_rank = 4. The next rank value given is then 6.
DENSE_RANK()
DENSE_RANK() works similar to RANK(), however consecutive ranks are not skipped in the event of a tie.
SELECT
fruit_id,
order_date,
order_price_aud,
DENSE_RANK() OVER (PARTITION BY fruit_id ORDER BY order_price_aud) as order_price_dense_rank
FROM
orders
WHERE fruit_id = 10
order by
fruit_id,
order_price_dense_rank;
The query will result in two rows with rank = 4, with the next rank value being 5.
NTILE(n)
The NTILE(n) function is used to divide the result set into n approximately equal-sized groups. Each row in the results assigned a group number, which ranges from 1 to n. We'll take a look at an example below.
SELECT
fruit_id,
order_date,
order_price_aud,
NTILE(4) OVER (PARTITION BY MONTH(order_date) ORDER BY order_date) AS quartered_orders_by_month
FROM
orders;
In this example, we're partitioning by the month and then splitting the orders for each month in to four approximately equal groups ('approximately' because the number of rows in each month isn't a multiple of 4).
Others
PERCENT_RANK(), while not as popular as other ranking functions, is another option. It works similarly to RANK() by leaving gaps in the percentile values in the event of a tie. Lastly, CUME_DIST() can be used to find the percentage of rows less than or equal to the current row. Unlike RANK(), a gap is produced between the tied rows and the rows with values lower than the tied rows, not greater.
Comparing Ranking Functions
Value window functions in SQL are used to assign to rows values from other rows. The most common ones include:
Before we cover the functions, we'll take a look at the concept of 'window frames', which we'll be using in conjunction with some of our value window functions.
Window Frames
Window frames in SQL are a way to define a subset of rows within a partition on which a window function operates. It can be thought of as a sort of 'partition' based on a range of rows, rather than the values of a column. When using window functions such as SUM(), AVG(), ROW_NUMBER(), etc., you can specify a window frame to limit the set of rows that are considered for each calculation. This allows for more granular control over the results.
LAG() & LEAD()
the LAG() function returns the value from a row before the current row within the same partition, and can take three arguments with the format LAG(column_name, offset, default_value). The offset specifies how many rows to look back, with the default being 1. If the offset value doesnt exist, the specified default_value is used, otherwise null is returned.
Perhaps we wanted to compare the order prices between consecutive orders, partitioned by each customer. This can be achieved using the following query:
SELECT
customer_id,
fruit_id,
order_date,
order_price_aud,
order_price_aud - LAG(order_price_aud, 1, NULL) OVER (PARTITION BY customer_id ORDER BY order_date) as cust_order_price_change
FROM orders
ORDER BY
customer_id,
order_date;
LEAD() is simply the reverse of LAG(), where we access data from a row after (as opposed to 'before') the current row within the same partition.
FIRST_VALUE(), LAST_VALUE & NTH_VALUE()
These functions do as you'd expect; return the first, last, and nth row within a partition. Perhaps we wanted to answer the following three questions:
SELECT
customer_id,
orders.fruit_id,
order_date,
order_price_aud,
FIRST_VALUE(fruit_name) OVER (PARTITION BY customer_id ORDER BY order_date) as first_order,
NTH_VALUE(fruit_name, 2) OVER (PARTITION BY customer_id ORDER BY order_date) as second_order,
LAST_VALUE(fruit_name) OVER (PARTITION BY customer_id ORDER BY order_date) as last_order,
FROM orders
JOIN fruits
ON orders.fruit_id = fruits.fruit_id
ORDER BY
customer_id,
order_date;
While our dataset is far too small for any real analysis, you can probably see how a query like this could be useful in the real world in determining which products customers are initially drawn to, as well as which products might be losing customers.
We've covered almost all of the typical window functions you'll encounter in SQL, but there's still one more concept to add. Being able to partition values on a column (or columns) is already quite powerful, but we can take it a step further by including a window frame.
Window frames are a way to define a subset of rows within a partition on which a window function operates. A window frame is similar to the window used in calculations such as moving averages, although there are some key differences. When using window functions such as SUM(), ROW_NUMBER(), LAG(), you can specify a window frame to limit the set of rows that are considered for each calculation.
First, add the below 40 rows of data to your orders table. The extra data isn't essential, but it'll make it much easier to see how our window frames work.
INSERT INTO orders (order_id, customer_id, fruit_id, order_date, order_price_aud)
VALUES
(41, 2, 7, '2024-03-02', 6514.27),
(42, 3, 14, '2024-03-02', 8475.45),
(43, 9, 2, '2024-03-03', 1254.58),
(44, 6, 11, '2024-03-05', 8234.21),
(45, 6, 2, '2024-03-05', 8368.34),
(46, 7, 18, '2024-03-08', 1237.45),
(47, 9, 1, '2024-03-10', 7658.34),
(48, 9, 9, '2024-03-11', 8345.23),
(49, 9, 9, '2024-03-12', 4356.33),
(50, 2, 16, '2024-03-14', 6345.45),
(51, 3, 13, '2024-03-14', 5476.23),
(52, 9, 11, '2024-03-16', 9463.48),
(53, 1, 3, '2024-03-16', 3947.48),
(54, 9, 6, '2024-03-17', 4946.23),
(55, 10, 17, '2024-03-18', 9464.23),
(56, 2, 9, '2024-03-19', 9246.23),
(57, 6, 14, '2024-03-20', 2365.23),
(58, 13, 20, '2024-03-22', 6364.23),
(59, 6, 5, '2024-03-23', 2425.59),
(60, 9, 2, '2024-03-25', 4335.64),
(61, 3, 7, '2024-03-26', 5356.50),
(62, 9, 12, '2024-03-28', 12160.46),
(63, 6, 2, '2024-03-29', 1415.06),
(64, 11, 13, '2024-04-1', 5151.80),
(65, 3, 10, '2024-04-2', 8714.78),
(66, 2, 9, '2024-04-3', 2342.24),
(67, 9, 14, '2024-04-5', 5231.33),
(68, 6, 16, '2024-04-6', 5876.23),
(69, 4, 6, '2024-04-6', 5358.32),
(70, 6, 10, '2024-04-8', 6453.08),
(71, 2, 13, '2024-04-10', 2121.80),
(72, 9, 11, '2024-04-10', 7579.26),
(73, 6, 25, '2024-04-11', 5235.55),
(74, 9, 6, '2024-04-12', 5242.08),
(75, 2, 8, '2024-04-14', 1237.59),
(76, 2, 9, '2024-04-15', 7312.79),
(77, 8, 10, '2024-04-15', 8383.77),
(78, 2, 20, '2024-04-16', 12043.52),
(79, 10, 10, '2024-04-17', 2342.59),
(80, 9, 19, '2024-04-19', 1254.59);
Next, lets see how we can use a window frame to calculate a
SELECT
customer_id,
orders.fruit_id,
order_date,
order_price_aud,
FIRST_VALUE(fruit_name) OVER (PARTITION BY customer_id ORDER BY order_date) as first_order,
NTH_VALUE(fruit_name, 2) OVER (PARTITION BY customer_id ORDER BY order_date) as second_order,
LAST_VALUE(fruit_name) OVER (PARTITION BY customer_id ORDER BY order_date) as last_order,
FROM orders
JOIN fruits
ON orders.fruit_id = fruits.fruit_id
ORDER BY
customer_id,
order_date;
SELECT
fruit_id,
order_date,
order_price_aud,
RANK() OVER (PARTITION BY fruit_id ORDER BY order_price_aud) as order_price_rank
FROM orders
WHERE fruit_id = 10
order by fruit_id, order_price_rank;
In situations where both subqueries and CTEs are possible, subqueries should be used for shorter, single use cases, while CTEs should be used for more complex cases, or cases where the same subquery would need to be written multiple times. This is just a generalization; each situation is unique and you should make your own decisions as to which option seems most appropriate.
As you might have guessed from the name, temp tables (short for temporary tables) are tables which exists only within the session they're created. Once the session ends, Any temp tables created are removed. One use of temp tables is to store intermediate results which can be used for querying later on. If you find yourself wanting to reuse a complex CTE or subquery multiple times, temp tables can be a good solution as they store/materialize the query results, avoiding repeated computation.
Temp tables can be created in practically the same way as a standard table as shown below:
CTAS
In this lesson we learnt about subqueries, CTEs, and temp tables, as well as the use of CTAS in creating both regular tables and temp tables.
In the next lesson we'll finish off the course by learning about window functions, allowing us to perform complex calculations across table rows that are somehow related to the current row.
Don't get lost in the vastness of AWS. In this series we'll cut through the noise and learn how to use AWS through real world examples.
Don't get lost in the vastness of AWS. In this series we'll cut through the noise and learn how to use AWS through real world examples.
Don't get lost in the vastness of AWS. In this series we'll cut through the noise and learn how to use AWS through real world examples.