The Data Hub

Introduction to Sql 01

What exactly is SQL?
Setting up for SQL
1. Creating a Snowflake Free Trial
Preparing our Worksheet Environment
1. stuff
2. more stuff
Essential SQL Clauses
1. SELECT
2. ORDER BY
3. GROUP BY & HAVING
4. DISTINCT
5. CASE
6. LIMIT & OFFSET
7. WHERE
Summary

SQL is an essential skill for data management. Whether you're an aspiring data analyst, scientist, or engineer, understanding SQL is essential. In this course we'll cover the fundamentals of SQL and develop a solid foundation to later build upon. If your're completely new to SQL or just want a refresher on the basics, this is the guide for you.

terminology: querying, schemas, tables, database, warehouse, roles, relational database

What exactly is SQL?

SQL is a standardized language used to manage relational databases and perform various operations on the data within them. SQL allows users to create, read, update, and delete database records. Unlike general-purpose programming languages, SQL is specifically designed for interacting with databases and enables efficient querying and manipulation of large datasets.

You may have already noticed people refering to different variations of SQL such as MySQL, PostgreSQl, or SQL Server, which can often create confusion. Different database management systems (DBMS) have their own SQL dialects with unique features and extensions, however the core syntax and commands remain consistent.

Setting up for SQL

Unlike most SQL tutorials, we'll be learning SQL through Snowflake as opposed to the traditional SQL Server route. At the end of the day, it doesn't really matter which version of SQL you use, however snowflake offers a modernized and user friendly interface with no need for downloads. It also includes a popular dataset used in benchmarks which we'll be using later.

To begin, you'll simply need to create a snowflake free trial account, which can be done here. Simply use your own email and pass 'N/A' as the company name if you're creating a personal account. Select standard edition, your preferred cloud provider (choose AWS if you have no preference), and the closest region. Your main account can have multiple sub accounts with different providers/regions, so don't stress thinking you'll be locked in to one particular provider/region.

Once your account is activated and you're logged in, head over to the worksheets section of the navigation panel on the left. This is where all our SQL worksheets will be located. To keep things neat for the future, create a folder using the + located at the top right and name it Introduction to SQL. Similarly, create a worksheet called introduction 01: Basics (you can name your worksheets using the 3 dots on the tab at the top of your screen, or by using the worksheets section of the sidebar).

Before we begin with SQL, we'll do a tiny crash-course for relational databases, as well as some of snowflake's functionalities. There are 5 key terms to know:

Tables
Schemas
Databases (relational)
Roles
Warehouses

Tables

A table is a database object that stores data in rows and columns. Each column represents a data field such as age or gender, and each row represents a unique record.

Schemas

A schema is a logical grouping of database objects such as tables, views, and stored procedures (we'll get to views and the rest another time). Schemas provide a way to organize and separate objects for different applications or users within a single database. For example, we might have all of our original, unchanged data in one schema (typically called 'raw'), while another schema is used to store 'enhanced' or 'curated' tables used for querying. Raw data is always kept out of sight from end users such as shareholders and analysts, and so schemas allow for an easy way to separate such data.

Databases

A database is a storage location of related data used to capture a specific situation. One example of a database could be a point-of-sale (POS) database containing information regarding transactions, inventory, customers, employees etc. On the other hand, their data for marketing and promotions would likely be stored in an entirely separate database.

Roles

Just like in cloud services, snowflake uses the concept of roles to define security and access permissions. These permissions determine what actions users can perform and what data they can access within the Snowflake environment. It's considered good practice to avoid usage of the ACCOUNTADMIN role unless necessary.

Warehouses

While the term warehouse may lead you to imagine a collection of databases, this isn't the case. A warehouse (in the context of snowflake) is essentially a cluster of virtual machines managed by snowflake. This is what provides the compute power required to execute SQL queries and other data operations. Snowflake gives users the ability to change the size and number of clusters for their warehouses, however this is irrelevent for our situation and we'll be using an X-Small warehouse at all times.

Preparing our Worksheet Environment

To begin, we need to create a database, schema, and table. We also need to select our role and warehouse. We could do this manually, but there's an easier way through the use of built in commands.

Note: use ctrl + enter to run queries.


USE ROLE SYSADMIN;
USE WAREHOUSE COMPUTE_WH; 
CREATE DATABASE TUTORIAL_DB;
CREATE SCHEMA FRUIT;
DROP SCHEMA PUBLIC;

First, we're selecting our role as sysadmin which provides a balance between permissions and security. Then we selected the warehouse COMPUTE_WH, which is a premade X-Small warehouse in every new snowflake account. Lastly, we created our database and schema, and dropped the 'public' schema created in every new database.

Note: You'll likely need to refresh the sidebar to see the database we created. This is a common theme you'll encounter when using snowflake.

Now its time to create a table with some data! Thoroughout this course we'll be playing the role as a newly hired analyst at a highly disorganized fruit wholesaler. The company only began tracking orders two months ago, and everythings stored in Microsoft Word. Its now our job to fix their mess. To create a table we use the following format:


CREATE TABLE table_name (
    integer_column INT,             -- INT: Integer only values
    string_column1 VARCHAR(50),     -- VARCHAR(50): Variable characters (strings) where max length = 50
    string_column2 VARCHAR(20),     -- VARCHAR(20): Variable characters (strings) where max length = 20
    float_column FLOAT              -- FLOAT: Float only values
);

Tip:Simply think of creating a table as creating the blueprints for how we want to store the data.

All we're doing is using the 'CREATE' and 'TABLE' keywords, followed by the table name. Then in brackets we specify the column name + data type pairs. You might notice how we keep including a colon at the end of every query. The colon indicates where our queries end, ensuring anything below the colon isn't executed.

Now we know the format for creating tables we'll use it to create our fruits table:


CREATE TABLE fruits (
    fruit_id INT,                     -- Unique identifier for each fruit (Primary Key)
    fruit_name VARCHAR(30),           -- Name of the fruit
    fruit_color VARCHAR(20),          -- Color of the fruit
    average_weight DECIMAL(10,2),      -- Average weight of the fruit in grams
    average_size VARCHAR(10),         -- Average size (e.g., small, medium, large)
    botanical_family VARCHAR(30),     -- Botanical family (e.g., Rosaceae)
    price_per_kg DECIMAL(10,2),       -- Price per kilogram in $ AUD
    season VARCHAR(30)                -- Typical season when the fruit is in season
);

We have our table, now we need some data. We can insert data into a table using the following format:


INSERT INTO table_name (integer_column, string_column1, string_column2, float_column) 
VALUES
    (int, 'string1', 'string2', float), -- Row 1
    (int, 'string1', 'string2', float), -- Row 2
    (int, 'string1', 'string2', float); -- Row 3

The format is similar to creating a table, but instead of listing the columns we list the rows with their column values separated by commas. We'll do the same again but with some fruit data this time. Notice the first column called 'fruit_id'. Almost all tables will contain an id value of some form, which ensures that even if other data in two rows are identical such as, they can still be distinguished.


INSERT INTO fruits (fruit_id, fruit_name, fruit_color, average_weight, average_size, botanical_family, price_per_kg, season) 
VALUES
    (1, 'Apple', 'Red', 150.64, 'Medium', 'Rosaceae', 3.10, 'Fall'),
    (2, 'Banana', 'Yellow', 120.64, 'Medium', 'Musaceae', 2.91, 'Year-Round'),
    (3, 'Orange', 'Orange', 200.23, 'Medium', 'Rutaceae', 3.55, 'Winter'),
    (4, 'Strawberry', 'Red', 10.75, 'Small', 'Rosaceae', 8.32, 'Spring'),
    (5, 'Mango', 'Yellow', 250.34, 'Large', 'Anacardiaceae', 5.02, 'Summer'),
    (6, 'Blueberry', 'Blue', 1.22, 'Small', 'Ericaceae', 10.28, 'Summer'),
    (7, 'Grapes', 'Purple', 5.24, 'Small', 'Vitaceae', 4.33, 'Fall'),
    (8, 'Pineapple', 'Brown', 900.75, 'Large', 'Bromeliaceae', 6.59, 'Year-Round'),
    (9, 'Cherry', 'Red', 8.19, 'Small', 'Rosaceae', 12.79, 'Summer'),
    (10, 'Kiwi', 'Green', 75.85, 'Small', 'Actinidiaceae', 7.39, 'Winter'),
    (11, 'Peach', 'Yellow', 150.36, 'Medium', 'Rosaceae', 4.58, 'Summer'),
    (12, 'Plum', 'Purple', 70.86, 'Small', 'Rosaceae', 5.19, 'Summer'),
    (13, 'Watermelon', 'Green', 5053.22, 'Large', 'Cucurbitaceae', 1.50, 'Summer'),
    (14, 'Lemon', 'Yellow', 100.26, 'Medium', 'Rutaceae', 3.53, 'Winter'),
    (15, 'Pear', 'Green', 180.13, 'Medium', 'Rosaceae', 4.23, 'Fall'),
    (16, 'Apricot', 'Orange', 35.64, 'Small', 'Rosaceae', 6.83, 'Summer'),
    (17, 'Coconut', 'Brown', 1500.18, 'Large', 'Arecaceae', 2.40, 'Year-Round'),
    (18, 'Fig', 'Purple', 50.84, 'Small', 'Moraceae', 7.06, 'Summer'),
    (19, 'Pomegranate', 'Red', 250.34, 'Medium', 'Lythraceae', 5.59, 'Fall'),
    (20, 'Avocado', 'Green', 200.68, 'Medium', 'Lauraceae', 6.52, 'Year-Round');

Our fruits table should now contain data! In order to check, we'll query our table. A basic query will consist of the followinging structure:


SELECT columns FROM table;

In SQL, you can use * as a quick way to specify 'all columns' without explicitly typing their names.


SELECT * FROM fruits;

If everything went to plan, you should see your table appear with the fruit data.

Essential SQL Clauses

Now its time to practice some of the essential SQL clauses. In this lesson we'll learn the following:

SELECT & FROM
ORDER BY
GROUP BY & HAVING
DISTINCT
AS
CASE
LIMIT & OFFSET
WHERE

SELECT & FROM

Any time you want to query data, SELECT from FROM will be involved. SELECT is used to specify the columns, while FROM is used to specify the table. The below examples involve querying specific columns from our table.


SELECT fruit_name FROM fruits;

SELECT fruit_name, price_per_kg FROM fruits;

SELECT fruit_name, average_weight, average_size FROM fruits;

ORDER BY

The ORDER BY clause is used to sort the result of a query on one or more columns. Perhaps we want to order our fruits from highest to lowest price per kg:


SELECT
    fruit_name,
    price_per_kg
FROM fruits
ORDER BY price_per_kg DESC; -- DESC is used to specify descending order

The above query will return the fruits cherry, blueberry, and strawberry at the top, as they have the highest price per kg. If we wanted to take the size of the fruit into account too, we can simply add it to our ORDER BY clause.


SELECT
    fruit_name,
    price_per_kg,
    average_size
FROM fruits
ORDER BY average_size, price_per_kg DESC;

Taking size into account, avocado and pineapple are now the highest price per kg fruits for the medium and large sizes.

GROUP BY & HAVING

We can use GROUP BY to aggregate data and generate summary statistics. We can also combine it with HAVING to filter our results.


-- Average price per kg for each size category
SELECT average_size, AVG(price_per_kg) AS max_price
FROM fruits
GROUP BY average_size;

-- Number of Fruits in each season
SELECT season, COUNT(*) AS fruit_count
FROM fruits
GROUP BY season;

-- Filtered on seasons with > 3 fruits
SELECT season, COUNT(*) AS fruit_count
FROM fruits
GROUP BY season
HAVING COUNT(*) > 3

Note: We'll learn the WHERE clause soon which may seem the same as HAVING, however these clauses have different use cases. HAVING is always used in conjunction with GROUP BY i.e. on aggregated data, while the WHERE clause cannot be used on aggregated data. Simply remember: if you've aggregated your data with GROUP BY and want to filter it, use HAVING, otherwise use WHERE.

DISTINCT

The DISTINCT clause is used to ensure that for a selected group of columns, each row in the output of a query will be unique. Lets try a few queries and see how their outputs differ:


SELECT DISTINCT fruit_color FROM fruits;

select DISTINCT fruit_color, fruit_name FROM fruits;

select DISTINCT fruit_color, average_size FROM fruits;

The first query returns one row for each fruit color, however when we include the fruit name into the query, we get multiple rows with the same fruit color. Since each fruit name is unique, the output will return every row. For the final query, some of the fruits are the same color and size, and so we only get back 12 of the 20 rows.

In SQL, we use AS to alias columns and tables. In our table, we have the average weight column, but without additional information or looking at the data we don't know what units are being used. Using AS, we can rename the column to contain the units.


SELECT 
	fruit_name,
	average_weight AS avg_weight_grams
FROM fruits;

When used in a select query, the effect is only temporary and the table retains the original column names. We'll learn how to rename the columns permanently soon.

Note: The use of as when aliasing is actually optional

CASE

The CASE statement allows us to create conditional logic in our queries. This is often used to add additional columns with new information. Perhaps we want a column to distinguish between berry and non-berry fruits. To achieve this, we can use the following case statement:


SELECT fruit_name,
    CASE 
        WHEN fruit_name LIKE '%berry' THEN 'Yes'
        ELSE 'No'
    END AS is_berry
FROM fruits;

In SQL, we can use '%string%' to search for any rows in a column containing a specified string. In the case of '%berry', we're searching for any words ending in berry.

Tip: When creating boolean columns, use prefixes such as is_, has_, can_, or should_. This makes it easier to identify and distinguish between columns at a glance.

LIMIT & OFFSET

LIMIT is used to define the number rows to be returned from a query. Using LIMIT can be useful when working with large databases as it improves computation time and costs. LIMIT can also be used in conjunction with OFFSET to specify the starting position of rows. Try the queries below and see how their results differ.


SELECT fruit_id, fruit_name FROM fruits;

SELECT fruit_id, fruit_name FROM fruits LIMIT 10;

SELECT fruit_id, fruit_name FROM fruits LIMIT 10 OFFSET 5;

The output goes from the first 20 rows to the first 10 rows, and finally to rows 6 through 15.

WHERE

The WHERE clause is possibly one of the most versatile clauses and can be used to filter a table in many ways. We can combine where with a variety of operators to create various filtering logic. Don't worry about memorizing each and every query below, however make sure to go through them slowly and familiarize yourself with SQL's syntax. Pay attention to the use of operators such as BETWEEN, IN, and LIKE.

Note: SQL is case sensitive! To remove upper cases we can use LOWER.


-- 1. Equalities
SELECT * FROM fruits WHERE fruit_color = 'Red';

-- 2. Inequalities
SELECT * FROM fruits WHERE fruit_color <> 'Red'; -- SQL inequalities use <> instead of != 

-- 3. Greater Than or Equal To
SELECT * FROM fruits WHERE average_weight >= 100;

-- 4. BETWEEN
SELECT * FROM fruits WHERE average_weight BETWEEN 50 AND 200;

-- 5. String Functions
SELECT * FROM fruits WHERE UPPER(fruit_name) = 'APPLE'; -- UPPER converts all letters to uppercase (LOWER for lowercase)
SELECT * FROM fruits WHERE LENGTH(fruit_name) > 5; -- LENGTH returns the length of a string

-- 6. IS NULL
SELECT * FROM fruits WHERE average_size IS NULL; -- Or 'IS NOT NULL'

-- 7. IN
SELECT * FROM fruits WHERE fruit_color IN ('Red', 'Yellow', 'Green'); -- Or 'NOT IN'

-- 8. AND, OR
SELECT * FROM fruits WHERE fruit_color = 'Red' AND average_weight > 100;
SELECT * FROM fruits WHERE fruit_color = 'Red' OR average_weight > 100;

-- 9. NOT
SELECT * FROM fruits WHERE NOT fruit_color = 'Red';

-- 10. Combining AND, OR, NOT
SELECT * FROM fruits WHERE (fruit_color = 'Red' OR fruit_color = 'Yellow') AND average_weight > 100;
SELECT * FROM fruits WHERE NOT (fruit_color = 'Red' AND average_weight > 100);

-- 5. String Functions
SELECT * FROM fruits WHERE UPPER(fruit_name) = 'APPLE'; -- Upper converts all letters to uppercase (LOWER for lowercase)
SELECT * FROM fruits WHERE LENGTH(fruit_name) > 5; -- LENGTH returns the length of a string

-- 11. LIKE
SELECT * FROM fruits WHERE fruit_name LIKE 'A%'; -- Fruit names starting with 'A'
SELECT * FROM fruits WHERE fruit_name LIKE '%berry'; -- Fruit names ending in 'berry'
SELECT * FROM fruits WHERE LOWER(fruit_name) LIKE '%apple%'; -- Fruit names containing 'apple'

-- 12. Functions
SELECT * FROM fruits WHERE ROUND(average_weight, 0) = 150;

Summary

In this lesson we learnt how to:

Set up a Snowflake account
Create a database, schema and table
Perform various queries on our data