How to test your Data Pipelines with SQL and DBT.

How do you perform data validation in your data pipeline? Why is important to do this?
Are you doing these tests manually or the data is automatically tested using some tool?
These will be the questions to be answered in this post.

We all know about the importance of running tests agains the data and tables created. Here some tips of how to do this manually using SQL or either using predefinded packages if you are a DBT user.

To have an idea of what tests you could use automatically in DBT without having to write them from scratch, check the following link: https://hub.getdbt.com/calogica/dbt_expectations/latest/

The thing is that you need to define this in a YAML style. So, for the translation of what SQL code some of these packages are executing, check below the examples:

For SQL

Checking Duplicate Records.Example: Check if there are more than 2 duplicated emails in customers table

SELECT Email, COUNT(*)
FROM Customers
GROUP BY Email
HAVING COUNT(*) > 1;

Checking Nulls in the Columns. Check if there is a NULL in the EMAIL column

SELECT *
FROM Customers
WHERE Email IS NULL;

Checking if the selected group of columns are unique between all of the records of the table. For this case: [name,last_name,address,location]

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (
      
with validation_errors as (

    select
        customer_name,customer_last_name,customer_address,customer_location,
        count(*) as "n_records"
    from customers
    where
        1=1
        and 
    not (
        customer_name is null and 
        customer_last_name is null and 
        customer_address is null and 
        customer_location is null and 
        
    )
    group by
        customer_name,customer_last_name,customer_address,customer_location
    having count(*) > 1
)
select * from validation_errors
    ) dbt_internal_test

Row count same between two tables analyzing specific columns.

(SELECT COUNT(*) as total_count FROM table_prod)
MINUS
(SELECT COUNT(*) as total_count FROM table_dev)

Check that the data types are the correct ones or that the data type include a specific length.
For the example, ensure that material column has only 11 characters.

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (
    with grouped_expression as (
    select
    length(
        material
    ) = 11 as expression
    from materials_table
  
),
validation_errors as (
    select
        *
    from
        grouped_expression
    where
        not(expression = true)
)
select *
from validation_errors

Check that aggregation SUM of one column is same as the aggregation SUM column from other table.

select
      count(*) as failures,
      count(*) != 0 as should_warn,
      count(*) != 0 as should_error
    from (
      
    with a as (
    select
        sum(customer_total_count) as expression
    from
        customer
    ),
    b as ( 
    select
        sum(customer_total_count) as expression
    from
        customer_address
    ),
    final as (

        select
            
            a.expression,
            b.expression as compare_expression,
            abs(coalesce(a.expression, 0) - coalesce(b.expression, 0)) as expression_difference,
            abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
                nullif(a.expression * 1.0, 0) as expression_difference_percent
        from
            a cross join b
    )
    -- DEBUG:select * from final
    select * from final where
        expression_difference > 0.0
    ) dbt_internal_test

Check once you load the data, and what happens when you load the data a second time(Depend if the load is full or incremental load). Any duplicates appearing? If yes you will need to check how is implemented the logic within your SQL view/script.
Is the query updating the records correctly according to the CDC logic that you implemented in the query? According to the hash keys functionality that you have implemented or other SCD mechanisms?
Check between the datatypes of one table and the dataypes of the second table that reads from the first one. You want consistency in the datatypes across all of your tables.
Last but not least, of course time of execution. You want to make it work, but with the most efficiency and least cost possible, Check for some basic optimizations to speed up your code: https://wetrustindata.com/sql_optimizations/