· Alvaro Barber · Tutorials · 3 min read
How to test your Data Pipelines with SQL and DBT.
And be confident deploying to Production
How do you perform data validation in your data pipeline? Why is important to do this?
Are you doing these tests manually or the data is automatically tested using some tool?
These will be the questions to be answered in this post.
We all know about the importance of running tests agains the data and tables created. Here some tips of how to do this manually using SQL or either using predefinded packages if you are a DBT user.
To have an idea of what tests you could use automatically in DBT without having to write them from scratch, check the following link: https://hub.getdbt.com/calogica/dbt_expectations/latest/
The thing is that you need to define this in a YAML style. So, for the translation of what SQL code some of these packages are executing, check below the examples:
- For SQL
- Checking Duplicate Records.Example: Check if there are more than 2 duplicated emails in customers table
SELECT Email, COUNT(*)
FROM Customers
GROUP BY Email
HAVING COUNT(*) > 1;
- Checking Nulls in the Columns. Check if there is a NULL in the EMAIL column
SELECT *
FROM Customers
WHERE Email IS NULL;
- Checking if the selected group of columns are unique between all of the records of the table. For this case: [name,last_name,address,location]
select
count(*) as failures,
count(*) != 0 as should_warn,
count(*) != 0 as should_error
from (
with validation_errors as (
select
customer_name,customer_last_name,customer_address,customer_location,
count(*) as "n_records"
from customers
where
1=1
and
not (
customer_name is null and
customer_last_name is null and
customer_address is null and
customer_location is null and
)
group by
customer_name,customer_last_name,customer_address,customer_location
having count(*) > 1
)
select * from validation_errors
) dbt_internal_test
- Row count same between two tables analyzing specific columns.
(SELECT COUNT(*) as total_count FROM table_prod)
MINUS
(SELECT COUNT(*) as total_count FROM table_dev)
- Check that the data types are the correct ones or that the data type include a specific length.
For the example, ensure that material column has only 11 characters.
select
count(*) as failures,
count(*) != 0 as should_warn,
count(*) != 0 as should_error
from (
with grouped_expression as (
select
length(
material
) = 11 as expression
from materials_table
),
validation_errors as (
select
*
from
grouped_expression
where
not(expression = true)
)
select *
from validation_errors
- Check that aggregation SUM of one column is same as the aggregation SUM column from other table.
select
count(*) as failures,
count(*) != 0 as should_warn,
count(*) != 0 as should_error
from (
with a as (
select
sum(customer_total_count) as expression
from
customer
),
b as (
select
sum(customer_total_count) as expression
from
customer_address
),
final as (
select
a.expression,
b.expression as compare_expression,
abs(coalesce(a.expression, 0) - coalesce(b.expression, 0)) as expression_difference,
abs(coalesce(a.expression, 0) - coalesce(b.expression, 0))/
nullif(a.expression * 1.0, 0) as expression_difference_percent
from
a cross join b
)
-- DEBUG:select * from final
select * from final where
expression_difference > 0.0
) dbt_internal_test
- Check once you load the data, and what happens when you load the data a second time(Depend if the load is full or incremental load). Any duplicates appearing? If yes you will need to check how is implemented the logic within your SQL view/script.
- Is the query updating the records correctly according to the CDC logic that you implemented in the query? According to the hash keys functionality that you have implemented or other SCD mechanisms?
- Check between the datatypes of one table and the dataypes of the second table that reads from the first one. You want consistency in the datatypes across all of your tables.
- Last but not least, of course time of execution. You want to make it work, but with the most efficiency and least cost possible, Check for some basic optimizations to speed up your code: https://wetrustindata.com/sql_optimizations/