Ace your Data Engineer technical interview

Going through the process of interviews is quite painful.
Take a quick look at the list of questions that will make your life easier before going to a Data Engineer technical interview.

This is based on my own experience going through the process of more than 10 interviews in 2024.
The descriptions and skillset asked can vary within the Data Engineer position. Some of them will ask you more about Pyspark, some others more about data modelling and SQL.

First of all, be sure to check at the HR stage which technologies will be the ones asked to be prepared. Secondly, a thing that cannot be skipped is that you will need to have a multiskill set and you will be asked a bit of everything to know if you are a complete asset.

Below is the list that you can check with your favourite friend CHATGPT. Whatever questions I feel that CHATGPT could answer but where I found more better materials that explains it better visually, I will give some hints or some quick links/materials to overcome this.

Good luck in one of the most important days of your life and career!

For practical exercises, go to: https://wetrustindata.com/ace_your_technical_interview_practical/

Data Warehousing

OLAP VS OLTP and benefits from each one
What is Data Normalization and Denormalization and for which use case to apply one or the other?
What is 3NF Normalization and benefits
What are SCD Slow Changing Dimensions and explain all of them, how many types and put an example of type 2 or type 3 that you used in the project.
Differences between ETL and ELT? Where would you apply one or the other?
Difference between a Data Lake and a Data Warehouse?
Data Vault vs Dimensional Modelling https://youtu.be/l5UcUEt1IzM?si=juvZiOP-yFyCg933

Dimensional modeling is pretty effective for query performances from BI tools like PowerBI or dashboards. The drawbacks are that if you set a certain granularity in your table(by granularity lets take as an example the different keys that points to your dimensions), once in production it will be costly to add another granularity(foreign key pointing to a new dimension) because that means that you need to reload the data, and backfill the historical data for the new key for the date before you created the new granularity.

For this purpose Data Vault is more flexible and instead of relying in a single backbone(fact table), in case of adding or removing a granularity, you can simply add or remove a satellite.

Data Vault is a very costly data modeling technique due to its logic with hash keys and hash diffs and how to model the satellites,but for data that is constantly changing, for example you have different providers for different countries you need to ingest, data vault makes it easier to build it with the addition of satellites. Later you can build on top of those satellites by making a UNION , the DIM and FACT tables. So you serve flexibility in the satellites and on the other side you serve query performance from the Visualization BI tools

Data Vault - How do you create a hashkey and a hashdiff. For what are they used in Data Vault modelling. https://youtu.be/bXTqz8u5dYQ?si=BgfEKU1-Ibs_fM8r
Data Vault - Name the three key components in Data Vault.
Data Modelling - Difference between a fact and a Dimension table.
Difference between incremental and full load in your data pipelines, in case of incremental tables
what do you use in the case of delta load?
Difference between columnar and row based data? Why columnar data in Snowflake is more efficient> Put an example.
HINT: If you query 5 out of 50 columns, 45 columns are skipped

SQL

BEGINNER

Difference between inner and outer join
Type of aggregations you know
What is a surrogate key?
Difference between primary and unique key?
Difference between a primary key and a unique key? How the primary key should be created, which constraints are mandatory?
Difference between HAVING and WHERE clause. Why would you use WHERE first instead of HAVING?
HINT: Goes before in the execution stack
Explain SQL order of execution. Does it executes same as its written? https://wetrustindata.com/sql_order_of_execution/

INTERMEDIATE

How do you optimize your queries. https://wetrustindata.com/sql_optimizations/
What is a CTE and when are you using it?
What are windowing functions? In which situations would you use ROW_NUMBER() or RANK()…
Difference between DENSE_RANK() and RANK()

Python

Difference between a tuple, dictionary and a list? Properties of each one and which one is more space efficient? It depends on the use case using one or another?
How do you share the packages/modules that you created between colleagues?
Difference between a dictionary and a collection
How do you optimize the code? Use of multiprocessing, threads, other libraries?
What are the best practices so the code is clean and fast?
HINT: Use the correct data structures(list,tuple,dictionaries), use built-in libraries etc…
The libraries most used, have you used libraries for Data Science? Tell me the more important ones
How do you test your code? Which libraries have you used
How do you catch errors? Use of exceptions?
How do you log your prints when you are debugging your code? simple prints or use of other libraries?
What are docstrings? How do you document your code?
What are generators and the concept of yield?
Have you used Virtual environment? venv?
What are decorators?
What are built-in libraries?

OBJECT ORIENTED PROGRAMMING

Difference between a function and a method? How do you call them?
How to create a class?
What is a module?
What are Python magic methods?
Explain the difference between deep copy and shallow copy?
Explain the four key principles of OOP. Where did you have to use such a cases?
What is the purpose of __ init __ ?
Why we use self when we initialize the functions?
Create a module with a class and two functions.
What is ‘f’ string”

Pyspark

What is an RDD?
What is Coalesce?
Why we say Pyspark is lazy?
What is doing a group by before a partition?
Difference between transformations and actions in Pyspark
What is a broadcast join?
What is a Partition in Pyspark? https://youtu.be/hvF7tY2-L3U?si=vsGd7-W4ZYkLn6gr
What is Repartitioning?
What is Shuffling? https://youtu.be/ffHboqNoW_A?si=n8IIuX1zgZAa3zjS
Have you made a code that improved the speed in Pyspark?
Was there a time that you had to tune your Spark cluster because it was taking too much time to run? Why it took so much time?
Difference between Spark and Hadoop?
What is the DRY acronym
Types of clusters in Databricks
What is the difference with Pyspark and Pandas library in Python? Will Pandas work with parallelism using the worker nodes if you apply it in a databricks notebook?
What is a driver and worker node? Functions
What happens when your code cannot run because a driver or executor out of memory error?
How do you optimize your code so these kind of error will not happen? https://wetrustindata.com/pyspark_optimizations/

Git

Have you used GIT in your project?
Have you used branching in GIT . Why is important to use it when working with other people?
Have you solved any merge conflict and how?
How do you unstage the staged files in GIT?
What is git reset ? and when did you have to use it ?

CI/CD

Tell me CI/CD tools that you used.
What is Gitlab? maybe Jenkins? Azure Devops?
What is build and release in the pipelines?
Do you have some experience building the full orchestration in any of the tools mentioned like Gitlab or Airflow?

Snowflake

What is Snowflake and how does it differ from traditional databases?
Explain the architecture of Snowflake.
What is a Snowflake Stage, and how is it used?
Types of tables in Snowflake(Temporary, Permanent, Transient), For which cases you use them? Which one would you use for testing purposes and why?
What is Snowpipe
What are streams in Snowflake
How do you load data to Snowflake from an S3 bucket or Azure Blob and viceversa?
How do you optimize the queries? Have you used the query optimizer area in Snowflake?
What is a clustered index? Have you used them in order to optimize your solutions?
Which file formats are available in Snowflake?
Have you used JSON format? How did you parsed it to a normal structured table?
What tools have you used as a security and governance in snowflake?
How can you benefit from Snowflake when the same query is being executed everyday ? Query Caching?
Do you have some experience in granting and revoking permissions?
What is cloning and for what is important?

Cloud

What are the steps to create a pipeline in ADF? - Check Youtube Video if needed
https://youtu.be/dFEzT-qfVIk?si=I9C8vc51hntthBPE
Tell me a situation where your pipeline failed and what did you do
Tell me the type of integration runtimes and which ones have you used.
Have you run SSIS jobs in Azure?
Was there a time that you processed a file in ADF from BLOB and the preview button shows no data? What did you do?
List of data flows that you know.
How are you checking the data quality of data and files? How do you test in ADF?
What is the Lookup Activity and Data Flow doing?
What is the benefit of using a Pipeline block activity within a Pipeline?
How do you configure a Linked Service?
How do you configure authentication in a Linked Service? How do you store passwords and confidential data?
Azure Vault - Secrets.
What types of files have you handled in ADF? JSON, CSV, from API?
How do you transform JSON files to CSV with a simple Copy Data Activity? - Check Youtube Video if needed
https://youtu.be/iJsLU0fduFk?si=9DrN9M50USuwru5I\

Dbt

Types of maetrialization
How are you handling the data vault in your tables with incremental tables? Are you making partial or full refresh?
Why is important to use DBT in your opinion, what brings to the table? only good integration with GIT and testing with snowflake?

Soft Skills

Are you aware of the Agile/Scrum ceremonies? What is your role there?
What role you would like to have? Developer or management one?
Do you consider yourself a Junior, Mid or Senior Data Engineer? Why?
Tell me your biggest achievement in a development process? How did you helped the company with your code/job?
Tell me when things go wrong, how do you act?
Tell me how you convince someone of your idea , if the other has another completely opposite of what you think.
What is the most you value in your working environment?
Do you have client management skills? Please tell me situations where you have to explain a developer solution to someone from business with no DEV skills
Do you have experience in team leading or giving demos to your colleagues about the skills you mastered?
As a software engineer, what are the main risks that we are facing?
How do you make your estimations? and how do you know how many time do you need to assign it?

Testing

How do you perform data validation at your place(data tests)? Why is important to do this?
Are you doing this tests manually or the data is automatically tested using some tool?

For SQL

Checking Duplicate Records
Checking Nulls in the Columns
Checking if the selected group of columns are unique between all of the records
Row count same for two tables that you need to ensure this constraint.
Check that the data types are the correct ones
Check that aggregation SUM of one column is same as the aggregation SUM column from other table.
Check once you load the data, and what happens when you load the data a second time(Depend if the load is full or incremental load). Any duplicates appearing?
Is the query updating the records correctly according to the CDC logic that you implemented in the query?
Use of MINUS clause between the total records from the table in PROD environment and table recently developed in DEV environment

HINT: DBT integration tool has this kind of packages and scripts pre made so you do not have to write them every single time

For Python:

Use of Exceptions
Try and Catch errors
Using unit tests with unittest or pytest library