· Alvaro Barber · Tutorials  · 8 min read

Ace your Data Engineer technical interview

Technical Questions and Quick Practical Tests

Technical Questions and Quick Practical Tests

Going through the process of interviews is quite painful.
Take a quick look at the list of questions that will make your life easier before going to a Data Engineer technical interview.

This is based on my own experience going through the process of more than 10 interviews in 2024.
The descriptions and skillset asked can vary within the Data Engineer position. Some of them will ask you more about Pyspark, some others more about data modelling and SQL.

First of all, be sure to check at the HR stage which technologies will be the ones asked to be prepared. Secondly, a thing that cannot be skipped is that you will need to have a multiskill set and you will be asked a bit of everything to know if you are a complete asset.

Below is the list that you can check with your favourite friend CHATGPT. Whatever questions I feel that CHATGPT could answer but where I found more better materials that explains it better visually, I will give some hints or some quick links/materials to overcome this.

Good luck in one of the most important days of your life and career!

For practical exercises, go to: https://wetrustindata.com/ace_your_technical_interview_practical/

Data Warehousing

  1. OLAP VS OLTP and benefits from each one
  2. What is Data Normalization and Denormalization and for which use case to apply one or the other?
  3. What is 3NF Normalization and benefits
  4. What are SCD Slow Changing Dimensions and explain all of them, how many types and put an example of type 2 or type 3 that you used in the project.
  5. Differences between ETL and ELT? Where would you apply one or the other?
  6. Difference between a Data Lake and a Data Warehouse?
  7. Data Vault vs Dimensional Modelling https://youtu.be/l5UcUEt1IzM?si=juvZiOP-yFyCg933
  8. Data Vault - How do you create a hashkey and a hashdiff. For what are they used in Data Vault modelling. https://youtu.be/bXTqz8u5dYQ?si=BgfEKU1-Ibs_fM8r
  9. Data Vault - Name the three key components in Data Vault.
  10. Data Modelling - Difference between a fact and a Dimension table.
  11. Difference between incremental and full load in your data pipelines, in case of incremental tables
  12. what do you use in the case of delta load? 
  13. Difference between columnar and row based data? Why columnar data in Snowflake is more efficient> Put an example.
    HINT: If you query 5 out of 50 columns, 45 columns are skipped

SQL

  • BEGINNER
  1. Difference between inner and outer join
  2. Type of aggregations you know
  3. What is a surrogate key? 
  4. Difference between primary and unique key?
  5. Difference between a primary key and a unique key? How the primary key should be created, which constraints are mandatory?
  6. Difference between HAVING and WHERE clause. Why would you use WHERE first instead of HAVING?
    HINT: Goes before in the execution stack
  7. Explain SQL order of execution. Does it executes same as its written? https://wetrustindata.com/sql_order_of_execution/
  • INTERMEDIATE
  1. How do you optimize your queries. https://wetrustindata.com/sql_optimizations/
  2. What is a CTE and when are you using it?
  3. What are windowing functions? In which situations would you use ROW_NUMBER() or RANK()…
  4. Difference between DENSE_RANK() and RANK()

Python

  1. Difference between a tuple, dictionary and a list? Properties of each one and which one is more space efficient? It depends on the use case using one or another?
  2. How do you share the packages/modules that you created between colleagues?
  3. Difference between a dictionary and a collection
  4. How do you optimize the code? Use of multiprocessing, threads, other libraries?
    What are the best practices so the code is clean and fast?
    HINT: Use the correct data structures(list,tuple,dictionaries), use built-in libraries etc…
  5. The libraries most used, have you used libraries for Data Science? Tell me the more important ones
  6. How do you test your code? Which libraries have you used
  7. How do you catch errors? Use of exceptions?
  8. How do you log your prints when you are debugging your code? simple prints or use of other libraries?
  9. What are docstrings? How do you document your code?
  10. What are generators and the concept of yield?
  11. Have you used Virtual environment? venv?
  12. What are decorators?
  13. What are built-in libraries?
  • OBJECT ORIENTED PROGRAMMING
  1. Difference between a function and a method? How do you call them?
  2. How to create a class?
  3. What is a module?
  4. What are Python magic methods?
  5. Explain the difference between deep copy and shallow copy?
  6. Explain the four key principles of OOP. Where did you have to use such a cases?
  7. What is the purpose of __ init __ ?
  8. Why we use self when we initialize the functions?
  9. Create a module with a class and two functions.
  10. What is ‘f’ string”

Pyspark

  1. What is an RDD?

  2. What is Coalesce?

  3. Why we say Pyspark is lazy?

  4. What is doing a group by before a partition?

  5. Difference between transformations and actions in Pyspark

  6. What is a broadcast join?

  7. What is a Partition in Pyspark? https://youtu.be/hvF7tY2-L3U?si=vsGd7-W4ZYkLn6gr

  8. What is Repartitioning?

  9. What is Shuffling? https://youtu.be/ffHboqNoW_A?si=n8IIuX1zgZAa3zjS

  10. Have you made a code that improved the speed in Pyspark?

  11. Was there a time that you had to tune your Spark cluster because it was taking too much time to run? Why it took so much time?

  12. Difference between Spark and Hadoop?

  13. What is the DRY acronym

  14. Types of clusters in Databricks

  15. What is the difference with Pyspark and Pandas library in Python? Will Pandas work with parallelism using the worker nodes if you apply it in a databricks notebook?

  16. What is a driver and worker node? Functions

  17. What happens when your code cannot run because a driver or executor out of memory error?

  18. How do you optimize your code so these kind of error will not happen? https://wetrustindata.com/pyspark_optimizations/

Git

  1. Have you used GIT in your project? 
  2. Have you used branching in GIT . Why is important to use it when working with other people?
  3. Have you solved any merge conflict and how?
  4. How do you unstage the staged files in GIT?
  5. What is git reset ? and when did you have to use it ? 
  • CI/CD
  1. Tell me CI/CD tools that you used.
  2. What is Gitlab? maybe Jenkins? Azure Devops?
  3. What is build and release in the pipelines?

Snowflake

  1. Types of tables in Snowflake(Temporary, Permanent, Transient), For which cases you use them? Which one would you use for testing purposes and why? 
  2. What is Snowpipe
  3. What are streams in Snowflake
  4. How do you load data to Snowflake from an S3 bucket or Azure Blob and viceversa?
  5. How do you optimize the queries? Have you used the query optimizer area in Snowflake?
  6. What is a clustered index? Have you used them in order to optimize your solutions?
  7. Which file formats are available in Snowflake?
  8. Have you used JSON format? How did you parsed it to a normal structured table?
  9. What tools have you used as a security and governance in snowflake?

Cloud

  1. What are the steps to create a pipeline in ADF? - Check Youtube Video if needed
    https://youtu.be/dFEzT-qfVIk?si=I9C8vc51hntthBPE
  2. Tell me a situation where your pipeline failed and what did you do
  3. Tell me the type of integration runtimes and which ones have you used.
  4. Have you run SSIS jobs in Azure?
  5. Was there a time that you processed a file in ADF from BLOB and the preview button shows no data? What did you do? 
  6. List of data flows that you know.
  7. How are you checking the data quality of data and files? How do you test in ADF?
  8. What is the Lookup Activity and Data Flow doing? 
  9. What is the benefit of using a Pipeline block activity within a Pipeline?
  10. How do you configure a Linked Service?
  11. How do you configure authentication in a Linked Service? How do you store passwords and confidential data?
  12. Azure Vault - Secrets.
  13. What types of files have you handled in ADF? JSON, CSV, from API?
  14. How do you transform JSON files to CSV with a simple Copy Data Activity? - Check Youtube Video if needed
    https://youtu.be/iJsLU0fduFk?si=9DrN9M50USuwru5I\

Soft Skills

  1. Are you aware of the Agile/Scrum ceremonies? What is your role there?
  2. What role you would like to have? Developer or management one?
  3. Do you consider yourself a Junior, Mid or Senior Data Engineer? Why?
  4. Tell me your biggest achievement in a development process? How did you helped the company with your code/job?
  5. Tell me when things go wrong, how do you act?
  6. Tell me how you convince someone of your idea , if the other has another completely opposite of what you think.
  7. What is the most you value in your working environment?
  8. Do you have client management skills? Please tell me situations where you have to explain a developer solution to someone from business with no DEV skills
  9. Do you have experience in team leading or giving demos to your colleagues about the skills you mastered?
  10. As a software engineer, what are the main risks that we are facing? 
  11. How do you make your estimations? and how do you know how many time do you need to assign it?

Testing

  1. How do you perform data validation at your place(data tests)? Why is important to do this?
  2. Are you doing this tests manually or the data is automatically tested using some tool?
  • For SQL
  1. Checking Duplicate Records
  2. Checking Nulls in the Columns
  3. Checking if the selected group of columns are unique between all of the records
  4. Row count same for two tables that you need to ensure this constraint.
  5. Check that the data types are the correct ones
  6. Check that aggregation SUM of one column is same as the aggregation SUM column from other table.
  7. Check once you load the data, and what happens when you load the data a second time(Depend if the load is full or incremental load). Any duplicates appearing?
  8. Is the query updating the records correctly according to the CDC logic that you implemented in the query?
  9. Use of MINUS clause between the total records from the table in PROD environment and table recently developed in DEV environment

HINT: DBT integration tool has this kind of packages and scripts pre made so you do not have to write them every single time

  • For Python:
  1. Use of Exceptions
  2. Try and Catch errors
  3. Using unit tests with unittest or pytest library
Back to Blog