VOOZH about

URL: https://dzone.com/articles/setting-up-dbt-and-snowpark-for-machine-learning-pipelines

⇱ Setting Up DBT and Snowpark for Machine Learning Pipelines


Related

  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Setting Up DBT and Snowpark for Machine Learning Pipelines

Setting Up DBT and Snowpark for Machine Learning Pipelines

This guide describes how to install and set up DBT and Snowpark for machine-learning model pipelines to streamline data processing and feature engineering tasks.

Likes
Comment
Save
13.5K Views

Join the DZone community and get the full member experience.

Join For Free

AI/ML workflows excel on structured, reliable data pipelines. To streamline these processes, DBT and Snowpark offer complementary capabilities: DBT is for modular SQL transformations, and Snowpark is for programmatic Python-driven feature engineering.

Here are some key benefits of using DBT, Snowpark, and Snowflake together:

  1. Simplifies SQL-based ETL with DBT’s modularity and tests.
  2. Handles complex computations with Snowpark’s Python UDFs.
  3. Leverages Snowflake’s high-performance engine for large-scale data processing.

Here’s a step-by-step guide to installing, configuring, and integrating DBT and Snowpark into your workflows.

Step 1: Install DBT

In Shell, you can use Python’s pip command for installing packages. Assuming Python is already installed and added to your PATH, follow these steps: 

Shell
# Set up a Python virtual environment (recommended):
python3 -m venv dbt_env
source dbt_env/bin/activate

# Install DBT and the Snowflake adapter:
pip install dbt-snowflake

# Verify DBT installation
dbt --version


Step 2: Install Snowpark

Shell
# Install Snowpark for Python
pip install snowflake-snowpark-python

# Install additional libraries for data manipulation
pip install pandas numpy

# Verify Snowpark installation
python -c "from snowflake.snowpark import Session; print('successful Snowpark installation')"


Step 3: Configuring DBT for Snowflake

DBT requires a profiles.yml file to define connection settings for Snowflake.

Locate the DBT Profiles Directory

By default, DBT expects the profiles.yml file in the ~/.dbt/ directory. Create the directory if it doesn’t exist:

Shell
mkdir -p ~/.dbt


Create the profiles.yml File

Define your Snowflake credentials in the following format:

YAML
my_project:
 outputs:
 dev:
 type: snowflake
 account: your_account_identifier
 user: your_username
 password: your_password
 role: your_role
 database: your_database
 warehouse: your_warehouse
 schema: your_schema
  target: dev


Replace placeholders like your_account_identifier with your Snowflake account details.

Test the Connection

Run the following command to validate your configuration:

Shell
dbt debug


If the setup is correct, you’ll see a success message confirming the connection.

Step 4: Setting Up Snowpark

Ensure Snowflake Permissions

Before using Snowpark, ensure your Snowflake user has the following permissions:

  • Access to the warehouse and schema.
  • Ability to create and register UDFs (User-Defined Functions).

Create a Snowpark Session

Set up a Snowpark session using the same credentials from profiles.yml:

Python
from snowflake.snowpark import Session

def create_session():
 connection_params = {
 "account": "your_account_identifier",
 "user": "your_username",
 "password": "your_password",
 "role": "your_role",
 "database": "your_database",
 "warehouse": "your_warehouse",
 "schema": "your_schema",
 }
 return Session.builder.configs(connection_params).create()

session = create_session()
print("Snowpark session created successfully")


Register a Sample UDF

Here’s an example of registering a simple Snowpark UDF for text processing:

Python
def clean_text(input_text):
 return input_text.strip().lower()

session.udf.register(
 func=clean_text,
 name="clean_text_udf",
 input_types=["string"],
 return_type="string",
 is_permanent=True
)
print("UDF registered successfully")


Step 5: Integrating DBT With Snowpark

You have a DBT model named raw_table that contains raw data.

raw_table DBT Model Definition 

SQL
-- models/raw_table.sql

SELECT * 
FROM my_database.my_schema.source_table


Use Snowpark UDFs in DBT Models

Once you’ve registered a UDF in Snowflake using Snowpark, you can call it directly from your DBT models.

SQL
-- models/processed_data.sql

WITH raw_data AS (
 SELECT id, text_column
 FROM {{ ref('raw_table') }}
),
cleaned_data AS (
 SELECT
 id,
 clean_text_udf(text_column) AS cleaned_text
 FROM raw_data
)
SELECT * FROM cleaned_data;


Run DBT Models

Execute your DBT models to apply the transformation:

Shell
dbt run --select processed_data


Step 6: Advanced AI/ML Use Case

For AI/ML workflows, Snowpark can handle tasks like feature engineering directly in Snowflake. Here’s an example of calculating text embeddings:

Create an Embedding UDF

Using Python and a pre-trained model, you can generate text embeddings:

Python
from transformers import pipeline

def generate_embeddings(text):
 model = pipeline("feature-extraction", model="bert-base-uncased")
 return model(text)[0]

session.udf.register(
 func=generate_embeddings,
 name="generate_embeddings_udf",
 input_types=["string"],
 return_type="array",
 is_permanent=True
)


Integrate UDF in DBT

Call the embedding UDF in a DBT model to create features for ML:

SQL
-- models/embedding_data.sql

WITH raw_text AS (
 SELECT id, text_column
 FROM {{ ref('raw_table') }}
),
embedded_text AS (
 SELECT
 id,
 generate_embeddings_udf(text_column) AS embeddings
 FROM raw_text
)
SELECT * FROM embedded_text;


Best Practices

  • Use DBT for reusable transformations: Break down complex SQL logic into reusable models.
  • Optimize Snowpark UDFs: Write lightweight, efficient UDFs to minimize resource usage.
  • Test Your Data: Leverage DBT’s testing framework for data quality.
  • Version Control Everything: Track changes in DBT models and Snowpark scripts for traceability.

Conclusion

By combining DBT’s SQL-based data transformations with Snowpark’s advanced programming capabilities, you can build AI/ML pipelines that are both scalable and efficient. This setup allows teams to collaborate effectively while leveraging Snowflake’s computational power to process large datasets.

Whether you’re cleaning data, engineering features, or preparing datasets for ML models, the DBT-Snowpark integration provides a seamless workflow to unlock your data’s full potential.

AI Data processing Pipeline (software)

Opinions expressed by DZone contributors are their own.

Related

  • What Nobody Tells You About Multimodal Data Pipelines for AI Training
  • Beyond Big Data: Designing Agentic Data Pipelines for AI Workloads
  • AI-Driven DevOps for SaaS: From Reactive to Predictive Pipelines
  • Designing AI-Assisted Integration Pipelines for Enterprise SaaS

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

Let's be friends: