Replicate Databricks Data in Heroku for Use in Salesforce Connect

Stanley Liu
Associate Technical Product Marketer

Replicate Databricks data to a PostgreSQL database on Heroku and connect it to Salesforce using Salesforce Connect.

CData Sync is a standalone application that supports a wide range of replication scenarios, including replicating both sandbox and production instances into your database. By replicating Databricks data to a PostgreSQL database in Heroku, you can access Databricks external objects (via Salesforce Connect) alongside standard Salesforce objects.

About Databricks Data Integration

Accessing and integrating live data from Databricks has never been easier with CData. Customers rely on CData connectivity to:

Access all versions of Databricks from Runtime Versions 9.1 - 13.X to both the Pro and Classic Databricks SQL versions.
Leave Databricks in their preferred environment thanks to compatibility with any hosting solution.
Secure authenticate in a variety of ways, including personal access token, Azure Service Principal, and Azure AD.
Upload data to Databricks using Databricks File System, Azure Blog Storage, and AWS S3 Storage.

While many customers are using CData's solutions to migrate data from different systems into their Databricks data lakehouse, several customers use our live connectivity solutions to federate connectivity between their databases and Databricks. These customers are using SQL Server Linked Servers or Polybase to get live access to Databricks from within their existing RDBMs.

Read more about common Databricks use-cases and how CData's solutions help solve data problems in our blog: What is Databricks Used For? 6 Use Cases.

Getting Started

Requirements

For this replication example, you need the following:

CData Sync (trial or licensed), along with a license (full or trial) for Databricks replication.
A Heroku app with the Heroku Postgres and Heroku Connect add-ons provisioned.
A Salesforce account.

Configure the Replication Destination

Using CData Sync, you can easily replicate data from Databricks data to a PostgreSQL database on Heroku. For this article, you will need an existing PostgreSQL database on Heroku. To add your PostgreSQL database as a replication destination, navigate to the Connections tab.

Click Add Connection. 👁 Add a new connection.
Select the Destinations tab and locate the PostgreSQL connector. 👁 Configure a Destination connection to PostgreSQL.
Click the Configure Connection icon at the end of that row to open the New Connection page. If the Configure Connection icon is not available, click the Download Connector icon to install the PostgreSQL connector. For more information about installing new connectors, see Connections in the Help documentation.
To connect to PostgreSQL, set the following connection properties:
- Connection Name: Enter a connection name of your choice for the PostgreSQL connection.
- Server: Enter the host name or IP of the server that hosts the PostgreSQL database. The default server value is localhost.
- Auth Scheme: Select the authentication scheme. The default auth scheme is Password.
- Port: Enter the port number of the server that hosts the PostgreSQL database. The default port value is 5432.
- User: Enter the user ID provided for authentication with the PostgreSQL database.
- Password: Enter the password provided for authentication with the PostgreSQL database.
- Database: Enter the name of the database. If not specified, use the default database.

👁 Enter the connection details.

Once connected, click Create & Test to create, test and save the connection.

You are now connected to PostgreSQL and can use it as both a source and a destination.

NOTE: You can use the Label feature to add a label for a source or a destination.

👁 Add a label.

Configure the Databricks Connection

You can configure a connection to Databricks from the Connections tab. To add a connection to your Databricks account, navigate to the Connections tab.

Click Add Connection.
Select a source (Databricks).
Configure the connection properties.

To connect to a Databricks cluster, set the properties as described below.

Note: The needed values can be found in your Databricks instance by navigating to Clusters, and selecting the desired cluster, and selecting the JDBC/ODBC tab under Advanced Options.
- Server: Set to the Server Hostname of your Databricks cluster.
- HTTPPath: Set to the HTTP Path of your Databricks cluster.
- Token: Set to your personal access token (this value can be obtained by navigating to the User Settings page of your Databricks instance and selecting the Access Tokens tab).
👁 Configuring a Source connection (Salesforce is shown).
Click Connect to Databricks to ensure that the connection is configured properly.
Click Save & Test to save the changes.

Configure Queries for Each Databricks Instance

CData Sync enables you to control replication with a point-and-click interface and with SQL queries. For each replication you wish to configure, navigate to the Jobs tab and click Add Job. Select the Source and Destination for your replication.

👁 Select Source and Destination connections for the replication.

Replicate Entire Tables

To replicate an entire table, navigate to the Task tab in the Job, click Add Tasks, choose the table(s) from the list of Databricks tables you wish to replicate into PostgreSQL, and click Add Tasks again.

👁 Choose entire tables to replicate (Salesforce is shown).

Customize Your Replication

You can use the Columns and Query tabs of a task to customize your replication. The Columns tab allows you to specify which columns to replicate, rename the columns at the destination, and even perform operations on the source data before replicating. The Query tab allows you to add filters, grouping, and sorting to the replication with the help of SQL queries.

👁 Configure a replication.

As you make changes using the interface, the SQL query used for the replication changes, going from something simple, like this:

REPLICATE [Customers]

to something customized and more complex, like this:

REPLICATE [Customers] SELECT [City], [CompanyName] FROM [Customers] WHERE [Country] = US

Schedule Your Replication

Select the Overview tab in the Job, and click Configure under Schedule. You can schedule a job to run automatically by configuring it to run at specified intervals, ranging from once every 10 minutes to once every month.

👁 Schedule your job to run automatically.

Once you have configured the replication job, click Save Changes. You can configure any number of jobs to manage the replication of your Databricks data to PostgreSQL.

Run the Replication Job

Once all the required configurations are made for the job, select the Databricks table you wish to replicate and click Run. After the replication completes successfully, a notification appears, showing the time taken to run the job and the number of rows replicated.

👁 Run the job.

The Databricks data tables are now replicated in Heroku PostgreSQL database.

Connect to Your Replicated Databricks Data as an External Data Source

Once your Databricks data is replicated to the PostgreSQL database on Heroku, configure the OData interface for Heroku and connect to the database as an external data source via Salesforce Connect.

Configure the OData Service for Heroku

The first part of connecting to Databricks data replicated to a PostgreSQL database on Heroku is configuring the Heroku External Objects for the database.

In your Heroku dashboard, click the Heroku Connect Add-On.
Select External Objects. (If this is the first time using Heroku External Object, you will be prompted to create the OData service's login credentials)
View the OData service URL and credentials (noting the URL and credentials to be used later from Salesforce Connect).
In Data Sources, select which replicated tables to share.

👁 Configure the OData service in Heroku Connect.

Refer to the Heroku documentation for more detailed instructions.

Configure an External Data Source in Salesforce

After the OData service for Heroku is configured, we can connect to the replicated Databricks data as an external data source from Salesforce Connect.

In Salesforce, click Setup
In the Administration section, click Data -> External Data Sources
Set the data source parameter properties:
- External Data Source: the name you wish to display in the Salesforce user interface
- Name: a unique identifier for the API
- Type: Salesforce Connect: OData 4.0
- URL: Enter the OData endpoint from Heroku Connect (above)
- Format: JSON
Set Authentication:
- Identity Type: Named Principal
- Authentication Protocol: Password Authentication
- Username: the Heroku Connect username
- Password: the Heroku Connect password
Click Save.

👁 Configure the Salesforce External Data Source.

Synchronize Databricks Objects

After you have created the external data source in Salesforce, follow the steps below to create Databricks external objects that reflect any changes in the data source. You will synchronize the definitions for the Databricks external objects with the definitions for Databricks tables.

Click the link for the external data source you created.
Click Validate and Sync.
Select the Databricks tables you want to work with as external objects and click Sync.

👁 Synchronize the External Data Source.

Access Databricks Data as Salesforce Objects

At this point, you will be able to connect to and work with your replicated Databricks entities as external objects just as you would with standard Salesforce objects, whether you are simply viewing the data or building related lists of external Databricks data alongside standard Salesforce objects.

Download a 30-day free trial of CData Sync and replicate your Databricks data for use with Salesforce Connect today!

Ready to get started?

Learn more or sign up for a free trial:

CData Sync

URL: https://www.cdata.com/kb/tech/databricks-sync-heroku-salesforce-connect.rst