In this article, I would like show you how to migrate a database platform from Azure SQL DB to Databricks Delta. This use case would be interesting for many teams working on migration, platform evaluation, PoC, etc. Especially, in foreseen / unforeseen data growth situations in terms of Volume (size), Variety (semi structured / un-structured) and Velocity (scalability) it is required to evaluate and choose best platform for migration. In many modern data warehouse solution architects, Microsoft recommends Azure Data Lake Store Gen 2 with Azure Databricks. This article provides step by step procedure to migrate database from Azure SQL DB to Azure Databricks Delta.
I would like to limit the scope of this article to focus only on migrating SQL Tables to Delta. Migration of Views, Functions, Procedures, Synonyms, user accounts, roles, etc are not covered. An interesting and complex topic of performance comparison between SQL DB and Delta is not covered in this article.
1. Provision Azure SQL DB
As a first setup lets setup an Adventureworks database on Azure SQL DB. Find Powershell scripts to deploy SQL Server and Adventureworks database on Azure I used Microsoft docs as a reference, further improved to a) Included if condition, to avoid errors in case of rerun b) Added a small script to include client ip range into Azure SQL Server firewall and c) A small query on SQL database to list no. of tables available on Azure SQL DB.
I execute the scripts multiple times for testing, On a average, it took 200 seconds to spin up an Azure SQL Server and create Adventureworks database. Once the Azure SQL DB is spin up, you can perform a connection test from SQL Server Management Studio by querying few records.
2. Setup Azure Data Lake Gen2, Key Vault, Service Principle Account and Access to ADLSG2
Refer powershell scripts to
- Create storage account on ADLS Gen 2 (with hierarchical name space enabled) and create a container to hold delta table files
- Create a service principle and grant Storage Blob Data Contributor to storage account
- Create Azure Key Vault and securely store the service principle application id, secret and Azure SQL DB password.
There are various secured ways to connect storage account from Azure Databricks. I really liked and read this article several times, to understand different types of connection that can be made between ADLSG2 and Azure Databricks. Thanks to Gerard Wolfaardt for writing such excellent article. I choose the best way of connecting via a service principle. I struggled little bit to create service principle using powershell Az commands. This post helped to understand right scripts to create service principle.
3. Provision Azure Databricks Workspace and mount ADLSG2 container
3.1 Spin up Azure Databricks workspace
If you don’t have an Azure databricks workspace, then it can be from https://portal.azure.com/#create/Microsoft.Databricks. Only five parameters to configure such as Subscription, resource group, Workspace name, Location and Pricing Tier (For PoC purpose use 14 days Trial period). For more details on pricing tier follow the link here. Further there is a tab to manage networking to select Vnet for deployment. Further a tab for Tag management, finally review and create a workspace.
3.2 Create secret scope on Azure Databricks to connect Azure Key Vault
Creating a secret scope is basically creating a connection from Azure Databricks to Azure Key vault. Follow this link to read more details on secret scope. To quickly perform actions follow below instructions,
- From you Azure Key Vault copy value of two properties DNS Name and Resource ID. A sample image is shown here
- Navigate to your databricks account url https://<location>.azuredatabricks.net/?o=<your organization id>#secrets/createScope
- Choose All users if you are in standard pricing tier and enter a scope name. Paste DNS Name and Resource ID into corresponding text box and save. A sample image is shown here
4. Azure SQL DB to Azure Databricks Delta Migration
Finally, we are here to execute databricks scripts for migration. Launch Azure Databricks, and from workspace tab, right click and select import menu. On Import wizard, select Import from URL and copy paste the below URL https://github.com/VijaybabuNakkonda/AzureSQL_to_Databricks_Delta/blob/master/notebooks/sql_to_delta.scala
Notebook contains, 15 cell commands. Let me summarize the steps here,
- Connect and Mount ADLS Gen2 Storage account on Azure Databricks using scoped credentials via Azure Key Vault
- Create a connection to Azure SQL by using scoped credentials via Azure Key Vault
- Select and query Information Schema from Azure SQL, to get base table information such as schema and table names
- For each Schema available from SQL create the same on Databricks by executing SQL execute Create schema <schema_name>
- For each Table exist on SQL, create spark dataframe. Read data from SQL tables and assign them to dataframes
- Now, table data is available on spark dataframe. For each dataframe, write data to ADLS Gen2 location using delta format
- Now, for each location from ADLS Gen2 which has been written in the previous step, Create databricks table by referring the same storage location from ADLS Gen2.
Migration is completed: Query and Play with Tables 🙂
To drop the azure resources provisioned for this project use scripts from here. Delete Azure Databricks manually from portal.
Hope this information is useful for you..!! Let me know if you face any challenges..!!