Migration of AdventureWorks/DW from Azure SQL DB into Snowflake
For all who wants to make a cost and performance comparison between Azure SQL DB Vs Snowflake, Adventureworks and AdventureworksDW are excellent database which could be utilized as a reference between both platform. Ideally organization would prefer to migrate their own database to Snowflake for a cost/performance PoC.
While Snowflake is not a best choice for OLTP applications. But an excellent choice for Analytics workload. In this article an attempt has been made to build a generic framework to migrate Adventureworks/DW database. Same framework could also be use to migrate any database from SQL Server into Snowflake using Azure Data Factory and Azure Blob Storage.
Data migration from SQL Database to Snowflake is not as easy as to backup and restore. Generally, there are two approaches to perform database migration,
- Setup Snowflake as a linked server within SQL Server (For Azure SQL DB directly this is not possible) and insert records from SQL to Snowflake. ODBC linked server setup guide is available (“snowflake linked server” keywords away from google). This is not performant way to load data into Snowflake.
- While searching for How to Migrate large volumes of data from Snowflake forums (need login), I found that using Copy into <table> is the best way to load data either from AWS S3 / Azure Blob storage.
This article explains the Copy into <table> from blob storage approach to migrate data from SQL Server into Snowflake. Design for the migration is three step approach.
Step1: Using Azure Data Factory connect to SQL Database and upload table data into blob storage. Generic data format of csv with pipe delimited (|) and Text qualifier (“) is used. Gzip compression technique is used to occupy less storage space. While data could be exported into various formats such as Parquet, Avro, ORC and Json, I choose csv + gzip by looking at Snowflake load performance figures. This helped to understand and select best file format for data migration. Created an ADF pipeline (Source code checked in Github) which was used to extract data into blob storage. From Azure Data Factory pipeline versions are directly checked into GitHub for version management. Bulk copy data tutorial from Microsoft helped me to build this pipeline.
All other blob storage file creation parameters such as Encoding, Null treatment, delimiter changes, etc could be modified as required from ADF Json configuration.
Step2: Create Tables on Snowflake. This is a manual operation. To achieve this, we use information_schema from SQL server to Generate table scripts based on Snowflake data type (Scripts from Github). Also to identify Primary and Foreign Key relations and apply them into Snowflake objects. Get required table creation scripts from GitHub to execute on Snowflake. There may be few issues during table creation due to default constraints. Script copies default constraint from SQL server which may not work on Snowflake. Pay attention during this exercise to check for any issue. Snowflake data types could be found from here. There are slight deviations on data type names between SQL Server and Snowflake such as bit-boolean, datetimeoffset,timestamp_tz, etc. Best is to study and understand them from Snowflake. One of the important factor to know about Snowflake that it support only of collation type of case sensitive.
Step3: Create an external stage on Snowflake database to point Azure blob storage container, and create file format definition to read. Finally load data from Azure blob storage into Snowflake Tables
// 1.To create a azure stage on Snowflakecreate or replace stage my_azure_stage url='azure://nklabstorageaccount.blob.core.windows.net/csvfiles' credentials=(azure_sas_token='?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-01-07T06:06:26Z&st=2019-01-05T22:06:26Z&spr=https&sig=JGclySQxmrAkw3YbCtD%2Bt4vbkys3B3nh8RScTOsCiGM%3D');// 2.To create a file formation definitioncreate or replace file format my_csv_format type = csv field_delimiter = '|' skip_header = 1 null_if = ('NULL', 'null') empty_field_as_null = true compression = GZIP FIELD_OPTIONALLY_ENCLOSED_BY ='"';// 3.Copy data into Snowflakecopy into SALESLT.ADDRESS from @my_azure_stage pattern='SALESLT.ADDRESS.csv' file_format=my_csv_format force=true;//4. Use this command to generate copy commands for all tables created on snowflakeselect 'copy into ' || Table_Schema || '.'|| Table_Name || ' from @my_azure_stage pattern=\'' || Table_Schema || '.' || Table_Name ||'.csv\' file_format=my_csv_format force=true;' as Load_Scriptfrom information_schema.tables
There are multiple options to copy into <table> object within Snowflake. For example snowflake remembers the files which are loaded, when you try to reload the same file, load doesn’t happen as the same is loaded within certain point of time. To force load option, set force parameter to true. Option to skip the copy in case of issue during loading, option to only validate load, many other Snowflake load options could be read for better understanding.
During multiple files data loading, I could see from history window, metrics such as time taken and no. of records loaded and furthermore.
Data Load Analysis
Using this approach I was successful for most of the Adventureworks and AdventureworksDW tables to migrate from SQL Server to Snowflake.
- Few of tables were not loading due to binary datatype. Snowflake do not recognize hexadecimal data starting with 0x. During export 0x need to be replaced.
- Uppercase GUID Field values from SQL Server is converted into lowercase in Snowflake.
- No other major issues observed in datetime files or in other fields within limited time taken.
Let me know if you find any interesting results around this..!!