Get started with Azure Data Factory
Creating a resource group in Azure:
Select "Create" or "Create Resource" and search for "resource group".
Specify a name for the resource group (e.g., "project-ip").
Select a location (e.g., "UK West").
Optionally, add tags for the resource group.
Review the overview and create the resource group.
Create an Azure Data Lake Gen2 storage account:
Click on "Create Resource" and search for "storage account".
Select "Create Storage Account".
Specify the following details: - Resource group: Choose the resource group you created in step 1. - Name: Provide a unique name for the storage account. - Location: Select the same location as the resource group. - Performance: Choose "Standard" (or "Premium" for production environments). - Redundancy: Select "Locally-redundant storage (LRS)" for basic configuration. - Account kind: Select "StorageV2 (general purpose v2)". - Advanced: Enable "Hierarchical namespace" to create a Data Lake Gen2. - Networking: Enable public access (or configure as needed). - Data protection: Leave encryption settings as default (Azure-managed key).
Review the overview and create the storage account.
Create containers within the storage account:
Go to the resource group > storage account resource we created
Navigate to the "Containers" section.
Create the following containers: - "raw" to store raw data from GitHub. - "processed" to store processed data. - "final" to store final, ready-to-use data.
Create Azure Data factory account
1. Create an Azure Data Factory:
Go to the Azure portal and select "Create Resource".
Navigate to "Integration" and choose "Data Factory".
Select "Create Data Factory".
Provide the following details: - Subscription: Choose your subscription. - Resource group: Select the existing resource group (e.g., "project-ip"). - Name: Provide a unique name for the data factory (e.g., "adf-project-ip-5658"). - Location: Select the same location as the resource group. - Version: Choose "V2".
Leave "Git configuration" for later. (If you are interested to know what it does, follow the one of these tutorials after you have a azure data factory resource set up. 1. Source control - Azure Data Factory | Microsoft Learn or 2. How to Use Git w/ Azure Data Factory - YouTube )
Review the settings and create the data factory.
2. Launch Data Factory Studio:
Once the deployment is complete, go to the data factory resource in the Azure portal.
Click on "Launch Studio" to open Data Factory Studio in a new tab.
3. Explore Data Factory Studio:
Close any initial warning messages.
Familiarize yourself with the following tabs: - Home: Provides an overview and quick actions. - Author: Used for creating and managing pipelines, datasets, data flows, etc. - Monitor: Allows monitoring of pipeline and trigger executions. - Manage: Used for managing linked services, Git configuration, triggers, and data factory settings. - Learning Center: Offers tutorials and learning resources. - Other: Provides ready-made templates for basic transformations.
The upcoming videos will delve into using Data Factory to ingest data from GitHub to the data lake.
Creating a Linked Service
1. Create a linked service to GitHub:
Go to the "Manage" tab in Azure Data Factory Studio.
Select "Linked Services" and click "New".
Choose "HTTP" as the linked service type.
Provide a name (e.g., "LinkedService_HTTP_GitHub").
Select "AutoResolveIntegrationRuntime".
Enter the base URL of the GitHub repository (e.g., "https://github.com/...").
Choose "Anonymous" authentication for this example.
Test the connection and create the linked service.
2. Create a linked service to Azure Data Lake Gen2:
Go to "Linked Services" in the "Manage" tab and click "New".
Search for "Azure Data Lake Gen2" and select it.
Provide a name (e.g., "LinkedService_AzureDataLakeGen2").
Select "AutoResolveIntegrationRuntime".
Choose "Account Key" as the authentication method.
Select the subscription and the Azure Data Lake Gen2 storage account you created earlier.
Test the connection and create the linked service.
Creating Datasets in Azure Data Factory
1. Create a dataset for the source data:
Go to the "Author" tab in Azure Data Factory Studio.
Select "Datasets" and click "New dataset".
Choose "HTTP" as the data source type.
Select "DelimitedText" as the format.
Provide a name (e.g., "Dataset_IPL_2008_CSV").
Choose the linked service to GitHub that you created earlier.
Enter the relative URL of the CSV file (e.g., "IPL_2008.csv").
Check "First row as header" if applicable.
Set schema as "None" for now.
2. Create a dataset for the destination:
Select "New dataset" in the "Datasets" section.
Choose "Azure Data Lake Gen2" as the data source type.
Select "JSON" as the format.
Provide a name (e.g., "Dataset_IPL_2008_JSON").
Choose the linked service to Azure Data Lake Gen2 that you created earlier.
Enter the path to the destination file in the "raw" container (e.g., "/raw/IPL_2008.json").
Select "Import schema" as "None".
3. Publish the changes:
Click on "Publish All" to save the new datasets and any unsaved linked services.
This is how the json dataset looks like once you publish all.
Summary:
Two datasets have been created: one for the source CSV file on GitHub and another for the destination JSON file in Azure Data Lake Gen2.
These datasets will be used in the next to create a pipeline for copying data from source to destination.
Creating Azure Data Factory Pipeline
1. Rename datasets for clarity:
Rename the source dataset to "Dataset_Source_IPL_2008".
Rename the sink dataset to "Dataset_Sink_IPL_2008_JSON".
Publish the changes.
2. Create a pipeline:
Go to the "Author" tab in Azure Data Factory Studio.
Click "Create" and select "Pipeline".
Name the pipeline "PL_Ingest_IPL_Data".
3. Add a copy activity:
From the activities list, drag a "Copy Data" activity onto the pipeline canvas.
Rename the activity to "Copy_IPL_2008_Data".
4. Configure the copy activity:
In the "Source" tab:
- Select the "Dataset_Source_IPL" dataset.
In the "Sink" tab:
- Select the "Dataset_Sink_IPL_2008_JSON" dataset.
Leave other configurations as default for now.
5. Publish changes:
- Click "Publish All" to save the pipeline and dataset changes.
6. Execute the pipeline:
Click "Debug" to start the pipeline execution.
Monitor the execution progress and view input/output details if needed.
7. Verify results:
Navigate to the "raw" container in your Azure Data Lake Gen2 storage.
Confirm that the "IPL_2008.json" file has been copied successfully.
Open the file to view the data in JSON format
Summary:
The transcript guided the creation of a simple pipeline in Azure Data Factory to copy data from a GitHub CSV file to a JSON file in Azure Data Lake Gen2 using a copy activity.
The steps involved creating datasets, configuring the copy activity, publishing changes, executing the pipeline, and verifying the results.
In the similar fashion, you can bring in other files as well from the github.
1. Create datasets for the second file:
Clone the existing dataset for IPL 2008 CSV and rename it to "2009_CSV".
Update the file path in the cloned dataset to point to "IPL_2009.csv".
Clone the existing dataset for IPL 2008 JSON and rename it to "2009_JSON".
Update the file path in the cloned dataset to point to "IPL_2009.json".
Publish the changes.
2. Add a second copy activity to the pipeline:
Drag another "Copy Data" activity onto the pipeline canvas.
Configure the source of the new activity to use the "2009_CSV" dataset.
Configure the sink of the new activity to use the "2009_JSON" dataset.
3. Connect activities with a success dependency:
Link the output of the first copy activity (2008 data) to the input of the second copy activity (2009 data) using a "Success" dependency.
This ensures the second activity executes only if the first one completes successfully.
4. Publish and execute the pipeline:
Publish all changes to save the pipeline modifications.
Execute the pipeline using the "Debug" option.
5. Verify results:
Monitor the pipeline execution and confirm both copy activities complete successfully.
Navigate to the "raw" container in Azure Data Lake Gen2 storage.
Verify that both "IPL_2008.json" and "IPL_2009.json" files have been copied successfully.
Summary:
The transcript guided the expansion of an existing pipeline in Azure Data Factory to copy a second file from GitHub to Azure Data Lake Gen2.
This involved cloning datasets, adding a new copy activity, configuring dependencies, and executing the updated pipeline.
Creating Parameterized Datasets and Pipelines
1. Creating Parameterized Datasets:
Clone the "Dataset_Source_2008_CSV" dataset and call it "Dataset_Source_2000_CSV_with_parameter".
Replace the filename with a parameter named "p_file_name".
Use dynamic content in the connection string referencing "p_file_name".
Clone the "Dataset_Sink_IPL_2008_JSON" dataset and call it "Dataset_Sink_IPL_2008_JSON_p_parameterized".
Replace the filename with a parameter named "p_sync_file_name".
2. Creating a Parameterized Pipeline:
Create a new pipeline named "Pull_Data".
Add a copy activity.
In the source, select the "Dataset_Source_2000_CSV_with_parameter" dataset.
In the "p_file_name" parameter, specify the desired filename (e.g., "IPL_2008.csv").
In the sink, select the "Dataset_Sink_IPL_2008_JSON_p_parameterized" dataset.
In the "p_sync_file_name" parameter, specify the desired JSON filename (e.g., "IPL_2008.json").
3. Execution and Verification:
Publish the changes and execute the pipeline using debug.
Verify that the desired file is copied from source to sink in the "raw" container.
Creating a Lookup Activity
1. Create a File Listing Source Files:
Create a text file listing the source file names, as instructed.
Save this file within your Azure Data Lake Gen2 container.
2. Create a New Dataset in ADF:
In Azure Data Factory, create a new dataset.
Select Azure Data Lake Storage Gen2 as the type.
Connect to your Azure Data Lake Gen2 account.
Choose the container where you saved the file listing.
Select the file and import its schema.
Name the dataset (e.g., "ListOfFiles").
3. Add a Lookup Activity to Your Pipeline:
Drag a Lookup activity onto your pipeline canvas.
Name it (e.g., "LookupFileNames").
Configure it to use the "ListOfFiles" dataset you created.
Make sure you select the right file path type.
Ensure "First row only" is deselected to read all rows.
4. Publish and Execute:
Publish the pipeline changes.
Execute the pipeline.
Verify that the Lookup activity successfully reads all file names from the text file.
Creating a ForEach Activity and finalizing it
1. Add a ForEach Activity:
Drag a ForEach activity onto the pipeline canvas.
Connect it to the Lookup activity's output.
Rename it descriptively (e.g., "ForEachFileNames").
In its settings:
Select "Sequential" for execution mode.
Under "Items," use dynamic content to reference the Lookup activity's output:
2. Configure Copy Activity within ForEach:
Cut and paste the existing Copy activity into the ForEach activity's scope.
Delete any unnecessary wait activity.
In the Copy activity's source settings:
Use dynamic content to reference the current file name from the ForEach iteration:
In the below picture, you can see I have @item().Files specified instead of @item() only. It is because, we want to iterate through all the rows of the column name Files of the csv dataset. Initially cursor was at the column level for the current item, with specifying the .Files, we went into the row level.
In the Copy activity's sink settings:
Use dynamic content and the "replace" string function to change the file extension from CSV to JSON:
- Type @(replace(item().file,'.csv','.json')}
3. Publish and Execute:
Publish the pipeline changes.
Execute the pipeline.
Observe the successful execution of the Lookup activity, followed by multiple iterations of the Copy activity, each processing a different file.
Verify that all files have been copied to the sink container with the correct JSON format.