
If you’re working with Azure Data Lake Gen 2 (ADLS Gen2) and Databricks, you might want to access your data without using mounts for added flexibility and security. In this blog, I’ll walk you through how to connect Databricks to ADLS Gen2 using access keys or Azure Active Directory (AAD) authentication in a simple and secure way.
Why Avoid Mounts?
Mounting ADLS Gen2 to Databricks is convenient but comes with some limitations:
- Mounts are cluster-specific and cannot be shared across workspaces.
- You need high-level access to the storage account for mounting, which may not align with your organization’s security policies.
Instead, directly accessing ADLS Gen2 provides better control and flexibility.
Step-by-Step Guide
Step 1: Prerequisites
Before starting, make sure you have:
- An Azure Data Lake Gen 2 storage account.
- A Databricks workspace.
- Access credentials (either Storage Account Access Keys or AAD authentication details).
Step 2: Set Up Access in Databricks
Option 1: Using Access Keys
- Get the Access Key
- In the Azure Portal, navigate to your storage account.
- Under Security + Networking, select Access keys.
- Copy one of the access keys.
- Configure Access in Databricks
- Open your Databricks notebook and use the following code to set up your configurations:
python
spark.conf.set(“fs.azure.account.key.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”, “<ACCESS_KEY>”)
Replace <STORAGE_ACCOUNT_NAME> with the name of your storage account and <ACCESS_KEY> with your copied key.
- Access Data
Use the Spark API to access files in your storage account.
python
df = spark.read.csv(“abfss://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<FILE_PATH>”)
df.show()
Option 2: Using Azure Active Directory (AAD) Authentication
- Create an Azure Service Principal
- In Azure Active Directory, register a new app.
- Note the Application (client) ID, Directory (tenant) ID, and generate a client secret.
- Grant Storage Permissions
- Navigate to your storage account in Azure.
- Under Access control (IAM), assign the Storage Blob Data Contributor role to your service principal.
- Configure Access in Databricks
- Set up the following configurations in your Databricks notebook:
python
spark.conf.set(“fs.azure.account.auth.type.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”, “OAuth”)
spark.conf.set(“fs.azure.account.oauth.provider.type.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”,
“org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider”)
spark.conf.set(“fs.azure.account.oauth2.client.id.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”, “<CLIENT_ID>”)
spark.conf.set(“fs.azure.account.oauth2.client.secret.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”, “<CLIENT_SECRET>”)
spark.conf.set(“fs.azure.account.oauth2.client.endpoint.<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net”,
“https://login.microsoftonline.com/<TENANT_ID>/oauth2/token”)
Replace <STORAGE_ACCOUNT_NAME>, <CLIENT_ID>, <CLIENT_SECRET>, and <TENANT_ID> with your values.
- Access Data
Now, you can read or write files just as before:
python
df = spark.read.csv(“abfss://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<FILE_PATH>”)
df.show()
Step 3: Secure Your Credentials
Never hardcode sensitive credentials in your notebooks. Instead:
- Use Databricks Secrets to store sensitive data.
- Access secrets programmatically in your notebooks:
python
client_id = dbutils.secrets.get(scope=”my_scope”, key=”client_id”)
client_secret = dbutils.secrets.get(scope=”my_scope”, key=”client_secret”)
Conclusion
By using these methods, you can connect Databricks to Azure Data Lake Gen 2 securely without using mounts. Whether you choose Access Keys or AAD authentication, you’ll benefit from a more scalable and secure solution for accessing your data.
ITECHSTORECA
FOR ALL YOUR TECH SOLUTIONS