
If you’ve ever had to transfer large amounts of data between systems, you know how time-consuming and resource-intensive it can be. Fortunately, Azure Data Factory (ADF) offers a solution to make this process much faster and more efficient — Incremental Copy.
This blog will walk you through, in simple terms, what incremental copy is, why it’s useful, and how to set it up step-by-step using Azure Data Factory.
What is Azure Data Factory (ADF)?
Azure Data Factory (ADF) is like a cloud-based “data pipeline” service that moves and transforms data between different storage systems. Think of it as a digital “conveyor belt” that transfers data from one storage system (like a database) to another (like a data warehouse).
What is an Incremental Copy?
Imagine you have a database with thousands of records. If you had to copy the entire database every day, it would take a lot of time and system resources. Incremental Copy solves this problem by copying only new or updated records — no need to copy everything.
This is done using something called a watermark column, which tracks the “last modified” date of each record. Every time you run the copy process, ADF looks for records that have been updated or added after the last run.
Why Should You Use Incremental Copy?
- Save Time: Instead of copying millions of records, only new or changed records are copied.
- Reduce Costs: Less data movement means lower costs.
- Increase Efficiency: Your data is ready faster for analysis or reporting.
How Does It Work?
- Watermark column: This is a date or timestamp column (like “LastUpdated”) that tells ADF which records are new or updated.
- Data source: The source system (like an Azure SQL Database) where the original data is stored.
- Data destination: The place where you want to store the copied data (like a Data Lake or SQL Database).
- Pipeline: The instructions that ADF follows to transfer the data.
Step-by-Step Guide to Incrementally Copy Multiple Tables Using Azure Data Factory
Step 1: Create an Azure Data Factory
- Go to Azure Portal (https://portal.azure.com).
- Click on Create a Resource and search for Data Factory.
- Click Create.
- Enter details like the Data Factory name, subscription, and resource group.
- Click Review + Create and wait for deployment.
Step 2: Create Linked Services
Linked Services are connections to external data sources (like SQL databases, cloud storage, etc.).
- Go to ADF and click on Manage (gear icon on the left).
- Click Linked Services > + New.
- Choose your source system (like Azure SQL Database or Azure Blob Storage).
- Provide connection details (like server name, database name, username, and password).
- Click Test Connection to ensure it works.
Step 3: Create the Incremental Copy Pipeline
This is the “recipe” ADF follows to copy data.
- In ADF, go to Author (left sidebar) and click + New Pipeline.
- Drag and drop an Copy Data activity from the Activities pane to the canvas.
- Source Configuration:
- Select the Linked Service (your source database).
- Choose the tables you want to copy (you can select multiple tables).
- Use a filter to select only changed records. For example, filter based on the LastUpdated or ModifiedDate column.
- Sink Configuration (Destination):
- Select the Linked Service where the data will be stored (like Azure Data Lake or SQL Database).
- Mapping: Ensure that the source and destination columns match (like LastUpdated in the source matching with LastUpdated in the destination).
Step 4: Add a Watermark Column
This is the most important step for incremental copying. The watermark tells ADF which data is new.
- In the Copy Data Activity, go to the Source tab.
- Set a filter for the watermark column (like LastUpdated > @lastRunTime).
- ADF will track the last run time and only copy records that have changed since then.
Step 5: Schedule the Pipeline to Run Automatically
Once your pipeline is ready, you can set it to run on a schedule (like every night).
- Click on Add Trigger at the top of the pipeline editor.
- Select New/Edit.
- Set the schedule (like “Run every day at 2:00 AM”).
- Save and activate the schedule.
Step 6: Monitor the Pipeline
After the pipeline runs, you can check if everything worked correctly.
- Go to the Monitor section in ADF.
- View logs, check success/failure messages, and see how many records were copied.
- If something goes wrong, you can view error details.
Example Scenario
Imagine you manage a sales database with a “LastUpdated” column that tracks changes to orders. Each day, new orders are added, and existing orders might be updated (like changes to customer info or order status).
Here’s how incremental copying works:
- Day 1: ADF copies all 100,000 sales records.
- Day 2: 10 new sales are added, and 20 records are updated. ADF only copies these 30 changes, not the full 100,000.
- Day 3: Another 15 new records and 5 updates are made. ADF only copies these 20 changes.
This process makes the data transfer faster and cheaper.
Pro Tips for Success
- Choose the right Watermark: If your table doesn’t have a LastUpdated column, consider adding one.
- Test with small datasets first: Before running it on large tables, try it with smaller test data.
- Error Monitoring: If something fails, check the logs to find issues with connection, mapping, or credentials.
- Optimize Frequency: Don’t run the pipeline too frequently unless absolutely necessary.
Common Questions
1. What if my table doesn’t have a “LastUpdated” column?
You can create a column like ModifiedDate and have it automatically update whenever a row changes. This can be done using triggers in SQL databases.
2. How do I know if incremental copy is working?
You can view logs in the Monitor tab of Azure Data Factory. It shows how many records were copied and if any errors occurred.
3. Can I copy multiple tables at once?
Yes! You can configure ADF to copy multiple tables by selecting multiple source tables in the copy activity. Each table will be processed one-by-one.
4. What is the difference between Full Copy and Incremental Copy?
- Full Copy: Copies everything from the source.
- Incremental Copy: Copies only the new or updated records.
Summary
Incremental copying in Azure Data Factory is a smart, efficient way to copy only changed data, saving time and costs. By using watermark columns, linked services, and pipelines, you can copy multiple tables incrementally and on a schedule.
If you’re working with large databases and want to keep your data warehouse up-to-date, incremental copy is your best friend. Try it out today and see how much time it can save you!
If you’d like more technical details, you can refer to the official documentation here.
Happy Data Engineering! 🚀
ITECHSTORECA
FOR ALL YOUR TECH SOLUTIONS