What is ETL?
ETL stands for Extract, Transform, Load — three steps that together form the process of consolidating data from various sources into usable information. It's the backbone of virtually every business intelligence and data warehouse system.
A simple analogy: imagine cooking a meal with ingredients from three different markets. First, you collect the ingredients (Extract). Then you wash, cut, and prepare them (Transform). Finally, you put everything in the pan (Load). The result is a ready-to-serve meal — or in data terms: a clean, structured dataset ready for analysis.
Without ETL, you'd have to manually copy data between systems, stitch Excel files together, and repeat the same cleaning tasks over and over. ETL automates this entire process so you always have current, reliable data.
Extract — Pulling data from sources
The first step is Extract: pulling data from all kinds of sources. In a typical organization, data comes from dozens of systems:
- Databases — SQL Server, PostgreSQL, MySQL, Oracle
- Files — Excel sheets, CSV exports, XML feeds, JSON files
- Cloud applications — Salesforce, HubSpot, Google Analytics
- APIs — REST or GraphQL endpoints from web services
- Streaming data — IoT sensors, log files, event streams
Two key choices: full extract (fetch everything each time — simple but slow) vs. incremental extract (only fetch what changed — faster but more complex). A good ETL process ensures extraction doesn't burden the source systems.
Transform — Cleaning and enriching data
The Transform step is where the real magic happens. Raw source data is almost never directly usable. Typical transformations include:
- Cleaning — Removing duplicates, handling missing values, fixing typos
- Standardizing — Aligning date formats, converting currencies, normalizing units
- Joining — Combining data from different sources (e.g., CRM customer data with ERP order data)
- Enrichment — Adding derived information (age from birthdate, margins from revenue and costs)
- Aggregation — Summarizing detail data (daily transactions to monthly totals)
- Business rules — Organization-specific calculations
In practice, data engineers spend 60-80% of their time understanding and cleaning data. Building the dashboard is often the easy part.
Load — Loading data into the warehouse
The final step is Load: loading transformed data into the target database, usually a data warehouse or data lakehouse.
| Strategy | How it works | When to use |
|---|---|---|
| Full load | Target table is emptied and completely refilled | Small datasets, simple scenarios |
| Incremental load | Only new or changed records are added/updated | Large datasets, frequent refreshes |
Modern data warehouses (Snowflake, BigQuery, Microsoft Fabric) make loading increasingly easy with features like MERGE statements, automatic partitioning, and scalable storage.
ETL vs. ELT — What's the difference?
Traditionally, transformation happens before loading: ETL. But recently, ELT (Extract, Load, Transform) has become popular — you load raw data first, then transform inside the warehouse.
| Aspect | ETL | ELT |
|---|---|---|
| Where transformation happens | Separate ETL server | Inside the data warehouse |
| Speed | Slower for large volumes | Faster via parallel processing |
| Flexibility | Transformations fixed at design time | Raw data available for new transformations |
| Best for | On-premises, sensitive data, compliance | Cloud environments, large volumes, agile |
The trend is clearly toward ELT, driven by powerful cloud warehouses. Tools like dbt are designed specifically for the "T" in ELT. But ETL isn't dead — it's still the better choice when you need to filter sensitive data before loading (GDPR compliance).
Popular ETL tools
Key ETL/ELT tools to consider:
- Power Query — Built into Power BI and Excel. Great for beginners and self-service BI. Free.
- Azure Data Factory — Microsoft's cloud ETL/ELT platform. Enterprise-grade, pay-as-you-go.
- dbt — The standard for SQL-based transformations in modern data stacks. Open source core.
- Informatica — Enterprise ETL with broad connector support.
- Fivetran — Automated connectors for loading SaaS data. Volume-based pricing.
- Microsoft Fabric Dataflows — Cloud ETL/ELT integrated with Power BI and Fabric.
For most organizations already using Microsoft, the combination of Power Query and Azure Data Factory or Fabric Dataflows is a logical choice. Start with the simplest tool that meets your needs.