Databricks Data Loader

A comprehensive data loading module for Databricks that provides parallel file processing with multiple loading strategies, file tracking, and robust error handling. Now with enhanced cluster mode for optimized Databricks operations.

New users should start with the Quickstart guide which explains installation using Poetry and demonstrates the configuration workflow using a small demo. See Pipeline State Management for details on restarting interrupted runs.

Features

File Monitoring: Automatically discovers and processes new files from configured locations
File Tracking: Tracks processing status to prevent duplicate processing using Delta tables
Parallel Processing: Configurable parallel execution for efficient file processing
Multiple Loading Strategies:
- SCD2 (Slowly Changing Dimensions Type 2): Maintains historical records with change tracking
- Append: Simple append operation for tables without primary keys
- Overwrite: Replace table contents (coming soon)
- Merge: Custom merge logic (coming soon)
Schema Evolution: Automatic schema evolution support
Error Handling: Robust error handling with configurable retry logic
Monitoring: Comprehensive logging and metrics collection
State Management: Pipeline progress saved to disk for safe restarts
Reset Command: Easily clear saved state with reset-state
Optimization: Automatic table optimization and vacuum operations
🆕 Cluster Mode: Enhanced Databricks cluster integration with:
- Automatic environment detection and optimization
- Resource monitoring and management
- Unity Catalog support
- Job dependency management
- Cluster-aware performance tuning

Quick Start

Standard Mode

# Install dependencies using Poetry
poetry install

# Run with configuration file
poetry run python -m data_loader.main run --config config.yaml

🆕 Cluster Mode (Recommended for Databricks)

# Run with cluster-specific optimizations
poetry run python -m data_loader.main run-cluster --config config.yaml

# Run with Unity Catalog
poetry run python -m data_loader.main run-cluster --config config.yaml --unity-catalog

# Check cluster status
poetry run python -m data_loader.main cluster-status --config config.yaml

Examples

Several example scripts are included in this repository. Execute them using poetry run to ensure all dependencies are loaded from the virtual environment.

# Basic configuration driven demo
poetry run python demo/run_demo.py

# Demonstrate configuration merging
poetry run python demo/config_merge_demo.py

# Showcase pipeline state management
poetry run python demo/state_management_demo.py

# Reset pipeline state
poetry run python demo/reset_state_demo.py

# Full example usage script
poetry run python example_usage.py

# Cluster mode demonstration
poetry run python cluster_demo.py

Architecture

data_loader/
├── config/                  # Configuration management
│   ├── table_config.py     # Table and loading strategy configuration
│   └── databricks_config.py # Databricks-specific settings
├── core/                   # Core processing components
│   ├── file_tracker.py     # File processing status tracking
│   ├── processor.py        # Main orchestrator
│   └── parallel_executor.py # Parallel processing framework
├── cluster/                # 🆕 Cluster mode components
│   ├── cluster_config.py   # Environment detection and configuration
│   ├── cluster_processor.py # Cluster-optimized processor
│   ├── resource_manager.py # Resource monitoring and optimization
│   └── job_orchestrator.py # Dependency and workflow management
├── strategies/             # Loading strategy implementations
│   ├── base_strategy.py    # Base strategy interface
│   ├── scd2_strategy.py    # SCD2 implementation
│   └── append_strategy.py  # Append strategy implementation
├── utils/                  # Utility functions
│   ├── logger.py          # Logging utilities
│   └── helpers.py         # Helper functions
└── main.py                # Entry point for Databricks jobs

Installation

Install dependencies with Poetry:

poetry install

(Optional) Activate the virtual environment created by Poetry:

poetry shell

Configuration

The data loader uses YAML configuration files to define tables, loading strategies, and processing options.

You can load a configuration file programmatically using load_config_from_file:

from data_loader.config import load_config_from_file

config = load_config_from_file("path/to/config.yaml")

Example Configuration

raw_data_path: /mnt/raw/
processed_data_path: /mnt/processed/
checkpoint_path: /mnt/checkpoints/
file_tracker_table: file_processing_tracker
file_tracker_database: metadata
max_parallel_jobs: 4
retry_attempts: 3
timeout_minutes: 60
log_level: INFO
enable_metrics: true
tables:
  - table_name: customers
    database_name: analytics
    source_path_pattern: /mnt/raw/customers/*.parquet
    loading_strategy: scd2
    primary_keys:
      - customer_id
    tracking_columns:
      - name
      - email
      - address
    file_format: parquet
    schema_evolution: true
    partition_columns:
      - date_partition
  - table_name: transactions
    database_name: analytics
    source_path_pattern: /mnt/raw/transactions/*.parquet
    loading_strategy: append
    file_format: parquet
    schema_evolution: true
    partition_columns:
      - transaction_date

Configuration Options

Global Settings

raw_data_path: Path to raw data location
processed_data_path: Path to processed data location
checkpoint_path: Path for checkpoints and metadata
max_parallel_jobs: Maximum number of concurrent processing jobs
retry_attempts: Number of retry attempts for failed files
timeout_minutes: Timeout for processing a single file

Table Configuration

table_name: Name of the target table
database_name: Target database/schema name
source_path_pattern: File path pattern to match source files (supports wildcards)
loading_strategy: Loading strategy (scd2, append, overwrite, merge)
file_format: Source file format (parquet, csv, json, delta)
schema_evolution: Enable automatic schema evolution
partition_columns: Columns to partition the target table by

SCD2 Specific Options

primary_keys: Primary key columns for SCD2
tracking_columns: Columns to track for changes
scd2_effective_date_column: Effective date column name
scd2_end_date_column: End date column name
scd2_current_flag_column: Current flag column name

Usage

Command Line Interface

Standard Data Loading

# Run with configuration file
python -m data_loader.main run --config config.yaml

# Run with inline YAML configuration
python -m data_loader.main run --config-json 'raw_data_path: /mnt/raw/\n...'

# Run specific tables only
python -m data_loader.main run --config config.yaml --tables "customers,transactions"

# Dry run to see what would be processed
python -m data_loader.main run --config config.yaml --dry-run

# Run with optimization and vacuum
python -m data_loader.main run --config config.yaml --optimize --vacuum

🆕 Cluster Mode (Enhanced for Databricks)

# Run with cluster optimizations (recommended)
python -m data_loader.main run-cluster --config config.yaml

# Run with Unity Catalog support
python -m data_loader.main run-cluster --config config.yaml --unity-catalog

# Run with resource monitoring
python -m data_loader.main run-cluster --config config.yaml --monitoring

# Dry run with cluster status
python -m data_loader.main run-cluster --config config.yaml --dry-run

# Check cluster configuration and health
python -m data_loader.main cluster-status --config config.yaml

Check Processing Status

python -m data_loader.main status --config config.yaml

Create Example Configuration

python -m data_loader.main create-example-config --output my_config.yaml

Databricks Job Setup

Standard Mode

Upload the package to Databricks workspace or DBFS
Create a new job with the following configuration:
- Cluster: Use a cluster with Databricks Runtime 11.0+ and Delta Lake support
- Task Type: Python script
- Script path: Path to main.py in your uploaded package
- Parameters: ["run", "--config", "/path/to/config.yaml"]

🆕 Cluster Mode (Recommended)

Upload the package to Databricks workspace or DBFS
Create a new job with the following configuration:
- Cluster: Use a cluster with Databricks Runtime 11.0+ and Delta Lake support
- Task Type: Python script
- Script path: Path to main.py in your uploaded package
- Parameters: ["run-cluster", "--config", "/path/to/config.yaml", "--unity-catalog"]
Set up file trigger (if using file-based triggers):
- Configure the job to trigger on file arrival in your raw data location
- Use Databricks Auto Loader for streaming ingestion scenarios

Enhanced Job Configuration for Cluster Mode

{
  "job_clusters": [{
    "job_cluster_key": "data-loader-cluster",
    "new_cluster": {
      "spark_version": "11.3.x-scala2.12",
      "node_type_id": "i3.xlarge",
      "num_workers": 4,
      "spark_conf": {
        "spark.databricks.delta.optimizeWrite.enabled": "true",
        "spark.databricks.delta.autoCompact.enabled": "true"
      }
    }
  }],
  "tasks": [{
    "task_key": "data-loader",
    "job_cluster_key": "data-loader-cluster",
    "python_wheel_task": {
      "package_name": "databricks_data_loader",
      "entry_point": "main",
      "parameters": ["run-cluster", "--config", "/mnt/config/data_loader.json"]
    }
  }]
}

Programmatic Usage

Standard Mode

from data_loader.config.table_config import DataLoaderConfig
from data_loader.core.processor import DataProcessor

# Load configuration
config = DataLoaderConfig(**config_dict)

# Initialize processor
processor = DataProcessor(config)

# Process all tables
results = processor.process_all_tables()

# Process specific table
table_config = config.get_table_config("customers")
table_result = processor.process_table(table_config)

# Check status
status = processor.get_processing_status()

🆕 Cluster Mode (Enhanced)

from data_loader.config.table_config import DataLoaderConfig
from data_loader.cluster import ClusterConfig, ClusterDataProcessor, DatabricksEnvironment

# Load base configuration
base_config = DataLoaderConfig(**config_dict)

# Detect Databricks environment and create cluster configuration
environment = DatabricksEnvironment.detect_environment()
cluster_config = ClusterConfig.from_base_config(
    base_config=base_config,
    environment=environment,
    enable_cluster_optimizations=True,
    use_unity_catalog=True
)

# Initialize cluster processor
processor = ClusterDataProcessor(cluster_config)

# Validate cluster configuration
validation = processor.validate_cluster_configuration()
if not validation['valid']:
    raise ValueError(f"Configuration invalid: {validation['errors']}")

# Process with cluster optimizations
results = processor.process_all_tables()

# Get comprehensive cluster status
cluster_status = processor.get_cluster_status()

Loading Strategies

SCD2 (Slowly Changing Dimensions Type 2)

The SCD2 strategy maintains historical records by:

Comparing incoming records with current records
Identifying new and changed records
Marking changed records as inactive (setting end_date and is_current=false)
Inserting new/changed records as active

Requirements:

primary_keys: Columns that uniquely identify records
tracking_columns: Columns to monitor for changes
SCD2 metadata columns (effective_date, end_date, is_current)

Append Strategy

The append strategy simply adds new data to the target table without any deduplication or change detection. Suitable for:

Event/transaction tables
Log tables
Tables without primary keys
Any scenario where all incoming data should be preserved

Features:

Automatic audit column addition (_load_timestamp, _source_file, _batch_id)
Optional deduplication
Late-arriving data handling

🆕 Cluster Mode Features

Environment Detection

Automatically detects and optimizes for Databricks environments:

Cluster Type: Single User, Shared, or No Isolation Shared
Resource Allocation: Worker count, cores, memory configuration
Runtime Features: Unity Catalog availability, Delta Lake optimization
Optimal Parallelism: Calculates ideal parallel job count based on cluster size

Resource Management

Real-time monitoring and optimization:

# Monitor cluster resources
resources = processor.resource_manager.get_cluster_resources()
health = processor.resource_manager.get_health_status()

# Get optimization recommendations
recommendations = processor.resource_manager.get_optimization_recommendations()

Unity Catalog Integration

Seamless integration with Unity Catalog:

# Enable Unity Catalog support
cluster_config = ClusterConfig.from_base_config(
    base_config=base_config,
    use_unity_catalog=True,
    default_catalog="production"
)

# Tables automatically use: catalog.schema.table format

Job Dependencies and Orchestration

Manage complex workflow dependencies:

cluster_config = ClusterConfig(
    base_config=base_config,
    enable_job_dependencies=True,
    upstream_dependencies=[
        "job:bronze-pipeline-job-id",
        "table:bronze.raw_events", 
        "file:/mnt/config/ready.flag"
    ],
    downstream_notifications=[
        "webhook:https://hooks.slack.com/...",
        "job:silver-transformation-job-id"
    ]
)

Cluster-Aware Optimizations

Automatic Spark configuration based on cluster characteristics:

Delta Lake optimizations: Auto-compaction, optimized writes
Adaptive query execution: Dynamic partition coalescing, skew join handling
Memory management: Optimal memory allocation and garbage collection
Shuffle optimization: Adaptive shuffle partitions based on data size

For detailed cluster mode documentation, see CLUSTER_MODE.md.

File Tracking

The data loader maintains a Delta table to track file processing status:

CREATE TABLE metadata.file_processing_tracker (
  file_path STRING,
  file_size INT,
  file_modified_time TIMESTAMP,
  table_name STRING,
  status STRING,  -- pending, processing, completed, failed, skipped
  processing_start_time TIMESTAMP,
  processing_end_time TIMESTAMP,
  error_message STRING,
  retry_count INT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

This ensures that:

Files are never processed more than once
Failed files can be retried
Processing history is maintained
Status can be monitored and reported

Monitoring and Logging

Logging

Structured logging with configurable levels
JSON format support for log aggregation
File and console output options
Performance metrics and execution timing

Metrics

File processing statistics
Table-level metrics
Success/failure rates
Execution times
Resource usage monitoring

Error Handling

The data loader provides robust error handling:

File-level errors: Individual file failures don't stop the entire process
Retry logic: Configurable retry attempts with exponential backoff
Error tracking: All errors are logged and tracked in the file tracker
Graceful degradation: Processing continues even if some files fail

Performance Optimization

Parallel Processing

Configurable number of concurrent workers
Thread-safe file status tracking
Resource usage monitoring

Databricks Optimizations

Delta Lake optimizations enabled by default
Adaptive query execution
Auto-compaction and optimize write
Partitioning support

Best Practices

Partitioning: Use appropriate partition columns for large tables
File sizes: Aim for file sizes between 100MB-1GB for optimal performance
Batch processing: Process files in batches rather than one-by-one
Resource allocation: Size your cluster appropriately for the workload

Testing

Run the test suite:

# Run all tests
poetry run pytest data_loader/tests/

# Run with coverage
poetry run pytest --cov=data_loader data_loader/tests/

# Run specific test file
poetry run pytest data_loader/tests/test_basic.py

Development

Setting up Development Environment

Clone the repository
Install dependencies with Poetry: poetry install
Activate the environment: poetry shell (optional)
Run tests to verify setup: poetry run pytest

Adding New Loading Strategies

Create a new strategy class inheriting from BaseLoadingStrategy
Implement required methods: load_data(), validate_config()
Add strategy to the factory in processor.py
Add configuration options to table_config.py
Add tests for the new strategy

Troubleshooting

Common Issues

Permission errors: Ensure the Databricks cluster has access to all specified paths
Schema conflicts: Enable schema evolution or ensure consistent schemas
Memory issues: Reduce batch sizes or increase cluster memory
Timeout errors: Increase timeout settings or optimize file processing

Debug Mode

Enable debug logging for detailed execution information:

python -m data_loader.main run --config config.yaml --log-level DEBUG

Databricks Job Execution

For running the loader as a Databricks job, use the data_loader.job_runner module. Configure widgets config, log_level, optimize and vacuum or set the environment variables DATALOADER_CONFIG_FILE etc. See docs/databricks_job.md for details.

Future Development

See ROADMAP.md for detailed development plans, upcoming features, and long-term vision for the Databricks Data Loader framework.

Contributing

Fork the repository
Create a feature branch
Make your changes with tests
Run the test suite
Submit a pull request

For major features, please refer to the roadmap to ensure alignment with project direction.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
data_loader		data_loader
demo		demo
docker		docker
docs		docs
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
CHECKLIST.md		CHECKLIST.md
CLUSTER_MODE.md		CLUSTER_MODE.md
DOCKER_SETUP_SUMMARY.md		DOCKER_SETUP_SUMMARY.md
Dockerfile		Dockerfile
Dockerfile.test		Dockerfile.test
IDEMPOTENCY_GUIDE.md		IDEMPOTENCY_GUIDE.md
README.md		README.md
ROADMAP.md		ROADMAP.md
cluster_demo.py		cluster_demo.py
demo_databricks_job.py		demo_databricks_job.py
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
example_config.yaml		example_config.yaml
example_usage.py		example_usage.py
pyproject.toml		pyproject.toml
quick-start.sh		quick-start.sh
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Databricks Data Loader

Features

Quick Start

Standard Mode

🆕 Cluster Mode (Recommended for Databricks)

Examples

Architecture

Installation

Configuration

Example Configuration

Configuration Options

Global Settings

Table Configuration

SCD2 Specific Options

Usage

Command Line Interface

Standard Data Loading

🆕 Cluster Mode (Enhanced for Databricks)

Check Processing Status

Create Example Configuration

Databricks Job Setup

Standard Mode

🆕 Cluster Mode (Recommended)

Enhanced Job Configuration for Cluster Mode

Programmatic Usage

Standard Mode

🆕 Cluster Mode (Enhanced)

Loading Strategies

SCD2 (Slowly Changing Dimensions Type 2)

Append Strategy

🆕 Cluster Mode Features

Environment Detection

Resource Management

Unity Catalog Integration

Job Dependencies and Orchestration

Cluster-Aware Optimizations

File Tracking

Monitoring and Logging

Logging

Metrics

Error Handling

Performance Optimization

Parallel Processing

Databricks Optimizations

Best Practices

Testing

Development

Setting up Development Environment

Adding New Loading Strategies

Troubleshooting

Common Issues

Debug Mode

Databricks Job Execution

Future Development

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages