Skip to content

Failed lookback interval runs create data holes #5843

@mday-io

Description

@mday-io

Core Problem

When an INCREMENTAL_BY_TIME_RANGE model with lookback configured fails during a lookback interval run, it creates a data hole (empty interval) rather than preserving the existing stale data. This occurs because the default materialization strategy for most engines is DELETE + INSERT, which is not atomic. You can find the materialization strategy documentation in docs/concepts/models/model_kinds.md.

Materialization Strategy

The DELETE + INSERT strategy works in two steps:

  1. DELETE existing data for the time interval
  2. INSERT new data for that interval

If step 1 succeeds but step 2 fails, the interval is left empty. This is fundamentally different from atomic operations like INSERT OVERWRITE (used by Spark/Databricks) which either fully succeed or fully fail. The strategy table is documented in docs/concepts/models/model_kinds.md.

Lookback Mechanism

The lookback parameter is designed to handle late-arriving data by reprocessing recent intervals on each run. The lookback parameter is documented in docs/concepts/models/overview.md.

For example, with lookback=2 and daily intervals:

  • On January 5th, SQLMesh processes January 5th, 4th, and 3rd
  • On January 6th, SQLMesh processes January 6th, 5th, and 4th

The lookback_start method calculates the effective start date by moving back lookback intervals. This method is implemented in sqlmesh/core/model/meta.py.

Interval Computation Logic

The compute_missing_intervals function determines which intervals need processing. This function is implemented in sqlmesh/core/snapshot/definition.py.

Key logic:

  1. Expands the time range into individual interval boundaries
  2. Compares against already-processed intervals
  3. If lookback > 0, adds additional intervals to the missing set based on parent interval availability

The critical section for lookback handling checks if parent intervals are missing and adds current intervals to the missing set accordingly. However, this doesn't create a mechanism for retrying failed lookback intervals specifically.

Scheduler Behavior

The scheduler's merged_missing_intervals function computes missing intervals across all snapshots. This function is implemented in sqlmesh/core/scheduler.py.

This function:

  1. Takes a collection of snapshots
  2. Calls compute_interval_params for each
  3. Returns a mapping of snapshots to their missing intervals

Crucially, it treats all missing intervals equally - there's no distinction between:

  • Current intervals that failed
  • Lookback intervals that failed
  • Intervals that were never processed

The Gap

The gap exists because:

  1. No Atomicity Guarantee: DELETE + INSERT is not atomic, so failures after DELETE leave data holes. This is documented in docs/concepts/models/model_kinds.md.

  2. No Special Retry Logic: The scheduler doesn't distinguish between failed lookback intervals and other missing intervals. The scheduler logic is in sqlmesh/core/scheduler.py.

  3. Assumption of Success: The lookback mechanism assumes intervals will either succeed or be naturally retried on the next scheduled run. However, if the failure is transient (e.g., network issue), the next scheduled run may have moved past the lookback window.

  4. Interval Tracking: While failed intervals should theoretically be included in missing intervals (since they're never added to the processed intervals list), there's no explicit mechanism to prioritize retrying them. The interval tracking logic is in sqlmesh/core/snapshot/definition.py.

Test Case Evidence

The test cases in tests/core/test_snapshot.py show the expected behavior of lookback. Specifically, the test_missing_intervals_past_end_date_with_lookback function demonstrates that lookback intervals ARE included in missing intervals when they're removed.

However, this doesn't address the case where a DELETE succeeded but INSERT failed - the interval would be missing from the processed list, but the data would also be gone from the target table.

Gap Detection Logic

SQLMesh does have gap detection logic in _ensure_no_gaps, but this is only used during plan application for version changes, not for runtime retry of failed intervals. This function is implemented in sqlmesh/core/state_sync/db/facade.py.

This prevents gaps during version changes but doesn't help with runtime failures.

Summary

The issue is a gap because:

  • The materialization strategy is non-atomic for most engines (documented in docs/concepts/models/model_kinds.md)
  • Lookback intervals that fail mid-execution leave data holes
  • There's no special retry mechanism for failed lookback intervals (scheduler logic in sqlmesh/core/scheduler.py)
  • The system assumes lookback intervals will succeed or be naturally retried, which may not happen for transient failures
  • Gap detection only applies to plan application, not runtime failures (logic in sqlmesh/core/state_sync/db/facade.py)

Notes

The end_bounded parameter in missing_intervals can be used to limit intervals to the target end date, but this doesn't address the retry issue for failed lookback intervals. This parameter is documented in the missing_intervals method in sqlmesh/core/snapshot/definition.py.

Relevant Files

  • docs/concepts/models/model_kinds.md - Materialization strategy documentation
  • docs/concepts/models/overview.md - Lookback parameter documentation
  • sqlmesh/core/model/meta.py - lookback_start method implementation
  • sqlmesh/core/snapshot/definition.py - compute_missing_intervals and missing_intervals implementation
  • sqlmesh/core/scheduler.py - merged_missing_intervals implementation
  • tests/core/test_snapshot.py - Test cases for lookback behavior
  • sqlmesh/core/state_sync/db/facade.py - Gap detection logic in _ensure_no_gaps

Wiki pages you might want to explore:

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions