Failed lookback interval runs create data holes

## Core Problem

When an `INCREMENTAL_BY_TIME_RANGE` model with `lookback` configured fails during a lookback interval run, it creates a data hole (empty interval) rather than preserving the existing stale data. This occurs because the default materialization strategy for most engines is DELETE + INSERT, which is not atomic. You can find the materialization strategy documentation in `docs/concepts/models/model_kinds.md`.

## Materialization Strategy

The DELETE + INSERT strategy works in two steps:
1. DELETE existing data for the time interval
2. INSERT new data for that interval

If step 1 succeeds but step 2 fails, the interval is left empty. This is fundamentally different from atomic operations like INSERT OVERWRITE (used by Spark/Databricks) which either fully succeed or fully fail. The strategy table is documented in `docs/concepts/models/model_kinds.md`.

## Lookback Mechanism

The `lookback` parameter is designed to handle late-arriving data by reprocessing recent intervals on each run. The lookback parameter is documented in `docs/concepts/models/overview.md`.

For example, with `lookback=2` and daily intervals:
- On January 5th, SQLMesh processes January 5th, 4th, and 3rd
- On January 6th, SQLMesh processes January 6th, 5th, and 4th

The `lookback_start` method calculates the effective start date by moving back `lookback` intervals. This method is implemented in `sqlmesh/core/model/meta.py`.

## Interval Computation Logic

The `compute_missing_intervals` function determines which intervals need processing. This function is implemented in `sqlmesh/core/snapshot/definition.py`.

Key logic:
1. Expands the time range into individual interval boundaries
2. Compares against already-processed intervals
3. If `lookback > 0`, adds additional intervals to the missing set based on parent interval availability

The critical section for lookback handling checks if parent intervals are missing and adds current intervals to the missing set accordingly. However, this doesn't create a mechanism for retrying failed lookback intervals specifically.

## Scheduler Behavior

The scheduler's `merged_missing_intervals` function computes missing intervals across all snapshots. This function is implemented in `sqlmesh/core/scheduler.py`.

This function:
1. Takes a collection of snapshots
2. Calls `compute_interval_params` for each
3. Returns a mapping of snapshots to their missing intervals

Crucially, it treats all missing intervals equally - there's no distinction between:
- Current intervals that failed
- Lookback intervals that failed
- Intervals that were never processed

## The Gap

The gap exists because:

1. **No Atomicity Guarantee**: DELETE + INSERT is not atomic, so failures after DELETE leave data holes. This is documented in `docs/concepts/models/model_kinds.md`.

2. **No Special Retry Logic**: The scheduler doesn't distinguish between failed lookback intervals and other missing intervals. The scheduler logic is in `sqlmesh/core/scheduler.py`.

3. **Assumption of Success**: The lookback mechanism assumes intervals will either succeed or be naturally retried on the next scheduled run. However, if the failure is transient (e.g., network issue), the next scheduled run may have moved past the lookback window.

4. **Interval Tracking**: While failed intervals should theoretically be included in missing intervals (since they're never added to the processed intervals list), there's no explicit mechanism to prioritize retrying them. The interval tracking logic is in `sqlmesh/core/snapshot/definition.py`.

## Test Case Evidence

The test cases in `tests/core/test_snapshot.py` show the expected behavior of lookback. Specifically, the `test_missing_intervals_past_end_date_with_lookback` function demonstrates that lookback intervals ARE included in missing intervals when they're removed.

However, this doesn't address the case where a DELETE succeeded but INSERT failed - the interval would be missing from the processed list, but the data would also be gone from the target table.

## Gap Detection Logic

SQLMesh does have gap detection logic in `_ensure_no_gaps`, but this is only used during plan application for version changes, not for runtime retry of failed intervals. This function is implemented in `sqlmesh/core/state_sync/db/facade.py`.

This prevents gaps during version changes but doesn't help with runtime failures.

## Summary

The issue is a gap because:
- The materialization strategy is non-atomic for most engines (documented in `docs/concepts/models/model_kinds.md`)
- Lookback intervals that fail mid-execution leave data holes
- There's no special retry mechanism for failed lookback intervals (scheduler logic in `sqlmesh/core/scheduler.py`)
- The system assumes lookback intervals will succeed or be naturally retried, which may not happen for transient failures
- Gap detection only applies to plan application, not runtime failures (logic in `sqlmesh/core/state_sync/db/facade.py`)

## Notes

The `end_bounded` parameter in `missing_intervals` can be used to limit intervals to the target end date, but this doesn't address the retry issue for failed lookback intervals. This parameter is documented in the `missing_intervals` method in `sqlmesh/core/snapshot/definition.py`.

## Relevant Files

- `docs/concepts/models/model_kinds.md` - Materialization strategy documentation
- `docs/concepts/models/overview.md` - Lookback parameter documentation
- `sqlmesh/core/model/meta.py` - lookback_start method implementation
- `sqlmesh/core/snapshot/definition.py` - compute_missing_intervals and missing_intervals implementation
- `sqlmesh/core/scheduler.py` - merged_missing_intervals implementation
- `tests/core/test_snapshot.py` - Test cases for lookback behavior
- `sqlmesh/core/state_sync/db/facade.py` - Gap detection logic in _ensure_no_gaps

Wiki pages you might want to explore:
- [Incremental Models (SQLMesh/sqlmesh)](/wiki/SQLMesh/sqlmesh#3.3)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed lookback interval runs create data holes #5843

Core Problem

Materialization Strategy

Lookback Mechanism

Interval Computation Logic

Scheduler Behavior

The Gap

Test Case Evidence

Gap Detection Logic

Summary

Notes

Relevant Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Failed lookback interval runs create data holes #5843

Description

Core Problem

Materialization Strategy

Lookback Mechanism

Interval Computation Logic

Scheduler Behavior

The Gap

Test Case Evidence

Gap Detection Logic

Summary

Notes

Relevant Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions