fix(file-service): retry LakeFS health check on startup#5647
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5647 +/- ##
============================================
- Coverage 53.02% 53.00% -0.02%
Complexity 2542 2542
============================================
Files 1075 1075
Lines 41743 41758 +15
Branches 4513 4516 +3
============================================
Hits 22134 22134
- Misses 18301 18316 +15
Partials 1308 1308
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ No material benchmark regressions detected🟢 13 better · 🔴 0 worse · ⚪ 2 noise (<±5%) · 0 without baseline
Baseline detailsLatest main
Raw CSVconfig_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,456.23,200,128000,438,0.268,20368.52,37736.36,37736.36
1,100,10,64,20,2037.53,2000,1280000,982,0.599,99153.64,131869.63,131869.63
2,1000,10,64,20,17761.75,20000,12800000,1126,0.687,882365.83,987051.46,987051.46 |
There was a problem hiding this comment.
Pull request overview
This PR improves file-service startup robustness by retrying the LakeFS health check instead of failing immediately on transient connection errors, addressing the reported startup race with LakeFS (issue #5646). It also introduces a new GitHub Actions workflow related to PR/issue template compliance, which expands the PR’s scope beyond the stated goal.
Changes:
- Add retry loop + warning logs to
LakeFSStorageClient.healthCheck()to tolerate transient LakeFS unavailability on startup. - Add
.github/workflows/template-compliance-warning.ymlto warn on non-compliant PR/issue templates (and clear the warning once fixed).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala | Adds retry-on-failure logic for LakeFS health checks during startup. |
| .github/workflows/template-compliance-warning.yml | Adds an Actions workflow that posts/removes a sticky template-compliance warning comment. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@Ma77Ball could you please add some tests? Currently the coverage is 0.. |
Yicong-Huang
left a comment
There was a problem hiding this comment.
mainly need to add tests. see inline comment as well
| s"LakeFS not reachable (attempt $attempt/$HealthCheckMaxAttempts): ${e.getMessage}. " + | ||
| s"Retrying in ${HealthCheckRetryDelayMillis}ms..." | ||
| ) | ||
| Thread.sleep(HealthCheckRetryDelayMillis) |
There was a problem hiding this comment.
it might be a good idea to do backoff waiting, like 200ms, 400ms, 800ms instead of 3 seconds constantly
and I feel 10 attempts is a bit too many. usually 3-5 times should be enough. and with backoff 5 attempts will be done in 5~6 seconds. while 3s * 10 attempts is 30s, do we need to wait that long?
What changes were proposed in this PR?
LakeFSStorageClient.healthCheck()now retries the LakeFS health check (10 attempts, 3s apart) before failing, so a transientConnection resetduring concurrent startup no longer crashes the file-service.Any related issues, documentation, discussions?
Closes: #5646
How was this PR tested?
:9092andcurl http://127.0.0.1:9092/api/healthcheckreturns HTTP 200.sbt "FileService/compile"and expect[success].Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF