Skip to content

fix(file-service): retry LakeFS health check on startup#5647

Open
Ma77Ball wants to merge 7 commits into
apache:mainfrom
Ma77Ball:fix/file-service-lakefs-startup-race
Open

fix(file-service): retry LakeFS health check on startup#5647
Ma77Ball wants to merge 7 commits into
apache:mainfrom
Ma77Ball:fix/file-service-lakefs-startup-race

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • LakeFSStorageClient.healthCheck() now retries the LakeFS health check (10 attempts, 3s apart) before failing, so a transient Connection reset during concurrent startup no longer crashes the file-service.
  • Each failed attempt logs a warning; if LakeFS is still unreachable after all attempts, startup fails with a clear aggregated error so a genuine outage is still surfaced.

Any related issues, documentation, discussions?

Closes: #5646

How was this PR tested?

  • Reproduce the race: stop LakeFS, start file-service, then start LakeFS within ~30s; confirm file-service logs retry warnings and then starts successfully instead of exiting.
  • Regression with LakeFS already up: start file-service, confirm it binds :9092 and curl http://127.0.0.1:9092/api/healthcheck returns HTTP 200.
  • Compile: run sbt "FileService/compile" and expect [success].

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions github-actions Bot added fix ci changes related to CI common labels Jun 12, 2026
@codecov-commenter

codecov-commenter commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.00%. Comparing base (7ae9b35) to head (c0a1e1a).

Files with missing lines Patch % Lines
.../amber/core/storage/util/LakeFSStorageClient.scala 0.00% 18 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5647      +/-   ##
============================================
- Coverage     53.02%   53.00%   -0.02%     
  Complexity     2542     2542              
============================================
  Files          1075     1075              
  Lines         41743    41758      +15     
  Branches       4513     4516       +3     
============================================
  Hits          22134    22134              
- Misses        18301    18316      +15     
  Partials       1308     1308              
Flag Coverage Δ *Carryforward flag
access-control-service 71.42% <ø> (ø)
agent-service 34.36% <ø> (ø) Carriedforward from 668e599
amber 53.86% <0.00%> (-0.06%) ⬇️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.71% <ø> (ø)
file-service 57.06% <ø> (ø)
frontend 47.38% <ø> (ø) Carriedforward from 668e599
pyamber 90.72% <ø> (ø) Carriedforward from 668e599
python 90.74% <ø> (ø) Carriedforward from 668e599
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

✅ No material benchmark regressions detected

🟢 13 better · 🔴 0 worse · ⚪ 2 noise (<±5%) · 0 without baseline

CI benchmark results are noisy; treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🟢 bs=10 sw=10 sl=64 438 0.268 20,369/37,736/37,736 us 🟢 -17.9% / 🟢 -15.8%
🟢 bs=100 sw=10 sl=64 982 0.599 99,154/131,870/131,870 us 🟢 +34.7% / 🟢 -13.7%
🟢 bs=1000 sw=10 sl=64 1,126 0.687 882,366/987,051/987,051 us 🟢 +19.0% / 🟢 +11.5%
Baseline details

Latest main 7ae9b35 from 2026-06-13T00:44:28.840Z

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 438 tuples/sec 394.18 tuples/sec 400.34 tuples/sec +11.1% +9.4%
bs=10 sw=10 sl=64 MB/s 0.268 MB/s 0.241 MB/s 0.244 MB/s +11.4% +9.7%
bs=10 sw=10 sl=64 p50 20,369 us 24,805 us 24,191 us -17.9% -15.8%
bs=10 sw=10 sl=64 p95 37,736 us 38,345 us 36,018 us -1.6% +4.8%
bs=10 sw=10 sl=64 p99 37,736 us 38,345 us 36,018 us -1.6% +4.8%
bs=100 sw=10 sl=64 throughput 982 tuples/sec 729.17 tuples/sec 867.19 tuples/sec +34.7% +13.2%
bs=100 sw=10 sl=64 MB/s 0.599 MB/s 0.445 MB/s 0.529 MB/s +34.6% +13.2%
bs=100 sw=10 sl=64 p50 99,154 us 133,219 us 114,874 us -25.6% -13.7%
bs=100 sw=10 sl=64 p95 131,870 us 173,178 us 142,264 us -23.9% -7.3%
bs=100 sw=10 sl=64 p99 131,870 us 173,178 us 142,264 us -23.9% -7.3%
bs=1000 sw=10 sl=64 throughput 1,126 tuples/sec 946.3 tuples/sec 1,010 tuples/sec +19.0% +11.5%
bs=1000 sw=10 sl=64 MB/s 0.687 MB/s 0.578 MB/s 0.616 MB/s +18.9% +11.5%
bs=1000 sw=10 sl=64 p50 882,366 us 1,059,718 us 995,071 us -16.7% -11.3%
bs=1000 sw=10 sl=64 p95 987,051 us 1,104,221 us 1,046,228 us -10.6% -5.7%
bs=1000 sw=10 sl=64 p99 987,051 us 1,104,221 us 1,046,228 us -10.6% -5.7%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,456.23,200,128000,438,0.268,20368.52,37736.36,37736.36
1,100,10,64,20,2037.53,2000,1280000,982,0.599,99153.64,131869.63,131869.63
2,1000,10,64,20,17761.75,20000,12800000,1126,0.687,882365.83,987051.46,987051.46

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves file-service startup robustness by retrying the LakeFS health check instead of failing immediately on transient connection errors, addressing the reported startup race with LakeFS (issue #5646). It also introduces a new GitHub Actions workflow related to PR/issue template compliance, which expands the PR’s scope beyond the stated goal.

Changes:

  • Add retry loop + warning logs to LakeFSStorageClient.healthCheck() to tolerate transient LakeFS unavailability on startup.
  • Add .github/workflows/template-compliance-warning.yml to warn on non-compliant PR/issue templates (and clear the warning once fixed).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala Adds retry-on-failure logic for LakeFS health checks during startup.
.github/workflows/template-compliance-warning.yml Adds an Actions workflow that posts/removes a sticky template-compliance warning comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/template-compliance-warning.yml Outdated
Comment thread .github/workflows/template-compliance-warning.yml Outdated
Comment thread .github/workflows/template-compliance-warning.yml Outdated
@github-actions github-actions Bot removed the ci changes related to CI label Jun 13, 2026
@Ma77Ball Ma77Ball requested a review from Yicong-Huang June 13, 2026 03:07
@Yicong-Huang

Copy link
Copy Markdown
Contributor

@Ma77Ball could you please add some tests? Currently the coverage is 0..

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly need to add tests. see inline comment as well

s"LakeFS not reachable (attempt $attempt/$HealthCheckMaxAttempts): ${e.getMessage}. " +
s"Retrying in ${HealthCheckRetryDelayMillis}ms..."
)
Thread.sleep(HealthCheckRetryDelayMillis)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a good idea to do backoff waiting, like 200ms, 400ms, 800ms instead of 3 seconds constantly

and I feel 10 attempts is a bit too many. usually 3-5 times should be enough. and with backoff 5 attempts will be done in 5~6 seconds. while 3s * 10 attempts is 30s, do we need to wait that long?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

file-service crashes on startup when LakeFS health check races LakeFS boot

4 participants