Skip to content

Add --chunk-concurrent-size for parallel row-copy#1688

Open
dnovitski wants to merge 2 commits into
github:masterfrom
dnovitski:pr-1398-rebased
Open

Add --chunk-concurrent-size for parallel row-copy#1688
dnovitski wants to merge 2 commits into
github:masterfrom
dnovitski:pr-1398-rebased

Conversation

@dnovitski

@dnovitski dnovitski commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Port of #1398 by @shaohk to current master, with correctness improvements.

Adds --chunk-concurrent-size flag that allows multiple row-copy chunks to execute in parallel within each iteration using errgroup. Default is 1 (no behavior change).

Motivation

On large tables with fast storage (NVMe/SSD), the single-threaded row-copy loop can become a bottleneck. This flag enables parallel chunk copying to improve migration throughput.

Performance Results

1M rows, ADD COLUMN extra_col INT DEFAULT 0, Docker MySQL 8.0, chunk-size=1000:

Concurrency Wall-clock Speedup
1 (default) 30.6s baseline
4 23.9s 22% faster
8 20.9s 32% faster

Benefits scale with table size and storage throughput.

Key Design Decisions

  • Two execution paths: concurrency=1 matches master's retry semantics exactly (range calc inside retry loop for hook-based chunk size reduction); concurrency>1 pre-calculates ranges under mutex for safe parallel execution
  • Thread-safe range calculation: CalculateNextIterationRangeEndValues(advanceCursor bool) protected by mutex, returns *IterationRangeValues struct with isolated Min/Max per goroutine
  • No shared mutable state in hot path: SQL warnings returned as function value (eliminates data race on shared MigrationLastInsertSQLWarnings field)
  • errgroup with real migration context: Proper cancellation propagation when any chunk fails
  • DB pool auto-sizing: Connection pool increased when --chunk-concurrent-size exceeds default pool size
  • Backward compatible: Default concurrency=1 preserves existing single-threaded behavior exactly

Changes from original #1398

  • Adapted to current master API (builder pattern, receiver names, retryBatchCopyWithHooks)
  • Fixed thread-safety: ApplyIterationInsertQuery returns SQL warnings instead of writing to shared field
  • Correct retry behavior: single-threaded path recalculates range on retry (matches master); concurrent path retries same range (INSERT IGNORE is idempotent)
  • Proper IncludeMinValues handling for first iteration
  • Uses real migration context (not context.Background())

Testing

  • All CI checks pass (build, lint, CodeQL, migration tests on MySQL 5.7/8.0/8.4/Percona)
  • Performance benchmarked (22-32% improvement with concurrency 4-8)
  • Data integrity verified (1M rows, checksum match)
  • TestRetryBatchCopyWithHooks passes (hook-based chunk size reduction works correctly)

Checklist

  • Tests pass
  • Documentation updated (doc/command-line-flags.md)
  • Backward compatible (default=1)

Based on work by @shaohk in #1398.

@shaohk

shaohk commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

There's a problem with concurrent chunk copying: during the chunk operation, executing SELECT ... INSERT will hold the auto-increment lock on the target table, which puts a ceiling on the concurrency gains you can achieve.

@dnovitski

dnovitski commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @shaohk — you're right to flag this, and since most tables do have an AUTO_INCREMENT column it's worth being precise about when it bites.

The AUTO-INC ceiling depends on innodb_autoinc_lock_mode:

  • Mode 0/1: INSERT ... SELECT is classified as a "bulk insert", so InnoDB holds the table-level AUTO-INC lock for the whole statement — and that's decided by statement type, not by whether we supply values (we do copy the source PK values, but it doesn't exempt us). Concurrent chunks serialize, so the ceiling is real here. Mode 1 was the default through MySQL 5.7.
  • Mode 2 (interleaved): no table-level AUTO-INC lock — just a lightweight per-allocation mutex. This is the default on MySQL 8.0+, and it's safe for gh-ost specifically because gh-ost mandates binlog_format=ROW (statement-based replication is the only reason mode 2 is otherwise flagged "unsafe"). So on a typical 8.0 deployment the AUTO-INC ceiling you describe is largely gone by default.

So I'd frame it as: the hard serialization you're describing applies under mode 0/1; under mode 2 it becomes soft contention.

That said — even under mode 2, parallel copy doesn't scale linearly, and the dominant cap is actually in gh-ost's own loop rather than MySQL. The next-chunk boundary calculation (CalculateNextIterationRangeEndValues) runs under a global mutex and includes a round-trip + indexed scan of the source, because each chunk's start depends on the previous chunk's max cursor. So boundary computation is fully serialized and only the INSERT...SELECT runs in parallel — classic Amdahl. On top of that there's secondary-index/redo contention on the ghost table and gh-ost's replication-lag/load throttling, which usually paces the migration well before insert concurrency does.

I think the right things to do here are (1) document that --chunk-concurrent-size > 1 wants innodb_autoinc_lock_mode = 2, and ideally detect mode 0/1 on an auto-increment table at startup and warn that the speedup won't materialize; (2) we need to fix the gh-ost loop such that it's not capped internally

Does that match what you were seeing?

@olegkv

olegkv commented Jun 12, 2026

Copy link
Copy Markdown

MySQL8 INSERTs scale much better than 20-30% when parallel workers count increases from 1 to 4-8.
My experiments show about several times speed up, just for parallel inserts into same table.
INSERTs are expected to be several times slower than range SELECTs, so selecting ranges is not going to be a bottleneck if the number of workers is 4.
auto-increment by default works well for parallel inserts, even with Mode 1 it still allows to increase throughput several times (by parallel inserts).
Something is not fully parallel here, I suspect.

dnovitski and others added 2 commits June 14, 2026 00:54
Port of PR github#1398 by @shaohk: allows multiple row-copy chunks to execute
in parallel within each iteration using errgroup.

Key changes:
- Add IterationRangeValues struct for thread-safe range passing
- Serialize range calculation with CalculateNextIterationRangeEndValuesLock
- Rewrite iterateChunks to spawn N goroutines per queue item via errgroup
- Return SQL warnings from ApplyIterationInsertQuery (eliminates race on
  shared MigrationLastInsertSQLWarnings field)
- Increase DB connection pool when concurrency > default pool size
- Add --chunk-concurrent-size CLI flag (default 1, no behavior change)

Co-authored-by: shaohk <shaohk@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…r, cut per-chunk round-trips

The --chunk-concurrent-size parallel row-copy only ran the INSERTs in
parallel; the boundary calculation and the per-chunk transaction overhead
serialized work and capped the achievable speedup well below the hardware's
parallel-insert ceiling. This addresses three of those caps.

Prefetch range producer (overlap serialized boundary calc with INSERTs):
- A single dedicated producer goroutine is the sole caller of
  CalculateNextIterationRangeEndValues and streams pre-computed ranges into a
  buffered channel, so boundary scans now overlap the parallel INSERTs of
  earlier work instead of stalling between batches.
- Split iterateChunks into iterateChunksSingle (unchanged single-threaded
  semantics) and iterateChunksConcurrent.
- Size the applier pool for concurrentSize + producer + headroom.

#1 Per-chunk round-trips (applier.go):
- ApplyIterationInsertQuery sent BEGIN / SET SESSION / INSERT / COMMIT as four
  round-trips per chunk. It now sends "SET SESSION ...; INSERT ..." as a single
  autocommit, multi-statement round-trip on one pinned connection. The applier
  pool already enables multiStatements + interpolateParams + autocommit;
  RowsAffected() reports the INSERT (last statement), and the optional
  SHOW WARNINGS runs on the same pinned connection. 4 round-trips -> 1.

#2 Persistent worker pool (migrator.go):
- Replace the per-batch errgroup+g.Wait barrier (which stalled N workers on
  the slowest chunk every N chunks) with continuous dispatch to an errgroup
  bounded by SetLimit(concurrentSize) for a 200ms time quantum. Workers stay
  saturated; the only barrier is at the quantum boundary. The time bound keeps
  executeWriteFuncs returning to apply binlog events and re-check throttling,
  preserving row-copy/event mutual exclusion.

Checkpoints record the last contiguous completed range (not the producer's
prefetched cursor), so resume restarts from fully-copied data.

Benchmarked on MySQL 8.0.46 (innodb_autoinc_lock_mode=2), 2.1M rows: copy time
vs the prior parallel impl improved up to 32% (chunk=200, conc=4: 22s->15s;
chunk=1000, conc=8: 8s->6s). Data integrity verified by row count + checksum.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants