MySQL 異なるサーバー間でDBの差分をとり、同期する

背景
環境
sync-diff-inspectorのインストール

背景

CentOS7 をAlmaLinux9 に移管する際、DBも移植する必要があった。
しかし、dumpファイルが100GBもあり、エクスポート→リストアまでに10時間近くかかってしまう。
これだと、メンテの時間がかなりかかってしまうため、現時点でのdumpファイルを移植しておいて、リリース時には差分データだけを反映できないかと考えた。

検討したツール
MySQLWorkbench：構造の差分は取れるが、データの差分までは取れない。
mysqldbcompare：ツール自体が古く、公式のドキュメントが見れない。
sync-diff-inspector：今回はこれを使ってみる

【注意】
結局、DBの容量が大きすぎるという問題で、このツールもだめでした。

環境

サーバー１：CentOS7（旧サーバー）
MySQL：5.6.40

サーバー２：AlmaLinux9（新サーバー）
MySQL：5.7.44

sync-diff-inspectorのインストール

Dockerイメージのダウンロード

% docker pull pingcap/tidb-tools:latest

イメージの確認

% docker images

REPOSITORY              TAG       IMAGE ID       CREATED         SIZE
pingcap/tidb-tools      latest    9f268d39711a   12 months ago   394MB

コンテナの作成と起動

config.toml と output をマウントする

% docker run -it -v /Users/xxx/programs/sync-diff-inspector/config.toml:/config.toml -v /Users/xxx/programs/sync-diff-inspector/output:/output pingcap/tidb-tools

[root@0593355371dd /]#

config.toml は公式Documentを参考に調整

# Diff Configuration.

######################### Global config #########################
# The number of goroutines created to check data. The number of connections between sync-diff-inspector and upstream/downstream databases is slightly greater than this value.
check-thread-count = 4

# If enabled, SQL statements is exported to fix inconsistent tables.
export-fix-sql = true

# Only compares the table structure instead of the data.
check-struct-only = false

# If enabled, sync-diff-inspector skips checking tables that do not exist in the upstream or downstream.
skip-non-existing-table = false

######################### Datasource config #########################
[data-sources]
[data-sources.mysql1] # mysql1 is the only custom ID for the database instance. It is used for the following `task.source-instances/task.target-instance` configuration.
    host = "127.0.0.1"
    port = 3306
    user = "root"
    password = ""  # The password for connecting to the upstream database. It can be plain text or Base64-encoded.

    # (optional) Use mapping rules to match multiple upstream sharded tables. Rule1 and rule2 are configured in the following Routes section.
    route-rules = ["rule1", "rule2"]

[data-sources.tidb0]
    host = "127.0.0.1"
    port = 4000
    user = "root"
    password = ""  # The password for connecting to the downstream database. It can be plain text or Base64-encoded.

    # (optional) Use TLS to connect TiDB.
    # security.ca-path = ".../ca.crt"
    # security.cert-path = ".../cert.crt"
    # security.key-path = ".../key.crt"

    # (optional) Use the snapshot feature. If enabled, historical data is used for comparison.
    # snapshot = "386902609362944000"
    # When "snapshot" is set to "auto", the last syncpoints generated by TiCDC in the upstream and downstream are used for comparison. For details, see <https://github.com/pingcap/tidb-tools/issues/663>.
    # snapshot = "auto"

########################### Routes ##############################
# To compare the data of a large number of tables with different schema names or table names, or check the data of multiple upstream sharded tables and downstream table family, use the table-rule to configure the mapping relationship. You can configure the mapping rule only for the schema or table. Also, you can configure the mapping rules for both the schema and the table.
[routes]
[routes.rule1] # rule1 is the only custom ID for the configuration. It is used for the above `data-sources.route-rules` configuration.
schema-pattern = "test_*"      # Matches the schema name of the data source. Supports the wildcards "*" and "?"
table-pattern = "t_*"          # Matches the table name of the data source. Supports the wildcards "*" and "?"
target-schema = "test"         # The name of the schema in the target database
target-table = "t"             # The name of the target table
[routes.rule2]
schema-pattern = "test2_*"      # Matches the schema name of the data source. Supports the wildcards "*" and "?"
table-pattern = "t2_*"          # Matches the table name of the data source. Supports the wildcards "*" and "?"
target-schema = "test2"         # The name of the schema in the target database
target-table = "t2"             # The name of the target table

######################### task config #########################
# Configures the tables of the target database that need to be compared.
[task]
    # output-dir saves the following information:
    # 1 sql: The SQL file to fix tables that is generated after error is detected. One chunk corresponds to one SQL file.
    # 2 log: sync-diff.log
    # 3 summary: summary.txt
    # 4 checkpoint: a dir
    output-dir = "./output"
    # The upstream database. The value is the unique ID declared by data-sources.
    source-instances = ["mysql1"]
    # The downstream database. The value is the unique ID declared by data-sources.
    target-instance = "tidb0"
    # The tables of downstream databases to be compared. Each table needs to contain the schema name and the table name, separated by '.'
    # Use "?" to match any character and "*" to match characters of any length.
    # For detailed match rules, refer to golang regexp pkg: https://github.com/google/re2/wiki/Syntax.
    target-check-tables = ["schema*.table*", "!c.*", "test2.t2"]
    # (optional) Extra configurations for some tables, Config1 is defined in the following table config example.
    target-configs = ["config1"]

######################### Table config #########################
# Special configurations for specific tables. The tables to be configured must be in `task.target-check-tables`.
[table-configs.config1] # config1  is the only custom ID for this configuration. It is used for the above `task.target-configs` configuration.
# The name of the target table, you can use regular expressions to match multiple tables, but one table is not allowed to be matched by multiple special configurations at the same time.
target-tables = ["schema*.test*", "test2.t2"]
# (optional) Specifies the range of the data to be checked
# It needs to comply with the syntax of the WHERE clause in SQL.
range = "age > 10 AND age < 20"
# (optional) Specifies the column used to divide data into chunks. If you do not configure it,
# sync-diff-inspector chooses an appropriate column (primary key, unique key, or a field with index).
index-fields = ["col1","col2"]
# (optional) Ignores checking some columns such as some types (json, bit, blob, etc.)
# that sync-diff-inspector does not currently support.
# The floating-point data type behaves differently in TiDB and MySQL. You can use
# `ignore-columns` to skip checking these columns.
ignore-columns = ["",""]
# (optional) Specifies the size of the chunk for dividing the table. If not specified, this configuration can be deleted or be set as 0.
chunk-size = 0
# (optional) Specifies the "collation" for the table. If not specified, this configuration can be deleted or be set as an empty string.
collation = ""

sync-diff-inspector の実行

[root@0593355371dd /]# ./sync_diff_inspector --config=./config.toml

結果

A total of 1 tables need to be compared

Comparing the table structure of ``database_name`.`table_name`` ... equivalent
Comparing the table data of ``database_name`.`table_name`` ...
_____________________________________________________________________________
Progress [>------------------------------------------------------------] 0% 0/1
[mysql] 2024/04/18 03:49:46 packets.go:73: unexpected EOF
Comparing the table data of ``database_name`.`table_name`` ... failure
_____________________________________________________________________________
Progress [============================================================>] 100% 0/0
Error in comparison process:
invalid connection error occured in `database_name`.`table_name`
You can view the comparision details through './output/sync_diff.log'

どうやらDBのデータが大きすぎて、エラーを吐いてしまった。
このツールもだめ。。
結局 dump してリストアすることに。