MySQL并行复制的二个坑,MySQL的并行复制八线程复

2019-12-29 作者:数据库资讯   |   浏览(193)

早上巡检数据库,发现一个延迟从库的sql_thread中断了。

MySQL的并行复制多线程复制MTS(Multi-Threaded Slaves)

Last_SQL_Errno: 1755
Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Gtid, relay-log name ./oracle-relay-bin.000093, position 152912092 which prevents execution of this event group in parallel mode. Reason: The master event is logically timestamped incorrectly..

检查performance_schema下的replication_applier_status_by_worker表,除了GTID之外也没有更具体的信息:

 

"root@localhost:mysql3308.sock  [(none)]>select * from performance_schema.replication_applier_status_by_worker;
 -------------- ----------- ----------- --------------- ------------------------------------------------ ------------------- -------------------- ---------------------- 
| CHANNEL_NAME | WORKER_ID | THREAD_ID | SERVICE_STATE | LAST_SEEN_TRANSACTION                          | LAST_ERROR_NUMBER | LAST_ERROR_MESSAGE | LAST_ERROR_TIMESTAMP |
 -------------- ----------- ----------- --------------- ------------------------------------------------ ------------------- -------------------- ---------------------- 
|              |         1 |      NULL | OFF           | 0b961fcc-41c2-11e7-84fd-286ed488c7da:156369774 |                 0 |                    | 0000-00-00 00:00:00  |
|              |         2 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         3 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         4 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         5 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         6 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         7 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
|              |         8 |      NULL | OFF           |                                                |                 0 |                    | 0000-00-00 00:00:00  |
 -------------- ----------- ----------- --------------- ------------------------------------------------ ------------------- -------------------- ---------------------- 

姜承饶

既然relay_log的位置信息都有了,那就去日志里看看吧:

 

解析Binlog文件:

简称MTS:基于binlog组提交,mysql5.7默认开启binlog组提交

mysqlbinlog -v --base64-output=decode-rows oracle-relay-bin.000093 >1.sql

找到152912092位置点附近的日志:

 组提交(group commit)是MYSQL处理日志的一种优化方式,主要为了解决写日志时频繁刷磁盘的问题。组提交伴随着MYSQL的发展不断优化,从最初只支持redo log 组提交,到目前5.6官方版本同时支持redo log 和binlog组提交。组提交的实现大大提高了mysql的事务处理性能

www.yzc216.com 1

 

检查了一下数据库中这个表ID为14816035的数据确实是不存在的。

支持多线程复制(Multi-Threaded Slaves, 简称MTS:基于binlog组提交 不是redolog组提交,一个组提交的事务都是可以并行回放 ,因为这些事务都已进入到事务的prepare阶段,则说明事务之间没有任何冲突(否则就不可能提交)。
SQL线程就分裂为coordinator线程和worker线程,worker线程对组提交的事务进行并行回放

另外除了这条日志,其它日志的last_committed和sequence_number都为0,last_committed表示事务提交的时候,上次事务提交的编号。last_committed和sequence_number代表的就是所谓的LOGICAL_CLOCK。

为了兼容MySQL 5.6基于库的并行复制,5.7引入了新的变量slave-parallel-type,其可以配置的值有:

猜测如果手动把这条数据插入延迟从库,并且使用注入一个空事务跳过这个GTID的方法重启sql_thread,相信这个错误也能被解决。

DATABASE:默认值,基于库的并行复制方式
LOGICAL_CLOCK:基于组提交的并行复制方式

但既然带了LOGICAL_CLOCK的事务就会出错,跳过事务的方法很难保证以后不会出错。

支持并行复制的GTID
如何知道事务是否在一组中,又是一个问题,因为原版的MySQL并没有提供这样的信息。在MySQL 5.7版本中,其设计方式是将组提交的信息存放在GTID中。那么如果用户没有开启GTID功能,即将参数gtid_mode设置为OFF呢?故MySQL 5.7又引入了称之为Anonymous_Gtid的二进制日志event类型,如:

注意到这条日志的last_committed是一个异常大的值,且错误信息中有提到The master event is logically timestamped incorrectly。我怀疑是不是并行配置的问题。

mysql> SHOW BINLOG EVENTS in 'mysql-bin.000006';
------------------ ----- ---------------- ----------- ------------- -----------------------------------------------
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
------------------ ----- ---------------- ----------- ------------- -----------------------------------------------
| mysql-bin.000006 | 4 | Format_desc | 88 | 123 | Server ver: 5.7.7-rc-debug-log, Binlog ver: 4 |
| mysql-bin.000006 | 123 | Previous_gtids | 88 | 194 | f11232f7-ff07-11e4-8fbb-00ff55e152c6:1-2 |
| mysql-bin.000006 | 194 | Anonymous_Gtid | 88 | 259 | SET @@SESSION.GTID_NEXT= 'ANONYMOUS' |
| mysql-bin.000006 | 259 | Query | 88 | 330 | BEGIN |
| mysql-bin.000006 | 330 | Table_map | 88 | 373 | table_id: 108 (aaa.t) |
| mysql-bin.000006 | 373 | Write_rows | 88 | 413 | table_id: 108 flags: STMT_END_F

从库配置:

这意味着在 MySQL 5.7版本中即使不开启GTID,每个事务开始前也是会存在一个Anonymous_Gtid ,而这GTID中就存在着组提交的信息。

"root@localhost:mysql3308.sock  [(none)]>show variables like '%para%';
 ------------------------ --------------- 
| Variable_name          | Value         |
 ------------------------ --------------- 
| slave_parallel_type    | LOGICAL_CLOCK |
| slave_parallel_workers | 8             |
 ------------------------ --------------- 

LOGICAL_CLOCK

 再检查主库配置:

然而,通过上述的SHOW BINLOG EVENTS,我们并没有发现有关组提交的任何信息。但是通过mysqlbinlog工具,用户就能发现组提交的内部信息:

(root@localhost:mysql.sock) [(none)]>show variables like '%para%';
 ------------------------ ------- 
| Variable_name          | Value |
 ------------------------ ------- 
| slave_parallel_workers | 0     |
 ------------------------ ------- 

root@localhost:~# mysqlbinlog mysql-bin.0000006 | grep last_committed
#150520 14:23:11 server id 88 end_log_pos 259 CRC32 0x4ead9ad6 GTID last_committed=0 sequence_number=1
#150520 14:23:11 server id 88 end_log_pos 1483 CRC32 0xdf94bc85 GTID last_committed=0 sequence_number=2
#150520 14:23:11 server id 88 end_log_pos 2708 CRC32 0x0914697b GTID last_committed=0 sequence_number=3
#150520 14:23:11 server id 88 end_log_pos 3934 CRC32 0xd9cb4a43 GTID last_committed=0 sequence_number=4
#150520 14:23:11 server id 88 end_log_pos 5159 CRC32 0x06a6f531 GTID last_committed=0 sequence_number=5
#150520 14:23:11 server id 88 end_log_pos 6386 CRC32 0xd6cae930 GTID last_committed=0 sequence_number=6
#150520 14:23:11 server id 88 end_log_pos 7610 CRC32 0xa1ea531c GTID last_committed=6 sequence_number=7
#150520 14:23:11 server id 88 end_log_pos 8834 CRC32 0x96864e6b GTID last_committed=6 sequence_number=8
#150520 14:23:11 server id 88 end_log_pos 10057 CRC32 0x2de1ae55 GTID last_committed=6 sequence_number=9
#150520 14:23:11 server id 88 end_log_pos 11280 CRC32 0x5eb13091 GTID last_committed=6 sequence_number=10
#150520 14:23:11 server id 88 end_log_pos 12504 CRC32 0x16721011 GTID last_committed=6 sequence_number=11
#150520 14:23:11 server id 88 end_log_pos 13727 CRC32 0xe2210ab6 GTID last_committed=6 sequence_number=12
#150520 14:23:11 server id 88 end_log_pos 14952 CRC32 0xf41181d3 GTID last_committed=12 sequence_number=13
...
可以发现较之原来的二进制日志内容多了last_committed和sequence_number,last_committed表示事务提交的时候,上次事务提交的编号,如果事务具有相同的last_committed,表示这些事务都在一组内,可以进行并行的回放。

 发现主库根本就没有slave_parallel_type这项配置。想起来主库是mysql5.6了。

例如上述last_committed为0的事务有6个,表示组提交时提交了6个事务,而这6个事务在从机是可以进行并行回放的。

(root@localhost:mysql.sock) [(none)]>select version();
 ------------ 
| version()  |
 ------------ 
| 5.6.35-log |
 ------------ 

上述的last_committed和sequence_number代表的就是所谓的LOGICAL_CLOCK。先来看源码中对于LOGICAL_CLOCK的定义:

 那么问题基本上就知道了,主库5.6只支持基于DATABASE的并行复制,而5.7的从库配置成LOGICAL_CLOCK导致了异常。

class Logical_clock
{
  private:
  int64 state;
  /*
  Offset is subtracted from the actual "absolute time" value at
  logging a replication event. That is the event holds logical
  timestamps in the "relative" format. They are meaningful only in
  the context of the current binlog.
  The member is updated (incremented) per binary log rotation.
  */
  int64 offset;
  ......
state是一个自增的值,offset在每次二进制日志发生rotate时更新,记录发生rotate时的state值。其实state和offset记录的是全局的计数值,而存在二进制日志中的仅是当前文件的相对值。使用LOGICAL_CLOCK的场景如下:

明白了问题所在,那就好解决了,把从库的slave_parallel_type改为DATABASE,再起sql_thread问题应该就解决了:

class MYSQL_BIN_LOG: public TC_LOG
{
  ...
  public:
  /* Committed transactions timestamp */
  Logical_clock max_committed_transaction;
  /* "Prepared" transactions timestamp */
  Logical_clock transaction_counter;
  ...
可以看到在类MYSQL_BIN_LOG中定义了两个Logical_clock的变量:

"root@localhost:mysql3308.sock  [none]>set global slave_parallel_type='DATABASE';
Query OK, 0 rows affected (0.00 sec)

"root@localhost:mysql3308.sock  [none]>show global variables like '%slave_parallel_type%';
 --------------------- ---------- 
| Variable_name       | Value    |
 --------------------- ---------- 
| slave_parallel_type | DATABASE |
 --------------------- ---------- 
1 row in set (0.00 sec)

"root@localhost:mysql3308.sock  [none]>show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: master
                  Master_User: rep
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000104
          Read_Master_Log_Pos: 160115307
               Relay_Log_File: oracle-relay-bin.000093
                Relay_Log_Pos: 152912092
        Relay_Master_Log_File: binlog.000100
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1755
                   Last_Error: Cannot execute the current event group in the parallel mode. Encountered event Gtid, relay-log name ./oracle-relay-bin.000093, position 152912092 which prevents execution of this event group in parallel mode. Reason: The master event is logically timestamped incorrectly..
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 152911925
              Relay_Log_Space: 4455094667
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1755
               Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Gtid, relay-log name ./oracle-relay-bin.000093, position 152912092 which prevents execution of this event group in parallel mode. Reason: The master event is logically timestamped incorrectly..
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 50
                  Master_UUID: 0b961fcc-41c2-11e7-84fd-286ed488c7da
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 3600
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 180716 18:02:56
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 0b961fcc-41c2-11e7-84fd-286ed488c7da:111060115-163843604
            Executed_Gtid_Set: 0b961fcc-41c2-11e7-84fd-286ed488c7da:1-156369774
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

"root@localhost:mysql3308.sock  [none]>stop slave sql_thread;
Query OK, 0 rows affected, 1 warning (0.00 sec)

"root@localhost:mysql3308.sock  [none]>start slave sql_thread;
Query OK, 0 rows affected (0.01 sec)

"root@localhost:mysql3308.sock  [none]>show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: master
                  Master_User: rep
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: binlog.000104
          Read_Master_Log_Pos: 160161836
               Relay_Log_File: oracle-relay-bin.000093
                Relay_Log_Pos: 169205552
        Relay_Master_Log_File: binlog.000100
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 169205385
              Relay_Log_Space: 4455141196
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 5351
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 50
                  Master_UUID: 0b961fcc-41c2-11e7-84fd-286ed488c7da
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 3600
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for Slave Worker to release partition
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 0b961fcc-41c2-11e7-84fd-286ed488c7da:111060115-163843692
            Executed_Gtid_Set: 0b961fcc-41c2-11e7-84fd-286ed488c7da:1-156400100
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

max_committed_transaction:记录上次组提交时的logical_clock,代表上述mysqlbinlog中的last_committed
transaction_counter:记录当前组提交中各事务的logcial_clock,代表上述mysqlbinlog中的sequence_number

打完收工。

并行复制测试

转载请注明出处。

下图显示了开启MTS后,slave服务器的QPS。测试的工具是sysbench的单表全update测试,测试结果显示在16个线程下的性能最好,从机的QPS可以达到25000以上,进一步增加并行执行的线程至32并没有带来更高的提升。
而原单线程回放的QPS仅在4000左右,可见MySQL 5.7 MTS带来的性能提升,而由于测试的是单表,所以MySQL 5.6的MTS机制则完全无能为力了。

本文地址:

并行复制配置与调优

master_info_repository

开启MTS功能后,务必将参数master_info_repostitory设置为TABLE,这样性能可以有50%~80%的提升。这是因为并行复制开启后对于元master.info这个文件的更新将会大幅提升,资源的竞争也会变大。在之前 InnoSQL 的版本中,添加了参数来控制刷新master.info这个文件的频率,甚至可以不刷新这个文件。因为刷新这个文件是没有必要的,即根据master-info.log这个文件恢复本身就是不可靠的。在MySQL 5.7中,Inside君推荐将master_info_repository设置为TABLE,来减小这部分的开销。

slave_parallel_workers

若将slave_parallel_workers设置为0,则MySQL 5.7退化为原单线程复制,但将slave_parallel_workers设置为1,则SQL线程功能转化为coordinator线程,但是只有1个worker线程进行回放,也是单线程复制。
然而,这两种性能却又有一些的区别,因为多了一次coordinator线程的转发,因此slave_parallel_workers=1的性能反而比0还要差,在Inside君的测试下还有20%左右的性能下降,如下图所示:
这里其中引入了另一个问题,如果主机上的负载不大,那么组提交的效率就不高,很有可能发生每组提交的事务数量仅有1个,那么在从机的回放时, 虽然开启了并行复制,但会出现性能反而比原先的单线程还要差的现象,即延迟反而增大了 。聪明的小伙伴们,有想过对这个进行优化吗?

Enhanced Multi-Threaded Slave配置

说了这么多,要开启enhanced multi-threaded slave其实很简单,只需根据如下设置:

# slave
slave-parallel-type=LOGICAL_CLOCK
slave-parallel-workers=16
master_info_repository=TABLE
relay_log_info_repository=TABLE
relay_log_recovery=ON
并行复制监控

复制的监控依旧可以通过SHOW SLAVE STATUSG,但是MySQL 5.7在performance_schema架构下多了这些表,用户可以更细力度的进行监控:

mysql> show tables like 'replication%';
---------------------------------------------
| Tables_in_www.yzc216.com,performance_schema (replication%) |
---------------------------------------------
| replication_applier_configuration           |
| replication_applier_status                  |
| replication_applier_status_by_coordinator   |
| replication_applier_status_by_worker        |
| replication_connection_configuration        |
| replication_connection_status               |
| replication_group_member_stats              |
| replication_group_members                   |
---------------------------------------------
8 rows in set (0.00 sec)
总结

MySQL 5.7推出的Enhanced Multi-Threaded Slave解决了困扰MySQL长达数十年的复制延迟问题,再次提醒一些无知的PostgreSQL用户,不要再停留在之前对于MySQL的印象,物理复制也不一定肯定比逻辑复制有优势,而MySQL 5.7的MTS已经完全可以解决延迟问题。anyway,和Inside君一起见证划时代MySQL 5.7 GA版本的降临吧。

 

MySQL5.7并行复制中并行的真正含义

我们知道MySQL5.7并行复制引入了两个值last_committed和sequence_number。last_committed表示事务提交的时候,上次事务提交的编号,在主库上同时提交的事务设置成相同的last_committed。如果事务具有相同的last_committed,表示这些事务都在一组内,可以进行并行的回放。这个机制也是Commit-Parent-Based SchemeWL#6314中的实现方式。不过之后,官方对这种模式做了改进,所以最新的并行回放机制和WL#6314有了不同,详情见Lock-Based SchemeWL#7165

这个参数设置为yes是为了确保,在slave上事务的提交顺序与relay log中一致。
但是经过测试,这个参数在MySQL5.7.18中设置之后,也无法保证slave上事务提交的顺序与relay log一致。
在MySQL5.7.19设置后,slave上事务的提交顺序与relay log中一致。
For multi-threaded slaves, enabling this variable ensures that transactions are externalized on the slave in the same order as they appear in the slave's relay log. Setting this variable has no effect on slaves for which multi-threading is not enabled. All replication threads (for all replication channels if you are using multiple replication channels) must be stopped before changing this variable. --log-bin and --log-slave-updates must be enabled on the slave. In addition --slave-parallel-type must be set to LOGICAL_CLOCK.

本文由yzc216亚洲城发布于数据库资讯,转载请注明出处:MySQL并行复制的二个坑,MySQL的并行复制八线程复

关键词: yzc216亚洲城 www.yzc216.c