如何构建高性能MySQL？

测试场景MySQL 5.7.12主要测试差异刷盘参数对性能的影响, 使用以下三个场景:sync_binlog=1, innodb_flush_log_at_trx_commit=1, 简写为b1e1 (binlog-1-engine-1)sync_binlog=0, innodb_flush_log_at_trx_commit=1, 简写为b0e1sync_binlog=0, innodb_flush_log_at_trx_commit=0, 简写为b0e0MySQL 环境搭建使用 MySQL sandbox, 对应三个场景的启动参数如下:

1. ./start –sync-binlog=1 –log-bin=bin –server-id=5712 –gtid-mode=ON –enforce-gtid-consistency=1 –log-slave-updates=1

2. ./start –sync-binlog=0 –log-bin=bin –server-id=5712 –gtid-mode=ON –enforce-gtid-consistency=1 –log-slave-updates=1

3. ./start –sync-binlog=0 –log-bin=bin –server-id=5712 –gtid-mode=ON –enforce-gtid-consistency=1 –log-slave-updates=1 –innodb-flush-log-at-trx-commit=0

压力天生使用sysbench:

mysql -h127.0.0.1 -P5712 -uroot -pmsandbox -e”truncate table test.sbtest1″; /opt/sysbench-0.5/dist/bin/sysbench –test=/opt/sysbench-0.5/dist/db/insert.lua –mysql-table-engine=innodb –mysql-host=127.0.0.1 –mysql-user=root –mysql-password=msandbox –mysql-port=5712 –oltp-table-size=1000000 –mysql-db=test –oltp-table-name=stest –num-threads=1 –max-time=30 –max-requests=1000000000 –oltp-auto-inc=off –db-driver=mysql run

性能观察工具使用systemtap(简称stap), version 2.7/0.160

基准在没有观察压力的情形下, 对三种场景划分进行基准测试, 用以矫正之后测试的误差:

场景sysbench事务数b1e167546b0e1125699b0e0181612

火焰图与offcpu火焰图火焰图是Brendan Gregg首创的示意性能的图形方式, 其可以直观的看到压力的漫衍. Brendan提供了厚实的工具天生火焰图.

火焰图对照b0e1和b0e0使用stap剧本获取CPU profile, 并天生火焰图(火焰图天生的下令略, 参看Brendan的文档)

stap剧本

global tids probe process(“/home/huangyan/sandboxes/5.7.12/bin/mysqld”).function(“mysql_execute_command”) { if (pid() == target() && tids[tid()] == 0) { tids[tid()] = 1; } } global oncpu probe timer.profile { if (tids[tid()] == 1) { oncpu[ubacktrace()] <<< 1; } } probe timer.s(10) { exit(); } probe end { foreach (i in oncpu+) { print_stack(i); printf(“\t%d\n”, @count(oncpu[i])); } }

注重:

1. 剧本只抓取MySQL的用户线程的CPU profile, 不抓取后台历程.

2. 剧本只抓取10s, 相当于对整个sysbench的30s历程进行了短期抽样.

b0e1天生的火焰图

性能

在开启观察的情形下, 考察性能:

场景sysbench事务数b0e1119274b0e0166424

剖析

在天生的火焰图中, 可以看到:

在b0e1场景中, _ZL27log_write_flush_to_disk_lowv的占比是12.93%, 其绝大部门时间是用于将innodb的log刷盘.在b0e0场景中, _ZL27log_write_flush_to_disk_lowv的开销被节约掉, 理论上的事务数比例应是1-12.93%=87.07%, 现实事务数的比例是119274/166424=71.67%, 误差较大误差较大的问题, 要引入offcpu来解决.

offcpu在之前的剖析中我们看到理论和现实的事务数误差较大. 思考_ZL27log_write_flush_to_disk_lowv的主要操作是IO操作, IO操作最先, 历程就会被OS进行上下文切换换下台, 以守候IO操作竣事, 那么只剖析CPU profile就忽略了IO守候的时间, 也就是说_ZL27log_write_flush_to_disk_lowv的开销被低估了.

offcpu也是Brendan Gregg提出的看法. 对于IO操作的观察, 除了CPU profile(称为oncpu时间), 还需要观察其上下文切换的价值, 即offcpu时间.

修改一下stap剧本可以观察offcpu时间. 不外为了将oncpu和offcpu的时间显示在一张火焰图上作对比, 我对于Brendan的工具做了微量修改, 本文将不先容这些修改.

stap剧本

global tids probe process(“/home/huangyan/sandboxes/5.7.12/bin/mysqld”).function(“mysql_execute_command”) { if (pid() == target() && tids[tid()] == 0) { tids[tid()] = 1; } } global oncpu, offcpu, timer_count, first_cpu_id = -1; probe timer.profile { if (first_cpu_id == -1) { first_cpu_id = cpu(); } if (tids[tid()] == 1) { oncpu[ubacktrace()] <<< 1; } if (first_cpu_id == cpu()) { timer_count++; } } global switchout_ustack, switchout_timestamp probe scheduler.ctxswitch { if (tids[prev_tid] == 1) { switchout_ustack[prev_tid] = ubacktrace(); switchout_timestamp[prev_tid] = timer_count; } if (tids[next_tid] == 1 && switchout_ustack[next_tid] != “”) { offcpu[switchout_ustack[next_tid]] <<< timer_count – switchout_timestamp[next_tid]; switchout_ustack[next_tid] = “”; } } probe timer.s(10) { exit(); } probe end { foreach (i in oncpu+) { print_stack(i); printf(“\t%d\n”, @sum(oncpu[i])); } foreach (i in offcpu+) { printf(“—“); print_stack(i); printf(“\t%d\n”, @sum(offcpu[i])); } }

注重: timer.profile的说明中是这样写的:

Profiling timers are available to provide probes that execute on all CPUs at each system tick.

也就是说在一个时间片中, timer.profile会对每一个CPU挪用一次. 因此代码中使用了如下代码, 保证时间片手艺的准确性:

if (first_cpu_id == cpu()) { timer_count++; }

b0e1天生的带有offcpu的火焰图

性能

由于换取了观察剧本, 需要重新观察性能以减小误差:

场景sysbench事务数b0e1105980b0e0164739

剖析

在火焰图中, 可以看到:

1. _ZL27log_write_flush_to_disk_lowv的占比为31.23%

2. 理论上的事务数比例应是1-31.23%=68.77%, 现实事务数的比例是105980/164739=64.33%, 误差较小.

观察误差的矫正在对照b0e1和b0e0两个场景时, 获得了对照好的效果. 但同样的方式在对照b1e1和b0e1两个场景时, 泛起了一些误差.

误差征象b1e1的火焰图如图:

其中_ZN13MYSQL_BIN_LOG16sync_binlog_fileEb(sync_binlog的函数)占比为26.52%.

但性能差异为:

场景sysbench事务数b1e153752b0e1105980

理论的事务数比例为1-26.52%=73.48%, 现实事务数比例为53752/105980=50.71%, 误差较大.

剖析压力漫衍首先嫌疑压力转移, 即当sync_binlog的压力消除后, 服务器压力被转移到了其它的瓶颈上. 但若是压力发生了转移, 那么现实事务数比例应大于理论事务数比例, 即sync_binlog=0带来的性能提升更小.

不外我们照样可以权衡一下压力漫衍, 看看b1e1和b0e1的压力有什么差异, 步骤如下:

修改stap剧本, 在b1e1中不统计sync_binlog的价值. 天生的火焰图示意消除sync_binlog价值后, 理论上的服务器压力类型.与b0e1发生的火焰图做对照.stap剧本

只修改了probe end部门, 略过对my_sync客栈的统计:

probe end { foreach (i in oncpu+) { if (isinstr(sprint_stack(i), “my_sync”)) { continue; } print_stack(i); printf(“\t%d\n”, @sum(oncpu[i])); } foreach (i in offcpu+) { if (isinstr(sprint_stack(i), “my_sync”)) { continue; } printf(“—“); print_stack(i); printf(“\t%d\n”, @sum(offcpu[i])); } }

效果

b1e1, 理论的服务器压力争:

b0e1, 现实的服务器压力争:

可以看到, 压力漫衍是异常类似, 即没有发生压力漫衍.

BTW: 这两张图的类似, 具有一定随机性, 需要做多次试验才气发生这个效果.

剖析

既然理论和现实的压力漫衍类似, 那么可能发生的就是压力的整体等比缩小. 推测是两个场景上的观察成本差异, 导致观察影响到了所有压力的观察.

考察stap剧本, 其中成本较大的是ctxswitch, 即上下文切换时的观察成本.

上下文切换的观察成本若是 “上下文切换的观察成本影响场景观察的公正性” 这一结论确定, 那么我们需要注释两个征象:

1. b1e1和b0e1的对照, 受到了上下文切换的观察成本的影响

2. b0e1和b0e0的对照, 未受到上下文切换的观察成本的影响

假设上下文切换的观察成本正比于上下文切换的次数, 那么我们只需要: