Best unofficial Apache Server developers community
Username
Forgot password?
Sign in with Twitter account
Sign in with Facebook account

Performance schema, overhead tuning

Performance schema performance tuning

One of the most important question users ask before deciding to use the performance schema is: what is the overhead ? The underlying concern of course is to make sure deploying the performance schema does not negatively impact production ("First, do no harm").
The question is simple, and yet the answer is not so simple, as it depends on so many things.
As seen in some benchmarks done already, the overhead when using the performance schema can vary a lot, and in some cases really causes degradations.
A lot of time as been spent investigating the root causes (note the plural form) for the performance schema overhead in general, so I think it is a good time to share current findings.

Server workload

Analysing performances, and testing the effects of different code or configuration changes, can hardly be done in a vacuum. That's where benchmarks come in.
The problem with benchmarks in general is that different benchmarks will produce different loads, and that a given code may perform very well under some conditions but poorly under others ...
One particular property of the performance schema is that, since it instruments the server itself, the load and behavior of the server:
  • is there capacity left or is it saturated ?
  • is it CPU bound or disk io bound ?,
  • paging ?
  • swaping ?
  • contended in a specific place (query cache, table cache, innodb buffer pool, [insert your favorite here]) ?
directly influences the load put on the performance schema code:
  • how many mutexes / rwlocks / cond events are detected ?
  • how many file io events are detected ?
  • how many table io / table lock events are detected ?
  • how many stages / statements are detected ?
Given this load as input, the user configuration of the performance schema:
  • the sizing of internal performance schema buffers,
  • which instrument is enabled ?
  • which instrument is timed ?
  • which timer is used ?
  • which consumer is enabled ?
  • is there some filtering per user ? per host ? (table setup_actors)
  • is there some filtering per object? (table setup_objects)
will decide what to do with the events seen:
  • discard them,
  • count them,
  • time them,
  • aggregate them, possibly with fancy aggregations,
all of which affects overhead.
To summarize, the overall system can be modeled as giant pipe, which consists of, in order:
  • the client application and the generated application workload,
  • the mysql server and the generated instrumentation workload,
  • the performance schema and the generated statistics / overhead.
It is critical to understand how that pipe works to interpret results.
For example, if a client application is sending very few queries to the server, generating a very "small" load, a benchmark with or without the performance schema is not going to show any differences. To be meaningful, a benchmark (of the database) has to put a lot of pressure (on the database). This is hardly a surprise, but here comes the interesting part.
Next, if a server is poorly configured for the application workload received, and if the server itself does not perform well for that load, the instrumentation workload on the performance schema will be "small". A properly configured server will execute more queries, transactions, etc, which will generate more events to instrument. To be meaningful, a benchmark (of the performance schema) has to put a lot of pressure (on the performance schema).
An immediate consequence is that benchmarks of the performance schema should disable the query cache: there are currently 474 instrumentation points in the server. If most executions paths bounce off of the query cache, stress will be on "wait/synch/mutex/sql/Query_cache::structure_guard_mutex" only, cutting off the load generated from the lower layers on the server code, such as the table io instrumentation, or the storage engines instrumentation.
A more subtle consequence is that every time a performance bottleneck in the server code is removed, either by a bug fix or by a better server tuning, server efficiency goes up, which increases the number of statements / transactions / whatever the benchmark metric is counting, that are executed. While this is good news for the server, it actually increases the instrumentation workload for the performance schema.
In conclusion, and in my opinion, to interpret a performance schema benchmark, results should describe not only the overhead observed, but also the load seen that caused the overhead. By load seen, I mean here the instrumentation load seen at the server / performance schema interface, since this is the real stress put on the performance schema code. The more details are available for each stage of the pipe, the better.
Time to dive into numbers then ...

The trunk benchmark

The load used for this analysis is sysbench, and more specifically the sysbench oltp read-only benchmark.
The reason this particular case was picked is not because I like the results on this one better.
The reasons are:
  • sysbench oltp was the test used in previous benchmarks, and to reproduce the results (and analyse the reasons for overhead), it helps to be in similar conditions,
  • a read only, in memory only load eliminates a lot of noise from measurements, making analysis easier: the goal here is not to get a number (by "just" running a benchmark), it is to get an explanation for the number, so that the code can be improved,
  • the performance schema code itself is "CPU only, user land only" code, as it does not make any io or any system call. A benchmark that shares the same characteristics seems more likely to provide interesting insight.
Time permitting, I hope to get to do the same analysis for other cases (sysbench read-write, dbt2, etc).
The test machine used is a 8 CPU / 32 GB Linux box, so the test configuration is sized accordingly:
  • for sysbench: --oltp-table-size=1000000
  • for mysql my.cnf: innodb_buffer_pool_size = 12G
Historical note: testing done for this post started on a 8 CPU 16 GB machine, then moved to another lab. The original configuration was not changed, hence the 12G buffer pool only.

The trunk results

All results are for mysql-trunk (revno 3195, 2011-06-13), compiled in Release configuration. This corresponds to the current code for mysql 5.6, which is still in development.

Tests without the performance schema

With this build, the performance instrumentation is not compiled. All the computation of overhead expressed as percentages are relative to this reference, which is used as the ultimate baseline.
Connections TPS CPU User (%) CPU System (%) CPU Idle (%)
1 807 10 2 88
2 1439 20 5 75
4 2610 40 10 50
8 5086 84 14 2
12 5037 84 15 1
16 5038 85 15 0
32 4854 85 15 0
48 4762 85 15 0
64 4118 (*) 60 32 8
128 3538 (*) 57 35 8
256 1746 (*) 57 35 8
The overall CPU statistics are collected by vmstat, and helps to qualify in which mode the server is operating.
Looking at the overall shape of the performance curve, three different operating modes can be observed:
  • Ramping up (1 to 8 connections)
  • Peek efficiency (8 to 48 connections)
  • Server progressive degradation (48 to infinity connections)
For a 8 core machine, this looks reasonable.
(*) A special note about results for 64 / 128 / 256 connections: the performance measured in this cases is not stable and can vary a lot (50 percent) between runs. This is due to the machine saturation: at 256 connections, 256 threads are running in sysbench, plus 256+16 in mysqld, which makes 512+ active threads on a 8 core. Numbers are posted for this case as well, to include this mode of operation in the analysis.

Tests with the performance schema

Now that we have established what the baseline looks like, executing the same workload on the same hardware, under the same conditions gives the following results. The CPU user / system / idle statistics from vmstat are not repeated (the overall profile is the same) for brevity.
With performance schema disabled off global (timed) thread (timed)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 796 1.36 775 3.97 684 15.24 681 15.61
2 1409 2.08 1349 6.25 1198 16.75 1191 17.23
4 2544 2.53 4634 5.86 2115 18.97 2114 19.00
8 4844 4.76 4634 8.89 3974 21.86 3913 23.06
12 4850 3.71 4662 7.44 4020 20.19 3979 21.00
16 4847 3.79 4676 7.19 4051 19.59 4020 20.21
32 4676 3.61 4485 7.60 3851 20.66 3789 21.94
48 4582 3.78 4384 7.94 3759 21.06 3692 22.47
64 (*) 2548 38.13 3048 25.98 3392 17.63 3286 20.20
128 (*) 2035 42.48 2007 43.27 1938 45.22 1917 45.82
256 (*) 1745 0.06 1697 2.81 1662 4.81 1624 6.99
(*) The same comment about results stability applies here. Because the results are unstable with great variations, the overhead computed after normalization on a unstable baseline gives meaningless results.

Analysis

Overall, the overhead in the most basic configuration, with all instruments enabled and only the global instrumentation consumer enabled, is at 20 percent already.
This is a lot. Looking in more details at these results, it appears that:
  • Adding the thread instrumentation consumer (to get statistics per thread) does only marginally increase overhead: at 8 connections, the overhead goes from 21.86% to 23.06%.
  • The overhead depends on the number of connections. It is progressive in the 1-8 connections range, peaks at 8 connections, then is both stable and significant at 8-48 connections.
  • Peak TPS without the performance schema was at 8 connections. With the performance schema, it seems to move to 12 or even 16 connections.
To understand the root cause for this 20% toll, the following measurements are also instructive:
  • With the performance schema enabled but all instruments disabled (configuration "off"), the overhead is at ~8%. For code doing nothing, this is actually quiet high.
  • When the performance schema is disabled at startup time (configuration "disabled"), the overhead is at ~4%. This configuration is not very useful from a functional point of view, but from a technical point of view it executes a code path which is a subset of the configuration "off". That makes the numbers collected useful.
Based on this, overhead seems to originate from, roughly:
  • compiling with the performance schema (configuration "disabled"): ~4% cost,
  • starting with the performance schema (configuration "off"): additional ~4% cost,
  • enabling the global consumer only (configuration "global (timed)"): additional ~12% cost,
  • adding the thread consumer only (configuration "global (timed)"): additional ~1% cost
To understand better which code is executed, and how often, let's use the performance schema itself to get some insight on the server execution. After all, that is precisely what the performance schema is for.
Using the "thread (timed)" configuration, with the assumption that the instrumentation does not distort the workload, and looking at details for 8 connections, which is chosen because it corresponds to the peak performance without the performance schema, statistics collected are as follows below.

Total number of wait events.

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_waits_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|      1098658477 |     797822128662705 |
+-----------------+---------------------+
1 row in set (1.79 sec)

Top waits, by count

mysql> select * from performance_schema.events_waits_summary_global_by_event_name
    ->   where count_star>0 order by count_star desc limit 25;
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                                    | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler                     |  482492695 | 583215424951365 |          42210 |        1208680 |     7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex        |  265080324 | 142377797253015 |          30150 |         537005 |     1862766495 |
| wait/synch/mutex/sql/THD::LOCK_thd_data       |  122077261 |   6519294560295 |          30150 |          53265 |     1072076715 |
| wait/synch/rwlock/innodb/hash table locks     |   38409624 |   5420427797475 |          27135 |         141035 |      360901530 |
| wait/synch/mutex/innodb/read_view_mutex       |   32554195 |   6146001298890 |          24120 |         188605 |     1011083265 |
| wait/synch/mutex/sql/LOCK_open                |   32553870 |   9359193544965 |          30150 |         287430 |     1080446355 |
| wait/lock/table/sql/handler                   |   32553724 |  20576707870950 |          99495 |         631810 |     1083789990 |
| wait/synch/mutex/mysys/THR_LOCK::mutex        |   32553724 |   6964485402105 |          30150 |         213730 |     3022857090 |
| wait/synch/rwlock/innodb/trx_sys_lock         |   18592602 |   5270410352025 |          42210 |         283410 |     1622682045 |
| wait/synch/rwlock/sql/LOCK_grant              |   16276408 |   3502306526580 |          51255 |         215070 |      892225935 |
| wait/synch/rwlock/innodb/btr_search_latch     |   16198467 |   4144473779535 |          27135 |         255605 |     1048297410 |
| wait/synch/rwlock/sql/MDL_lock::rwlock        |    2325412 |    524010714480 |          54270 |         225120 |       36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex          |    2325268 |    409667867595 |          42210 |         175875 |      908214480 |
| wait/synch/mutex/innodb/trx_mutex             |    1163261 |     32488536510 |          24120 |          27805 |        7419915 |
| wait/synch/mutex/sql/MDL_map::mutex           |    1162780 |    274705866855 |          36180 |         236175 |      133772535 |
| wait/synch/mutex/sql/LOCK_plugin              |    1162673 |    212769433395 |          33165 |         182910 |      791328960 |
| wait/synch/mutex/innodb/lock_mutex            |    1162633 |    145945057815 |          39195 |         125290 |       14173515 |
| wait/synch/mutex/innodb/log_sys_mutex         |       7757 |      2441752020 |          27135 |         314565 |     1239502680 |
| wait/synch/mutex/innodb/os_mutex              |       2490 |       271684665 |          30150 |         108875 |       17423685 |
| wait/synch/mutex/innodb/fil_system_mutex      |       1402 |       146559150 |          24120 |         104520 |        5538555 |
| wait/synch/mutex/innodb/flush_list_mutex      |        607 |        96362415 |          33165 |         158455 |         346725 |
| wait/io/file/innodb/innodb_log_file           |        600 |   2693980326195 |       23094900 |     4489967145 |   819050374485 |
| wait/synch/mutex/innodb/buf_pool_mutex        |        516 |        40862295 |          27135 |          79060 |         521595 |
| wait/synch/mutex/innodb/log_flush_order_mutex |        473 |        49503285 |          33165 |         104520 |         397980 |
| wait/synch/mutex/innodb/lock_wait_mutex       |        428 |        71949960 |          63315 |         167835 |        1706490 |
+-----------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)

Top waits, by time

mysql> select * from performance_schema.events_waits_summary_global_by_event_name
    ->   where count_star>0 order by sum_timer_wait desc limit 25;
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                                | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
| wait/io/table/sql/handler                 |  482492695 | 583215424951365 |          42210 |        1208680 |     7913678535 |
| wait/synch/mutex/innodb/srv_conc_mutex    |  265080324 | 142377797253015 |          30150 |         537005 |     1862766495 |
| wait/lock/table/sql/handler               |   32553724 |  20576707870950 |          99495 |         631810 |     1083789990 |
| wait/synch/mutex/sql/LOCK_open            |   32553870 |   9359193544965 |          30150 |         287430 |     1080446355 |
| wait/synch/mutex/mysys/THR_LOCK::mutex    |   32553724 |   6964485402105 |          30150 |         213730 |     3022857090 |
| wait/synch/mutex/sql/THD::LOCK_thd_data   |  122077261 |   6519294560295 |          30150 |          53265 |     1072076715 |
| wait/synch/mutex/innodb/read_view_mutex   |   32554342 |   6146010024300 |          24120 |         188605 |     1011083265 |
| wait/synch/rwlock/innodb/hash table locks |   38409624 |   5420427797475 |          27135 |         141035 |      360901530 |
| wait/synch/rwlock/innodb/trx_sys_lock     |   18592651 |   5270414135850 |          42210 |         283410 |     1622682045 |
| wait/synch/rwlock/innodb/btr_search_latch |   16198467 |   4144473779535 |          27135 |         255605 |     1048297410 |
| wait/synch/rwlock/sql/LOCK_grant          |   16276408 |   3502306526580 |          51255 |         215070 |      892225935 |
| wait/io/file/innodb/innodb_log_file       |        600 |   2693980326195 |       23094900 |     4489967145 |   819050374485 |
| wait/synch/rwlock/sql/MDL_lock::rwlock    |    2325412 |    524010714480 |          54270 |         225120 |       36433260 |
| wait/synch/mutex/mysys/BITMAP::mutex      |    2325268 |    409667867595 |          42210 |         175875 |      908214480 |
| wait/synch/mutex/sql/MDL_map::mutex       |    1162780 |    274705866855 |          36180 |         236175 |      133772535 |
| wait/synch/mutex/sql/LOCK_plugin          |    1162673 |    212769433395 |          33165 |         182910 |      791328960 |
| wait/synch/mutex/innodb/lock_mutex        |    1162633 |    145945057815 |          39195 |         125290 |       14173515 |
| wait/synch/mutex/innodb/trx_mutex         |    1163457 |     32497638795 |          24120 |          27805 |        7419915 |
| wait/io/file/innodb/innodb_data_file      |          8 |     23031533745 |       54936315 |     2878941425 |     7430889600 |
| wait/synch/rwlock/innodb/checkpoint_lock  |          4 |      5048894880 |         250245 |     1262223720 |     2745088155 |
| wait/synch/mutex/innodb/log_sys_mutex     |       8095 |      2471021640 |          27135 |         305185 |     1239502680 |
| wait/synch/mutex/sql/LOCK_thread_count    |         31 |       830150100 |          30150 |       26778895 |      235748880 |
| wait/synch/mutex/innodb/os_mutex          |       2539 |       276339825 |          30150 |         108540 |       17423685 |
| wait/synch/mutex/innodb/fil_system_mutex  |       1451 |       149064615 |          24120 |         102510 |        5538555 |
| wait/synch/mutex/innodb/flush_list_mutex  |        704 |       111835395 |          33165 |         158790 |         346725 |
+-------------------------------------------+------------+-----------------+----------------+----------------+----------------+
25 rows in set (1.79 sec)

Total number of stage events

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_stages_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|       244153447 |    2263079088326000 |
+-----------------+---------------------+
1 row in set (0.00 sec)

Top stages, by count

mysql> select * from performance_schema.events_stages_summary_global_by_event_name
    ->   where count_star>0 order by count_star desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                     | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/init                 |   34879169 | 157347558824000 |        2374000 |        4511000 |     1685373000 |
| stage/sql/cleaning up          |   18602306 |  25263461317000 |         628000 |        1358000 |     2278265000 |
| stage/sql/closing tables       |   18602129 | 384870074281000 |         628000 |       20689000 |     4374172000 |
| stage/sql/query end            |   18602129 |  23815226118000 |         698000 |        1280000 |     1674268000 |
| stage/sql/checking permissions |   16276937 |  37278853417000 |        1327000 |        2290000 |     1734543000 |
| stage/sql/Opening tables       |   16276936 |  91344126412000 |        3352000 |        5611000 |     1641511000 |
| stage/sql/executing            |   16276863 |  15264844128000 |         628000 |         937000 |     1097577000 |
| stage/sql/statistics           |   16276863 | 325924010836000 |        6495000 |       20023000 |     3296152000 |
| stage/sql/Sending data         |   16276863 | 548279995436000 |        2794000 |       33684000 |     6805393000 |
| stage/sql/System lock          |   16276863 |  68298443425000 |        2444000 |        4196000 |     1102675000 |
| stage/sql/preparing            |   16276863 |  40997524468000 |        1676000 |        2518000 |     1090453000 |
| stage/sql/optimizing           |   16276863 |  31599629407000 |        1606000 |        1941000 |     1093176000 |
| stage/sql/end                  |   16276863 |  17257774496000 |         768000 |        1060000 |     1070408000 |
| stage/sql/Sorting result       |    2325266 | 270070871391000 |        7264000 |      116146000 |     3214855000 |
| stage/sql/removing tmp table   |    1162634 |  15392269553000 |        9359000 |       13239000 |      904741000 |
| stage/sql/freeing items        |    1162634 |  15041751826000 |        3143000 |       12937000 |     1106237000 |
| stage/sql/Creating tmp table   |    1162633 |  14681955665000 |        9568000 |       12628000 |     1092898000 |
| stage/sql/Copying to tmp table |    1162633 | 180350717326000 |      116008000 |      155122000 |     8074224000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)

Top stages, by time

mysql> select * from performance_schema.events_stages_summary_global_by_event_name
    ->   where count_star>0 order by sum_timer_wait desc limit 25;
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| EVENT_NAME                     | COUNT_STAR | SUM_TIMER_WAIT  | MIN_TIMER_WAIT | AVG_TIMER_WAIT | MAX_TIMER_WAIT |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
| stage/sql/Sending data         |   16276863 | 548279995436000 |        2794000 |       33684000 |     6805393000 |
| stage/sql/closing tables       |   18602129 | 384870074281000 |         628000 |       20689000 |     4374172000 |
| stage/sql/statistics           |   16276863 | 325924010836000 |        6495000 |       20023000 |     3296152000 |
| stage/sql/Sorting result       |    2325266 | 270070871391000 |        7264000 |      116146000 |     3214855000 |
| stage/sql/Copying to tmp table |    1162633 | 180350717326000 |      116008000 |      155122000 |     8074224000 |
| stage/sql/init                 |   34879169 | 157347558824000 |        2374000 |        4511000 |     1685373000 |
| stage/sql/Opening tables       |   16276936 |  91344126412000 |        3352000 |        5611000 |     1641511000 |
| stage/sql/System lock          |   16276863 |  68298443425000 |        2444000 |        4196000 |     1102675000 |
| stage/sql/preparing            |   16276863 |  40997524468000 |        1676000 |        2518000 |     1090453000 |
| stage/sql/checking permissions |   16276937 |  37278853417000 |        1327000 |        2290000 |     1734543000 |
| stage/sql/optimizing           |   16276863 |  31599629407000 |        1606000 |        1941000 |     1093176000 |
| stage/sql/cleaning up          |   18602306 |  25263461317000 |         628000 |        1358000 |     2278265000 |
| stage/sql/query end            |   18602129 |  23815226118000 |         698000 |        1280000 |     1674268000 |
| stage/sql/end                  |   16276863 |  17257774496000 |         768000 |        1060000 |     1070408000 |
| stage/sql/removing tmp table   |    1162634 |  15392269553000 |        9359000 |       13239000 |      904741000 |
| stage/sql/executing            |   16276863 |  15264844128000 |         628000 |         937000 |     1097577000 |
| stage/sql/freeing items        |    1162634 |  15041751826000 |        3143000 |       12937000 |     1106237000 |
| stage/sql/Creating tmp table   |    1162633 |  14681955665000 |        9568000 |       12628000 |     1092898000 |
+--------------------------------+------------+-----------------+----------------+----------------+----------------+
18 rows in set (0.00 sec)

Total number of statement events

mysql> select sum(count_star), sum(sum_timer_wait)
    ->   from performance_schema.events_statements_summary_global_by_event_name;
+-----------------+---------------------+
| sum(count_star) | sum(sum_timer_wait) |
+-----------------+---------------------+
|        18602306 |    2293686457711000 |
+-----------------+---------------------+
1 row in set (0.00 sec)

Top statements, by count

mysql> select event_name, count_star, sum_timer_wait
         from performance_schema.events_statements_summary_global_by_event_name
         where count_star>0 order by count_star desc limit 25;
+---------------------------------+------------+------------------+
| event_name                      | count_star | sum_timer_wait   |
+---------------------------------+------------+------------------+
| statement/com/Execute           |   17439495 | 2267764442472000 |
| statement/sql/begin             |    1162633 |   25914627564000 |
| statement/com/Prepare           |         88 |       3041856000 |
| statement/com/Close stmt        |         80 |        729649000 |
| statement/com/Quit              |          9 |         47910000 |
| statement/sql/show_table_status |          1 |       3568260000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)

Top statements, by time

mysql> select event_name, count_star, sum_timer_wait
         from performance_schema.events_statements_summary_global_by_event_name
         where count_star>0 order by sum_timer_wait desc limit 25;
+---------------------------------+------------+------------------+
| event_name                      | count_star | sum_timer_wait   |
+---------------------------------+------------+------------------+
| statement/com/Execute           |   17439495 | 2267764442472000 |
| statement/sql/begin             |    1162633 |   25914627564000 |
| statement/sql/show_table_status |          1 |       3568260000 |
| statement/com/Prepare           |         88 |       3041856000 |
| statement/com/Close stmt        |         80 |        729649000 |
| statement/com/Quit              |          9 |         47910000 |
+---------------------------------+------------+------------------+
6 rows in set (0.00 sec)

Timer statistics

mysql> select * from performance_schema.setup_timers;
+-----------+------------+
| NAME      | TIMER_NAME |
+-----------+------------+
| wait      | CYCLE      |
| stage     | NANOSECOND |
| statement | NANOSECOND |
+-----------+------------+
3 rows in set (0.00 sec)

mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2982179104 |                1 |             36 |
| NANOSECOND  |      1000000000 |                1 |           1404 |
| MICROSECOND |         1000000 |                1 |           1422 |
| MILLISECOND |            1034 |                1 |           1431 |
| TICK        |             104 |                1 |            774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)

Interpreting results

Let's do some basic math to understand what this means ...
The total number of wait events collected is "1098658477". That's a big number. In a more readable form, it reads "1,098,658,477". That still is a big number. In plain English, it is 1 Billion, with a B. This was for a 5 minutes (300 sec) run, and all the statistics were truncated just before the sysbench load, so only the load itself is counted here.
Events collection rates (in number or events per seconds) are then:
  • 3662194 waits/sec, or 3.6 M waits/sec
  • 813844 stages/sec, or 813 K stages/sec
  • 62007 statements/sec, or 62 K statements/sec
These numbers are actually mind blowing. Keep in mind that there is no sampling involved. When each and every one of these events is collected, the performance schema time the event, and add statistics to the proper counter for the object and the thread involved, so that statistics per object, per table, per thread, etc, are available.
For what it is worth, we also have 13 stages per statements, and 59 waits per statements, on average.
Based on the top wait events, we can verify that this workload is indeed CPU bound and read only, since no file io are involved
An important point to note is that "wait/io/table" here represents table IO, as seen at the SQL layer / storage engine interface. A table IO event will time the code executed overall inside the storage engine, so it basically is the sum of the CPU time spent inside the engine and the waits blocking the storage engine. Having a big number reported in table io is expected.
Given the rate of collection for this load, 3.6 M waits/sec, the overhead in the performance schema is not caused by a big, rarely invoked, slow function, but rather by very frequent calls to very short functions.
With such a number of events, timing events cause a huge number of timer invocations.
For waits, the timer used is the cycle timer (see table performance_schema.setup_timers). For the CYCLE timer, the timer overhead alone, expressed in cycles, is 36 for this machine (see table performance_schema.performance_timers).
With 1098658477 events, 2 timer invocation per event, that makes 79103410344 cycles spent timing waits.
Given a CPU frequency of 2982179104 or 2.9 GHz, this represents 26.5224 seconds of CPU in a 300 seconds benchmark.
During this 300 sec benchmark, top reports that mysqld (the server process) consumes 723 percents of CPU (this is a 8 core, but sysbench alone eats 70 percent), so the overall CPU used by the server is 2169 sec. An overhead of 26 sec / 2169 sec represents 1.2 percent of CPU power, just timing waits.
Let's do the same math for stages and statements. They both use the same timer, nanoseconds.
With 244153447 stages, 2 timers per stage, 1404 cycles per timer, we have 685582879176 cycles or 229.8933 sec of CPU.
With 18602306 statements, 2 timers per stage, 1404 cycles per timer, we have 52235275248 cycles or 17.5158 sec of CPU.
Ouch ! That makes 229.8933+17.5158 or 247.4091 seconds of CPU calling the nanosecond timer. In terms of overhead, 247 seconds taken from 2169 represents 11.4 percent of mysqld spent for the nanosecond timer alone.
The bottom line is, this usage of the nanosecond timer is killing the performance. Now, why ? Intuitively, one would think that collecting and timing waits at a rate of 3.6 M / sec is more consuming than doing the same thing for stages and statements, at a rate of 813 K / sec and 62 K / sec respectively.
The root cause for this bad performances is hidden in plain sight, exposed by the performance_timers table.
mysql> select * from performance_schema.performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2982179104 |                1 |             36 |
| NANOSECOND  |      1000000000 |                1 |           1404 | <-- Really ? This is HUGE !
| MICROSECOND |         1000000 |                1 |           1422 |
| MILLISECOND |            1034 |                1 |           1431 |
| TICK        |             104 |                1 |            774 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
This is actually the first time I find a machine where the overhead for the NANOSECOND, MICROSECOND and MILLISECOND timer is so high
Some details about the machine used ...
malff@tyr03:~>more /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 2992.227
cache size      : 6144 KB
...
There are 8 cores total, all identical.
Timer details
malff@tyr03:withpfs>./unittest/mysys/my_rdtsc-t
1..11
# ----- Routine ---------------
# myt.cycles.routine          :             2 <-- RDTSC
# myt.nanoseconds.routine     :            11 <-- CLOCK_GETTIME
# myt.microseconds.routine    :            13
# myt.milliseconds.routine    :            18
# myt.ticks.routine           :            17
# ----- Frequency -------------
# myt.cycles.frequency        :    2986029850
# myt.nanoseconds.frequency   :    1000000000
# myt.microseconds.frequency  :       1000000
# myt.milliseconds.frequency  :          1035
# myt.ticks.frequency         :           104
# ----- Resolution ------------
# myt.cycles.resolution       :             1
# myt.nanoseconds.resolution  :             1
# myt.microseconds.resolution :             1
# myt.milliseconds.resolution :             1
# myt.ticks.resolution        :             1
# ----- Overhead --------------
# myt.cycles.overhead         :            36
# myt.nanoseconds.overhead    :          1404 <-- Ouch
# myt.microseconds.overhead   :          1422
# myt.milliseconds.overhead   :          1431
# myt.ticks.overhead          :           765
...
malff@tyr03:~>uname -a
Linux tyr03 2.6.32-100.28.17.el6.x86_64 #1 SMP Mon May 23 04:37:16 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
To have a point of comparison, the following are taken from a 4 core 8 Gb machine (my now several years old desktop)
mysql> select * from performance_timers;
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2831810945 |                1 |             34 | <-- Marc desktop
| NANOSECOND  |      1000000000 |                1 |            102 | <-- Marc desktop
| MICROSECOND |         1000000 |                1 |            111 |
| MILLISECOND |            1037 |                1 |            127 |
| TICK        |             105 |                1 |            493 |
+-------------+-----------------+------------------+----------------+
5 rows in set (0.00 sec)
malff@linux-8edv:~> more /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
stepping        : 10
cpu MHz         : 2833.000
cache size      : 6144 KB
...
malff@linux-8edv:~> uname -a
Linux linux-8edv 2.6.37.6-0.5-desktop #1 SMP PREEMPT 2011-04-25 21:48:33 +0200 x86_64 x86_64 x86_64 GNU/Linux
Both the lab machine (tyr03) and my desktop have an Intel processor, and they happen to even be of the same family (6) and model (23), only the model name differ (Xeon E5450 vs Core 2 Quad Q9550).
The CYCLE timer is implemented using RDTSC in both cases, and the overheads measured are very similar, with 36 and 34 cycles.
Implementation of the CYCLE timer is in my_timer_cycles() in mysys.
Details about the RDTSC instruction can be found in the processor spec, page C-32 (722). Family 6 model 23 is CPUID 06_17H.
For the nanoseconds timer, it is implemented in my_timer_nanoseconds() in both cases with a call to clock_gettime(CLOCK_REALTIME), in both cases on Linux 2.6.
The lab machine has a Linux 2.6.32 kernel, while the desktop has Linux 2.6.37.
The exact root cause of this huge difference in overhead (1404 vs 102) has not been explained yet.
In the mean time, given the cost of the nanosecond timer, maybe the performance schema overhead could be reduced by not timing events or timing them with the cycle timer.
This is worth a new set of measurements, still for mysql-trunk ...

The trunk benchmark, revisited

With performance schema global (not timed) thread (not timed) thread (all cycles)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 757 6.20 752 6.82 739 8.43
2 1362 5.35 1327 7.78 1293 10.15
4 2441 6.48 2385 8.62 2328 10.80
8 4490 11.72 4384 13.80 4165 18.11
12 4489 10.88 4411 12.43 4220 16.22
16 4489 10.90 4436 11.95 4284 14.97
32 4336 10.67 4232 12.81 3992 17.76
48 4244 10.88 4111 13.67 3876 18.61
64 (*) 3977 3.42 3040 26.18 2527 38.64
128 (*) 2830 20.01 1972 44.26 1897 46.38
256 (*) 1683 3.61 1671 4.30 1629 6.70
(*) Noise at 64/128/256 still makes results unstable.
These new results, using a different configuration but still the same code in mysql-trunk, are interesting for two reasons.
First, it confirms the effect of timing alone. Timing a lot or events takes time, and timing a lot of events with a bad timer makes things really worse. Using the CYCLE timer for waits, stages and statements here, in the "thread (all cycle)" configuration, helped to regain 5 percent.
Second, it shows that once all timers are removed from the code path, the overhead is still in the ~11% range for the global only consumer. The code probably has room for optimizations, see the next section.

The tuning benchmark

The current code in mysql-trunk (at time of writing) can be improved and optimized, in an effort to reduce the overhead even more. A long list of different tweaks, changes, fixes ... has been the topic of a lot of research and tuning lately. Instead of going through each change, the cumulated changes are available in a different branch, in mysql-trunk-pfs-tuning.
This branch is available on launchpad for the technically curious. A bit of warning: it is used only for prototyping and benchmarking to see what code does (or does not) work better. It may not even build for your platform. Do not use this in production.
mysql-trunk-pfs-tuning without pfs (baseline) disabled off global (not timed) global (timed) thread (not timed) thread (timed) thread (all cycles)
Connections TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%) TPS Overhead (%)
1 819 0 788 3.79 778 5.01 771 5.86 684 16.48 755 7.81 678 17.22 744 9.16
2 1431 0 1405 1.82 1383 3.35 1354 5.38 1212 15.30 1329 7.13 1199 16.21 1307 8.67
4 2598 0 2567 1.19 2522 2.93 2452 5.62 2163 16.74 2394 7.85 2141 17.59 2371 8.74
8 4949 0 4980 -0.63 4799 3.03 4635 6.34 4105 17.05 4445 10.18 4061 17.94 4367 11.76
12 4948 0 4936 0.24 4759 3.82 4616 6.71 4112 16.90 4478 9.50 4076 17.62 4390 11.28
16 4944 0 4931 0.26 4754 3.84 4601 6.94 4131 16.44 4476 9.53 4077 17.54 4384 11.33
32 4764 0 4753 0.23 4596 3.53 4455 6.49 3984 16.37 4286 10.03 3945 17.19 4200 11.84
48 4682 0 4681 0.02 4519 3.48 4377 6.51 3908 16.53 4202 10.25 3860 17.56 4111 12.20
64 (*) 4148 0 3881 6.44 2583 37.73 3653 11.93 2916 29.70 3082 25.70 2863 30.98 3496 15.72
128 (*) 3659 0 2034 44.41 2024 44.68 1983 45.80 1924 47.42 2011 45.04 1886 48.46 1939 47.01
256 (*) 2514 0 1746 30.55 1732 31.11 1690 32.78 1632 35.08 1724 31.42 1640 34.77 1646 34.53
We can see some progress:
  • When all consumers are off, the overhead is in the 3% to 4% range, which is much better compared to ~8%.
  • The "global (timed)" and "thread (timed)" configurations still have an important overhead because they use the nanosecond timer, but this overhead decreased, which shows the benefits of the code optimizations.
  • The "thread (all cycles)" overhead is in the ~11% range.
Now that's better. With every instruments enabled and timed, for all waits stages and statements, and with statistics collected globally and per thread, 11% for this wealth of information is not so bad. Especially knowing that the overhead can be changed back to 3-4% at anytime, since all this configuration can be changed dynamically.

Take aways

The most important facts found here:
  • Timer performances do vary between platforms, and have a great impact. Check the timer performance itself with table performance_schema.performance_timers before deciding to time events.
  • The performance schema code itselt can be improved. Current optimizations are available in mysql-trunk-pfs-tuning, as a preview.
  • Overall, the performance schema captures instrumentation events at a very high rate. Limiting the choice of consumers and instruments to use is a good way to cut down the overhead.
  • There is no magic formula that says "overhead is X". Overhead depends so much on the work load, the hardware, the configuration, etc, that the best answer is "it depends".
The real question about overhead, given that it depends so much, really boils down to "What is the overhead for my application ?". The best way to find that is to try it ...
-- Marc Alff

PlanetMySQL Voting: Vote UP / Vote DOWN
Source Article
Comments
0
Be the first to comment

Join with account you already have


Sign in with Twitter account
Sign in with Facebook account
Sign in with Google Friend Connect
avatar
Tags: place query, buffer pool, impact production, host table, internal performance, plural form, specific place, configuration changes, benchmarks, schema, many thin
compareStringObjects performance tuning
Nov 28, 2010
Hi, Is there a reason for converting the numbers to strings in compareStringObjects? If the both inputs are RAW, they are compared byte-by-byte, that's fine. If not then if (a->encoding != REDIS_ENCODING_RAW) { …

Service performance tuning.
Jun 12, 2011
Hi, Considering the fact that it is a critical service, what can be implemented to improve the postgreSQL service. Hailu

Tomcat performance tuning
Sep 24, 2010
Support, I Installed tomcat 5.5.20 version in my Windows 2008 server and the java version is Java 1.6.0.b105.We are using 2 Mbps Bandwidth Leased line.In tomcat we hosted around 5(five) application.When the no of seesion increases more than 1600…

Need help for performance tuning in apache
Jun 8, 2011
Hi All, I am using Apache-2.2.x as load balancer and tomcat as backend server. Can you please somebody tell what are the options to boost the apache load balancer's performance? Thanks, Narendra

Performance Tuning Tomcat 7...
Feb 12, 2011
Is there any performance tuning that can be done to Tomcat 7 besides just the standard jvm arguments? Regards, -Tony

Re: performance tuning - where does the slowness come from?
Jun 29, 2010
Weijunli, I also have an environment that has similar very large datasets with strict latency. Can you please elaborate on the custom changes you added to cassandra to meet these sla, either code or configuration. i am very interested in…

Need some help tuning for large file performance
Sep 16, 2010
Hi! I am running some performance test with large files. As mentioned in one of the earlier threads I am using curl-loader for testing with randomization in the URL to stress the cache. Version: 2.0.1 Config: CONFIG…

Performance tuning the MySQL persistence store
Mar 16, 2011
Hi, In my test Voldemort cluster (running version 0.81) I've been experimenting with getting better Voldemort performance when using MySQL for persistence, and a couple of questions have come up: Is there any way to configure the number of JDBC…

Scanning through 12k rows, performance tuning suggestions?
Aug 11, 2010
Hi list, I just uploaded a screen shot of some profiling result to flickr: http://www.flickr.com/photos/alexdong/4884059360/ This query will scan around 12k rows, my c++ code will read one column and parse the row key into 3 parts. I've already…

General Memory/Performance Tuning Guidelines
May 23, 2011
Greetings! We are working with a riaksearch cluster that uses innostore as the primary backend in tandem with merge_index that is required by search. From reading the Basho wiki it looks like the following are the most important factors affecting…