你的位置:首页 > 数据库

[数据库]MySQL 调优基础(二) Linux内存管理


进程的运行,必须使用内存。下图是Linux中进程中的内存的分布图:

其中最重要的 heap segment 和 stack segment。其它内存段基本是大小固定的。注意stack是向低地址增长的,和heap相反。另外进程的内存地址从0开始,是因为使用的是虚拟内存。所以存在虚拟内存到物理内存的映射。目前服务器一般都是64位的,32位的已经极少了,32为对内存有极大限制。

1. Linux 虚拟内存

Linux是通过虚拟内存的方式来管理内存的。虚拟内存和物理内存之间存在映射关系。当进程在CPU上运行时,虚拟内存就会映射到物理内存,供CPU来访问。

applications do not allocate physical memory, but request a memory map of a certain size at the Linux kernel and in exchange receive a map in virtual memory. As you can see, virtual memory does not necessarily have to be mapped into physical memory. If your application allocates a large amount of memory, some of it might be mapped to the swap file on the disk subsystem.

图示 进程虚拟内存 = 进程物理内存 + 进程swap(page out):

上图是top命令的截图,可以看到:mysqld 使用的虚拟内存为 735M,而常驻物理内存为 430M,所以其余的305M被swap out了(实际上是延迟分配)。

VIRT:The total amount of virtual memory used by the task. It includes all code, data and shared libraries plus pages that have been swapped out.

RES: Resident size (kb)。The non-swapped physical memory a task is using(常驻内存).

Linux handles the memory resource far more efficiently. The default configuration of the virtual memory manager allocates all available free
memory space as disk cache. Hence it is not unusual to see productive Linux systems that boast gigabytes of memory but only have 20 MB of that memory free. In the same context, Linux also handles swap space very efficiently. Swap space being used does not indicate a memory bottleneck but proves how efficiently Linux handles system resources. 

There is no need to be alarmed if you find the swap partition filled to 50%. The fact that swap space is being used does not indicate a memory bottleneck; instead it proves how efficiently Linux handles system resources.

可见,系统显示空闲的内存很少,并不表示真的存在内存瓶颈;swap分区被使用了,也不代表存在内存瓶颈。

内存的分配

Linux 管理内存是通过内存page为单位的,一般一个page为4K。Linux通过一个维持一个free内存的列表来管理和分配内存,并且维持内存的连续,防止内存碎片的产生。该系统被称为buddy system。内存的分配和管理全依靠buddy system.

内存的回收(page frame reclaiming)

当空闲内存不足时,就涉及到内存的回收。内存的回收有两种方式:回收用于缓存磁盘文件的 page cache(disk cache);swap out/page out 其它非活跃进程的内存;而且优先回收用于文件缓存的内存(disk cache):

When kswapd reclaims pages, it would rather shrink the page cache than page out (or swap out) the pages owned by processes.

然后会扫描内存的 active list 和 inactive list,根据LRU原则将active的page移到inactive,然后将inactive list中的page swap out.

active list 和 inactive list,可以用vmstat -a 查看到:

[root@localhost ~]# vmstat -aprocs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b  swpd  free inact active  si  so  bi  bo  in  cs us sy id wa st 0 0   0 462024 72680 471416  0  0  75   6 182 107 1 13 85 0 0

kswapd: kernel swap daemon

内存的两大主要用处

The pages are used mainly for two purposes: page and process address space. The page cache is pages mapped to a file on disk. The
cache pages that belong to a process address space (called anonymous memory because it is not mapped to any files, and it has no name) are used for heap and stack.

一、disk cache(page cache, file cache);

二、进程使用(anonymous memory、heap 和 stack)

kswapd 处理swap in 和swap out; 而 pdflush 处理disk cache到磁盘的刷新。

2. 如何尽量避免swap对mysql的影响

控制系统kswapd在内核中有一个专门的参数:

[root@localhost ~]# cat /proc/sys/vm/swappiness60

我们把 vm.swappiness = 0 设置好,就可以在内存不足时,尽量避免系统发生swap,而尽量去 flush disk cache. 但是最新的Linux内核修改了对vm.swappingness=0 的解释,如果设置成0,可能会发生00M,而将mysqld给kill掉。新内核(2.6.32-303.el6及以后)推荐的做法是:

1)尽量保证Linux操作系统还有足够的内存;

2)最新的内核,建议把vm.swappiness设置1

3)考虑设置 /proc/$(pidof -s mysqld)/oom_adj为较小的值来尽量避免MySQL由于内存不足而被关闭。

具体参见:http://www.woqutech.com/?p=1397

3. 如何修改 oom_adj 值

查看mysqld的oom_ajd值:

[root@localhost ~]# cat /proc/`pidof -s mysqld`/oom_adj0[root@localhost ~]# cat /proc/$(pidof -s mysqld)/oom_adj0

默认值为0. 当我们设置为-17时,对于该进程来说,就不会触发OOM机制,被杀掉。修改:

[root@localhost ~]# echo -17 > /proc/$(pidof mysqld)/oom_adj[root@localhost ~]# cat /proc/$(pidof mysqld)/oom_adj-17

这里为什么是-17呢?这和Linux的实现有关系。在Linux内核中的oom.h文件中,可以看到下面的定义:

 /* /proc//oom_adj set to -17 protects from the oom-killer */#define OOM_DISABLE (-17)/* inclusive */#define OOM_ADJUST_MIN (-16)#define OOM_ADJUST_MAX 15 

这个oom_adj中的变量的范围为15到-16之间。越大越容易被kill。oom_score就是它计算出来的一个值,就是根据这个值来选择哪些进程被kill掉的。

总之,通过上面的分析可知,满足下面的条件后,就是启动OOM机制。

1) VM里面分配不出更多的page(注意linux kernel是延迟分配page策略,及用到的时候才alloc;所以malloc + memset才有效)。

2) 用户地址空间不足,这种情况在32bit机器上及user space超过了3GB,在64bit机器上不太可能发生。

具体参见:http://blog.chinaunix.net/uid-20788636-id-4308527.html

其实设置mysqld的oom_adj不是最好的选择,mysqld不会被kill,必然就会导致其它进程被kill掉;最好还是保障内存充足或者设置vm.swappiness=1比较好

4. 内存瓶颈的检测

Linux内存的瓶颈,主要在于查看是否有比较严重的 swap 的发生(swap out/page out)。其它空虚内存的大小,swap分区被使用都不能说明说明问题。

区分 swap out 和 page out:

Page out moves individual pages to swap space on the disk; swapping is a bigger operation that moves the entire address space of a process to swap space in one operation.(page out 是指将单独的page交换到磁盘,而swap out是指将整个进程的内存交换到磁盘)

使用 vmstat 从整个系统层面查看swap out

[root@localhost ~]# vmstat 2procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b  swpd  free  buff cache  si  so  bi  bo  in  cs us sy id wa st 0 0   0 400776 55292 82416  0  0  33   5 103  87 0 6 94 0 0 0 0   0 400768 55292 82416  0  0   0   0  54  65 0 2 98 0 0 0 0   0 400768 55292 82416  0  0   0   0  69  72 0 3 97 0 0 0 0   0 400644 55300 82416  0  0   0  18  67  79 0 3 97 0 0 0 0   0 400644 55300 82416  0  0   0   0  51  61 0 2 98 0 0 0 0   0 400644 55300 82416  0  0   0   0  64  69 0 2 98 0 0 0 0   0 400644 55308 82416  0  0   0  20  58  73 0 2 98 0 0

其中的 swap si: 表示每秒 swap in; so:表示每秒swap out;

  Swap    si: Amount of memory swapped in from disk (/s).    so: Amount of memory swapped to disk (/s).

使用 sar -B 从整个系统层面查看page out

[root@localhost ~]# sar -BLinux 2.6.32-504.el6.i686 (localhost.localdomain)    10/01/2015   _i686_ (1 CPU)10:57:33 AM    LINUX RESTART11:00:01 AM pgpgin/s pgpgout/s  fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s  %vmeff11:10:01 AM   39.84   4.85  340.32   0.21   39.40   0.00   0.00   0.00   0.0011:20:01 AM   0.06   2.76   10.69   0.00   3.21   0.00   0.00   0.00   0.0011:30:01 AM   0.14   2.68   10.16   0.00   3.08   0.00   0.00   0.00   0.0011:40:01 AM   69.58   13.07  154.16   0.01   47.29   0.00   0.00   0.00   0.0011:50:01 AM   1.84   3.93   28.39   0.02   9.17   0.00   0.00   0.00   0.0012:00:01 PM   0.00   3.20   19.70   0.00   10.87   0.00   0.00   0.00   0.0012:10:01 PM   0.01   2.90   31.96   0.00   8.77   0.00   0.00   0.00   0.0012:20:01 PM   0.06   3.06   40.04   0.00   10.98   0.00   0.00   0.00   0.0012:30:02 PM   2.17   3.81   81.19   0.02   21.63   0.00   0.00   0.00   0.00Average:    12.62   4.47   79.63   0.03   17.15   0.00   0.00   0.00   0.0003:01:38 PM    LINUX RESTART03:10:01 PM pgpgin/s pgpgout/s  fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s  %vmeff03:20:01 PM   6.22   3.99   93.05   0.04   22.89   0.00   0.00   0.00   0.00Average:     6.22   3.99   93.05   0.04   22.89   0.00   0.00   0.00   0.00[root@localhost ~]# sar -B 2 3Linux 2.6.32-504.el6.i686 (localhost.localdomain)    10/01/2015   _i686_ (1 CPU)03:24:05 PM pgpgin/s pgpgout/s  fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s  %vmeff03:24:07 PM   0.00   0.00   26.63   0.00   30.15   0.00   0.00   0.00   0.0003:24:09 PM   0.00   0.00   19.70   0.00   30.30   0.00   0.00   0.00   0.0003:24:11 PM   0.00   0.00   15.00   0.00   30.00   0.00   0.00   0.00   0.00Average:     0.00   0.00   20.44   0.00   30.15   0.00   0.00   0.00   0.00

sar -B 取的是从系统启动到目前的平均值;sar -B 2 3 是指每隔2秒取值,总共取值3次。输出字段的含义如下:

    -B   Report paging statistics. Some of the metrics below are available only with post 2.5       kernels. The following values are displayed:       pgpgin/s           Total number of kilobytes the system paged in from disk per second. Note: With           old kernels (2.2.x) this value is a number of blocks per second (and not kilo-           bytes).       pgpgout/s           Total number of kilobytes the system paged out to disk per second. Note: With           old kernels (2.2.x) this value is a number of blocks per second (and not kilo-           bytes).       fault/s           Number of page faults (major + minor) made by the system per second.  This is           not a count of page faults that generate I/O, because some page faults can be           resolved without I/O.       majflt/s           Number of major faults the system has made per second, those which have           required loading a memory page from disk.       pgfree/s           Number of pages placed on the free list by the system per second.       pgscank/s           Number of pages scanned by the kswapd daemon per second.       pgscand/s           Number of pages scanned directly per second.       pgsteal/s           Number of pages the system has reclaimed from cache (pagecache and swapcache)           per second to satisfy its memory demands.       %vmeff           Calculated as pgsteal / pgscan, this is a metric of the efficiency of page           reclaim. If it is near 100% then almost every page coming off the tail of the           inactive list is being reaped. If it gets too low (e.g. less than 30%) then the           virtual memory is having some difficulty. This field is displayed as zero if           no pages have been scanned during the interval of time.

pgpgout/s 表示就是每秒的page out 的KB数量。majflt/s 也是极为重要的指标,该指标涉及到虚拟内存的 page fault机制。

虚拟内存的 page fault机制

linux 使用虚拟内存层来映射物理地址空间,这种映射在某种意义上是说当一个进程开始运行,内核仅仅映射其需要的那部分,内核首先会搜索 CPU缓存和物理内存,如果没有找到内核则开始一次 MPF, 一次 MPF 即是一次对磁盘子系统的请求,它将数据页从磁盘和缓存读入 RAM。一旦内存页被映射到高速缓冲区,内核便会试图使用这些页,被称作 MnPF,MnPF 通过重复使用内存页而缩短了内核时间。

文件缓冲区(disk cache)可使内核减少对 MPFs 和 MnPFs 的使用, 随着系统不断地 IO 操作, 缓冲区会随之增大, 直至内存空闲空间不足并开始回收.

使用 free 查看空闲内存

[root@localhost ~]# free       total    used    free   shared  buffers   cachedMem:    1030548   630284   400264    220   55388   82428-/+ buffers/cache:   492468   538080Swap:   1048572     0  1048572[root@localhost ~]# free -m       total    used    free   shared  buffers   cachedMem:     1006    616    390     0     54     80-/+ buffers/cache:    481    524Swap:     1023     0    1023

1g的内存,1g的swap分区,使用了616M,空闲390M; swap分区没有被使用,全部空闲。

其实free内存很小不能说明问题,但是free比较大,却能说明内存充足。

swap如果大部分被使用,或者全部使用也能说明 swap 严重,当然最好结合 vmstat 来综合考虑。

使用  ps -mp 1959 -o THREAD,pmem,rss,vsz,tid,pid 查看mysqld的内存和CPU使用情况

[root@localhost ~]# pidof -s mysqld1959[root@localhost ~]# ps -mp 1959 -o THREAD,pmem,rss,vsz,tid,pidUSER   %CPU PRI SCNT WCHAN USER SYSTEM %MEM  RSS  VSZ  TID  PIDmysql   0.6  -  - -     -   - 42.8 441212 752744  - 1959mysql   0.1 19  - -     -   -  -   -   - 1959   -mysql   0.0 19  - -     -   -  -   -   - 1962   -mysql   0.0 19  - -     -   -  -   -   - 1963   -mysql   0.0 19  - -     -   -  -   -   - 1964   -mysql   0.0 19  - -     -   -  -   -   - 1965   -mysql   0.0 19  - -     -   -  -   -   - 1966   -mysql   0.0 19  - -     -   -  -   -   - 1967   -mysql   0.0 19  - -     -   -  -   -   - 1968   -mysql   0.0 19  - -     -   -  -   -   - 1969   -mysql   0.0 19  - -     -   -  -   -   - 1970   -mysql   0.0 19  - -     -   -  -   -   - 1971   -mysql   0.0 19  - -     -   -  -   -   - 1973   -mysql   0.0 19  - -     -   -  -   -   - 1974   -mysql   0.0 19  - -     -   -  -   -   - 1975   -mysql   0.0 19  - -     -   -  -   -   - 1976   -mysql   0.0 19  - -     -   -  -   -   - 1977   -mysql   0.0 19  - -     -   -  -   -   - 1978   -mysql   0.0 19  - -     -   -  -   -   - 1979   -mysql   0.0 19  - -     -   -  -   -   - 1980   -mysql   0.0 19  - -     -   -  -   -   - 1981   -mysql   0.0 19  - -     -   -  -   -   - 1982   -

使用 pmap 查看进程的内存分布情况

The pmap command reports the memory map of a process or processes.

[root@localhost ~]# pmap -x 19591959:  /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/var/lib/mysql --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/mysql/mysqld.pid --socket=/var/lib/mysql/mysql.sockAddress  Kbytes   RSS  Dirty Mode  Mapping00297000    4    4    0 r-x--  [ anon ]002e0000   48   20    0 r-x-- libnss_files-2.12.so002ec000    4    4    4 r---- libnss_files-2.12.so002ed000    4    4    4 rw--- libnss_files-2.12.so003fb000   116   60    0 r-x-- libgcc_s-4.4.7-20120601.so.100418000    4    4    4 rw--- libgcc_s-4.4.7-20120601.so.10041b000   28    8    0 r-x-- libcrypt-2.12.so00422000    4    4    4 r---- libcrypt-2.12.so00423000    4    4    4 rw--- libcrypt-2.12.so00424000   156    0    0 rw---  [ anon ]0044d000   368   148    0 r-x-- libfreebl3.so004a9000    4    0    0 ----- libfreebl3.so004aa000    4    4    4 r---- libfreebl3.so004ab000    4    4    4 rw--- libfreebl3.so004ac000   16   12   12 rw---  [ anon ]0053e000   120   100    0 r-x-- ld-2.12.so0055c000    4    4    4 r---- ld-2.12.so0055d000    4    4    4 rw--- ld-2.12.so00560000    4    4    0 r-x-- libaio.so.1.0.100561000    4    4    4 rw--- libaio.so.1.0.100564000  1600   680    0 r-x-- libc-2.12.so006f4000    8    8    8 r---- libc-2.12.so006f6000    4    4    4 rw--- libc-2.12.so006f7000   12   12   12 rw---  [ anon ]006fc000   92   84    0 r-x-- libpthread-2.12.so00713000    4    4    4 r---- libpthread-2.12.so00714000    4    4    4 rw--- libpthread-2.12.so00715000    8    4    4 rw---  [ anon ]00719000   12    8    0 r-x-- libdl-2.12.so0071c000    4    4    4 r---- libdl-2.12.so0071d000    4    4    4 rw--- libdl-2.12.so00720000   28   16    0 r-x-- librt-2.12.so00727000    4    4    4 r---- librt-2.12.so00728000    4    4    4 rw--- librt-2.12.so0072b000   160   28    0 r-x-- libm-2.12.so00753000    4    4    4 r---- libm-2.12.so00754000    4    4    4 rw--- libm-2.12.so07b14000   900   400    0 r-x-- libstdc++.so.6.0.1307bf5000   16   16   12 r---- libstdc++.so.6.0.1307bf9000    8    8    8 rw--- libstdc++.so.6.0.1307bfb000   24    8    8 rw---  [ anon ]08048000  12096  4284    0 r-x-- mysqld08c18000  1224   468   304 rw--- mysqld08d4a000   256   252   252 rw---  [ anon ]0a809000  5492  5396  5396 rw---  [ anon ]8abfd000    4    0    0 -----  [ anon ]8abfe000  10240    4    4 rw---  [ anon ]8b5fe000    4    0    0 -----  [ anon ]8b5ff000  10240    4    4 rw---  [ anon ]8bfff000    4    0    0 -----  [ anon ]8c000000  10240    8    8 rw---  [ anon ]8ca00000  1024   436   436 rw---  [ anon ]8cbf7000    4    0    0 -----  [ anon ]8cbf8000  10240   16   16 rw---  [ anon ]8d5f8000    4    0    0 -----  [ anon ]8d5f9000  10240    8    8 rw---  [ anon ]8dff9000    4    0    0 -----  [ anon ]8dffa000  10240    4    4 rw---  [ anon ]8e9fa000    4    0    0 -----  [ anon ]8e9fb000  10240    4    4 rw---  [ anon ]8f3fb000    4    0    0 -----  [ anon ]8f3fc000  10240    4    4 rw---  [ anon ]8fdfc000    4    0    0 -----  [ anon ]8fdfd000  12720  2468  2468 rw---  [ anon ]90c00000   132    4    4 rw---  [ anon ]90c21000   892    0    0 -----  [ anon ]90d04000    4    0    0 -----  [ anon ]90d05000   192   12   12 rw---  [ anon ]90d35000    4    0    0 -----  [ anon ]90d36000  10240    4    4 rw---  [ anon ]91736000    4    0    0 -----  [ anon ]91737000  10240    4    4 rw---  [ anon ]92137000    4    0    0 -----  [ anon ]92138000  10240    4    4 rw---  [ anon ]92b38000    4    0    0 -----  [ anon ]92b39000  10240    4    4 rw---  [ anon ]93539000    4    0    0 -----  [ anon ]9353a000  10240    4    4 rw---  [ anon ]93f3a000    4    0    0 -----  [ anon ]93f3b000  10240    4    4 rw---  [ anon ]9493b000    4    0    0 -----  [ anon ]9493c000  10240    4    4 rw---  [ anon ]9533c000    4    0    0 -----  [ anon ]9533d000  10240    4    4 rw---  [ anon ]95d3d000    4    0    0 -----  [ anon ]95d3e000  10240    8    8 rw---  [ anon ]9673e000    4    0    0 -----  [ anon ]9673f000 133548  19940  19940 rw---  [ anon ]9e9ab000 407108 406096 406096 rw---  [ anon ]b774b000    4    4    4 rw---  [ anon ]bfc28000   84   56   56 rw---  [ stack ]-------- ------- ------- ------- -------total kB 752740    -    -    -

上面字段的含义:

EXTENDED AND DEVICE FORMAT FIELDS    Address:  start address of map    Kbytes:  size of map in kilobytes    RSS:    resident set size in kilobytes    Dirty:   dirty pages (both shared and private) in kilobytes    Mode:   permissions on map: read, write, execute, shared, private (copy on write)    Mapping:  file backing the map, or ’[ anon ]’ for allocated memory, or ’[ stack ]’ for the program stack    Offset:  offset into the file    Device:  device name (major:minor)

Mapping 字段说明是通过文件map使用的内存,还是[ anon ] 实际分配的内存,还是[ stack ] 栈使用的内存。

最后一行的 total KB 752740 的结果 和上面一条命令中 VSZ: 752744(虚拟内存) 是一致的。

5. 内存的调优

上面我们说到内存的瓶颈,主要看 swap out, page out, major page fault. 它们会极大的影响性能,特别是swap out. 所以内存调优也就是减少和防止它们的出现。

1)使用 hugepage 可以避免swap out; 但是 huagepage也是有代价的,一定要事先测试;

2)修改 vm.swapingness, 优先flush disk cache,尽量减少page out 和 swap out; 但是flush disk cache又可能会导致 major page fault的产生;

3)加内存;