Linux page allocation failures — reading the fine print

In this document we will see how to decipher the messages logged by the Linux kernel when it fails to honour a page allocation request. Armed with the right knowledge one can glean enough information from the message to see what’s going on and possibly tweak some virtual memory knobs provided by the Linux kernel, to prevent such failures. Note that though depending on your workload and your system memory it might not always be possible to avoid the page allocation failures.

You are more likely to see such failures for atomic allocations (done using GFP_ATOMIC allocator flag) as for non-atomic allocations the kernel tries very hard to satisfy the given request, even if it means killing some process (OOM killing) :-(

Of course, before resorting to the big hammer OOM killer, the kernel will try more humane methods of freeing memory — reclaiming pages not actively used, by writing back dirty file pages and/or swapping out anonymous pages, etc.

We will look at a complete message logged by the Linux kernel when it encounters a page allocation failure, and try to decipher it piece by piece. This particular message was logged by a 2.6.24 Linux kernel running on a machine with 44GB of RAM and 24 logical processors.

swapper: page allocation failure. order:1, mode:0x4020
Pid: 0, comm: swapper Not tainted 2.6.24 #1

Call Trace:
 <IRQ>  [<ffffffff8028ff6e>] __alloc_pages+0x2fe/0x3d0
 [<ffffffff802ab90a>] alloc_pages_current+0x8a/0xe0
 [<ffffffff802b3074>] new_slab+0x224/0x260
 [<ffffffff802b339e>] __slab_alloc+0x2ee/0x410
 [<ffffffff881f69b3>] :e1000e:_kc_netdev_alloc_skb_ip_align+0x23/0x50
 [<ffffffff802b45d6>] __kmalloc_node_track_caller+0xe6/0xf0
 [<ffffffff881f69b3>] :e1000e:_kc_netdev_alloc_skb_ip_align+0x23/0x50
 [<ffffffff803f3206>] __alloc_skb+0x76/0x150
 [<ffffffff881f69b3>] :e1000e:_kc_netdev_alloc_skb_ip_align+0x23/0x50
 [<ffffffff881e5a16>] :e1000e:e1000_alloc_rx_buffers+0x1c6/0x240
 [<ffffffff881e6c5b>] :e1000e:e1000_clean_rx_irq+0x2cb/0x340
 [<ffffffff881e4611>] :e1000e:e1000_poll+0x1f1/0x530
 [<ffffffff803fb41a>] net_rx_action+0x12a/0x230
 [<ffffffff80244dc4>] __do_softirq+0x74/0xf0
 [<ffffffff8020d58c>] call_softirq+0x1c/0x30
 [<ffffffff8020ed9d>] do_softirq+0x3d/0x90
 [<ffffffff80244d45>] irq_exit+0x85/0x90
 [<ffffffff8020f005>] do_IRQ+0x85/0x100
 [<ffffffff8020c911>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80239b83>] finish_task_switch+0x33/0xb0
 [<ffffffff8047b290>] thread_return+0x3a/0x59a
 [<ffffffff803e48b0>] cpuidle_idle_call+0x0/0xd0
 [<ffffffff8020b3c0>] default_idle+0x0/0x50
 [<ffffffff803e48b0>] cpuidle_idle_call+0x0/0xd0
 [<ffffffff8020b3c0>] default_idle+0x0/0x50
 [<ffffffff8020b4ed>] cpu_idle+0xdd/0xf0
 [<ffffffff8021fee6>] start_secondary+0x2f6/0x420

Mem-info:
Node 0 DMA per-cpu:
CPU    0: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    1: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    2: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    3: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    4: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    5: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    6: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    7: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    8: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU    9: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   10: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   11: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   12: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   13: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   14: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   15: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   16: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   17: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   18: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   19: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   20: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   21: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   22: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
CPU   23: Hot: hi:    0, btch:   1 usd:   0   Cold: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    1: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    2: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    3: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    4: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    5: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    6: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    7: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    8: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    9: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   10: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   11: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   12: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   13: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   14: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   15: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   16: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   17: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   18: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   19: Hot: hi:  186, btch:  31 usd:   1   Cold: hi:   62, btch:  15 usd:   0
CPU   20: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   21: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   22: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   23: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
Node 0 Normal per-cpu:
CPU    0: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    1: Hot: hi:  186, btch:  31 usd:  11   Cold: hi:   62, btch:  15 usd:   0
CPU    2: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU    3: Hot: hi:  186, btch:  31 usd:  48   Cold: hi:   62, btch:  15 usd:   0
CPU    4: Hot: hi:  186, btch:  31 usd:  54   Cold: hi:   62, btch:  15 usd:   0
CPU    5: Hot: hi:  186, btch:  31 usd:  27   Cold: hi:   62, btch:  15 usd:   0
CPU    6: Hot: hi:  186, btch:  31 usd:  56   Cold: hi:   62, btch:  15 usd:   0
CPU    7: Hot: hi:  186, btch:  31 usd:  17   Cold: hi:   62, btch:  15 usd:   0
CPU    8: Hot: hi:  186, btch:  31 usd:  31   Cold: hi:   62, btch:  15 usd:   0
CPU    9: Hot: hi:  186, btch:  31 usd:  23   Cold: hi:   62, btch:  15 usd:   0
CPU   10: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   11: Hot: hi:  186, btch:  31 usd:  23   Cold: hi:   62, btch:  15 usd:   0
CPU   12: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   13: Hot: hi:  186, btch:  31 usd:  25   Cold: hi:   62, btch:  15 usd:   0
CPU   14: Hot: hi:  186, btch:  31 usd:  30   Cold: hi:   62, btch:  15 usd:   0
CPU   15: Hot: hi:  186, btch:  31 usd:  20   Cold: hi:   62, btch:  15 usd:   0
CPU   16: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   17: Hot: hi:  186, btch:  31 usd:  16   Cold: hi:   62, btch:  15 usd:   0
CPU   18: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   19: Hot: hi:  186, btch:  31 usd:  34   Cold: hi:   62, btch:  15 usd:   0
CPU   20: Hot: hi:  186, btch:  31 usd:   5   Cold: hi:   62, btch:  15 usd:   0
CPU   21: Hot: hi:  186, btch:  31 usd:  22   Cold: hi:   62, btch:  15 usd:   0
CPU   22: Hot: hi:  186, btch:  31 usd:   0   Cold: hi:   62, btch:  15 usd:   0
CPU   23: Hot: hi:  186, btch:  31 usd:  30   Cold: hi:   62, btch:  15 usd:   0
Active:1489549 inactive:10111786 dirty:39867 writeback:2 unstable:0
 free:131385 slab:574338 mapped:15186 pagetables:3260 bounce:0
Node 0 DMA free:10652kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:10176kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2172 48380 48380
Node 0 DMA32 free:187020kB min:5884kB low:7352kB high:8824kB active:92816kB inactive:1642252kB present:2224956kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 46207 46207
Node 0 Normal free:327868kB min:125156kB low:156444kB high:187732kB active:5865380kB inactive:38804892kB present:47316480kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 2*8kB 4*16kB 2*32kB 4*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 10652kB
Node 0 DMA32: 8592*4kB 2955*8kB 1160*16kB 1839*32kB 576*64kB 83*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 189304kB
Node 0 Normal: 76636*4kB 0*8kB 0*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 2*1024kB 0*2048kB 5*4096kB = 330032kB
Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 0kB
Total swap = 0kB
Free swap:            0kB
13041664 pages of RAM
686887 reserved pages
8267230 pages shared
0 pages swap cached
13041664 pages of RAM
686887 reserved pages
8268757 pages shared
0 pages swap cached

The header

swapper: page allocation failure. order:1, mode:0x4020
Pid: 0, comm: swapper Not tainted 2.6.24 #1

This tells us that this page allocation failure was hit when the swapper process (the idle task) was running on the CPU where the allocation request failed. It does not necessarily mean that this was the process doing the allocation. It could have so happened that the process listed here could be running on that CPU  when an interrupt came and the actual allocation that failed could have been made in that interrupt context (either hardirq or bottom half context). The mode value tells us about the allocation flags that were used for the failed allocation. It is a bitwise OR of the various __GFP_* flags defined in include/linux/gfp.h. The important thing to note is the 0×20 bit, which signifies that it was an atomic allocation — GFP_ATOMIC flag was used for allocation.

The order value tells us the order of allocation. It signifies how many physically contiguous pages did the request ask for. Its basically 2^order,  so in our case 2^1 or 2 physically contiguous pages were requested. 

Coming back to the question whether the swapper process was indeed responsible for the allocation.

Generally atomic allocations are done in interrupt or softirq context, so typically for atomic allocation failures the process id (and name) is the process which was running on the CPU when the interrupt (in whose context the allocation was being done) was received by that CPU. Having said that, we should note that there is nothing to stop an atomic allocation from being made in a process context. It is not a good practice, but the bottom line is that kernel does allow that and hence someone someday might use it. Though in this case it is the idle task code and since idle task code is part of the core kernel it is extremely unlikely that it would break the “no atomic allocation in process context” rule, but if I’ve to bet my life on this, I’ll check more!

We have to look at the backtrace to see who requested the allocation. In our case, it looks like the e1000 network interface card’s driver is trying to allocate buffers for its use.

Network drivers will typically replenish their buffers from their Rx interrupt handler.

Looking at the call sequence do_IRQ()->irq_exit()->do_softirq()->…->e1000_alloc_rx_buffers(), I am pretty confident that the swapper process did not cause the allocation but it was preempted by an e1000 network interrupt which then caused the e1000 driver to request buffer allocation to replenish its Rx buffers.

Before continuing with our forensics, lets take a moment to understand the (bad) effect of higher order allocations.

Higher order allocations are more likely to fail than an order-0 (single page) allocation. This is because memory fragmentation might lead us into situations where we might have many memory pages free but all the free pages are sitting alone, and hence we cannot allocate physically contiguous pages.

Whenever we see order-1 or higher allocation, we should sit back for a while and question, whether the higher order allocation is really justified. In this particular case, I wonder why e1000 driver is doing >4096 bytes allocation for Rx buffers, while we were not using jumbo frames (regular 1500 bytes MTU was in use). The reason why we are seeing order-1 allocations is because the e1000 driver is doing a kmalloc() for ~1524 bytes. This comes from the kmalloc-2048 slab, which in this kernel uses order-1 slabs (2 pages per slab to minimize wastage of memory). In our case the 8K allocation is done by the new_slab() function when it is trying to replenish the kmalloc-2048 slab.

root@cheetah:~# cat /sys/slab/kmalloc-2048/order
1

After this small digression, lets get back to our forensics..

First to catch our attention is the per-cpu page details, one such line per cpu and one block of per-cpu information for each memory zone.

CPU    1: Hot: hi:  186, btch:  31 usd:  11   Cold: hi:   62, btch:  15 usd:   0

Before we get into the details, a small theory lesson.

Linux kernel page allocator has this little optimization where it maintains a per-cpu list of free pages for each memory zone.  This is apart from the buddy allocator list that is inherent to each zone. order-0 pages can be quickly allocated from this list w/o going thru the (more costly) buddy allocator. There are two lists for each cpu, the “hot” list and the “cold” list. “hot” list is supposed to contains pages which are more likely to be in the cpu caches, whereas “cold” list contains pages which are not likely to be in the cpu caches. Depending on the requirement, callers can explicitly ask for hot or cold pages.  f.e. someone needing a page for DMA is better off using a cold page since it will not need the page contents to be in the CPU caches. OTOH, someone needing to access data from the page should ask for a hot page. Each of the list has the following three parameters which are dumped here

hi

This is the high watermark for this per-cpu list of pages. Once the number of free pages in this list goes above this, ‘btch‘  (see below) number of pages are freed to the buddy allocator. Note that this is a per-cpu list and hence only this CPU has access to these pages. We do not want too many such pages else we will be unfair to other CPUs wanting to allocate memory.

btch

This is the batch size for this list. Batch size is significant both for allocation  and freeing. When the allocator finds the per-cpu list empty, it fills it by allocating btch pages from the buddy alocator, and as mentioned above when the free list goes beyond hi pages the kernel frees btch pages back to the buddy allocator.

usd

This is a misnomer. It does not contain the current used pages from the list, but instead contains the current number of free pages in the list. Just to reiterate, this should always be less than hi.

Note that the per-cpu list is a simple list of single pages, hence only order-0 pages can be allocated from the per-cpu lists.

In our case an order-1 allocation has failed, hence we need not look at the per-cpu lists. No matter how many pages are there in these free lists, our request could not have been satisfied by these pages.

Ok, lets look at some more evidence.

Active:1489549 inactive:10111786 dirty:39867 writeback:2 unstable:0
 free:131385 slab:574338 mapped:15186 pagetables:3260 bounce:0

Lets first understand the various terms used above.

Active

These is the count of pages which the kernel feels are being actively used. This should comprise of the working set of all the processes. Linux kernel keeps moving pages from active to inactive lists when they are not actively used. To find out whether a page is actively used or not, typically needs processor support. For x86/x86_64 the Linux kernel uses the Accessed bit in the pagetable.

These pages are not considered for freeing, but under extreme memory pressure we might consider moving active pages to inactive list and finally free them. These can be both dirty (page more recent than the on-disk copy) and clean (page same as on-disk copy).

Inactive

These are the pages which are not in recent use and hence are good candidates for freeing. Again, these can be both dirty and clean.

Dirty

These are the pages which are more recent than their on-disk copy and hence need to be first written back to disk before they can be even considered for freeing. Linux kernel will always try to keep dirty pages within limit and it adopts various techniques for it, depending on the extent of dirty pages.

There is a pdflush daemon which periodically writes back dirty pages. It is woken up once every few seconds (default 5 secs) to do the dirty page writeback. Once it get to run, it tries to writeback all the dirty pages. By default it is called every 5 secs, but its frequency can be controller by writing to /proc/sys/vm/dirty_writeback_centisecs.  Note that the value to be written is not in secs but centisecs, so if you want to change the periodicity to 10 secs, you would write 1000 to /proc/sys/vm/dirty_writeback_centisecs.

Writing back dirty pages every 5 secs works very well for normal no-memory-pressure situations, but if we are in a memory pressure we cannot wait for 5 secs for the pdflush daemon to be woken up and clean up dirty pages (making them suitable for freeing), we have to take immediate action.

For this, Linux kernel has two different urgency levels.  The first (less urgent) condition is when the number of dirty pages crosses the /proc/sys/vm/dirty_background_ratio (as a percentage of the total pages). So if  /proc/sys/vm/dirty_background_ratio is 10 and we have 1GB of ram, then the dirty background threshold is ~100MB.  Whenever the number of dirty pages crosses this, the kernel does not wait for the periodic pdflush handler to be invoked (which can be anytime within the next 5 secs), but instead wakes up the pdflush thread explicitly asking it to writeback dirty pages immediately. Once kswapd gets a chance to run, it writes back enough pages to bring back the number of dirty pages to less than the dirty_background_ratio.

The second (more urgent) condition is when the number of dirty pages grows still more and crosses the /proc/sys/vm/dirty_ratio threshold. Default value of dirty_ratio in my kernel is 20, so taking the above example, when the dirty pages crosses more than 200MB, then the kernel treats the memory pressure situation as very critical and hence cannot depend on pdflush to wakeup and writeback data (note that from the time pdflush is woken up till it gets to run few msecs might elapse),  let alone wait for the periodic pdflush thread, instead it blocks the writer process (the one generating dirty pages) itself and does the writeback in its context. It writes back enough pages to bring back the number of dirty pages to less than the dirty_ratio.

Few other things to note before we go back to the original discussion.

Dirty pages have to be written back to the disk before they can be freed. This involves disk IOs so it takes time and hence we cannot wait till the last moment, the kernel always tries to keep dirty pages to a minimum.

Once dirty pages are written back, they are not immediately freed as they cache the content on disk and hence might be used to (fast) service some future read request. Slowly the pages which are not used frequently will move to inactive clean list from where they can be freed fast (no disk IOs needed). Its in the better interest of the kernel (and everyone) to keep as many pages in this state, where they can cache disk contents but at the same time are immediately available for freeing (an subsequent re-allocation).

So we have the following life cycle for a page that is written

free -> allocated -> dirty -> actively used -> writeback -> inactive-clean -> free

Armed with this freshly acquired knowledge, lets look at the various numbers dumped by our kernel. Important number to look out for is the dirty count. This in our case is 39867 pages, i.e. ~150MB, out of total 44GB. This looks good. 10111786 pages i.e. ~39GB of memory is inactive (since its not dirty, it is inactive-clean and hence ready for reclamation). Since the dirty pages are well in control we need not worry about tuning any of the dirty pages related knobs in /proc/sys/vm/.

Note: If you see a very high percentage (20% and above) of dirty pages, you might consider decreasing either or both of these proc variables.

OK, so we have enough inactive-clean pages but for some reason they could not be reclaimed in time to satisfy our (failed) request.

Lets see why.

To understand this we will have to look at the various zone related statistics (specifically the number of free pages of various order)  dumped as part of the allocation failure message and see why none of the zones could satisfy that request.

Another small digression is due now.

Linux divides all the available memory pages into various zones, where pages belonging to one zone can satisfy certain kind of requests and maybe not suitable for some other kind of requests. f.e. we have the following zones

ZONE_DMA

This contains memory pages with physical address less than 16MB. These pages (and only these pages) can be used to satisfy an allocation request made with GFP_DMA flag. These are used for older ISA h/w which cannot DMA to/from memory above 16MB. Just for the sake of completeness, let me mention that such hardware is very rare to find nowadays and more recent Linux kernels have added a config option to compile out this zone.

ZONE_DMA32

This contains memory pages with physical address more than 16MB but less than 4GB.  These pages (along with ZONE_DMA pages) can be used to satisfy an allocation request made with GFP_DMA32 flag. Similar to ISA h/w’s 16MB limitation, some PCI devices have the 4GB DMA limitation. Those h/w driver’s use these pages. Such h/w is more common that the ISA h/w discussed above.

ZONE_NORMAL

This contains always mapped memory pages. Some processors like x86 have limited virtual address space and hence cannot keep all of the physical memory pages always mapped. It keeps some pages (896MB for x86) always mapped and the rest of the pages are mapped as needed (and then umpapped once we are done using them).  The always mapped pages have the advantage of not needing a map/unmap call overhead so they are preferred, over ZONE_HIGHMEM pages in critical paths.

ZONE_HIGHMEM

Memory pages mapped on demand are contained in this zone. Note that with the proliferation of 64-bit architectures, a lot of memory can be permanently mapped so ZONE_HIGMEM has become vestigial (f.e. in x86_64 architecture) at least till we have more than Exabytes of RAM.

 

The Linux kernel will try to satisfy an allocation request from the highest level zone meeting the allocation requirements, in order to keep the memory in more specific lower level zones free. Note that if we need memory for a 4GB DMA limited PCI device driver, we can give it only from ZONE_DMA32. f.e. if there is a normal request (GFP_ATOMIC or GFP_KERNEL) kernel will first try to allocate from ZONE_NORMAL, failing which it looks for pages in ZONE_DMA32, failing which it tries ZONE_DMA as the last resort. Since the lower zones (ZONE_DMA and ZONE_DMA32) contain special pages (only) which can be used to satisfy special request, kernel also ensures that it does not exhaust a lower level zone completely while satisfying a request originally targeted to a higher level zone.

So, while we may see some free pages in lower level zones, we can still experience page allocation failures. More on this as we go along.

Following is a summary of the per-zone free pages, dumped by the kernel.

Node 0 DMA free:10652kB min:24kB low:28kB high:36kB active:0kB inactive:0kB present:10176kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 2172 48380 48380

Node 0 DMA32 free:187020kB min:5884kB low:7352kB high:8824kB active:92816kB inactive:1642252kB present:2224956kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 46207 46207

Node 0 Normal free:327868kB min:125156kB low:156444kB high:187732kB active:5865380kB inactive:38804892kB present:47316480kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0

Some more terminology.

min

This represents the emergency pool for this zone. No allocation request is allowed to eat into this pool. This is for emergency allocators, f.e. code paths which are executed with the job of freeing memory. They should have access to the last bit of free memory so that they can do their job and free up more memory for everyone else to use.

low

This is the point at which the free memory is considered “low” and when the free memory falls below this, kswapd is woken up in an attempt to free up more memory. Atomic allocations can still continue to allocate memory till the free memory falls below min. So we have (low-min) amount of memory to be allocated after kswapd is woken up and before we start seeing allocation failures. IOW the kswapd has only so much time to reclaim more memory. If the allocation requests are coming fast enough, such that by the time kswapd reclaims enough memory, we exhaust the (low-min) memory, we will see allocation failures such as the ones discussed in this post. Keep this in mind. This is very important !

high

These many free pages are considered “good enough” and once kswapd is woken up, its goal is to free these many and not any more.

 

The absolute values of these variables depend on /proc/sys/vm/min_free_kbytes. In our case, it is set to 131072 (i.e. 128MB). Roughly speaking this is an indication to the kernel to keep at least this much memory free at all times. The min_free_kbytes is distributed among the various zones, in proportion to the actual number of pages they contribute.  Since ZONE_NORMAL contributes the max number of pages, min value for ZONE_NORMAL is set to 125156kB, ZONE_DMA32 comes second with min set to 5884kB and finally ZONE_DMA with only 16MB pages has min set to only 24kB.  125156kB+5884kB+24KB=128MB.

The corresponding low and high values for each zone are calculated from their respective min values as

low = min + min/4
high = min + min/2

Another variable that deserves mention here is the lowmem_reserve variable. It is actually an array of integer values. It is different for each of the zones and contains amount of memory to reserve in that zone while serving requests originally targetted to higher zones and which fell back to this zone. So lowmem_reserve[0] corresponds to ZONE_DMA, lowmem_reserve[1] corresponds to ZONE_DMA32, lowmem_reserve[2] ZONE_NORMAL and lowmem_reserve[3] corresponds to ZONE_MOVABLE.

There is a different lowmem_reserve array for each zone.The index into the array is the original zone to which the request was initially targetted (failing which the request has fallen back to the zone in question). So, f.e. if a request is targetted to ZONE_NORMAL and it falls back to ZONE_DMA32 as the request could not be satisifed by ZONE_NORMAL, then ZONE_DMA32 will keep lowmem_reserve[2] pages reserved. But, if the request was originally targetted to ZONE_DMA32, then the pages to reserve is lowmem_reserve[1] i.e. 0. This makes sense as the ZONE_DMA32 is there to satisfy DMA32 requests, so for such requests it should open up all its doors. Note that in the above calculation, lowmem_reserve[] array from ZONE_DMA32 was used.

With this knowledge, lets try to understand why this allocation would have failed.

We need to keep this last bit of data handy while we drill down the problem

Node 0 DMA: 3*4kB 2*8kB 4*16kB 2*32kB 4*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 10652kB
Node 0 DMA32: 8592*4kB 2955*8kB 1160*16kB 1839*32kB 576*64kB 83*128kB 1*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 189304kB
Node 0 Normal: 76636*4kB 0*8kB 0*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 2*1024kB 0*2048kB 5*4096kB = 330032kB

Since the mode value (0×402) tells us that it was a GFP_ATOMIC allocation, the kernel allocator would have first looked for a free page in the ZONE_NORMAL pool. There seem to be enough free pages in ZONE_NORMAL (76636 pages) and some higher order pages also. Recall that the failed request that we are looking here was an order-1 request, so order-0 pages cannot be used to satisfy this request.

So, despite the availability of some order-1 and higher order pages, our allocation still failed. Why ? Does this have something to do with kernel’s attempt to keep some pages free for emergency situation (f.e. code paths responsible for freeing memory).

One important function in the kernel, mm/page_alloc.c:zone_watermark_ok() has answers to all our questions. This function is called by the kernel page allocator code to query whether a zone can be used to satisfy a certain request. Its job is to look at the available pages, look at the request size and see whether this request can be fullfilled and whether after fulfilling this request the zone will still have enough free pages (of various order) left to meet the future emergency allocations (apart from any reserve page requirement, as per lowmem_reserve). It takes a watermark value as argument and ensures that the number of free pages after the allocation is still “above” the watermark.

Before diving into the internals of zone_watermark_ok(), one more important bit of knowledge. The Linux kernel page allocator (__alloc_pages() function) adopts a two phased approach for allocating pages. In the first phase, it sets the watermark (to be passed to zone_watermark_ok()) as “low”, failing which it tries again this time reducing the watermark to “min”. When the allocation fails with “low” watermark, before retrying the allocation with “min” watermark, the allocator also wakes up the kswapd so that it can free up more pages.

One more thing to note is that atomic allocations are allowed to dip more into the memory reserves, so the zone_watermark_ok()  will further lower the watermark values (passed as argument by the __alloc_pages()) for atomic allocations. The exact amount by which the watermark is lowered is 62.5% i.e. zone_watermark_ok() will only consider 37.5% of the watermark value for atomic allocations.

Lets look at the algorithm used by zone_watermark_ok(). Following is the algorithm in a C-like style.

zone_watermark_ok(zone, watermark) {
    total_free_pages_in_this_zone_after_this_alloc =
            current_free_pages_in_zone - 2^order_of_this_alloc;

    /* relax watermark for atomic allocations */
    if (atomic_allocation)
            min_watermark = 0.375 * watermark;
    /*
     * number of free pages after this alloc, should be above
     * the requested watermark, after leaving the reserved pages
     * for allocations targetted to this zone (and not the ones
     * that have fallen back to this zone due to unavailibility
     * in higher zones)
     */
    if (total_free_pages_in_this_zone_after_this_alloc <=
                           min_watermark + lowmem_reserve[zoneid])
            goto dont_allow_alloc_from_this_zone;
    /*
     * also check for appropriate higher order pages
     */
    for (order = 1; order <= requested_order; order++) {
           /*
            * Require fewer higher order pages to be free
            */
           min_watermark /= 2;
           if (free_pages_of_order_and_higher <= min_watermark)
                   goto dont_allow_alloc_from_this_zone;
    }
    /* allow allocation from this zone */
    return 1;
dont_allow_alloc_from_this_zone:
    return 0;
}

With this knowledge, lets get back to our analysis.

Looking at the free pages data dumped by the kernel for various zones, and keeping the above zone_watermark_ok() algorithm in minds lets see why the allocation failed.

Since it was a GFP_ATOMIC allocation, the page allocator would have first looked at the ZONE_NORMAL.

Total free pages in ZONE_NORMAL = 330032KB/4 = 82508
Total free pages in ZONE_NORMAL after this allocation = 82508 – 2 = 82506
Min watermark = (0.375 *  31289) = 11733 pages     (125156KB = 31289 pages)

lowmem_reserve[ZONE_NORMAL] for ZONE_NORMAL is 0, so we just need to ensure that free pages is more than the watermark.

Since total free pages after allocation is greater than min watermark, we look good, but we have to also look at higher order pages, before we allow the allocation.

Total order-1 and higher pages = (330032KB - 76636*4kB) = 23488KB = 5872 pages
Min order-1 and higher pages required = 11733/2 = 5866 pages

Oops! Looks like we have more free pages than required and hence the allocation should have succeeded. So why did it fail ??

If we look at this message which was printed before the detailed free pages  count, we see that free pages reported in that was 327868KB (and not 330032KB). Here is that bit of message repeated for your reference.

Node 0 Normal free:327868kB min:125156kB low:156444kB high:187732kB active:5865380kB inactive:38804892kB present:47316480kB pages_scanned:0 all_unreclaimable? no

With 327868KB of free memory, “Total order-1 and higher pages” becomes

327868KB – 76636*4 = 21324KB = 5331 pages

This is less than the required (5866 pages) and hence the ZONE_NORMAL is unable to satisfy this request.

The allocator will now fallback to ZONE_DMA32. Before looking at why ZONE_DMA32 also fails, lets take a moment to understand the discrepancy between the free pages reported by two different section of the kernel messages. This is because the kswapd is working while the messages are printed and by the time, the 2nd part of the message is printed kswapd is able to free up some more memory and hence the free memory printed later is more.

Running  the same algorithm for ZONE_DMA32 we have,

Total free pages in ZONE_DMA32 = 187020KB/4 = 46755
Total free pages in ZONE_DMA32 after this allocation = 46755 – 2 = 46753
Min watermark = (0.375 *  1471) = 551 pages

lowmem_reserve[ZONE_NORMAL] for ZONE_DMA32 is 46207 pages,

Since 46753 < (46207 + 551),  total free pages after allocation is less than min watermark and reserve pages requirement, so the allocator will not allow pages to be allocated from this zone. We need not even look further into the order-1 and higher pages.

Extending this logic we can see that even ZONE_DMA is not able to fulfill this allocation and hence the kernel is unable to satisfy this allocation request.

Is there something we can do to help the situation ?

Note that, we talked about the two phased allocation strategy adopted by the kernel page allocator. It first sets the watermark as “low” and if that fails, it retries the allocation with watermark set to “min”, but before that it wakes up the kswapd to free up some memory.

This is important – allocator wakes up kswapd once an allocation with watermark=low fails.

As we saw before, the kswapd has only very little time before allocations start failing. The exact time is the time taken by all the allocations to eat up (low-min) amount of memory. In our case, it is (156444kB – 125156kB)=~30MB. Since the allocations in my case were happening at 1Gig network line rate (all the allocations were for allocating network buffers), 30MB will take around 300msecs to exhaust. This might not be enough time. So if we can increase the (low-min) difference we can buy kswapd more time and hence increase the likelihood of kswapd being able to free memory and allocations not failing!

This can be done by increasing /proc/sys/vm/min_free_kbytes.

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>