CVE-2023-52934 Information

Description

In the Linux kernel the following vulnerability has been resolved:

mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups

In commit 34488399fa08 (\mm/madvise: add file and shmem support to MADV_COLLAPSE) we make the following change to find_pmd_or_thp_or_none():

-       if (!pmd_present(pmde))
-               return SCAN_PMD_NULL;
+       if (pmd_none(pmde))
+               return SCAN_PMD_NONE;

This was for-use by MADV_COLLAPSE file/shmem codepaths where MADV_COLLAPSE might identify a pte-mapped hugepage only to have khugepaged race-in free the pte table and clear the pmd. Such codepaths include:

A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER already in the pagecache. B) In retract_page_tables() if we fail to grab mmap_lock for the target mm/address.

In these cases collapse_pte_mapped_thp() really does expect a none (not just !present) pmd and we want to suitably identify that case separate from the case where no pmd is found or it’s a bad-pmd (of course many things could happen once we drop mmap_lock and the pmd could plausibly undergo multiple transitions due to intervening fault split etc). Regardless the code is prepared install a huge-pmd only when the existing pmd entry is either a genuine pte-table-mapping-pmd or the none-pmd.

However the commit introduces a logical hole; namely that we’ve allowed !none- && !huge- && !bad-pmds to be classified as genuine pte-table-mapping-pmds. One such example that could leak through are swap entries. The pmd values aren’t checked again before use in pte_offset_map_lock() which is expecting nothing less than a genuine pte-table-mapping-pmd.

We want to put back the !pmd_present() check (below the pmd_none() check) but need to be careful to deal with subtleties in pmd transitions and treatments by various arch.

The issue is that __split_huge_pmd_locked() temporarily clears the present bit (or otherwise marks the entry as invalid) but pmd_present() and pmd_trans_huge() still need to return true while the pmd is in this transitory state. For example x86’s pmd_present() also checks the _PAGE_PSE riscv’s version also checks the _PAGE_LEAF bit and arm64 also checks a PMD_PRESENT_INVALID bit.

Covering all 4 cases for x86 (all checks done on the same pmd value):

  1. pmd_present() && pmd_trans_huge() All we actually know here is that the PSE bit is set. Either: a) We aren’t racing with __split_huge_page() and PRESENT or PROTNONE is set. => huge-pmd b) We are currently racing with __split_huge_page(). The danger here is that we proceed as-if we have a huge-pmd but really we are looking at a pte-mapping-pmd. So what is the risk of this danger?

    The only relevant path is:

    madvise_collapse() -> collapse_pte_mapped_thp()

    Where we might just incorrectly report back \success\ when really the memory isn’t pmd-backed. This is fine since split could happen immediately after (actually) successful madvise_collapse(). So it should be safe to just assume huge-pmd here.

  2. pmd_present() && !pmd_trans_huge() Either: a) PSE not set and either PRESENT or PROTNONE is. => pte-table-mapping pmd (or PROT_NONE) b) devmap. This routine can be called immediately after unlocking/locking mmap_lock – or called with no locks held (see khugepaged_scan_mm_slot()) so previous VMA checks have since been invalidated.

  3. !pmd_present() && pmd_trans_huge() Not possible.

  4. !pmd_present() && !pmd_trans_huge() Neither PRESENT nor PROTNONE set => not present

I’ve checked all archs that implement pmd_trans_huge() (arm64 riscv powerpc longarch x86 mips s390) and this logic roughly translates (though devmap treatment is unique to x86 and powerpc and (3) doesn’t necessarily hold in general – but that doesn’t matter since !pmd_present() always takes failure path).

Also add a comment above find_pmd_or_thp_or_none()

truncated—

Reference

https://git.kernel.org/stable/c/96aaaf8666010a39430cecf8a65c7ce2908a030f https://git.kernel.org/stable/c/edb5d0cf5525357652aff6eacd9850b8ced07143

Share on: