lineage-22.2
919 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
071b028ed3 |
Merge 4.19.45 into android-4.19-q
Changes in 4.19.45
locking/rwsem: Prevent decrement of reader count before increment
x86/speculation/mds: Revert CPU buffer clear on double fault exit
x86/speculation/mds: Improve CPU buffer clear documentation
objtool: Fix function fallthrough detection
arm64: dts: rockchip: Disable DCMDs on RK3399's eMMC controller.
ARM: dts: exynos: Fix interrupt for shared EINTs on Exynos5260
ARM: dts: exynos: Fix audio (microphone) routing on Odroid XU3
mmc: sdhci-of-arasan: Add DTS property to disable DCMDs.
ARM: exynos: Fix a leaked reference by adding missing of_node_put
power: supply: axp288_charger: Fix unchecked return value
power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist
arm64: mmap: Ensure file offset is treated as unsigned
arm64: arch_timer: Ensure counter register reads occur with seqlock held
arm64: compat: Reduce address limit
arm64: Clear OSDLR_EL1 on CPU boot
arm64: Save and restore OSDLR_EL1 across suspend/resume
sched/x86: Save [ER]FLAGS on context switch
crypto: crypto4xx - fix ctr-aes missing output IV
crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
crypto: salsa20 - don't access already-freed walk.iv
crypto: chacha20poly1305 - set cra_name correctly
crypto: ccp - Do not free psp_master when PLATFORM_INIT fails
crypto: vmx - fix copy-paste error in CTR mode
crypto: skcipher - don't WARN on unprocessed data after slow walk step
crypto: crct10dif-generic - fix use via crypto_shash_digest()
crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
crypto: arm64/gcm-aes-ce - fix no-NEON fallback code
crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
crypto: rockchip - update IV buffer to contain the next IV
crypto: arm/aes-neonbs - don't access already-freed walk.iv
crypto: arm64/aes-neonbs - don't access already-freed walk.iv
mmc: core: Fix tag set memory leak
ALSA: line6: toneport: Fix broken usage of timer for delayed execution
ALSA: usb-audio: Fix a memory leak bug
ALSA: hda/hdmi - Read the pin sense from register when repolling
ALSA: hda/hdmi - Consider eld_valid when reporting jack event
ALSA: hda/realtek - EAPD turn on later
ALSA: hdea/realtek - Headset fixup for System76 Gazelle (gaze14)
ASoC: max98090: Fix restore of DAPM Muxes
ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
ASoC: fsl_esai: Fix missing break in switch statement
ASoC: codec: hdac_hdmi add device_link to card device
bpf, arm64: remove prefetch insn in xadd mapping
crypto: ccree - remove special handling of chained sg
crypto: ccree - fix mem leak on error path
crypto: ccree - don't map MAC key on stack
crypto: ccree - use correct internal state sizes for export
crypto: ccree - don't map AEAD key and IV on stack
crypto: ccree - pm resume first enable the source clk
crypto: ccree - HOST_POWER_DOWN_EN should be the last CC access during suspend
crypto: ccree - add function to handle cryptocell tee fips error
crypto: ccree - handle tee fips error during power management resume
mm/mincore.c: make mincore() more conservative
mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses
mm/hugetlb.c: don't put_page in lock of hugetlb_lock
hugetlb: use same fault hash key for shared and private mappings
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
userfaultfd: use RCU to free the task struct when fork fails
ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle
mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L
mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values
mtd: spi-nor: intel-spi: Avoid crossing 4K address boundary on read/write
tty: vt.c: Fix TIOCL_BLANKSCREEN console blanking if blankinterval == 0
tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
jbd2: check superblock mapped prior to committing
ext4: make sanity check in mballoc more strict
ext4: ignore e_value_offs for xattrs with value-in-ea-inode
ext4: avoid drop reference to iloc.bh twice
ext4: fix use-after-free race with debug_want_extra_isize
ext4: actually request zeroing of inode table after grow
ext4: fix ext4_show_options for file systems w/o journal
btrfs: Check the first key and level for cached extent buffer
btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
btrfs: Honour FITRIM range constraints during free space trim
Btrfs: send, flush dellaloc in order to avoid data loss
Btrfs: do not start a transaction during fiemap
Btrfs: do not start a transaction at iterate_extent_inodes()
bcache: fix a race between cache register and cacheset unregister
bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
ipmi:ssif: compare block number correctly for multi-part return messages
crypto: ccm - fix incompatibility between "ccm" and "ccm_base"
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
tty: Don't force RISCV SBI console as preferred console
ext4: zero out the unused memory region in the extent tree block
ext4: fix data corruption caused by overlapping unaligned and aligned IO
ext4: fix use-after-free in dx_release()
ext4: avoid panic during forced reboot due to aborted journal
ALSA: hda/realtek - Corrected fixup for System76 Gazelle (gaze14)
ALSA: hda/realtek - Fixup headphone noise via runtime suspend
ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug
jbd2: fix potential double free
KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes
KVM: lapic: Busy wait for timer to expire when using hv_timer
kbuild: turn auto.conf.cmd into a mandatory include file
xen/pvh: set xen_domain_type to HVM in xen_pvh_init
libnvdimm/namespace: Fix label tracking error
iov_iter: optimize page_copy_sane()
pstore: Centralize init/exit routines
pstore: Allocate compression during late_initcall()
pstore: Refactor compression initialization
ext4: fix compile error when using BUFFER_TRACE
ext4: don't update s_rev_level if not required
Linux 4.19.45
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
|
||
|
|
50f91435a2 |
Merge 4.19.45 into android-4.19
Changes in 4.19.45
locking/rwsem: Prevent decrement of reader count before increment
x86/speculation/mds: Revert CPU buffer clear on double fault exit
x86/speculation/mds: Improve CPU buffer clear documentation
objtool: Fix function fallthrough detection
arm64: dts: rockchip: Disable DCMDs on RK3399's eMMC controller.
ARM: dts: exynos: Fix interrupt for shared EINTs on Exynos5260
ARM: dts: exynos: Fix audio (microphone) routing on Odroid XU3
mmc: sdhci-of-arasan: Add DTS property to disable DCMDs.
ARM: exynos: Fix a leaked reference by adding missing of_node_put
power: supply: axp288_charger: Fix unchecked return value
power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist
arm64: mmap: Ensure file offset is treated as unsigned
arm64: arch_timer: Ensure counter register reads occur with seqlock held
arm64: compat: Reduce address limit
arm64: Clear OSDLR_EL1 on CPU boot
arm64: Save and restore OSDLR_EL1 across suspend/resume
sched/x86: Save [ER]FLAGS on context switch
crypto: crypto4xx - fix ctr-aes missing output IV
crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
crypto: salsa20 - don't access already-freed walk.iv
crypto: chacha20poly1305 - set cra_name correctly
crypto: ccp - Do not free psp_master when PLATFORM_INIT fails
crypto: vmx - fix copy-paste error in CTR mode
crypto: skcipher - don't WARN on unprocessed data after slow walk step
crypto: crct10dif-generic - fix use via crypto_shash_digest()
crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest()
crypto: arm64/gcm-aes-ce - fix no-NEON fallback code
crypto: gcm - fix incompatibility between "gcm" and "gcm_base"
crypto: rockchip - update IV buffer to contain the next IV
crypto: arm/aes-neonbs - don't access already-freed walk.iv
crypto: arm64/aes-neonbs - don't access already-freed walk.iv
mmc: core: Fix tag set memory leak
ALSA: line6: toneport: Fix broken usage of timer for delayed execution
ALSA: usb-audio: Fix a memory leak bug
ALSA: hda/hdmi - Read the pin sense from register when repolling
ALSA: hda/hdmi - Consider eld_valid when reporting jack event
ALSA: hda/realtek - EAPD turn on later
ALSA: hdea/realtek - Headset fixup for System76 Gazelle (gaze14)
ASoC: max98090: Fix restore of DAPM Muxes
ASoC: RT5677-SPI: Disable 16Bit SPI Transfers
ASoC: fsl_esai: Fix missing break in switch statement
ASoC: codec: hdac_hdmi add device_link to card device
bpf, arm64: remove prefetch insn in xadd mapping
crypto: ccree - remove special handling of chained sg
crypto: ccree - fix mem leak on error path
crypto: ccree - don't map MAC key on stack
crypto: ccree - use correct internal state sizes for export
crypto: ccree - don't map AEAD key and IV on stack
crypto: ccree - pm resume first enable the source clk
crypto: ccree - HOST_POWER_DOWN_EN should be the last CC access during suspend
crypto: ccree - add function to handle cryptocell tee fips error
crypto: ccree - handle tee fips error during power management resume
mm/mincore.c: make mincore() more conservative
mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses
mm/hugetlb.c: don't put_page in lock of hugetlb_lock
hugetlb: use same fault hash key for shared and private mappings
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
userfaultfd: use RCU to free the task struct when fork fails
ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle
mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L
mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values
mtd: spi-nor: intel-spi: Avoid crossing 4K address boundary on read/write
tty: vt.c: Fix TIOCL_BLANKSCREEN console blanking if blankinterval == 0
tty/vt: fix write/write race in ioctl(KDSKBSENT) handler
jbd2: check superblock mapped prior to committing
ext4: make sanity check in mballoc more strict
ext4: ignore e_value_offs for xattrs with value-in-ea-inode
ext4: avoid drop reference to iloc.bh twice
ext4: fix use-after-free race with debug_want_extra_isize
ext4: actually request zeroing of inode table after grow
ext4: fix ext4_show_options for file systems w/o journal
btrfs: Check the first key and level for cached extent buffer
btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails
btrfs: Honour FITRIM range constraints during free space trim
Btrfs: send, flush dellaloc in order to avoid data loss
Btrfs: do not start a transaction during fiemap
Btrfs: do not start a transaction at iterate_extent_inodes()
bcache: fix a race between cache register and cacheset unregister
bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim()
ipmi:ssif: compare block number correctly for multi-part return messages
crypto: ccm - fix incompatibility between "ccm" and "ccm_base"
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
tty: Don't force RISCV SBI console as preferred console
ext4: zero out the unused memory region in the extent tree block
ext4: fix data corruption caused by overlapping unaligned and aligned IO
ext4: fix use-after-free in dx_release()
ext4: avoid panic during forced reboot due to aborted journal
ALSA: hda/realtek - Corrected fixup for System76 Gazelle (gaze14)
ALSA: hda/realtek - Fixup headphone noise via runtime suspend
ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug
jbd2: fix potential double free
KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes
KVM: lapic: Busy wait for timer to expire when using hv_timer
kbuild: turn auto.conf.cmd into a mandatory include file
xen/pvh: set xen_domain_type to HVM in xen_pvh_init
libnvdimm/namespace: Fix label tracking error
iov_iter: optimize page_copy_sane()
pstore: Centralize init/exit routines
pstore: Allocate compression during late_initcall()
pstore: Refactor compression initialization
ext4: fix compile error when using BUFFER_TRACE
ext4: don't update s_rev_level if not required
Linux 4.19.45
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
|
||
|
|
8bae439855 |
userfaultfd: use RCU to free the task struct when fork fails
commit c3f3ce049f7d97cc7ec9c01cb51d9ec74e0f37c2 upstream.
The task structure is freed while get_mem_cgroup_from_mm() holds
rcu_read_lock() and dereferences mm->owner.
get_mem_cgroup_from_mm() failing fork()
---- ---
task = mm->owner
mm->owner = NULL;
free(task)
if (task) *task; /* use after free */
The fix consists in freeing the task with RCU also in the fork failure
case, exactly like it always happens for the regular exit(2) path. That
is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm()
(left side above) effective to avoid a use after free when dereferencing
the task structure.
An alternate possible fix would be to defer the delivery of the
userfaultfd contexts to the monitor until after fork() is guaranteed to
succeed. Such a change would require more changes because it would
create a strict ordering dependency where the uffd methods would need to
be called beyond the last potentially failing branch in order to be
safe. This solution as opposed only adds the dependency to common code
to set mm->owner to NULL and to free the task struct that was pointed by
mm->owner with RCU, if fork ends up failing. The userfaultfd methods
can still be called anywhere during the fork runtime and the monitor
will keep discarding orphaned "mm" coming from failed forks in userland.
This race condition couldn't trigger if CONFIG_MEMCG was set =n at build
time.
[aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal]
Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com
Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com
Fixes:
|
||
|
|
0c8a35f8dd |
mm: protect against PTE changes done by dup_mmap()
Vinayak Menon and Ganesh Mahendran reported that the following scenario may
lead to thread being blocked due to data corruption:
CPU 1 CPU 2 CPU 3
Process 1, Process 1, Process 1,
Thread A Thread B Thread C
while (1) { while (1) { while(1) {
pthread_mutex_lock(l) pthread_mutex_lock(l) fork
pthread_mutex_unlock(l) pthread_mutex_unlock(l) }
} }
In the details this happens because :
CPU 1 CPU 2 CPU 3
fork()
copy_pte_range()
set PTE rdonly
got to next VMA...
. PTE is seen rdonly PTE still writable
. thread is writing to page
. -> page fault
. copy the page Thread writes to page
. . -> no page fault
. update the PTE
. flush TLB for that PTE
flush TLB PTE are now rdonly
So the write done by the CPU 3 is interfering with the page copy operation
done by CPU 2, leading to the data corruption.
To avoid this we mark all the VMA involved in the COW mechanism as changing
by calling vm_write_begin(). This ensures that the speculative page fault
handler will not try to handle a fault on these pages.
The marker is set until the TLB is flushed, ensuring that all the CPUs will
now see the PTE as not writable.
Once the TLB is flush, the marker is removed by calling vm_write_end().
The variable last is used to keep tracked of the latest VMA marked to
handle the error path where part of the VMA may have been marked.
Change-Id: I3fe07109e27d8f77c9b435053567fe5c287703aa
Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reported-by: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Link: https://www.spinics.net/lists/linux-mm/msg171207.html
Patch-mainline: linux-mm@ Fri, 18 Jan 2019 17:24:16
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
|
||
|
|
3f31f748a8 |
mm: protect mm_rb tree with a rwlock
This change is inspired by the Peter's proposal patch [1] which was protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in that particular case, and it is introducing major performance degradation due to excessive scheduling operations. To allow access to the mm_rb tree without grabbing the mmap_sem, this patch is protecting it access using a rwlock. As the mm_rb tree is a O(log n) search it is safe to protect it using such a lock. The VMA cache is not protected by the new rwlock and it should not be used without holding the mmap_sem. To allow the picked VMA structure to be used once the rwlock is released, a use count is added to the VMA structure. When the VMA is allocated it is set to 1. Each time the VMA is picked with the rwlock held its use count is incremented. Each time the VMA is released it is decremented. When the use count hits zero, this means that the VMA is no more used and should be freed. This patch is preparing for 2 kind of VMA access : - as usual, under the control of the mmap_sem, - without holding the mmap_sem for the speculative page fault handler. Access done under the control the mmap_sem doesn't require to grab the rwlock to protect read access to the mm_rb tree, but access in write must be done under the protection of the rwlock too. This affects inserting and removing of elements in the RB tree. The patch is introducing 2 new functions: - vma_get() to find a VMA based on an address by holding the new rwlock. - vma_put() to release the VMA when its no more used. These services are designed to be used when access are made to the RB tree without holding the mmap_sem. When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and we rely on the WMB done when releasing the rwlock to serialize the write with the RMB done in a later patch to check for the VMA's validity. When free_vma is called, the file associated with the VMA is closed immediately, but the policy and the file structure remained in used until the VMA's use count reach 0, which may happens later when exiting an in progress speculative page fault. [1] https://patchwork.kernel.org/patch/5108281/ Change-Id: I9ecc922b8efa4b28975cc6a8e9531284c24ac14e Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:23 [vinmenon@codeaurora.org: fix the return of put_vma] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> |
||
|
|
ead04c98fd |
mm: introduce INIT_VMA()
Some VMA struct fields need to be initialized once the VMA structure is allocated. Currently this only concerns anon_vma_chain field but some other will be added to support the speculative page fault. Instead of spreading the initialization calls all over the code, let's introduce a dedicated inline function. Change-Id: I9f6b29dc74055354318b548e2b6b22c37d4c61bb Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:13 [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [charante@codeaurora.org: merge conflict fixes] Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> |
||
|
|
e550f94252 |
UPSTREAM: psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2)
Bug: 127712811
Test: lmkd in PSI mode
Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
|
||
|
|
25f0c86db1 |
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills. In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively. Stall states are aggregate versions of the per-task delay
accounting delays:
cpu: some tasks are runnable but not executing on a CPU
memory: tasks are reclaiming, or waiting for swapin or thrashing cache
io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit. They can also indicate when the system
is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states. Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime. A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy]
Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change-Id: Ic4b6fddb7013719ffcba5d68abfda87ada90715a
Git-commit: eb414681d5a07d28d2ff90dc05f69ec6b232ebd2
Git-repo: https://source.codeaurora.org/quic/la/kernel/msm-4.19
[pdaly@codeaurora.org: Move PF_MEMSTALL flag]
Signed-off-by: Patrick Daly <pdaly@codeaurora.org>
|
||
|
|
406d53f0c7 |
ANDROID: cpufreq: track per-task time in state
Add time in state data to task structs, and create /proc/<pid>/time_in_state files to show how long each individual task has run at each frequency. Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking. Bug: 72339335 Bug: 127641090 Test: Read /proc/<pid>/time_in_state Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8 Signed-off-by: Connor O'Brien <connoro@google.com> [astrachan: Folded the following changes into this patch: a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES") b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")] Signed-off-by: Alistair Strachan <astrachan@google.com> |
||
|
|
ffb1bfc29b |
Merge android-4.19.15 (caf5433) into msm-4.19
* refs/heads/tmp-caf5433:
Linux 4.19.15
bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw
drm/amd/display: Fix unintialized max_bpc state values
drm/rockchip: psr: do not dereference encoder before it is null checked.
drm/vc4: Set ->is_yuv to false when num_planes == 1
drm/nouveau/drm/nouveau: Check rc from drm_dp_mst_topology_mgr_resume()
lib: fix build failure in CONFIG_DEBUG_VIRTUAL test
of: __of_detach_node() - remove node from phandle cache
of: of_node_get()/of_node_put() nodes held in phandle cache
power: supply: olpc_battery: correct the temperature units
intel_th: msu: Fix an off-by-one in attribute store
genwqe: Fix size check
drivers/perf: hisi: Fixup one DDRC PMU register offset
video: fbdev: pxafb: Fix "WARNING: invalid free of devm_ allocated data"
ceph: don't update importing cap's mseq when handing cap export
sched/fair: Fix infinite loop in update_blocked_averages() by reverting
|
||
|
|
bc999b5099 |
fork: record start_time late
commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream. This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
7ebdf76d85 |
sched: Add snapshot of Window Assisted Load Tracking (WALT)
This snapshot is taken from msm-4.14 as of
commit 871eac76e6be567 ("sched: Improve the scheduler").
Change-Id: Ib4e0b39526d3009cedebb626ece5a767d8247846
Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
|
||
|
|
bfee6d7f04 |
Merge remote-tracking branch 'origin/tmp-11da3a7' into msm-kona
* origin/tmp-11da3a7:
Linux 4.19-rc3
kbuild: modules_install: warn when missing System.map file
x86/mm: Use WRITE_ONCE() when setting PTEs
x86/apic/vector: Make error return value negative
afs: Fix cell specification to permit an empty address list
KVM: LAPIC: Fix pv ipis out-of-bounds access
KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2
arm64: KVM: Remove pgd_lock
KVM: Remove obsolete kvm_unmap_hva notifier backend
arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD
KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW
i2c: xiic: Record xilinx i2c with Zynq fragment
clocksource: Revert "Remove kthread"
i2c: xiic: Make the start and the byte count write atomic
irqchip/gic-v3-its: Cap lpi_id_bits to reduce memory footprint
block: bfq: swap puts in bfqg_and_blkg_put
memory: ti-aemif: fix a potential NULL-pointer dereference
arm64: fix erroneous warnings in page freeing functions
firmware: arm_scmi: fix divide by zero when sustained_perf_level is zero
printk/tracing: Do not trace printk_nmi_enter()
rbd: support cloning across namespaces
rbd: factor out get_parent_info()
ceph: avoid a use-after-free in ceph_destroy_options()
cpu/hotplug: Prevent state corruption on error rollback
cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun()
x86/process: Don't mix user/kernel regs in 64bit __show_regs()
x86/tsc: Prevent result truncation on 32bit
ACPI / LPSS: Force LPSS quirks on boot
ACPI / bus: Only call dmi_check_system() on X86
block: don't warn when doing fsync on read-only devices
hwmon: rpi: add module alias to raspberrypi-hwmon
tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints
nds32: linker script: GCOV kernel may refers data in __exit
nilfs2: convert to SPDX license tags
drivers/dax/device.c: convert variable to vm_fault_t type
lib/Kconfig.debug: fix three typos in help text
checkpatch: add __ro_after_init to known $Attribute
mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal
uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name
memory_hotplug: fix kernel_panic on offline page processing
checkpatch: add optional static const to blank line declarations test
ipc/shm: properly return EIDRM in shm_lock()
mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.
mm/util.c: improve kvfree() kerneldoc
tools/vm/page-types.c: fix "defined but not used" warning
tools/vm/slabinfo.c: fix sign-compare warning
kmemleak: always register debugfs file
mm: respect arch_dup_mmap() return value
mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm().
mm: memcontrol: print proper OOM header when no eligible victim left
ARC: don't check for HIGHMEM pages in arch_dma_alloc
ARC: IOC: panic if both IOC and ZONE_HIGHMEM enabled
ARC: dma [IOC] Enable per device io coherency
net: phy: sfp: Handle unimplemented hwmon limits and alarms
net: sched: action_ife: take reference to meta module
act_ife: fix a potential use-after-free
net/mlx5: Fix SQ offset in QPs with small RQ
nbd: don't allow invalid blocksize settings
i2c: i801: fix DNV's SMBCTRL register offset
KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting
KVM: s390: vsie: copy wrapping keys to right place
KVM: s390: Fix pfmf and conditional skey emulation
nds32: fix build error because of wrong semicolon
nds32: Fix a kernel panic issue because of wrong frame pointer access.
nds32: Only print one page of stack when die to prevent printing too much information.
nds32: Add macro definition for offset of lp register on stack
nds32: Remove the deprecated ABI implementation
nds32/stack: Get real return address by using ftrace_graph_ret_addr
nds32/ftrace: Support dynamic function graph tracer
nds32/ftrace: Support dynamic function tracer
nds32/ftrace: Add RECORD_MCOUNT support
nds32/ftrace: Support static function graph tracer
nds32/ftrace: Support static function tracer
nds32: Extract the checking and getting pointer to a macro
nds32: Clean up the coding style
nds32: Fix get_user/put_user macro expand pointer problem
nds32: Fix empty call trace
nds32: add NULL entry to the end of_device_id array
nds32: fix logic for module
tipc: correct spelling errors for tipc_topsrv_queue_evt() comments
tipc: correct spelling errors for struct tipc_bc_base's comment
bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA.
bnxt_en: Clean up unused functions.
bnxt_en: Fix firmware signaled resource change logic in open.
sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel
sctp: fix invalid reference to the index variable of the iterator
net/ibm/emac: wrong emac_calc_base call was used by typo
net: sched: null actions array pointer before releasing action
drm/i915/dp_mst: Fix enabling pipe clock for all streams
drm/i915/dsc: Fix PPS register definition macros for 2nd VDSC engine
drm/i915: Re-apply "Perform link quality check, unconditionally during long pulse"
vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition
r8169: add support for NCube 8168 network card
ip6_tunnel: respect ttl inherit for ip6tnl
ALSA: hda: Fix several mismatch for register mask and value
apparmor: fix bad debug check in apparmor_secid_to_secctx()
ALSA: rawmidi: Initialize allocated buffers
fsnotify: fix ignore mask logic in fsnotify()
timekeeping: Fix declaration of read_persistent_wall_and_boot_offset()
x86: Fix kernel-doc atomic.h warnings
mac80211: shorten the IBSS debug messages
mac80211: don't Tx a deauth frame if the AP forbade Tx
mac80211: Fix station bandwidth setting after channel switch
mac80211: fix a race between restart and CSA flows
mac80211: fix WMM TXOP calculation
cfg80211: fix a type issue in ieee80211_chandef_to_operating_class()
mac80211: fix an off-by-one issue in A-MSDU max_subframe computation
drm/i915/gvt: Give new born vGPU higher scheduling chance
cifs: connect to servername instead of IP for IPC$ share
smb3: check for and properly advertise directory lease support
smb3: minor debugging clarifications in rfc1001 len processing
SMB3: Backup intent flag missing for directory opens with backupuid mounts
fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
m68k: fix early memory reservation for ColdFire MMU systems
uapi: Fix linux/rds.h userspace compilation errors.
net: cadence: Fix a sleep-in-atomic-context bug in macb_halt_tx()
i2c: imx-lpi2c: Remove mx8dv compatible entry
dt-bindings: imx-lpi2c: Remove mx8dv compatible entry
i2c: uniphier-f: issue STOP only for last message or I2C_M_STOP
i2c: uniphier: issue STOP only for last message or I2C_M_STOP
net/ipv6: Only update MTU metric if it set
net: ethernet: cpsw-phy-sel: prefer phandle for phy sel
dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
igmp: fix incorrect unsolicit report count after link down and up
igmp: fix incorrect unsolicit report count when join group
bpf: avoid misuse of psock when TCP_ULP_BPF collides with another ULP
tools/bpf: bpftool, add xskmap in map types
bpf: Fix bpf_msg_pull_data()
kbuild: make missing $DEPMOD a Warning instead of an Error
kconfig: do not require pkg-config on make {menu,n}config
x86/microcode: Update the new microcode revision unconditionally
x86/microcode: Make sure boot_cpu_data.microcode is up-to-date
of/platform: initialise AMBA default DMA masks
sparc: set a default 32-bit dma mask for OF devices
ipv6: don't get lwtstate twice in ip6_rt_copy_init()
random: make CPU trust a boot parameter
kernel/dma/direct: take DMA offset into account in dma_direct_supported
ibmvnic: Include missing return code checks in reset function
selftests: pmtu: detect correct binary to ping ipv6 addresses
selftests: pmtu: maximum MTU for vti4 is 2^16-1-20
tcp: do not restart timewait timer on rst reception
net/rds: RDS is not Radio Data System
hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe()
nfp: wait for posted reconfigs when disabling the device
Revert "packet: switch kvzalloc to allocate memory"
md-cluster: release RESYNC lock after the last resync message
RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0
md/raid5-cache: disable reshape completely
blkcg: use tryget logic when associating a blkg with a bio
blkcg: delay blkg destruction until after writeback has finished
Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
ARC: dma [IOC]: mark DMA devices connected as dma-coherent
ARC: atomics: unbork atomic_fetch_##op()
MIPS: VDSO: Match data page cache colouring when D$ aliases
kconfig: remove a spurious self-assignment
scripts/setlocalversion: git: Make -dirty check more robust
gpio: Fix crash due to registration race
arc: remove redundant GCC version checks
tools/kvm_stat: re-animate display of dead guests
tools/kvm_stat: indicate dead guests as such
tools/kvm_stat: handle guest removals more gracefully
tools/kvm_stat: don't reset stats when setting PID filter for debugfs
tools/kvm_stat: fix updates for dead guests
tools/kvm_stat: fix handling of invalid paths in debugfs provider
tools/kvm_stat: fix python3 issues
KVM: x86: Unexport x86_emulate_instruction()
KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction()
KVM: x86: Do not re-{try,execute} after failed emulation in L2
KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault
KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE
KVM: x86: Invert emulation re-execute behavior to make it opt-in
KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation
KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr
KVM: SVM: remove unused variable dst_vaddr_end
KVM: nVMX: avoid redundant double assignment of nested_run_pending
ALSA: hda - Fix cancel_work_sync() stall from jackpoll work
mac80211: always account for A-MSDU header changes
mac80211: do not convert to A-MSDU if frag/subframe limited
cfg80211: nl80211_update_ft_ies() to validate NL80211_ATTR_IE
tc-testing: add test-cases for numeric and invalid control action
net_sched: reject unknown tcfa_action values
net: mvpp2: initialize port of_node pointer
drm/i915/gvt: Fix drm_format_mod value for vGPU plane
drm/i915/gvt: move intel_runtime_pm_get out of spin_lock in stop_schedule
drm/i915/gvt: Handle GEN9_WM_CHICKEN3 with F_CMD_ACCESS.
drm/i915/gvt: Make correct handling to vreg BXT_PHY_CTL_FAMILY
drm/i915/gvt: emulate gen9 dbuf ctl register access
net: bcmgenet: use MAC link status for fixed phy
net: stmmac: build the dwmac-socfpga platform driver for Stratix10
net: rtnl: return early from rtnl_unregister_all when protocol isn't registered
ipv6: fix cleanup ordering for pingv6 registration
ipv6: fix cleanup ordering for ip6_mr failure
net/sched: act_pedit: fix dump of extended layered op
sh_eth: Add R7S9210 support
net: hns: add netif_carrier_off before change speed and duplex
net: hns: add the code for cleaning pkt in chip
r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices
tipc: switch to rhashtable iterator
Revert "net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit"
tipc: fix a missing rhashtable_walk_exit()
vti6: remove !skb->ignore_df check from vti6_xmit()
bpf: fix sg shift repair start offset in bpf_msg_pull_data
bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data
bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data
gpio: dwapb: Fix error handling in dwapb_gpio_probe()
gpiolib-acpi: Register GpioInt ACPI event handlers from a late_initcall
gpiolib: acpi: Switch to cansleep version of GPIO library call
mac80211: avoid kernel panic when building AMSDU from non-linear SKB
mac80211: mesh: fix HWMP sequence numbering to follow standard
gpio: adp5588: Fix sleep-in-atomic-context bug
bpf: fix several offset tests in bpf_msg_pull_data
nl80211: Pass center frequency in kHz instead of MHz
nl80211: Fix nla_put_u8 to u16 for NL80211_WMMR_TXOP
mac80211_hwsim: Fix possible Spectre-v1 for hwsim_world_regdom_custom
mac80211: don't update the PM state of a peer upon a multicast frame
cfg80211: make wmm_rule part of the reg_rule structure
mac80211_hwsim: correct use of IEEE80211_VHT_CAP_RXSTBC_X
mac80211: correct use of IEEE80211_VHT_CAP_RXSTBC_X
bpf: sockmap, decrement copied count correctly in redirect error case
bpf: fix build error with clang
bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg
bpf, sockmap: fix potential use after free in bpf_tcp_close
net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6
net: dsa: Drop GPIO includes
tipc: fix the big/little endian issue in tipc_dest
net: sched: return -ENOENT when trying to remove filter from non-existent chain
net: sched: fix extack error message when chain is failed to be created
erspan: set erspan_ver to 1 by default when adding an erspan dev
sctp: remove useless start_fail from sctp_ht_iter in proc
sctp: hold transport before accessing its asoc in sctp_transport_get_next
scsi: aacraid: fix a signedness bug
Revert "scsi: core: avoid host-wide host_busy counter for scsi_mq"
Revert "scsi: core: fix scsi_host_queue_ready"
scsi: libata: Add missing newline at end of file
scsi: target: iscsi: cxgbit: use pr_debug() instead of pr_info()
scsi: hpsa: limit transfer length to 1MB, not 512kB
scsi: lpfc: Correct MDS diag and nvmet configuration
scsi: lpfc: Default fdmi_on to on
scsi: csiostor: fix incorrect port capabilities
scsi: csiostor: add a check for NULL pointer after kmalloc()
scsi: documentation: add scsi_mod.use_blk_mq to scsi-parameters
scsi: core: Update SCSI_MQ_DEFAULT help text to match default
ARC: sort Kconfig
ARC: cleanup show_faulting_vma()
ARC: [plat-axs*]: Enable SWAP
ARC: [plat-axs*/plat-hsdk]: Allow U-Boot to pass MAC-address to the kernel
ARC: configs: cleanup
arm64: allwinner: dts: h6: fix Pine H64 MMC bus width
btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
btrfs: use after free in btrfs_quota_enable
btrfs: btrfs_shrink_device should call commit transaction at the end
btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
Btrfs: fix data corruption when deduplicating between different files
Btrfs: sync log after logging new name
cfg80211: remove division by size of sizeof(struct ieee80211_wmm_rule)
KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function
Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
mac80211_hwsim: require at least one channel
KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix()
mac80211: Run TXQ teardown code before de-registering interfaces
rfkill-gpio: include linux/mod_devicetable.h
Change-Id: Ic6d1654e67ece823a5fce6ae18d241ad350bfb08
Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>
|
||
|
|
1ed0cc5a01 |
mm: respect arch_dup_mmap() return value
Commit |
||
|
|
9c752f5721 |
ANDROID: Fix massive cpufreq_times memory leaks
Every time _cpu_up() is called for a CPU, idle_thread_get() is called which then re-initializes a CPU's idle thread that was already previously created and cached in a global variable in smpboot.c. idle_thread_get() calls init_idle() which then calls __sched_fork(). __sched_fork() is where cpufreq_task_times_init() is, and cpufreq_task_times_init() allocates memory for the task struct's time_in_state array. Since idle_thread_get() reuses a task struct instance that was already previously created, this means that every time it calls init_idle(), cpufreq_task_times_init() allocates this array again and overwrites the existing allocation that the idle thread already had. This causes memory to be leaked every time a CPU is onlined. In order to fix this, move allocation of time_in_state into _do_fork to avoid allocating it at all for idle threads. The cpufreq times interface is intended to be used for tracking userspace tasks, so we can safely remove it from the kernel's idle threads without killing any functionality. But that's not all! Task structs can be freed outside of release_task(), which creates another memory leak because a task struct can be freed without having its cpufreq times allocation freed. To fix this, free the cpufreq times allocation at the same time that task struct allocations are freed, in free_task(). Since free_task() can also be called in error paths of copy_process() after dup_task_struct(), set time_in_state to NULL immediately after calling dup_task_struct() to avoid possible double free. Bug description and fix adapted from patch submitted by Sultan Alsawaf <sultanxda@gmail.com> at https://android-review.googlesource.com/c/kernel/msm/+/700134 Bug: 110044919 Test: Hikey960 builds, boots & reports /proc/<pid>/time_in_state correctly Change-Id: I12fe7611fc88eb7f6c39f8f7629ad27b6ec4722c Signed-off-by: Connor O'Brien <connoro@google.com> |
||
|
|
cd9b44f907 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ... |
||
|
|
06e62a46bb |
fork: don't copy inconsistent signal handler state to child
Before this change, if a multithreaded process forks while one of its threads is changing a signal handler using sigaction(), the memcpy() in copy_sighand() can race with the struct assignment in do_sigaction(). It isn't clear whether this can cause corruption of the userspace signal handler pointer, but it definitely can cause inconsistency between different fields of struct sigaction. Take the appropriate spinlock to avoid this. I have tested that this patch prevents inconsistency between sa_sigaction and sa_flags, which is possible before this patch. Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a2e5144538 |
kernel/hung_task.c: allow to set checking interval separately from timeout
Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [akpm@linux-foundation.org: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
a670468f5e |
mm: zero out the vma in vma_init()
Rather than in vm_area_alloc(). To ensure that the various oddball stack-based vmas are in a good state. Some of the callers were zeroing them out, others were not. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
0214f46b3a |
Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
"It was observed that a periodic timer in combination with a
sufficiently expensive fork could prevent fork from every completing.
This contains the changes to remove the need for that restart.
This set of changes is split into several parts:
- The first part makes PIDTYPE_TGID a proper pid type instead
something only for very special cases. The part starts using
PIDTYPE_TGID enough so that in __send_signal where signals are
actually delivered we know if the signal is being sent to a a group
of processes or just a single process.
- With that prep work out of the way the logic in fork is modified so
that fork logically makes signals received while it is running
appear to be received after the fork completes"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
signal: Don't send signals to tasks that don't exist
signal: Don't restart fork when signals come in.
fork: Have new threads join on-going signal group stops
fork: Skip setting TIF_SIGPENDING in ptrace_init_task
signal: Add calculate_sigpending()
fork: Unconditionally exit if a fatal signal is pending
fork: Move and describe why the code examines PIDNS_ADDING
signal: Push pid type down into complete_signal.
signal: Push pid type down into __send_signal
signal: Push pid type down into send_signal
signal: Pass pid type into do_send_sig_info
signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
signal: Pass pid type into group_send_sig_info
signal: Pass pid and pid type into send_sigqueue
posix-timers: Noralize good_sigevent
signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
pid: Implement PIDTYPE_TGID
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
kvm: Don't open code task_pid in kvm_vcpu_ioctl
pids: Compute task_tgid using signal->leader_pid
...
|
||
|
|
d46eb14b73 |
fs: fsnotify: account fsnotify metadata to kmemcg
Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
73ba2fb33c |
Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"First pull request for this merge window, there will also be a
followup request with some stragglers.
This pull request contains:
- Fix for a thundering heard issue in the wbt block code (Anchal
Agarwal)
- A few NVMe pull requests:
* Improved tracepoints (Keith)
* Larger inline data support for RDMA (Steve Wise)
* RDMA setup/teardown fixes (Sagi)
* Effects log suppor for NVMe target (Chaitanya Kulkarni)
* Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
* TP4004 (ANA) support (Christoph)
* Various NVMe fixes
- Block io-latency controller support. Much needed support for
properly containing block devices. (Josef)
- Series improving how we handle sense information on the stack
(Kees)
- Lightnvm fixes and updates/improvements (Mathias/Javier et al)
- Zoned device support for null_blk (Matias)
- AIX partition fixes (Mauricio Faria de Oliveira)
- DIF checksum code made generic (Max Gurtovoy)
- Add support for discard in iostats (Michael Callahan / Tejun)
- Set of updates for BFQ (Paolo)
- Removal of async write support for bsg (Christoph)
- Bio page dirtying and clone fixups (Christoph)
- Set of bcache fix/changes (via Coly)
- Series improving blk-mq queue setup/teardown speed (Ming)
- Series improving merging performance on blk-mq (Ming)
- Lots of other fixes and cleanups from a slew of folks"
* tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
blkcg: Make blkg_root_lookup() work for queues in bypass mode
bcache: fix error setting writeback_rate through sysfs interface
null_blk: add lock drop/acquire annotation
Blk-throttle: reduce tail io latency when iops limit is enforced
block: paride: pd: mark expected switch fall-throughs
block: Ensure that a request queue is dissociated from the cgroup controller
block: Introduce blk_exit_queue()
blkcg: Introduce blkg_root_lookup()
block: Remove two superfluous #include directives
blk-mq: count the hctx as active before allocating tag
block: bvec_nr_vecs() returns value for wrong slab
bcache: trivial - remove tailing backslash in macro BTREE_FLAG
bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
bcache: set max writeback rate when I/O request is idle
bcache: add code comments for bset.c
bcache: fix mistaken comments in request.c
bcache: fix mistaken code comments in bcache.h
bcache: add a comment in super.c
bcache: avoid unncessary cache prefetch bch_btree_node_get()
bcache: display rate debug parameters to 0 when writeback is not running
...
|
||
|
|
203b4fc903 |
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Thomas Gleixner: - Make lazy TLB mode even lazier to avoid pointless switch_mm() operations, which reduces CPU load by 1-2% for memcache workloads - Small cleanups and improvements all over the place * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove redundant check for kmem_cache_create() arm/asm/tlb.h: Fix build error implicit func declaration x86/mm/tlb: Make clear_asid_other() static x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off() x86/mm/tlb: Always use lazy TLB mode x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Restructure switch_mm_irqs_off() x86/mm/tlb: Leave lazy TLB mode at page table free time mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids x86/mm: Add TLB purge to free pmd/pte page interfaces ioremap: Update pgtable free interfaces with addr x86/mm: Disable ioremap free page handling on x86-PAE |
||
|
|
c3ad2c3b02 |
signal: Don't restart fork when signals come in.
Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn>
report that a periodic signal received during fork can cause fork to
continually restart preventing an application from making progress.
The code was being overly pessimistic. Fork needs to guarantee that a
signal sent to multiple processes is logically delivered before the
fork and just to the forking process or logically delivered after the
fork to both the forking process and it's newly spawned child. For
signals like periodic timers that are always delivered to a single
process fork can safely complete and let them appear to logically
delivered after the fork().
While examining this issue I also discovered that fork today will miss
signals delivered to multiple processes during the fork and handled by
another thread. Similarly the current code will also miss blocked
signals that are delivered to multiple process, as those signals will
not appear pending during fork.
Add a list of each thread that is currently forking, and keep on that
list a signal set that records all of the signals sent to multiple
processes. When fork completes initialize the new processes
shared_pending signal set with it. The calculate_sigpending function
will see those signals and set TIF_SIGPENDING causing the new task to
take the slow path to userspace to handle those signals. Making it
appear as if those signals were received immediately after the fork.
It is not possible to send real time signals to multiple processes and
exceptions don't go to multiple processes, which means that that are
no signals sent to multiple processes that require siginfo. This
means it is safe to not bother collecting siginfo on signals sent
during fork.
The sigaction of a child of fork is initially the same as the
sigaction of the parent process. So a signal the parent ignores the
child will also initially ignore. Therefore it is safe to ignore
signals sent to multiple processes and ignored by the forking process.
Signals sent to only a single process or only a single thread and delivered
during fork are treated as if they are received after the fork, and generally
not dealt with. They won't cause any problems.
V2: Added removal from the multiprocess list on failure.
V3: Use -ERESTARTNOINTR directly
V4: - Don't queue both SIGCONT and SIGSTOP
- Initialize signal_struct.multiprocess in init_task
- Move setting of shared_pending to before the new task
is visible to signals. This prevents signals from comming
in before shared_pending.signal is set to delayed.signal
and being lost.
V5: - rework list add and delete to account for idle threads
v6: - Use sigdelsetmask when removing stop signals
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447
Reported-by: Wen Yang <wen.yang99@zte.com.cn> and
Reported-by: majiang <ma.jiang@zte.com.cn>
Fixes:
|
||
|
|
05b9ba4b55 |
Merge tag 'v4.18-rc6' into for-4.19/block2
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
924de3b8c9 |
fork: Have new threads join on-going signal group stops
There are only two signals that are delivered to every member of a
signal group: SIGSTOP and SIGKILL. Signal delivery requires every
signal appear to be delivered either before or after a clone syscall.
SIGKILL terminates the clone so does not need to be considered. Which
leaves only SIGSTOP that needs to be considered when creating new
threads.
Today in the event of a group stop TIF_SIGPENDING will get set and the
fork will restart ensuring the fork syscall participates in the group
stop.
A fork (especially of a process with a lot of memory) is one of the
most expensive system so we really only want to restart a fork when
necessary.
It is easy so check to see if a SIGSTOP is ongoing and have the new
thread join it immediate after the clone completes. Making it appear
the clone completed happened just before the SIGSTOP.
The calculate_sigpending function will see the bits set in jobctl and
set TIF_SIGPENDING to ensure the new task takes the slow path to userspace.
V2: The call to task_join_group_stop was moved before the new task is
added to the thread group list. This should not matter as
sighand->siglock is held over both the addition of the threads,
the call to task_join_group_stop and do_signal_stop. But the change
is trivial and it is one less thing to worry about when reading
the code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
||
|
|
2c323017e3 |
blk-cgroup: clear the throttle queue on fork
We were hitting a panic in production where we put too many times on the request queue. This is because we'd get the throttle_queue of the parent if we fork()'ed while we needed to be throttled, but we didn't have a reference on it. Instead just clear these flags on fork so the child doesn't pay for the sins of its father. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
027232da7c |
mm: introduce vma_init()
Not all VMAs allocated with vm_area_alloc(). Some of them allocated on stack or in data segment. The new helper can be use to initialize VMA properly regardless where it was allocated. Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
7673bf553b |
fork: Unconditionally exit if a fatal signal is pending
In practice this does not change anything as testing for fatal_signal_pending and exiting for with an error code duplicates the work of the next clause which recalculates pending signals and then exits fork if any are pending. In both cases the pending signal will trigger the slow path when existing to userspace, and the fatal signal will cause do_exit to be called. The advantage of making this a separate test is that it makes it clear processing the fatal signal will terminate the fork, and it allows the rest of the signal logic to be updated without fear that this important case will be lost. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
4ca1d3ee46 |
fork: Move and describe why the code examines PIDNS_ADDING
Normally this would be something that would be handled by handling signals that are sent to a group of processes but in this case the forking process is not a member of the group being signaled. Thus special code is needed to prevent a race with pid namespaces exiting, and fork adding new processes within them. Move this test up before the signal restart just in case signals are also pending. Fatal conditions should take presedence over restarts. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
490fc05386 |
mm: make vm_area_alloc() initialize core fields
Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
95faf6992d |
mm: make vm_area_dup() actually copy the old vma data
.. and re-initialize th eanon_vma_chain head. This removes some boiler-plate from the users, and also makes it clear why it didn't need use the 'zalloc()' version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3928d4f5ee |
mm: use helper functions for allocating and freeing vm_area structs
The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.
We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.
Right now those functions are literally just wrappers around the
kmem_cache_*() calls. This is a purely mechanical conversion:
# new vma:
kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
# copy old vma
kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
# free vma
kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
6883f81aac |
pid: Implement PIDTYPE_TGID
Everywhere except in the pid array we distinguish between a tasks pid and a tasks tgid (thread group id). Even in the enumeration we want that distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid we almost have an implementation of PIDTYPE_TGID in struct signal_struct. Add PIDTYPE_TGID as a first class member of the pid_type enumeration and into the pids array. Then remove the __PIDTYPE_TGID special case and the leader_pid in signal_struct. The net size increase is just an extra pointer added to struct pid and an extra pair of pointers of an hlist_node added to task_struct. The effect on code maintenance is the removal of a number of special cases today and the potential to remove many more special cases as PIDTYPE_TGID gets used to it's fullest. The long term potential is allowing zombie thread group leaders to exit, which will remove a lot more special cases in the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
2c4704756c |
pids: Move the pgrp and session pid pointers from task_struct to signal_struct
To access these fields the code always has to go to group leader so going to signal struct is no loss and is actually a fundamental simplification. This saves a little bit of memory by only allocating the pid pointer array once instead of once for every thread, and even better this removes a few potential races caused by the fact that group_leader can be changed by de_thread, while signal_struct can not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> |
||
|
|
c1a2f7f0c0 |
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
The mm_struct always contains a cpumask bitmap, regardless of CONFIG_CPUMASK_OFFSTACK. That means the first step can be to simplify things, and simply have one bitmask at the end of the mm_struct for the mm_cpumask. This does necessitate moving everything else in mm_struct into an anonymous sub-structure, which can be randomized when struct randomization is enabled. The second step is to determine the correct size for the mm_struct slab object from the size of the mm_struct (excluding the CPU bitmap) and the size the cpumask. For init_mm we can simply allocate the maximum size this kernel is compiled for, since we only have one init_mm in the system, anyway. Pointer magic by Mike Galbraith, to evade -Wstringop-overflow getting confused by the dynamically sized array. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
655c79bb40 |
mm: check for SIGKILL inside dup_mmap() loop
As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting exit_mmap() due to NULL mmap ***** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
050e9baa9d |
Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables
The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
d82991a868 |
Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull restartable sequence support from Thomas Gleixner: "The restartable sequences syscall (finally): After a lot of back and forth discussion and massive delays caused by the speculative distraction of maintainers, the core set of restartable sequences has finally reached a consensus. It comes with the basic non disputed core implementation along with support for arm, powerpc and x86 and a full set of selftests It was exposed to linux-next earlier this week, so it does not fully comply with the merge window requirements, but there is really no point to drag it out for yet another cycle" * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Provide Makefile, scripts, gitignore rseq/selftests: Provide parametrized tests rseq/selftests: Provide basic percpu ops test rseq/selftests: Provide basic test rseq/selftests: Provide rseq library selftests/lib.mk: Introduce OVERRIDE_TARGETS powerpc: Wire up restartable sequences system call powerpc: Add syscall detection for restartable sequences powerpc: Add support for restartable sequences x86: Wire up restartable sequence system call x86: Add support for restartable sequences arm: Wire up restartable sequences system call arm: Add syscall detection for restartable sequences arm: Add restartable sequences support rseq: Introduce restartable sequences system call uapi/headers: Provide types_32_64.h |
||
|
|
88aa7cc688 |
mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.
And, the mmap_sem contention may cause unexpected issue like below:
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
Call Trace:
schedule+0x36/0x80
rwsem_down_read_failed+0xf0/0x150
call_rwsem_down_read_failed+0x18/0x30
down_read+0x20/0x40
proc_pid_cmdline_read+0xd9/0x4e0
__vfs_read+0x37/0x150
vfs_read+0x96/0x130
SyS_read+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xc5
Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.
So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.
This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem. The mmap_sem
scalability issue will be solved in the future.
[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
||
|
|
8b5c6a3a49 |
Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore: "Another reasonable chunk of audit changes for v4.18, thirteen patches in total. The thirteen patches can mostly be broken down into one of four categories: general bug fixes, accessor functions for audit state stored in the task_struct, negative filter matches on executable names, and extending the (relatively) new seccomp logging knobs to the audit subsystem. The main driver for the accessor functions from Richard are the changes we're working on to associate audit events with containers, but I think they have some standalone value too so I figured it would be good to get them in now. The seccomp/audit patches from Tyler apply the seccomp logging improvements from a few releases ago to audit's seccomp logging; starting with this patchset the changes in /proc/sys/kernel/seccomp/actions_logged should apply to both the standard kernel logging and audit. As usual, everything passes the audit-testsuite and it happens to merge cleanly with your tree" [ Heh, except it had trivial merge conflicts with the SELinux tree that also came in from Paul - Linus ] * tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Fix wrong task in comparison of session ID audit: use existing session info function audit: normalize loginuid read access audit: use new audit_context access funciton for seccomp_actions_logged audit: use inline function to set audit context audit: use inline function to get audit context audit: convert sessionid unset to a macro seccomp: Don't special case audited processes when logging seccomp: Audit attempts to modify the actions_logged sysctl seccomp: Configurable separator for the actions_logged string seccomp: Separate read and write code for actions_logged sysctl audit: allow not equal op for audit by executable audit: add syscall information to FEATURE_CHANGE records |
||
|
|
d7822b1e24 |
rseq: Introduce restartable sequences system call
Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com |
||
|
|
c0b0ae8a87 |
audit: use inline function to set audit context
Recognizing that the audit context is an internal audit value, use an access function to set the audit context pointer for the task rather than reaching directly into the task struct to set it. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in audit.h] Signed-off-by: Paul Moore <paul@paul-moore.com> |
||
|
|
e01e80634e |
fork: unconditionally clear stack on fork
One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak I did some more with perf and cycle counts on running 100,000 execs of /bin/true. before: Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841 Mean: 221015379122.60 Std Dev: 4662486552.47 after: Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348 Mean: 217745009865.40 Std Dev: 5935559279.99 It continues to look like it's faster, though the deviation is rather wide, but I'm not sure what I could do that would be less noisy. I'm open to ideas! Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
3eda69c92d |
kernel/fork.c: detect early free of a live mm
KASAN splats indicate that in some cases we free a live mm, then continue to access it, with potentially disastrous results. This is likely due to a mismatched mmdrop() somewhere in the kernel, but so far the culprit remains elusive. Let's have __mmdrop() verify that the mm isn't live for the current task, similar to the existing check for init_mm. This way, we can catch this class of issue earlier, and without requiring KASAN. Currently, idle_task_exit() leaves active_mm stale after it switches to init_mm. This isn't harmful, but will trigger the new assertions, so we must adjust idle_task_exit() to update active_mm. Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|
|
9b32105ec6 |
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
Using this helper allows us to avoid the in-kernel calls to the sys_unshare() syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_unshare(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> |
||
|
|
2de0db992d |
mm: use do_futex() instead of sys_futex() in mm_release()
sys_futex() is a wrapper to do_futex() which does not modify any values here: - uaddr, val and val3 are kept the same - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE. Therefore, val2 is always 0. - as utime is set to NULL, *timeout is NULL This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> |
||
|
|
d34bc48f82 |
include/linux/sched/mm.h: re-inline mmdrop()
As Peter points out, Doing a CALL+RET for just the decrement is a bit silly.
Fixes:
|
||
|
|
a2e5790d84 |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ... |
||
|
|
667b60946e |
kernel/fork.c: add comment about usage of CLONE_FS flags and namespaces
All other places that deals with namespaces have an explanation of why
the restriction is there.
The description added in this commit was based on commit
|