kernel_xiaomi_sm8250

Author	SHA1	Message	Date
Greg Kroah-Hartman	071b028ed3	Merge 4.19.45 into android-4.19-q Changes in 4.19.45 locking/rwsem: Prevent decrement of reader count before increment x86/speculation/mds: Revert CPU buffer clear on double fault exit x86/speculation/mds: Improve CPU buffer clear documentation objtool: Fix function fallthrough detection arm64: dts: rockchip: Disable DCMDs on RK3399's eMMC controller. ARM: dts: exynos: Fix interrupt for shared EINTs on Exynos5260 ARM: dts: exynos: Fix audio (microphone) routing on Odroid XU3 mmc: sdhci-of-arasan: Add DTS property to disable DCMDs. ARM: exynos: Fix a leaked reference by adding missing of_node_put power: supply: axp288_charger: Fix unchecked return value power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist arm64: mmap: Ensure file offset is treated as unsigned arm64: arch_timer: Ensure counter register reads occur with seqlock held arm64: compat: Reduce address limit arm64: Clear OSDLR_EL1 on CPU boot arm64: Save and restore OSDLR_EL1 across suspend/resume sched/x86: Save [ER]FLAGS on context switch crypto: crypto4xx - fix ctr-aes missing output IV crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues crypto: salsa20 - don't access already-freed walk.iv crypto: chacha20poly1305 - set cra_name correctly crypto: ccp - Do not free psp_master when PLATFORM_INIT fails crypto: vmx - fix copy-paste error in CTR mode crypto: skcipher - don't WARN on unprocessed data after slow walk step crypto: crct10dif-generic - fix use via crypto_shash_digest() crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest() crypto: arm64/gcm-aes-ce - fix no-NEON fallback code crypto: gcm - fix incompatibility between "gcm" and "gcm_base" crypto: rockchip - update IV buffer to contain the next IV crypto: arm/aes-neonbs - don't access already-freed walk.iv crypto: arm64/aes-neonbs - don't access already-freed walk.iv mmc: core: Fix tag set memory leak ALSA: line6: toneport: Fix broken usage of timer for delayed execution ALSA: usb-audio: Fix a memory leak bug ALSA: hda/hdmi - Read the pin sense from register when repolling ALSA: hda/hdmi - Consider eld_valid when reporting jack event ALSA: hda/realtek - EAPD turn on later ALSA: hdea/realtek - Headset fixup for System76 Gazelle (gaze14) ASoC: max98090: Fix restore of DAPM Muxes ASoC: RT5677-SPI: Disable 16Bit SPI Transfers ASoC: fsl_esai: Fix missing break in switch statement ASoC: codec: hdac_hdmi add device_link to card device bpf, arm64: remove prefetch insn in xadd mapping crypto: ccree - remove special handling of chained sg crypto: ccree - fix mem leak on error path crypto: ccree - don't map MAC key on stack crypto: ccree - use correct internal state sizes for export crypto: ccree - don't map AEAD key and IV on stack crypto: ccree - pm resume first enable the source clk crypto: ccree - HOST_POWER_DOWN_EN should be the last CC access during suspend crypto: ccree - add function to handle cryptocell tee fips error crypto: ccree - handle tee fips error during power management resume mm/mincore.c: make mincore() more conservative mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses mm/hugetlb.c: don't put_page in lock of hugetlb_lock hugetlb: use same fault hash key for shared and private mappings ocfs2: fix ocfs2 read inode data panic in ocfs2_iget userfaultfd: use RCU to free the task struct when fork fails ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values mtd: spi-nor: intel-spi: Avoid crossing 4K address boundary on read/write tty: vt.c: Fix TIOCL_BLANKSCREEN console blanking if blankinterval == 0 tty/vt: fix write/write race in ioctl(KDSKBSENT) handler jbd2: check superblock mapped prior to committing ext4: make sanity check in mballoc more strict ext4: ignore e_value_offs for xattrs with value-in-ea-inode ext4: avoid drop reference to iloc.bh twice ext4: fix use-after-free race with debug_want_extra_isize ext4: actually request zeroing of inode table after grow ext4: fix ext4_show_options for file systems w/o journal btrfs: Check the first key and level for cached extent buffer btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails btrfs: Honour FITRIM range constraints during free space trim Btrfs: send, flush dellaloc in order to avoid data loss Btrfs: do not start a transaction during fiemap Btrfs: do not start a transaction at iterate_extent_inodes() bcache: fix a race between cache register and cacheset unregister bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() ipmi:ssif: compare block number correctly for multi-part return messages crypto: ccm - fix incompatibility between "ccm" and "ccm_base" fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount tty: Don't force RISCV SBI console as preferred console ext4: zero out the unused memory region in the extent tree block ext4: fix data corruption caused by overlapping unaligned and aligned IO ext4: fix use-after-free in dx_release() ext4: avoid panic during forced reboot due to aborted journal ALSA: hda/realtek - Corrected fixup for System76 Gazelle (gaze14) ALSA: hda/realtek - Fixup headphone noise via runtime suspend ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug jbd2: fix potential double free KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes KVM: lapic: Busy wait for timer to expire when using hv_timer kbuild: turn auto.conf.cmd into a mandatory include file xen/pvh: set xen_domain_type to HVM in xen_pvh_init libnvdimm/namespace: Fix label tracking error iov_iter: optimize page_copy_sane() pstore: Centralize init/exit routines pstore: Allocate compression during late_initcall() pstore: Refactor compression initialization ext4: fix compile error when using BUFFER_TRACE ext4: don't update s_rev_level if not required Linux 4.19.45 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-05-22 08:01:49 +02:00
Greg Kroah-Hartman	50f91435a2	Merge 4.19.45 into android-4.19 Changes in 4.19.45 locking/rwsem: Prevent decrement of reader count before increment x86/speculation/mds: Revert CPU buffer clear on double fault exit x86/speculation/mds: Improve CPU buffer clear documentation objtool: Fix function fallthrough detection arm64: dts: rockchip: Disable DCMDs on RK3399's eMMC controller. ARM: dts: exynos: Fix interrupt for shared EINTs on Exynos5260 ARM: dts: exynos: Fix audio (microphone) routing on Odroid XU3 mmc: sdhci-of-arasan: Add DTS property to disable DCMDs. ARM: exynos: Fix a leaked reference by adding missing of_node_put power: supply: axp288_charger: Fix unchecked return value power: supply: axp288_fuel_gauge: Add ACEPC T8 and T11 mini PCs to the blacklist arm64: mmap: Ensure file offset is treated as unsigned arm64: arch_timer: Ensure counter register reads occur with seqlock held arm64: compat: Reduce address limit arm64: Clear OSDLR_EL1 on CPU boot arm64: Save and restore OSDLR_EL1 across suspend/resume sched/x86: Save [ER]FLAGS on context switch crypto: crypto4xx - fix ctr-aes missing output IV crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues crypto: salsa20 - don't access already-freed walk.iv crypto: chacha20poly1305 - set cra_name correctly crypto: ccp - Do not free psp_master when PLATFORM_INIT fails crypto: vmx - fix copy-paste error in CTR mode crypto: skcipher - don't WARN on unprocessed data after slow walk step crypto: crct10dif-generic - fix use via crypto_shash_digest() crypto: x86/crct10dif-pcl - fix use via crypto_shash_digest() crypto: arm64/gcm-aes-ce - fix no-NEON fallback code crypto: gcm - fix incompatibility between "gcm" and "gcm_base" crypto: rockchip - update IV buffer to contain the next IV crypto: arm/aes-neonbs - don't access already-freed walk.iv crypto: arm64/aes-neonbs - don't access already-freed walk.iv mmc: core: Fix tag set memory leak ALSA: line6: toneport: Fix broken usage of timer for delayed execution ALSA: usb-audio: Fix a memory leak bug ALSA: hda/hdmi - Read the pin sense from register when repolling ALSA: hda/hdmi - Consider eld_valid when reporting jack event ALSA: hda/realtek - EAPD turn on later ALSA: hdea/realtek - Headset fixup for System76 Gazelle (gaze14) ASoC: max98090: Fix restore of DAPM Muxes ASoC: RT5677-SPI: Disable 16Bit SPI Transfers ASoC: fsl_esai: Fix missing break in switch statement ASoC: codec: hdac_hdmi add device_link to card device bpf, arm64: remove prefetch insn in xadd mapping crypto: ccree - remove special handling of chained sg crypto: ccree - fix mem leak on error path crypto: ccree - don't map MAC key on stack crypto: ccree - use correct internal state sizes for export crypto: ccree - don't map AEAD key and IV on stack crypto: ccree - pm resume first enable the source clk crypto: ccree - HOST_POWER_DOWN_EN should be the last CC access during suspend crypto: ccree - add function to handle cryptocell tee fips error crypto: ccree - handle tee fips error during power management resume mm/mincore.c: make mincore() more conservative mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses mm/hugetlb.c: don't put_page in lock of hugetlb_lock hugetlb: use same fault hash key for shared and private mappings ocfs2: fix ocfs2 read inode data panic in ocfs2_iget userfaultfd: use RCU to free the task struct when fork fails ACPI: PM: Set enable_for_wake for wakeup GPEs during suspend-to-idle mfd: da9063: Fix OTP control register names to match datasheets for DA9063/63L mfd: max77620: Fix swapped FPS_PERIOD_MAX_US values mtd: spi-nor: intel-spi: Avoid crossing 4K address boundary on read/write tty: vt.c: Fix TIOCL_BLANKSCREEN console blanking if blankinterval == 0 tty/vt: fix write/write race in ioctl(KDSKBSENT) handler jbd2: check superblock mapped prior to committing ext4: make sanity check in mballoc more strict ext4: ignore e_value_offs for xattrs with value-in-ea-inode ext4: avoid drop reference to iloc.bh twice ext4: fix use-after-free race with debug_want_extra_isize ext4: actually request zeroing of inode table after grow ext4: fix ext4_show_options for file systems w/o journal btrfs: Check the first key and level for cached extent buffer btrfs: Correctly free extent buffer in case btree_read_extent_buffer_pages fails btrfs: Honour FITRIM range constraints during free space trim Btrfs: send, flush dellaloc in order to avoid data loss Btrfs: do not start a transaction during fiemap Btrfs: do not start a transaction at iterate_extent_inodes() bcache: fix a race between cache register and cacheset unregister bcache: never set KEY_PTRS of journal key to 0 in journal_reclaim() ipmi:ssif: compare block number correctly for multi-part return messages crypto: ccm - fix incompatibility between "ccm" and "ccm_base" fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount tty: Don't force RISCV SBI console as preferred console ext4: zero out the unused memory region in the extent tree block ext4: fix data corruption caused by overlapping unaligned and aligned IO ext4: fix use-after-free in dx_release() ext4: avoid panic during forced reboot due to aborted journal ALSA: hda/realtek - Corrected fixup for System76 Gazelle (gaze14) ALSA: hda/realtek - Fixup headphone noise via runtime suspend ALSA: hda/realtek - Fix for Lenovo B50-70 inverted internal microphone bug jbd2: fix potential double free KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes KVM: lapic: Busy wait for timer to expire when using hv_timer kbuild: turn auto.conf.cmd into a mandatory include file xen/pvh: set xen_domain_type to HVM in xen_pvh_init libnvdimm/namespace: Fix label tracking error iov_iter: optimize page_copy_sane() pstore: Centralize init/exit routines pstore: Allocate compression during late_initcall() pstore: Refactor compression initialization ext4: fix compile error when using BUFFER_TRACE ext4: don't update s_rev_level if not required Linux 4.19.45 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-05-22 08:00:39 +02:00
Andrea Arcangeli	8bae439855	userfaultfd: use RCU to free the task struct when fork fails commit c3f3ce049f7d97cc7ec9c01cb51d9ec74e0f37c2 upstream. The task structure is freed while get_mem_cgroup_from_mm() holds rcu_read_lock() and dereferences mm->owner. get_mem_cgroup_from_mm() failing fork() ---- --- task = mm->owner mm->owner = NULL; free(task) if (task) task; / use after free */ The fix consists in freeing the task with RCU also in the fork failure case, exactly like it always happens for the regular exit(2) path. That is enough to make the rcu_read_lock hold in get_mem_cgroup_from_mm() (left side above) effective to avoid a use after free when dereferencing the task structure. An alternate possible fix would be to defer the delivery of the userfaultfd contexts to the monitor until after fork() is guaranteed to succeed. Such a change would require more changes because it would create a strict ordering dependency where the uffd methods would need to be called beyond the last potentially failing branch in order to be safe. This solution as opposed only adds the dependency to common code to set mm->owner to NULL and to free the task struct that was pointed by mm->owner with RCU, if fork ends up failing. The userfaultfd methods can still be called anywhere during the fork runtime and the monitor will keep discarding orphaned "mm" coming from failed forks in userland. This race condition couldn't trigger if CONFIG_MEMCG was set =n at build time. [aarcange@redhat.com: improve changelog, reduce #ifdefs per Michal] Link: http://lkml.kernel.org/r/20190429035752.4508-1-aarcange@redhat.com Link: http://lkml.kernel.org/r/20190325225636.11635-2-aarcange@redhat.com Fixes: `893e26e61d` ("userfaultfd: non-cooperative: Add fork() event") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: zhong jiang <zhongjiang@huawei.com> Reported-by: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: zhong jiang <zhongjiang@huawei.com> Cc: syzbot+cbb52e396df3e565ab02@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-05-22 07:37:41 +02:00
Laurent Dufour	0c8a35f8dd	mm: protect against PTE changes done by dup_mmap() Vinayak Menon and Ganesh Mahendran reported that the following scenario may lead to thread being blocked due to data corruption: CPU 1 CPU 2 CPU 3 Process 1, Process 1, Process 1, Thread A Thread B Thread C while (1) { while (1) { while(1) { pthread_mutex_lock(l) pthread_mutex_lock(l) fork pthread_mutex_unlock(l) pthread_mutex_unlock(l) } } } In the details this happens because : CPU 1 CPU 2 CPU 3 fork() copy_pte_range() set PTE rdonly got to next VMA... . PTE is seen rdonly PTE still writable . thread is writing to page . -> page fault . copy the page Thread writes to page . . -> no page fault . update the PTE . flush TLB for that PTE flush TLB PTE are now rdonly So the write done by the CPU 3 is interfering with the page copy operation done by CPU 2, leading to the data corruption. To avoid this we mark all the VMA involved in the COW mechanism as changing by calling vm_write_begin(). This ensures that the speculative page fault handler will not try to handle a fault on these pages. The marker is set until the TLB is flushed, ensuring that all the CPUs will now see the PTE as not writable. Once the TLB is flush, the marker is removed by calling vm_write_end(). The variable last is used to keep tracked of the latest VMA marked to handle the error path where part of the VMA may have been marked. Change-Id: I3fe07109e27d8f77c9b435053567fe5c287703aa Reported-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Reported-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Link: https://www.spinics.net/lists/linux-mm/msg171207.html Patch-mainline: linux-mm@ Fri, 18 Jan 2019 17:24:16 Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2019-04-01 21:57:33 -07:00
Laurent Dufour	3f31f748a8	mm: protect mm_rb tree with a rwlock This change is inspired by the Peter's proposal patch [1] which was protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in that particular case, and it is introducing major performance degradation due to excessive scheduling operations. To allow access to the mm_rb tree without grabbing the mmap_sem, this patch is protecting it access using a rwlock. As the mm_rb tree is a O(log n) search it is safe to protect it using such a lock. The VMA cache is not protected by the new rwlock and it should not be used without holding the mmap_sem. To allow the picked VMA structure to be used once the rwlock is released, a use count is added to the VMA structure. When the VMA is allocated it is set to 1. Each time the VMA is picked with the rwlock held its use count is incremented. Each time the VMA is released it is decremented. When the use count hits zero, this means that the VMA is no more used and should be freed. This patch is preparing for 2 kind of VMA access : - as usual, under the control of the mmap_sem, - without holding the mmap_sem for the speculative page fault handler. Access done under the control the mmap_sem doesn't require to grab the rwlock to protect read access to the mm_rb tree, but access in write must be done under the protection of the rwlock too. This affects inserting and removing of elements in the RB tree. The patch is introducing 2 new functions: - vma_get() to find a VMA based on an address by holding the new rwlock. - vma_put() to release the VMA when its no more used. These services are designed to be used when access are made to the RB tree without holding the mmap_sem. When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and we rely on the WMB done when releasing the rwlock to serialize the write with the RMB done in a later patch to check for the VMA's validity. When free_vma is called, the file associated with the VMA is closed immediately, but the policy and the file structure remained in used until the VMA's use count reach 0, which may happens later when exiting an in progress speculative page fault. [1] https://patchwork.kernel.org/patch/5108281/ Change-Id: I9ecc922b8efa4b28975cc6a8e9531284c24ac14e Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:23 [vinmenon@codeaurora.org: fix the return of put_vma] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2019-04-01 12:48:55 +05:30
Laurent Dufour	ead04c98fd	mm: introduce INIT_VMA() Some VMA struct fields need to be initialized once the VMA structure is allocated. Currently this only concerns anon_vma_chain field but some other will be added to support the speculative page fault. Instead of spreading the initialization calls all over the code, let's introduce a dedicated inline function. Change-Id: I9f6b29dc74055354318b548e2b6b22c37d4c61bb Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:13 [vinmenon@codeaurora.org: trivial merge conflict fixes] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> [charante@codeaurora.org: merge conflict fixes] Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2019-03-29 03:08:28 -07:00
Johannes Weiner	e550f94252	UPSTREAM: psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:25:27 -07:00
Johannes Weiner	25f0c86db1	psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: Ic4b6fddb7013719ffcba5d68abfda87ada90715a Git-commit: eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 Git-repo: https://source.codeaurora.org/quic/la/kernel/msm-4.19 [pdaly@codeaurora.org: Move PF_MEMSTALL flag] Signed-off-by: Patrick Daly <pdaly@codeaurora.org>	2019-03-08 19:33:14 -08:00
Connor O'Brien	406d53f0c7	ANDROID: cpufreq: track per-task time in state Add time in state data to task structs, and create /proc/<pid>/time_in_state files to show how long each individual task has run at each frequency. Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking. Bug: 72339335 Bug: 127641090 Test: Read /proc/<pid>/time_in_state Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8 Signed-off-by: Connor O'Brien <connoro@google.com> [astrachan: Folded the following changes into this patch: a6d3de6a7fba ("ANDROID: Reduce use of #ifdef CONFIG_CPU_FREQ_TIMES") b89ada5d9c09 ("ANDROID: Fix massive cpufreq_times memory leaks")] Signed-off-by: Alistair Strachan <astrachan@google.com>	2019-03-06 15:57:25 +00:00
Ivaylo Georgiev	ffb1bfc29b	Merge android-4.19.15 (`caf5433`) into msm-4.19 * refs/heads/tmp-caf5433: Linux 4.19.15 bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw drm/amd/display: Fix unintialized max_bpc state values drm/rockchip: psr: do not dereference encoder before it is null checked. drm/vc4: Set ->is_yuv to false when num_planes == 1 drm/nouveau/drm/nouveau: Check rc from drm_dp_mst_topology_mgr_resume() lib: fix build failure in CONFIG_DEBUG_VIRTUAL test of: __of_detach_node() - remove node from phandle cache of: of_node_get()/of_node_put() nodes held in phandle cache power: supply: olpc_battery: correct the temperature units intel_th: msu: Fix an off-by-one in attribute store genwqe: Fix size check drivers/perf: hisi: Fixup one DDRC PMU register offset video: fbdev: pxafb: Fix "WARNING: invalid free of devm_ allocated data" ceph: don't update importing cap's mseq when handing cap export sched/fair: Fix infinite loop in update_blocked_averages() by reverting `a9e7f6544b` iommu/vt-d: Handle domain agaw being less than iommu agaw RDMA/srpt: Fix a use-after-free in the channel release code rxe: fix error completion wr_id and qp_num 9p/net: put a lower bound on msize iio: dac: ad5686: fix bit shift read register powerpc/tm: Set MSR[TS] just prior to recheckpoint Revert "powerpc/tm: Unset MSR[TS] if not recheckpointing" leds: pwm: silently error out on EPROBE_DEFER arm64: relocatable: fix inconsistencies in linker script and options arm64: drop linker script hack to hide __efistub_ symbols nfsd4: zero-length WRITE should succeed lockd: Show pid of lockd for remote locks PCI / PM: Allow runtime PM without callback functions selinux: policydb - fix byte order and alignment issues b43: Fix error in cordic routine gfs2: Fix loop in gfs2_rbm_find gfs2: Get rid of potential double-freeing in gfs2_create_inode dlm: memory leaks on error path in dlm_user_request() dlm: lost put_lkb on error path in receive_convert() and receive_unlock() dlm: possible memory leak on error path in create_lkb() dlm: fixed memory leaks after failed ls_remove_names allocation block: mq-deadline: Fix write completion handling block: deactivate blk_stat timer in wbt_disable_default() Fix failure path in alloc_pid() driver core: Add missing dev->bus->need_parent_lock checks srcu: Lock srcu_data structure in srcu_gp_start() ALSA: usb-audio: Always check descriptor sizes in parser code ALSA: usb-audio: Fix an out-of-bound read in create_composite_quirks ALSA: usb-audio: Check mixer unit descriptors more strictly ALSA: usb-audio: Avoid access before bLength check in build_audio_procunit() ALSA: cs46xx: Potential NULL dereference in probe media: cx23885: only reset DMA on problematic CPUs mt76x0: init hw capabilities dma-direct: do not include SME mask in the DMA supported check raid6/ppc: Fix build for clang powerpc/boot: Set target when cross-compiling for clang Makefile: Export clang toolchain variables kbuild: consolidate Clang compiler flags kbuild: add -no-integrated-as Clang option unconditionally powerpc: Disable -Wbuiltin-requires-header when setjmp is used powerpc: avoid -mno-sched-epilog on GCC 4.9 and newer powerpc: consolidate -mno-sched-epilog into FTRACE flags powerpc: remove old GCC version checks sunrpc: use SVC_NET() in svcauth_gss_* functions sunrpc: fix cache_head leak due to queued request memcg, oom: notify on oom killer invocation from the charge path mm, swap: fix swapoff with KSM pages mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL mm, hmm: use devm semantics for hmm_devmem_{add, remove} mm, devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support mm, devm_memremap_pages: fix shutdown handling mm, devm_memremap_pages: kill mapping "System RAM" support mm, devm_memremap_pages: mark devm_memremap_pages() EXPORT_SYMBOL_GPL hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined zram: fix double free backing device fork: record start_time late scsi: lpfc: do not set queue->page_count to 0 if pc_sli4_params.wqpcnt is invalid scsi: zfcp: fix posting too many status read buffers leading to adapter shutdown auxdisplay: charlcd: fix x/y command parsing serial/sunsu: fix refcount leak qmi_wwan: Fix qmap header retrieval in qmimux_rx_fixup net: netxen: fix a missing check and an uninitialized use Input: synaptics - enable SMBus for HP EliteBook 840 G4 gpio: mvebu: only fail on missing clk if pwm is actually to be used lan743x: Remove MAC Reset from initialization virtio: fix test build after uio.h change m68k: Fix memblock-related crashes kbuild: fix false positive warning/error about missing libelf mac80211: free skb fraglist before freeing the skb nl80211: fix memory leak if validate_pae_over_nl80211() fails vxge: ensure data0 is initialized in when fetching firmware version information lan78xx: Resolve issue with changing MAC address lan743x: Expand phy search for LAN7431 net: macb: add missing barriers when reading descriptors net: macb: fix dropped RX frames due to a race net: macb: fix random memory corruption on RX with 64-bit DMA qed: Fix an error code qed_ll2_start_xmit() SUNRPC: Fix a race with XPRT_CONNECTING mac80211: fix a kernel panic when TXing after TXQ teardown net: hns: Fix ping failed when use net bridge and send multicast net: hns: Add mac pcs config when enable\|disable mac net: hns: Fix ntuple-filters status error. net: hns: Avoid net reset caused by pause frames storm net: hns: Free irq when exit from abnormal branch net: hns: Clean rx fbd when ae stopped. net: hns: Fixed bug that netdev was opened twice net: hns: Some registers use wrong address according to the datasheet. net: hns: All ports can not work when insmod hns ko after rmmod. net: hns: Incorrect offset address used for some registers. w90p910_ether: remove incorrect __init annotation net/tls: Init routines in create_ctx drivers: net: xgene: Remove unnecessary forward declarations x86, hyperv: remove PCI dependency mt76: fix potential NULL pointer dereference in mt76_stop_tx_queues scsi: target: iscsi: cxgbit: add missing spin_lock_init() scsi: target: iscsi: cxgbit: fix csk leak bnx2x: Send update-svid ramrod with retry/poll flags enabled bnx2x: Remove configured vlans as part of unload sequence. bnx2x: Clear fip MAC when fcoe offload support is disabled netfilter: nf_conncount: use rb_link_node_rcu() instead of rb_link_node() netfilter: nat: can't use dst_hold on noref dst netfilter: ipset: do not call ipset_nest_end after nla_nest_cancel ixgbe: Fix race when the VF driver does a reset i40e: fix mac filter delete when setting mac address x86/dump_pagetables: Fix LDT remap address marker x86/mm: Fix guard hole handling ieee802154: ca8210: fix possible u8 overflow in ca8210_rx_done ibmvnic: Fix non-atomic memory allocation in IRQ context ibmvnic: Convert reset work item mutex to spin lock Input: synaptics - enable RMI on ThinkPad T560 Input: omap-keypad - fix idle configuration to not block SoC idle states scsi: bnx2fc: Fix NULL dereference in error handling Revert "scsi: qla2xxx: Fix NVMe Target discovery" netfilter: seqadj: re-load tcp header pointer after possible head reallocation netfilter: nf_tables: fix suspicious RCU usage in nft_chain_stats_replace() ieee802154: hwsim: fix off-by-one in parse nested xfrm: Fix NULL pointer dereference in xfrm_input when skb_dst_force clears the dst_entry. xfrm: Fix bucket count reported to userspace xfrm: Fix error return code in xfrm_output_one() checkstack.pl: fix for aarch64 IB/core: Fix oops in netdev_next_upper_dev_rcu() drm/amdgpu: Fix DEBUG_LOCKS_WARN_ON(depth <= 0) in amdgpu_ctx.lock powerpc/mm: Fallback to RAM if the altmap is unusable Input: restore EV_ABS ABS_RESERVED IB/mlx5: Block DEVX umem from the non applicable cases ARM: dts: imx7d-nitrogen7: Fix the description of the Wifi clock ARM: imx: update the cpu power up timing setting on i.mx6sx ARM: dts: imx7d-pico: Describe the Wifi clock HID: ite: Add USB id match for another ITE based keyboard rfkill key quirk powerpc/mm: Fix linux page tables build with some configs powerpc: Fix COFF zImage booting on old powermacs arm64: dts: mt7622: fix no more console output on rfb1 pinctrl: meson: fix pull enable register calculation ARM: dts: sun8i: a83t: bananapi-m3: increase vcc-pd voltage to 3.3V f2fs: don't access node/meta inode mapping after iput f2fs: wait on atomic writes to count F2FS_CP_WB_DATA f2fs: sanity check of xattr entry size f2fs: fix use-after-free issue when accessing sbi->stat_info f2fs: check PageWriteback flag for ordered case f2fs: fix validation of the block count in sanity_check_raw_super f2fs: fix missing unlock(sbi->gc_mutex) f2fs: fix to dirty inode synchronously f2fs: clean up structure extent_node f2fs: fix block address for __check_sit_bitmap f2fs: fix sbi->extent_list corruption issue f2fs: clean up checkpoint flow f2fs: flush stale issued discard candidates f2fs: correct wrong spelling, issing_* f2fs: use kvmalloc, if kmalloc is failed f2fs: remove redundant comment of unused wio_mutex f2fs: fix to reorder set_page_dirty and wait_on_page_writeback f2fs: clear PG_writeback if IPU failed f2fs: add an ioctl() to explicitly trigger fsck later f2fs: avoid frequent costly fsck triggers f2fs: fix m_may_create to make OPU DIO write correctly f2fs: fix to update new block address correctly for OPU f2fs: adjust trace print in f2fs_get_victim() to cover all paths f2fs: fix to allow node segment for GC by ioctl path f2fs: make "f2fs_fault_name[]" const char * f2fs: read page index before freeing f2fs: fix wrong return value of f2fs_acl_create f2fs: avoid build warn of fall_through f2fs: fix race between write_checkpoint and write_begin f2fs: check memory boundary by insane namelen f2fs: only flush the single temp bio cache which owns the target page f2fs: fix out-place-update DIO write f2fs: fix to be aware discard/preflush/dio command in is_idle() f2fs: add to account direct IO f2fs: move dir data flush to write checkpoint process f2fs: Change to use DEFINE_SHOW_ATTRIBUTE macro f2fs: change segment to section in f2fs_ioc_gc_range f2fs: export migration_granularity sysfs entry f2fs: support subsectional garbage collection f2fs: introduce __is_large_section() for cleanup f2fs: clean up f2fs_sb_has_##feature_name f2fs: remove codes of unused wio_mutex f2fs: fix count of seg_freed to make sec_freed correct f2fs: fix to account preflush command for noflush_merge mode f2fs: avoid GC causing encrypted file corrupted Conflicts: mm/memory_hotplug.c Change-Id: I26d2fbdddfa882fe9aae568a84a9269725ffb5ea Signed-off-by: Ivaylo Georgiev <irgeorgiev@codeaurora.org>	2019-02-27 02:23:14 -08:00
David Herrmann	bc999b5099	fork: record start_time late commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream. This changes the fork(2) syscall to record the process start_time after initializing the basic task structure but still before making the new process visible to user-space. Technically, we could record the start_time anytime during fork(2). But this might lead to scenarios where a start_time is recorded long before a process becomes visible to user-space. For instance, with userfaultfd(2) and TLS, user-space can delay the execution of fork(2) for an indefinite amount of time (and will, if this causes network access, or similar). By recording the start_time late, it much closer reflects the point in time where the process becomes live and can be observed by other processes. Lastly, this makes it much harder for user-space to predict and control the start_time they get assigned. Previously, user-space could fork a process and stall it in copy_thread_tls() before its pid is allocated, but after its start_time is recorded. This can be misused to later-on cycle through PIDs and resume the stalled fork(2) yielding a process that has the same pid and start_time as a process that existed before. This can be used to circumvent security systems that identify processes by their pid+start_time combination. Even though user-space was always aware that start_time recording is flaky (but several projects are known to still rely on start_time-based identification), changing the start_time to be recorded late will help mitigate existing attacks and make it much harder for user-space to control the start_time a process gets assigned. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2019-01-13 09:51:04 +01:00
Satya Durga Srinivasu Prabhala	7ebdf76d85	sched: Add snapshot of Window Assisted Load Tracking (WALT) This snapshot is taken from msm-4.14 as of commit 871eac76e6be567 ("sched: Improve the scheduler"). Change-Id: Ib4e0b39526d3009cedebb626ece5a767d8247846 Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>	2019-01-02 10:56:07 -08:00
Rishabh Bhatnagar	bfee6d7f04	Merge remote-tracking branch 'origin/tmp-11da3a7' into msm-kona * origin/tmp-11da3a7: Linux 4.19-rc3 kbuild: modules_install: warn when missing System.map file x86/mm: Use WRITE_ONCE() when setting PTEs x86/apic/vector: Make error return value negative afs: Fix cell specification to permit an empty address list KVM: LAPIC: Fix pv ipis out-of-bounds access KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 arm64: KVM: Remove pgd_lock KVM: Remove obsolete kvm_unmap_hva notifier backend arm64: KVM: Only force FPEXC32_EL2.EN if trapping FPSIMD KVM: arm/arm64: Clean dcache to PoC when changing PTE due to CoW i2c: xiic: Record xilinx i2c with Zynq fragment clocksource: Revert "Remove kthread" i2c: xiic: Make the start and the byte count write atomic irqchip/gic-v3-its: Cap lpi_id_bits to reduce memory footprint block: bfq: swap puts in bfqg_and_blkg_put memory: ti-aemif: fix a potential NULL-pointer dereference arm64: fix erroneous warnings in page freeing functions firmware: arm_scmi: fix divide by zero when sustained_perf_level is zero printk/tracing: Do not trace printk_nmi_enter() rbd: support cloning across namespaces rbd: factor out get_parent_info() ceph: avoid a use-after-free in ceph_destroy_options() cpu/hotplug: Prevent state corruption on error rollback cpu/hotplug: Adjust misplaced smb() in cpuhp_thread_fun() x86/process: Don't mix user/kernel regs in 64bit __show_regs() x86/tsc: Prevent result truncation on 32bit ACPI / LPSS: Force LPSS quirks on boot ACPI / bus: Only call dmi_check_system() on X86 block: don't warn when doing fsync on read-only devices hwmon: rpi: add module alias to raspberrypi-hwmon tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints nds32: linker script: GCOV kernel may refers data in __exit nilfs2: convert to SPDX license tags drivers/dax/device.c: convert variable to vm_fault_t type lib/Kconfig.debug: fix three typos in help text checkpatch: add __ro_after_init to known $Attribute mm: fix BUG_ON() in vmf_insert_pfn_pud() from VM_MIXEDMAP removal uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name memory_hotplug: fix kernel_panic on offline page processing checkpatch: add optional static const to blank line declarations test ipc/shm: properly return EIDRM in shm_lock() mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported. mm/util.c: improve kvfree() kerneldoc tools/vm/page-types.c: fix "defined but not used" warning tools/vm/slabinfo.c: fix sign-compare warning kmemleak: always register debugfs file mm: respect arch_dup_mmap() return value mm, oom: fix missing tlb_finish_mmu() in __oom_reap_task_mm(). mm: memcontrol: print proper OOM header when no eligible victim left ARC: don't check for HIGHMEM pages in arch_dma_alloc ARC: IOC: panic if both IOC and ZONE_HIGHMEM enabled ARC: dma [IOC] Enable per device io coherency net: phy: sfp: Handle unimplemented hwmon limits and alarms net: sched: action_ife: take reference to meta module act_ife: fix a potential use-after-free net/mlx5: Fix SQ offset in QPs with small RQ nbd: don't allow invalid blocksize settings i2c: i801: fix DNV's SMBCTRL register offset KVM: s390: Properly lock mm context allow_gmap_hpage_1m setting KVM: s390: vsie: copy wrapping keys to right place KVM: s390: Fix pfmf and conditional skey emulation nds32: fix build error because of wrong semicolon nds32: Fix a kernel panic issue because of wrong frame pointer access. nds32: Only print one page of stack when die to prevent printing too much information. nds32: Add macro definition for offset of lp register on stack nds32: Remove the deprecated ABI implementation nds32/stack: Get real return address by using ftrace_graph_ret_addr nds32/ftrace: Support dynamic function graph tracer nds32/ftrace: Support dynamic function tracer nds32/ftrace: Add RECORD_MCOUNT support nds32/ftrace: Support static function graph tracer nds32/ftrace: Support static function tracer nds32: Extract the checking and getting pointer to a macro nds32: Clean up the coding style nds32: Fix get_user/put_user macro expand pointer problem nds32: Fix empty call trace nds32: add NULL entry to the end of_device_id array nds32: fix logic for module tipc: correct spelling errors for tipc_topsrv_queue_evt() comments tipc: correct spelling errors for struct tipc_bc_base's comment bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA. bnxt_en: Clean up unused functions. bnxt_en: Fix firmware signaled resource change logic in open. sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel sctp: fix invalid reference to the index variable of the iterator net/ibm/emac: wrong emac_calc_base call was used by typo net: sched: null actions array pointer before releasing action drm/i915/dp_mst: Fix enabling pipe clock for all streams drm/i915/dsc: Fix PPS register definition macros for 2nd VDSC engine drm/i915: Re-apply "Perform link quality check, unconditionally during long pulse" vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition r8169: add support for NCube 8168 network card ip6_tunnel: respect ttl inherit for ip6tnl ALSA: hda: Fix several mismatch for register mask and value apparmor: fix bad debug check in apparmor_secid_to_secctx() ALSA: rawmidi: Initialize allocated buffers fsnotify: fix ignore mask logic in fsnotify() timekeeping: Fix declaration of read_persistent_wall_and_boot_offset() x86: Fix kernel-doc atomic.h warnings mac80211: shorten the IBSS debug messages mac80211: don't Tx a deauth frame if the AP forbade Tx mac80211: Fix station bandwidth setting after channel switch mac80211: fix a race between restart and CSA flows mac80211: fix WMM TXOP calculation cfg80211: fix a type issue in ieee80211_chandef_to_operating_class() mac80211: fix an off-by-one issue in A-MSDU max_subframe computation drm/i915/gvt: Give new born vGPU higher scheduling chance cifs: connect to servername instead of IP for IPC$ share smb3: check for and properly advertise directory lease support smb3: minor debugging clarifications in rfc1001 len processing SMB3: Backup intent flag missing for directory opens with backupuid mounts fs/cifs: don't translate SFM_SLASH (U+F026) to backslash m68k: fix early memory reservation for ColdFire MMU systems uapi: Fix linux/rds.h userspace compilation errors. net: cadence: Fix a sleep-in-atomic-context bug in macb_halt_tx() i2c: imx-lpi2c: Remove mx8dv compatible entry dt-bindings: imx-lpi2c: Remove mx8dv compatible entry i2c: uniphier-f: issue STOP only for last message or I2C_M_STOP i2c: uniphier: issue STOP only for last message or I2C_M_STOP net/ipv6: Only update MTU metric if it set net: ethernet: cpsw-phy-sel: prefer phandle for phy sel dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle igmp: fix incorrect unsolicit report count after link down and up igmp: fix incorrect unsolicit report count when join group bpf: avoid misuse of psock when TCP_ULP_BPF collides with another ULP tools/bpf: bpftool, add xskmap in map types bpf: Fix bpf_msg_pull_data() kbuild: make missing $DEPMOD a Warning instead of an Error kconfig: do not require pkg-config on make {menu,n}config x86/microcode: Update the new microcode revision unconditionally x86/microcode: Make sure boot_cpu_data.microcode is up-to-date of/platform: initialise AMBA default DMA masks sparc: set a default 32-bit dma mask for OF devices ipv6: don't get lwtstate twice in ip6_rt_copy_init() random: make CPU trust a boot parameter kernel/dma/direct: take DMA offset into account in dma_direct_supported ibmvnic: Include missing return code checks in reset function selftests: pmtu: detect correct binary to ping ipv6 addresses selftests: pmtu: maximum MTU for vti4 is 2^16-1-20 tcp: do not restart timewait timer on rst reception net/rds: RDS is not Radio Data System hv_netvsc: Fix a deadlock by getting rtnl lock earlier in netvsc_probe() nfp: wait for posted reconfigs when disabling the device Revert "packet: switch kvzalloc to allocate memory" md-cluster: release RESYNC lock after the last resync message RAID10 BUG_ON in raise_barrier when force is true and conf->barrier is 0 md/raid5-cache: disable reshape completely blkcg: use tryget logic when associating a blkg with a bio blkcg: delay blkg destruction until after writeback has finished Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" ARC: dma [IOC]: mark DMA devices connected as dma-coherent ARC: atomics: unbork atomic_fetch_##op() MIPS: VDSO: Match data page cache colouring when D$ aliases kconfig: remove a spurious self-assignment scripts/setlocalversion: git: Make -dirty check more robust gpio: Fix crash due to registration race arc: remove redundant GCC version checks tools/kvm_stat: re-animate display of dead guests tools/kvm_stat: indicate dead guests as such tools/kvm_stat: handle guest removals more gracefully tools/kvm_stat: don't reset stats when setting PID filter for debugfs tools/kvm_stat: fix updates for dead guests tools/kvm_stat: fix handling of invalid paths in debugfs provider tools/kvm_stat: fix python3 issues KVM: x86: Unexport x86_emulate_instruction() KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() KVM: x86: Do not re-{try,execute} after failed emulation in L2 KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE KVM: x86: Invert emulation re-execute behavior to make it opt-in KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr KVM: SVM: remove unused variable dst_vaddr_end KVM: nVMX: avoid redundant double assignment of nested_run_pending ALSA: hda - Fix cancel_work_sync() stall from jackpoll work mac80211: always account for A-MSDU header changes mac80211: do not convert to A-MSDU if frag/subframe limited cfg80211: nl80211_update_ft_ies() to validate NL80211_ATTR_IE tc-testing: add test-cases for numeric and invalid control action net_sched: reject unknown tcfa_action values net: mvpp2: initialize port of_node pointer drm/i915/gvt: Fix drm_format_mod value for vGPU plane drm/i915/gvt: move intel_runtime_pm_get out of spin_lock in stop_schedule drm/i915/gvt: Handle GEN9_WM_CHICKEN3 with F_CMD_ACCESS. drm/i915/gvt: Make correct handling to vreg BXT_PHY_CTL_FAMILY drm/i915/gvt: emulate gen9 dbuf ctl register access net: bcmgenet: use MAC link status for fixed phy net: stmmac: build the dwmac-socfpga platform driver for Stratix10 net: rtnl: return early from rtnl_unregister_all when protocol isn't registered ipv6: fix cleanup ordering for pingv6 registration ipv6: fix cleanup ordering for ip6_mr failure net/sched: act_pedit: fix dump of extended layered op sh_eth: Add R7S9210 support net: hns: add netif_carrier_off before change speed and duplex net: hns: add the code for cleaning pkt in chip r8169: set RxConfig after tx/rx is enabled for RTL8169sb/8110sb devices tipc: switch to rhashtable iterator Revert "net: stmmac: Do not keep rearming the coalesce timer in stmmac_xmit" tipc: fix a missing rhashtable_walk_exit() vti6: remove !skb->ignore_df check from vti6_xmit() bpf: fix sg shift repair start offset in bpf_msg_pull_data bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data gpio: dwapb: Fix error handling in dwapb_gpio_probe() gpiolib-acpi: Register GpioInt ACPI event handlers from a late_initcall gpiolib: acpi: Switch to cansleep version of GPIO library call mac80211: avoid kernel panic when building AMSDU from non-linear SKB mac80211: mesh: fix HWMP sequence numbering to follow standard gpio: adp5588: Fix sleep-in-atomic-context bug bpf: fix several offset tests in bpf_msg_pull_data nl80211: Pass center frequency in kHz instead of MHz nl80211: Fix nla_put_u8 to u16 for NL80211_WMMR_TXOP mac80211_hwsim: Fix possible Spectre-v1 for hwsim_world_regdom_custom mac80211: don't update the PM state of a peer upon a multicast frame cfg80211: make wmm_rule part of the reg_rule structure mac80211_hwsim: correct use of IEEE80211_VHT_CAP_RXSTBC_X mac80211: correct use of IEEE80211_VHT_CAP_RXSTBC_X bpf: sockmap, decrement copied count correctly in redirect error case bpf: fix build error with clang bpf, sockmap: fix psock refcount leak in bpf_tcp_recvmsg bpf, sockmap: fix potential use after free in bpf_tcp_close net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6 net: dsa: Drop GPIO includes tipc: fix the big/little endian issue in tipc_dest net: sched: return -ENOENT when trying to remove filter from non-existent chain net: sched: fix extack error message when chain is failed to be created erspan: set erspan_ver to 1 by default when adding an erspan dev sctp: remove useless start_fail from sctp_ht_iter in proc sctp: hold transport before accessing its asoc in sctp_transport_get_next scsi: aacraid: fix a signedness bug Revert "scsi: core: avoid host-wide host_busy counter for scsi_mq" Revert "scsi: core: fix scsi_host_queue_ready" scsi: libata: Add missing newline at end of file scsi: target: iscsi: cxgbit: use pr_debug() instead of pr_info() scsi: hpsa: limit transfer length to 1MB, not 512kB scsi: lpfc: Correct MDS diag and nvmet configuration scsi: lpfc: Default fdmi_on to on scsi: csiostor: fix incorrect port capabilities scsi: csiostor: add a check for NULL pointer after kmalloc() scsi: documentation: add scsi_mod.use_blk_mq to scsi-parameters scsi: core: Update SCSI_MQ_DEFAULT help text to match default ARC: sort Kconfig ARC: cleanup show_faulting_vma() ARC: [plat-axs]: Enable SWAP ARC: [plat-axs/plat-hsdk]: Allow U-Boot to pass MAC-address to the kernel ARC: configs: cleanup arm64: allwinner: dts: h6: fix Pine H64 MMC bus width btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu btrfs: use after free in btrfs_quota_enable btrfs: btrfs_shrink_device should call commit transaction at the end btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata Btrfs: fix data corruption when deduplicating between different files Btrfs: sync log after logging new name cfg80211: remove division by size of sizeof(struct ieee80211_wmm_rule) KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space mac80211_hwsim: require at least one channel KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix() mac80211: Run TXQ teardown code before de-registering interfaces rfkill-gpio: include linux/mod_devicetable.h Change-Id: Ic6d1654e67ece823a5fce6ae18d241ad350bfb08 Signed-off-by: Rishabh Bhatnagar <rishabhb@codeaurora.org>	2018-09-14 11:24:01 -07:00
Nadav Amit	1ed0cc5a01	mm: respect arch_dup_mmap() return value Commit `d70f2a14b7` ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") ignored the return value of arch_dup_mmap(). As a result, on x86, a failure to duplicate the LDT (e.g. due to memory allocation error) would leave the duplicated memory mapping in an inconsistent state. Fix by using the return value, as it was before the change. Link: http://lkml.kernel.org/r/20180823051229.211856-1-namit@vmware.com Fixes: `d70f2a14b7` ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-09-04 16:45:02 -07:00
Sultan Alsawaf	9c752f5721	ANDROID: Fix massive cpufreq_times memory leaks Every time _cpu_up() is called for a CPU, idle_thread_get() is called which then re-initializes a CPU's idle thread that was already previously created and cached in a global variable in smpboot.c. idle_thread_get() calls init_idle() which then calls __sched_fork(). __sched_fork() is where cpufreq_task_times_init() is, and cpufreq_task_times_init() allocates memory for the task struct's time_in_state array. Since idle_thread_get() reuses a task struct instance that was already previously created, this means that every time it calls init_idle(), cpufreq_task_times_init() allocates this array again and overwrites the existing allocation that the idle thread already had. This causes memory to be leaked every time a CPU is onlined. In order to fix this, move allocation of time_in_state into _do_fork to avoid allocating it at all for idle threads. The cpufreq times interface is intended to be used for tracking userspace tasks, so we can safely remove it from the kernel's idle threads without killing any functionality. But that's not all! Task structs can be freed outside of release_task(), which creates another memory leak because a task struct can be freed without having its cpufreq times allocation freed. To fix this, free the cpufreq times allocation at the same time that task struct allocations are freed, in free_task(). Since free_task() can also be called in error paths of copy_process() after dup_task_struct(), set time_in_state to NULL immediately after calling dup_task_struct() to avoid possible double free. Bug description and fix adapted from patch submitted by Sultan Alsawaf <sultanxda@gmail.com> at https://android-review.googlesource.com/c/kernel/msm/+/700134 Bug: 110044919 Test: Hikey960 builds, boots & reports /proc/<pid>/time_in_state correctly Change-Id: I12fe7611fc88eb7f6c39f8f7629ad27b6ec4722c Signed-off-by: Connor O'Brien <connoro@google.com>	2018-08-28 17:10:42 +05:30
Linus Torvalds	cd9b44f907	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: - the rest of MM - procfs updates - various misc things - more y2038 fixes - get_maintainer updates - lib/ updates - checkpatch updates - various epoll updates - autofs updates - hfsplus - some reiserfs work - fatfs updates - signal.c cleanups - ipc/ updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits) ipc/util.c: update return value of ipc_getref from int to bool ipc/util.c: further variable name cleanups ipc: simplify ipc initialization ipc: get rid of ids->tables_initialized hack lib/rhashtable: guarantee initial hashtable allocation lib/rhashtable: simplify bucket_table_alloc() ipc: drop ipc_lock() ipc/util.c: correct comment in ipc_obtain_object_check ipc: rename ipcctl_pre_down_nolock() ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid() ipc: reorganize initialization of kern_ipc_perm.seq ipc: compute kern_ipc_perm.id under the ipc lock init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp adfs: use timespec64 for time conversion kernel/sysctl.c: fix typos in comments drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md fork: don't copy inconsistent signal handler state to child signal: make get_signal() return bool signal: make sigkill_pending() return bool ...	2018-08-22 12:34:08 -07:00
Jann Horn	06e62a46bb	fork: don't copy inconsistent signal handler state to child Before this change, if a multithreaded process forks while one of its threads is changing a signal handler using sigaction(), the memcpy() in copy_sighand() can race with the struct assignment in do_sigaction(). It isn't clear whether this can cause corruption of the userspace signal handler pointer, but it definitely can cause inconsistency between different fields of struct sigaction. Take the appropriate spinlock to avoid this. I have tested that this patch prevents inconsistency between sa_sigaction and sa_flags, which is possible before this patch. Link: http://lkml.kernel.org/r/20180702145108.73189-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 10:52:51 -07:00
Dmitry Vyukov	a2e5144538	kernel/hung_task.c: allow to set checking interval separately from timeout Currently task hung checking interval is equal to timeout, as the result hung is detected anywhere between timeout and 2*timeout. This is fine for most interactive environments, but this hurts automated testing setups (syzbot). In an automated setup we need to strictly order CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall is not detected as task hung and task hung is not detected as silent machine loss. The large variance in task hung detection timeout requires setting silent machine loss timeout to a very large value (e.g. if task hung is 3 mins, then silent loss need to be set to ~7 mins). The additional 3 minutes significantly reduce testing efficiency because usually we crash kernel within a minute, and this can add hours to bug localization process as it needs to do dozens of tests. Allow setting checking interval separately from timeout. This allows to set timeout to, say, 3 minutes, but checking interval to 10 secs. The interval is controlled via a new hung_task_check_interval_secs sysctl, similar to the existing hung_task_timeout_secs sysctl. The default value of 0 results in the current behavior: checking interval is equal to timeout. [akpm@linux-foundation.org: update hung_task_timeout_max's comment] Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 10:52:47 -07:00
Andrew Morton	a670468f5e	mm: zero out the vma in vma_init() Rather than in vm_area_alloc(). To ensure that the various oddball stack-based vmas are in a good state. Some of the callers were zeroing them out, others were not. Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-22 10:52:44 -07:00
Linus Torvalds	0214f46b3a	Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull core signal handling updates from Eric Biederman: "It was observed that a periodic timer in combination with a sufficiently expensive fork could prevent fork from every completing. This contains the changes to remove the need for that restart. This set of changes is split into several parts: - The first part makes PIDTYPE_TGID a proper pid type instead something only for very special cases. The part starts using PIDTYPE_TGID enough so that in __send_signal where signals are actually delivered we know if the signal is being sent to a a group of processes or just a single process. - With that prep work out of the way the logic in fork is modified so that fork logically makes signals received while it is running appear to be received after the fork completes" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits) signal: Don't send signals to tasks that don't exist signal: Don't restart fork when signals come in. fork: Have new threads join on-going signal group stops fork: Skip setting TIF_SIGPENDING in ptrace_init_task signal: Add calculate_sigpending() fork: Unconditionally exit if a fatal signal is pending fork: Move and describe why the code examines PIDNS_ADDING signal: Push pid type down into complete_signal. signal: Push pid type down into __send_signal signal: Push pid type down into send_signal signal: Pass pid type into do_send_sig_info signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task signal: Pass pid type into group_send_sig_info signal: Pass pid and pid type into send_sigqueue posix-timers: Noralize good_sigevent signal: Use PIDTYPE_TGID to clearly store where file signals will be sent pid: Implement PIDTYPE_TGID pids: Move the pgrp and session pid pointers from task_struct to signal_struct kvm: Don't open code task_pid in kvm_vcpu_ioctl pids: Compute task_tgid using signal->leader_pid ...	2018-08-21 13:47:29 -07:00
Shakeel Butt	d46eb14b73	fs: fsnotify: account fsnotify metadata to kmemcg Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-08-17 16:20:30 -07:00
Linus Torvalds	73ba2fb33c	Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...	2018-08-14 10:23:25 -07:00
Linus Torvalds	203b4fc903	Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Thomas Gleixner: - Make lazy TLB mode even lazier to avoid pointless switch_mm() operations, which reduces CPU load by 1-2% for memcache workloads - Small cleanups and improvements all over the place * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Remove redundant check for kmem_cache_create() arm/asm/tlb.h: Fix build error implicit func declaration x86/mm/tlb: Make clear_asid_other() static x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off() x86/mm/tlb: Always use lazy TLB mode x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Restructure switch_mm_irqs_off() x86/mm/tlb: Leave lazy TLB mode at page table free time mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids x86/mm: Add TLB purge to free pmd/pte page interfaces ioremap: Update pgtable free interfaces with addr x86/mm: Disable ioremap free page handling on x86-PAE	2018-08-13 16:29:35 -07:00
Eric W. Biederman	c3ad2c3b02	signal: Don't restart fork when signals come in. Wen Yang <wen.yang99@zte.com.cn> and majiang <ma.jiang@zte.com.cn> report that a periodic signal received during fork can cause fork to continually restart preventing an application from making progress. The code was being overly pessimistic. Fork needs to guarantee that a signal sent to multiple processes is logically delivered before the fork and just to the forking process or logically delivered after the fork to both the forking process and it's newly spawned child. For signals like periodic timers that are always delivered to a single process fork can safely complete and let them appear to logically delivered after the fork(). While examining this issue I also discovered that fork today will miss signals delivered to multiple processes during the fork and handled by another thread. Similarly the current code will also miss blocked signals that are delivered to multiple process, as those signals will not appear pending during fork. Add a list of each thread that is currently forking, and keep on that list a signal set that records all of the signals sent to multiple processes. When fork completes initialize the new processes shared_pending signal set with it. The calculate_sigpending function will see those signals and set TIF_SIGPENDING causing the new task to take the slow path to userspace to handle those signals. Making it appear as if those signals were received immediately after the fork. It is not possible to send real time signals to multiple processes and exceptions don't go to multiple processes, which means that that are no signals sent to multiple processes that require siginfo. This means it is safe to not bother collecting siginfo on signals sent during fork. The sigaction of a child of fork is initially the same as the sigaction of the parent process. So a signal the parent ignores the child will also initially ignore. Therefore it is safe to ignore signals sent to multiple processes and ignored by the forking process. Signals sent to only a single process or only a single thread and delivered during fork are treated as if they are received after the fork, and generally not dealt with. They won't cause any problems. V2: Added removal from the multiprocess list on failure. V3: Use -ERESTARTNOINTR directly V4: - Don't queue both SIGCONT and SIGSTOP - Initialize signal_struct.multiprocess in init_task - Move setting of shared_pending to before the new task is visible to signals. This prevents signals from comming in before shared_pending.signal is set to delayed.signal and being lost. V5: - rework list add and delete to account for idle threads v6: - Use sigdelsetmask when removing stop signals Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200447 Reported-by: Wen Yang <wen.yang99@zte.com.cn> and Reported-by: majiang <ma.jiang@zte.com.cn> Fixes: `4a2c7a7837` ("[PATCH] make fork() atomic wrt pgrp/session signals") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-08-09 13:07:01 -05:00
Jens Axboe	05b9ba4b55	Merge tag 'v4.18-rc6' into for-4.19/block2 Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>	2018-08-05 19:32:09 -06:00
Eric W. Biederman	924de3b8c9	fork: Have new threads join on-going signal group stops There are only two signals that are delivered to every member of a signal group: SIGSTOP and SIGKILL. Signal delivery requires every signal appear to be delivered either before or after a clone syscall. SIGKILL terminates the clone so does not need to be considered. Which leaves only SIGSTOP that needs to be considered when creating new threads. Today in the event of a group stop TIF_SIGPENDING will get set and the fork will restart ensuring the fork syscall participates in the group stop. A fork (especially of a process with a lot of memory) is one of the most expensive system so we really only want to restart a fork when necessary. It is easy so check to see if a SIGSTOP is ongoing and have the new thread join it immediate after the clone completes. Making it appear the clone completed happened just before the SIGSTOP. The calculate_sigpending function will see the bits set in jobctl and set TIF_SIGPENDING to ensure the new task takes the slow path to userspace. V2: The call to task_join_group_stop was moved before the new task is added to the thread group list. This should not matter as sighand->siglock is held over both the addition of the threads, the call to task_join_group_stop and do_signal_stop. But the change is trivial and it is one less thing to worry about when reading the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-08-03 20:20:14 -05:00
Josef Bacik	2c323017e3	blk-cgroup: clear the throttle queue on fork We were hitting a panic in production where we put too many times on the request queue. This is because we'd get the throttle_queue of the parent if we fork()'ed while we needed to be throttled, but we didn't have a reference on it. Instead just clear these flags on fork so the child doesn't pay for the sins of its father. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-08-01 09:16:04 -06:00
Kirill A. Shutemov	027232da7c	mm: introduce vma_init() Not all VMAs allocated with vm_area_alloc(). Some of them allocated on stack or in data segment. The new helper can be use to initialize VMA properly regardless where it was allocated. Link: http://lkml.kernel.org/r/20180724121139.62570-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-07-26 19:38:03 -07:00
Eric W. Biederman	7673bf553b	fork: Unconditionally exit if a fatal signal is pending In practice this does not change anything as testing for fatal_signal_pending and exiting for with an error code duplicates the work of the next clause which recalculates pending signals and then exits fork if any are pending. In both cases the pending signal will trigger the slow path when existing to userspace, and the fatal signal will cause do_exit to be called. The advantage of making this a separate test is that it makes it clear processing the fatal signal will terminate the fork, and it allows the rest of the signal logic to be updated without fear that this important case will be lost. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-07-23 08:01:10 -05:00
Eric W. Biederman	4ca1d3ee46	fork: Move and describe why the code examines PIDNS_ADDING Normally this would be something that would be handled by handling signals that are sent to a group of processes but in this case the forking process is not a member of the group being signaled. Thus special code is needed to prevent a race with pid namespaces exiting, and fork adding new processes within them. Move this test up before the signal restart just in case signals are also pending. Fatal conditions should take presedence over restarts. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-07-23 07:57:12 -05:00
Linus Torvalds	490fc05386	mm: make vm_area_alloc() initialize core fields Like vm_area_dup(), it initializes the anon_vma_chain head, and the basic mm pointer. The rest of the fields end up being different for different users, although the plan is to also initialize the 'vm_ops' field to a dummy entry. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-07-21 15:24:03 -07:00
Linus Torvalds	95faf6992d	mm: make vm_area_dup() actually copy the old vma data .. and re-initialize th eanon_vma_chain head. This removes some boiler-plate from the users, and also makes it clear why it didn't need use the 'zalloc()' version. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-07-21 14:48:45 -07:00
Linus Torvalds	3928d4f5ee	mm: use helper functions for allocating and freeing vm_area structs The vm_area_struct is one of the most fundamental memory management objects, but the management of it is entirely open-coded evertwhere, ranging from allocation and freeing (using kmem_cache_[z]alloc and kmem_cache_free) to initializing all the fields. We want to unify this in order to end up having some unified initialization of the vmas, and the first step to this is to at least have basic allocation functions. Right now those functions are literally just wrappers around the kmem_cache_*() calls. This is a purely mechanical conversion: # new vma: kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc() # copy old vma kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old) # free vma kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma) to the point where the old vma passed in to the vm_area_dup() function isn't even used yet (because I've left all the old manual initialization alone). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-07-21 13:48:51 -07:00
Eric W. Biederman	6883f81aac	pid: Implement PIDTYPE_TGID Everywhere except in the pid array we distinguish between a tasks pid and a tasks tgid (thread group id). Even in the enumeration we want that distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid we almost have an implementation of PIDTYPE_TGID in struct signal_struct. Add PIDTYPE_TGID as a first class member of the pid_type enumeration and into the pids array. Then remove the __PIDTYPE_TGID special case and the leader_pid in signal_struct. The net size increase is just an extra pointer added to struct pid and an extra pair of pointers of an hlist_node added to task_struct. The effect on code maintenance is the removal of a number of special cases today and the potential to remove many more special cases as PIDTYPE_TGID gets used to it's fullest. The long term potential is allowing zombie thread group leaders to exit, which will remove a lot more special cases in the code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-07-21 10:43:12 -05:00
Eric W. Biederman	2c4704756c	pids: Move the pgrp and session pid pointers from task_struct to signal_struct To access these fields the code always has to go to group leader so going to signal struct is no loss and is actually a fundamental simplification. This saves a little bit of memory by only allocating the pid pointer array once instead of once for every thread, and even better this removes a few potential races caused by the fact that group_leader can be changed by de_thread, while signal_struct can not. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2018-07-21 10:43:12 -05:00
Rik van Riel	c1a2f7f0c0	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids The mm_struct always contains a cpumask bitmap, regardless of CONFIG_CPUMASK_OFFSTACK. That means the first step can be to simplify things, and simply have one bitmask at the end of the mm_struct for the mm_cpumask. This does necessitate moving everything else in mm_struct into an anonymous sub-structure, which can be randomized when struct randomization is enabled. The second step is to determine the correct size for the mm_struct slab object from the size of the mm_struct (excluding the CPU bitmap) and the size the cpumask. For init_mm we can simply allocate the maximum size this kernel is compiled for, since we only have one init_mm in the system, anyway. Pointer magic by Mike Galbraith, to evade -Wstringop-overflow getting confused by the dynamically sized array. Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: luto@kernel.org Link: http://lkml.kernel.org/r/20180716190337.26133-2-riel@surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2018-07-17 09:35:30 +02:00
Tetsuo Handa	655c79bb40	mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. *** Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting dup_mmap() due to SIGKILL * * Aborting exit_mmap() due to NULL mmap *** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-15 07:55:24 +09:00
Linus Torvalds	050e9baa9d	Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables The changes to automatically test for working stack protector compiler support in the Kconfig files removed the special STACKPROTECTOR_AUTO option that picked the strongest stack protector that the compiler supported. That was all a nice cleanup - it makes no sense to have the AUTO case now that the Kconfig phase can just determine the compiler support directly. HOWEVER. It also meant that doing "make oldconfig" would now _disable_ the strong stackprotector if you had AUTO enabled, because in a legacy config file, the sane stack protector configuration would look like CONFIG_HAVE_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_NONE is not set # CONFIG_CC_STACKPROTECTOR_REGULAR is not set # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_STACKPROTECTOR_AUTO=y and when you ran this through "make oldconfig" with the Kbuild changes, it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version used to be disabled (because it was really enabled by AUTO), and would disable it in the new config, resulting in: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_CC_STACKPROTECTOR=y # CONFIG_CC_STACKPROTECTOR_STRONG is not set CONFIG_CC_HAS_SANE_STACKPROTECTOR=y That's dangerously subtle - people could suddenly find themselves with the weaker stack protector setup without even realizing. The solution here is to just rename not just the old RECULAR stack protector option, but also the strong one. This does that by just removing the CC_ prefix entirely for the user choices, because it really is not about the compiler support (the compiler support now instead automatially impacts _visibility_ of the options to users). This results in "make oldconfig" actually asking the user for their choice, so that we don't have any silent subtle security model changes. The end result would generally look like this: CONFIG_HAVE_CC_STACKPROTECTOR=y CONFIG_CC_HAS_STACKPROTECTOR_NONE=y CONFIG_STACKPROTECTOR=y CONFIG_STACKPROTECTOR_STRONG=y CONFIG_CC_HAS_SANE_STACKPROTECTOR=y where the "CC_" versions really are about internal compiler infrastructure, not the user selections. Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-14 12:21:18 +09:00
Linus Torvalds	d82991a868	Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull restartable sequence support from Thomas Gleixner: "The restartable sequences syscall (finally): After a lot of back and forth discussion and massive delays caused by the speculative distraction of maintainers, the core set of restartable sequences has finally reached a consensus. It comes with the basic non disputed core implementation along with support for arm, powerpc and x86 and a full set of selftests It was exposed to linux-next earlier this week, so it does not fully comply with the merge window requirements, but there is really no point to drag it out for yet another cycle" * 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq/selftests: Provide Makefile, scripts, gitignore rseq/selftests: Provide parametrized tests rseq/selftests: Provide basic percpu ops test rseq/selftests: Provide basic test rseq/selftests: Provide rseq library selftests/lib.mk: Introduce OVERRIDE_TARGETS powerpc: Wire up restartable sequences system call powerpc: Add syscall detection for restartable sequences powerpc: Add support for restartable sequences x86: Wire up restartable sequence system call x86: Add support for restartable sequences arm: Wire up restartable sequences system call arm: Add syscall detection for restartable sequences arm: Add restartable sequences support rseq: Introduce restartable sequences system call uapi/headers: Provide types_32_64.h	2018-06-10 10:17:09 -07:00
Yang Shi	88aa7cc688	mm: introduce arg_lock to protect arg_start\|end and env_start\|end in mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start\|end and evn_start\|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start\|end, env_start\|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-06-07 17:34:34 -07:00
Linus Torvalds	8b5c6a3a49	Merge tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Another reasonable chunk of audit changes for v4.18, thirteen patches in total. The thirteen patches can mostly be broken down into one of four categories: general bug fixes, accessor functions for audit state stored in the task_struct, negative filter matches on executable names, and extending the (relatively) new seccomp logging knobs to the audit subsystem. The main driver for the accessor functions from Richard are the changes we're working on to associate audit events with containers, but I think they have some standalone value too so I figured it would be good to get them in now. The seccomp/audit patches from Tyler apply the seccomp logging improvements from a few releases ago to audit's seccomp logging; starting with this patchset the changes in /proc/sys/kernel/seccomp/actions_logged should apply to both the standard kernel logging and audit. As usual, everything passes the audit-testsuite and it happens to merge cleanly with your tree" [ Heh, except it had trivial merge conflicts with the SELinux tree that also came in from Paul - Linus ] * tag 'audit-pr-20180605' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Fix wrong task in comparison of session ID audit: use existing session info function audit: normalize loginuid read access audit: use new audit_context access funciton for seccomp_actions_logged audit: use inline function to set audit context audit: use inline function to get audit context audit: convert sessionid unset to a macro seccomp: Don't special case audited processes when logging seccomp: Audit attempts to modify the actions_logged sysctl seccomp: Configurable separator for the actions_logged string seccomp: Separate read and write code for actions_logged sysctl audit: allow not equal op for audit by executable audit: add syscall information to FEATURE_CHANGE records	2018-06-06 16:34:00 -07:00
Mathieu Desnoyers	d7822b1e24	rseq: Introduce restartable sequences system call Expose a new system call allowing each thread to register one userspace memory area to be used as an ABI between kernel and user-space for two purposes: user-space restartable sequences and quick access to read the current CPU number value from user-space. * Restartable sequences (per-cpu atomics) Restartables sequences allow user-space to perform update operations on per-cpu data without requiring heavy-weight atomic operations. The restartable critical sections (percpu atomics) work has been started by Paul Turner and Andrew Hunter. It lets the kernel handle restart of critical sections. [1] [2] The re-implementation proposed here brings a few simplifications to the ABI which facilitates porting to other architectures and speeds up the user-space fast path. Here are benchmarks of various rseq use-cases. Test hardware: arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading The following benchmarks were all performed on a single thread. * Per-CPU statistic counter increment getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 344.0 31.4 11.0 x86-64: 15.3 2.0 7.7 * LTTng-UST: write event 32-bit header, 32-bit payload into tracer per-cpu buffer getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 2502.0 2250.0 1.1 x86-64: 117.4 98.0 1.2 * liburcu percpu: lock-unlock pair, dereference, read/compare word getcpu+atomic (ns/op) rseq (ns/op) speedup arm32: 751.0 128.5 5.8 x86-64: 53.4 28.6 1.9 * jemalloc memory allocator adapted to use rseq Using rseq with per-cpu memory pools in jemalloc at Facebook (based on rseq 2016 implementation): The production workload response-time has 1-2% gain avg. latency, and the P99 overall latency drops by 2-3%. * Reading the current CPU number Speeding up reading the current CPU number on which the caller thread is running is done by keeping the current CPU number up do date within the cpu_id field of the memory area registered by the thread. This is done by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the current thread. Upon return to user-space, a notify-resume handler updates the current CPU value within the registered user-space memory area. User-space can then read the current CPU number directly from memory. Keeping the current cpu id in a memory area shared between kernel and user-space is an improvement over current mechanisms available to read the current CPU number, which has the following benefits over alternative approaches: - 35x speedup on ARM vs system call through glibc - 20x speedup on x86 compared to calling glibc, which calls vdso executing a "lsl" instruction, - 14x speedup on x86 compared to inlined "lsl" instruction, - Unlike vdso approaches, this cpu_id value can be read from an inline assembly, which makes it a useful building block for restartable sequences. - The approach of reading the cpu id through memory mapping shared between kernel and user-space is portable (e.g. ARM), which is not the case for the lsl-based x86 vdso. On x86, yet another possible approach would be to use the gs segment selector to point to user-space per-cpu data. This approach performs similarly to the cpu id cache, but it has two disadvantages: it is not portable, and it is incompatible with existing applications already using the gs segment selector for other purposes. Benchmarking various approaches for reading the current CPU number: ARMv7 Processor rev 4 (v7l) Machine model: Cubietruck - Baseline (empty loop): 8.4 ns - Read CPU from rseq cpu_id: 16.7 ns - Read CPU from rseq cpu_id (lazy register): 19.8 ns - glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns - getcpu system call: 234.9 ns x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz: - Baseline (empty loop): 0.8 ns - Read CPU from rseq cpu_id: 0.8 ns - Read CPU from rseq cpu_id (lazy register): 0.8 ns - Read using gs segment selector: 0.8 ns - "lsl" inline assembly: 13.0 ns - glibc 2.19-0ubuntu6 getcpu: 16.6 ns - getcpu system call: 53.9 ns - Speed (benchmark taken on v8 of patchset) Running 10 runs of hackbench -l 100000 seems to indicate, contrary to expectations, that enabling CONFIG_RSEQ slightly accelerates the scheduler: Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1 kernel parameter), with a Linux v4.6 defconfig+localyesconfig, restartable sequences series applied. * CONFIG_RSEQ=n avg.: 41.37 s std.dev.: 0.36 s * CONFIG_RSEQ=y avg.: 40.46 s std.dev.: 0.33 s - Size On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is 567 bytes, and the data size increase of vmlinux is 5696 bytes. [1] https://lwn.net/Articles/650333/ [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Watson <davejwatson@fb.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Chris Lameter <cl@linux.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Hunter <ahh@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Maurer <bmaurer@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-api@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com	2018-06-06 11:58:31 +02:00
Richard Guy Briggs	c0b0ae8a87	audit: use inline function to set audit context Recognizing that the audit context is an internal audit value, use an access function to set the audit context pointer for the task rather than reaching directly into the task struct to set it. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: merge fuzz in audit.h] Signed-off-by: Paul Moore <paul@paul-moore.com>	2018-05-14 17:45:21 -04:00
Kees Cook	e01e80634e	fork: unconditionally clear stack on fork One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak I did some more with perf and cycle counts on running 100,000 execs of /bin/true. before: Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841 Mean: 221015379122.60 Std Dev: 4662486552.47 after: Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348 Mean: 217745009865.40 Std Dev: 5935559279.99 It continues to look like it's faster, though the deviation is rather wide, but I'm not sure what I could do that would be less noisy. I'm open to ideas! Link: http://lkml.kernel.org/r/20180221021659.GA37073@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-20 17:18:35 -07:00
Mark Rutland	3eda69c92d	kernel/fork.c: detect early free of a live mm KASAN splats indicate that in some cases we free a live mm, then continue to access it, with potentially disastrous results. This is likely due to a mismatched mmdrop() somewhere in the kernel, but so far the culprit remains elusive. Let's have __mmdrop() verify that the mm isn't live for the current task, similar to the existing check for init_mm. This way, we can catch this class of issue earlier, and without requiring KASAN. Currently, idle_task_exit() leaves active_mm stale after it switches to init_mm. This isn't harmful, but will trigger the new assertions, so we must adjust idle_task_exit() to update active_mm. Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-04-05 21:36:27 -07:00
Dominik Brodowski	9b32105ec6	kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare() Using this helper allows us to avoid the in-kernel calls to the sys_unshare() syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_unshare(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:16:06 +02:00
Dominik Brodowski	2de0db992d	mm: use do_futex() instead of sys_futex() in mm_release() sys_futex() is a wrapper to do_futex() which does not modify any values here: - uaddr, val and val3 are kept the same - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE. Therefore, val2 is always 0. - as utime is set to NULL, *timeout is NULL This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>	2018-04-02 20:15:02 +02:00
Andrew Morton	d34bc48f82	include/linux/sched/mm.h: re-inline mmdrop() As Peter points out, Doing a CALL+RET for just the decrement is a bit silly. Fixes: `d70f2a14b7` ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") Acked-by: Peter Zijlstra (Intel) <peterz@infraded.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-21 15:35:42 -08:00
Linus Torvalds	a2e5790d84	Merge branch 'akpm' (patches from Andrew) Merge misc updates from Andrew Morton: - kasan updates - procfs - lib/bitmap updates - other lib/ updates - checkpatch tweaks - rapidio - ubsan - pipe fixes and cleanups - lots of other misc bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits) Documentation/sysctl/user.txt: fix typo MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns MAINTAINERS: update various PALM patterns MAINTAINERS: update "ARM/OXNAS platform support" patterns MAINTAINERS: update Cortina/Gemini patterns MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern MAINTAINERS: remove ANDROID ION pattern mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors mm: docs: fix parameter names mismatch mm: docs: fixup punctuation pipe: read buffer limits atomically pipe: simplify round_pipe_size() pipe: reject F_SETPIPE_SZ with size over UINT_MAX pipe: fix off-by-one error when checking buffer limits pipe: actually allow root to exceed the pipe buffer limits pipe, sysctl: remove pipe_proc_fn() pipe, sysctl: drop 'min' parameter from pipe-max-size converter kasan: rework Kconfig settings crash_dump: is_kdump_kernel can be boolean kernel/mutex: mutex_is_locked can be boolean ...	2018-02-06 22:15:42 -08:00
Marcos Paulo de Souza	667b60946e	kernel/fork.c: add comment about usage of CLONE_FS flags and namespaces All other places that deals with namespaces have an explanation of why the restriction is there. The description added in this commit was based on commit `e66eded830` ("userns: Don't allow CLONE_NEWUSER \| CLONE_FS"). Link: http://lkml.kernel.org/r/20171112151637.13258-1-marcos.souza.org@gmail.com Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2018-02-06 18:32:45 -08:00

1 2 3 4 5 ...

919 Commits